IDENTIFICATION OF GENES AND THEIR REGULATION THAT DETERMINE A
PHENOTYPE: A SYSTEMATIC APPROACH
By
Ming Wu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2012

ABSTRACT
IDENTIFICATION OF GENES AND THEIR REGULATION THAT
DETERMINE A PHENOTYPE: A SYSTEMATIC APPROACH
By
Ming Wu
Research in computational systems biology focuses on establishing the complex relationship and interactions between genes and how they work together to render a particular phenotype. This involves the development and application of systematic approaches to study
the biological regulation in the context of a network in which genes are regulating each
other. Our research aim to develop novel approaches to identify genes and their regulation
that determine a phenotype, which involves the reverse engineering of regulatory mechanisms
through identiﬁcation of condition speciﬁc genes and interactions, as well as the systematical
modeling and simulation to reconstruct context-dependent regulatory networks.
Chapter 1 introduces the fundamental approaches in systems biology. Data mining techniques have been developed to identify genes and interactions from gene expression data,
while systems modeling integrate current knowledge to develop a functional context to address the complexity that arises in biological systems. We provide examples to demonstrate
the practical aspects and biological relevance of the methodologies. Chapter 2 introduces
and discusses the multi-layer approach that is able to reconstruct condition-speciﬁc genes
and their regulation through an integrative analysis of large scale information of gene expression, protein interaction and transcriptional regulation. In Chapter 3 we explore a dynamic
feature of gene network— the switch-like behavior, wherein we show that gene switches have
speciﬁc pattern of gene expression which can be uncover by mining microarray data. This
study demonstrates that one can capitalize on genome-wide expression proﬁling to capture

dynamic properties of a complex network, thereby predicting gene switches that could be
important for a phenotype and can participate in cell fate decision. In Chapter 4 the cancer
phenotype is studied using systems modeling of the human metabolic network. We develop
a novel approach to simulate context dependent metabolic states that upon perturbation of
the gene(s) that modulate metabolic functions, can determine whether the gene is involved in
confering a phenotype. The approach is then applied to predict therapeutic microRNAs for
human hepatocellular cancer. Chapter 5 provides a brief summary of the implications of the
research towards a systematic understanding of gene network as well as a future perspective
of the ﬁeld.

ACKNOWLEDGMENT

First I would like to thank my advisor Dr.Christina Chan, and other members in the
research group. I would like to thank members in my guidance committee: Dr. Rong Jin,
Dr. Yuehua Cui, and Dr. C. Titus Brown, and my collaborates Dr. S. Patrick Walton, Li
Liu, Amanda Portis Malefyt, HyunJu Cho, Dr. Xuerui Yang, Dr. Linxia Zhang, and Xiao
Pan. Finally I am very grateful for the great support from my wife and my Mom during the
PhD study.

iv

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . .
List of Figures

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . . viii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 The paradigm shift and computational systems biology . . . . . . . . . . . .
1.2 Learning from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Current approaches to identify genes/biomarkers based on expression
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Current approaches to identify condition-dependent gene interactions
1.2.3 The reconstruction of transcription regulatory network . . . . . . . .
1.3 Modeling based on mechanism . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Modeling of gene network: An overview of systems theory . . . . . .
1.3.2 Example: discrete dynamic modeling of insulin signaling in liver cells

5
10
13
34
35
39

Chapter 2
2.1
2.2

2.3

2.4

2.5

Identiﬁcation of condition speciﬁc genes and interactions — a
multi-layer approach . . . . . . . . . . . . . . . . . . . . . . . . . .
The problem in current approaches to identify speciﬁc genes and interactions
Basic ideas to improve the speciﬁcity of prediction . . . . . . . . . . . . . . .
2.2.1 Incorporating diverse conditions for the identiﬁcation of genes . . . .
2.2.2 Integrating multiple conditions to identify regulatory relationships . .
The multi-layer approach and the proof-of-concept applications . . . . . . . .
2.3.1 Layer I. Identiﬁcation of candidate genes . . . . . . . . . . . . . . . .
2.3.2 Layer II. Identiﬁcation of the potential gene regulatory relationships .
2.3.3 Layer III. Inference of TF activity and transcriptional regulation . . .
Applications on human breast cancer . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Layer I: Genes identiﬁed for breast cancer . . . . . . . . . . . . . . .
2.4.2 Layer II: network for ER positive breast cancer . . . . . . . . . . . .
2.4.3 Layer II and III: the Trop2 network . . . . . . . . . . . . . . . . . . .
Comparison of computational approaches to reconstruct gene network . . . .

1
2
4

Chapter 3

Identiﬁcation of novel targets by exploring gene switches . . .
v

44
45
47
48
51
52
53
59
62
65
65
65
70
75
78

3.1
3.2

3.3

Gene switches play essential role in cell fate decision, and could be good
biomarkers and targets for cancer . . . . . . . . . . . . . . . . . . . . . . . .
How to identify gene switches . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Simulation of kinetic models of gene switch . . . . . . . . . . . . . . .
3.2.2 Data mining to identify gene switches . . . . . . . . . . . . . . . . . .
3.2.3 A Proof-of-Concept application of the E2F-Rb network . . . . . . . .
3.2.4 A Proof-of-Concept application to Yeast microarray data . . . . . . .
Identify characteristic signatures of human breast cancer . . . . . . . . . . .

79
82
83
85
88
91
95

Chapter 4 Identiﬁcation of genes (microRNAs) that determine a phenotype107
4.1 Systems modeling of a metabolic state . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Reconstruction of global human metabolic network . . . . . . . . . . 110
4.1.2 Modeling and simulation based on human metabolic network . . . . . 111
4.2 Prediction of the metabolic state-change upon perturbations . . . . . . . . . 114
4.2.1 Reconstructing context-dependent metabolic network . . . . . . . . . 114
4.2.2 Simulating phenotypes based on metabolic network model . . . . . . 117
4.3 A novel approach to reconstructing context-dependent networks . . . . . . . 118
4.4 Prediction of therapeutic microRNA based on condition speciﬁc metabolic
network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4.1 Prediction of the metabolic state in liver cancer cells upon perturbation
of gene expression induced by miRNAs . . . . . . . . . . . . . . . . . 125
Chapter 5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Bibliography . . . . . . . . . . . . . . .

vi

. . . . . . . . . . . . . . . . . . . 145

LIST OF TABLES

Table 1.1

The average correlation between the expression level of the transcription factors and their target genes . . . . . . . . . . . . . . . . . . .

18

Table 2.1

Computational approaches for gene network reconstruction . . . . .

77

Table 4.1

The essential metabolic enzymes for human liver cancer that can be
targeted by miRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Table 4.2

The 5 miRNAs ranked at the top in our analysis and their potential
target genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

vii

LIST OF FIGURES

Figure 1.1

Incorporation of information from multiple sources . . . . . . . . . .

22

Figure 1.2

Schematic representations of diﬀerent models to estimate activity of
transcription factors . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Distributions of the correlation achieved by three diﬀerent TF activity
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Box plots of the average correlation between activity of the transcription factors and the expression of the target genes . . . . . . . . . .

28

Figure 1.5

Simulations of the combinatorial transcription regulation . . . . . .

30

Figure 1.6

Examples of combinatorial transcription regulation in expression data 33

Figure 2.1

Proof-of-concept example of applying Relief on multiple conditions .

55

Figure 2.2

Comparing multiple conditions in yeast data . . . . . . . . . . . . .

57

Figure 2.3

Application of Relief on the integrated dataset with multiple conditions 58

Figure 2.4

The ROC curves for inferring TF-gene regulatory relationships . . .

61

Figure 2.5

Network reconstruction of the GAL pathway . . . . . . . . . . . . .

64

Figure 2.6

The regulatory network for ESR1 . . . . . . . . . . . . . . . . . . .

67

Figure 2.7

The mRNA-fold change of Trop2 upon FI treatment . . . . . . . . .

71

Figure 2.8

The regulatory network for TACSTD2 . . . . . . . . . . . . . . . . .

72

Figure 1.3

Figure 1.4

viii

Figure 2.9

The TROP2 mRNA expression levels in diﬀerent cell types . . . . .

74

Figure 3.1

Dynamics of gene switches and bimodality in their expression proﬁles

80

Figure 3.2

Schematic representation of a phase plane of a gene switch . . . . .

86

Figure 3.3

Proof-of-concept example: simulation of the E2F-RB network . . . .

90

Figure 3.4

Mining approach to identify switches in the E2F-RB network . . . .

92

Figure 3.5

A proof-of-concept application on the yeast dataset . . . . . . . . .

93

Figure 3.6

Identiﬁcation of potential gene switches for breast cancer . . . . . .

97

Figure 3.7

The expression proﬁles of Trop2 in breast cancer . . . . . . . . . . . 100

Figure 3.8

A switch-like behavior in Trop2 expression . . . . . . . . . . . . . . 102

Figure 4.1

A pipeline for systems biology applications of human metabolic network115

Figure 4.2

Reconstructing context speciﬁc metabolic network . . . . . . . . . . 123

Figure 4.3

A pipeline for predicting therapeutic miRNAs for human liver cancer 127

Figure 4.4

The ROC curve of the prediction based on a test-set . . . . . . . . . 129

Figure 4.5

The ﬂux change in response to miRNA perturbations on the reaction
catalyzed by lactate dehydrogenase . . . . . . . . . . . . . . . . . . 132

Figure 4.6

The metabolic pathways and processes that could be aﬀected by the
essential enzymes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

ix

Chapter 1
Introduction
A central question in biological science is the gene(genotype)-phenotype relationship, i.e.
how do the genes and their regulation determine a phenotype.
The genotype used to deﬁne by a single genetic trait in Mendel’s experiments. This concept has been extended to a large variety of genetic endowment of living organisms at the
molecular level, involving gene and gene network upon genetic and epi-genetic regulations.
In genetics, the genotype could be mutants (e.g. SNPs), insertion/deletions, or duplications
in a genomic region, or some modiﬁcations on the chromatin. In cell biology, as we focus
on throughout this dissertation, a particular genotype could be a signaling molecule, a transcription factor, a pathway, or a gene network involving molecules that can aﬀect a cellular
process.
The phenotype represents the biological observations to be investigated. Depending on
the research question and the experimental design, a phenotype could be a particular cell
type, a decision of cell fate, a biological process, or some physiological or disease states.
Research in molecular biology focuses on establishing the relationships between genes,
interactions of genes and their functions in regulating the processes by which cells respond
1

to external or internal signals to determine a phenotype, i.e. to answer the question: what is
the underlying molecular mechanism that governs a particular state of a biological system?
The question has been studied previously by the identifying and analyzing the function
of individual genes and proteins. However, the recent development of high-throughput techniques has driven the need to ﬁnd global changes at the “omics” level, in order to elucidate
how the genes interact to regulate biological processes. The ability to routinely study thousands of genes’ expression has shifted the paradigm in the biological research community,
suggesting that genes and their interactions should be evaluated in the context of the whole
network. To gain a better understanding of the complexity of biological regulation has raised
new challenges in analyzing the data, designing the experiments to validate these analysis
and the processing of the information. Thus, computational and systems biology approaches
are rapidly being developed to analyze and integrate omics data to obtain a coherent and
mechanistic snapshot of cellular regulation.

1.1

The paradigm shift and computational systems biology

Since the advent of molecular cell biology, researchers have studied biological phenomenon
mainly by analyzing the function(s) of individual genes and proteins, and the change(s) they
exhibit in diseased states. This reductionist approach helped discover many of the underlying biological principles [1]. However, researchers subsequently found that the relationship
between genotype and phenotype is more complicated then can be ascribed to a change
in a single gene [2], and the behavior of the diﬀerent components in the biological system
2

cannot be captured in isolation. These observations, together with the recent availability
of ’omics’ data, have revolutionized the previous view of single gene-phenotype correlation
by demonstrating the importance of the inter-relationships between genes, as Linus Pauling
said “Life is a relationship among molecules and not a property of any molecule”. This has
intensiﬁed the investigation of protein function in the context of complex biological systems,
and initiated a more systematic perspective of biological processes [3].
In the last decades biologists have become increasingly aware of the importance of functional context, the community has been enthusiastic about the paradigm shift from cataloguing molecular characterizations to understanding the functional activities of genes and
proteins speciﬁc to a phenotype. More and more experiments are guided by models that
serve as a basis for generating hypothesis and interpreting results (e.g. regulatory schemes
or pathway diagrams). Nevertheless, biological processes consist of many interacting components, exceeding the human capacity to systematically analyze them, thus requiring computational methods to reduce the complexity and thereby enhance their accessibility [4].
Thus, a central idea in computational systems biology is to construct mathematical formulation of data-driven or hypothesis-generating models to help reveal the function of biological
systems [2].
Computational systems biology gathers gene expression, interaction, or perturbation data
to build a speciﬁc network that regulates a biological process, and studies how the design
features of the network specify biological decisions. One of the main focuses has been predominantly on analyzing expression data to understand the regulation of gene expression
and its functional activity for a phenotype, wherein a “systems modeling” or “systematic
approach” is to address the following two questions:
3

• How does one identify the genes and interactions that are speciﬁc for a phenotype/condition,
given the large amount of gene expression changes that can be measured in highthroughput techniques?

• How do genes interact with each other to determine a phenotype?

The two questions are attempting to systematically establish the relationship between
gene expression and phenotype from two diﬀerent aspects, i.e., learning from data to identify
important genes, and modeling based on mechanism to understand a phenotype. Therefore,
there are two distinct ﬁelds of computational modeling in systems biology that are being
developed to address these two questions: data mining and in-silico simulations. Data
mining, or “top-down” modeling, aims at discovering patterns from large amount of highthroughput biological data to help generate hypotheses and make predictions, by learning
data driven models to establish a statistical relationships between genotype and phenotype.
Alternatively, simulation-based analysis, or “bottom-up” systems modeling tests hypotheses
with in-silico experiments and predict network behavior by integrating current knowledge
of molecular mechanisms.
In this chapter I introduce these two diﬀerent computational modeling approaches in
systems biology and explain the importance of identifying the genes and interactions that
are speciﬁc to a phenotype.

1.2

Learning from data

Data mining, or “top-down” modeling, learns data driven models to establish statistical
relationships between genotype and phenotype. Data mining techniques have been widely
4

applied on gene expression data to identify genes and interactions that are important to a
phenotype.

1.2.1

Current approaches to identify genes/biomarkers based on
expression data

Identiﬁcation of important genes based on gene expression microarray data, also known as
“biomarker discovery”, “analysis of diﬀerentially expressed genes”, is essentially a feature selection problem in data mining. The gene expression data is collected from samples of diﬀerent phenotypes, or, diﬀerent “annotations”. The expression levels across all samples of each
gene are investigated to discover genes that are relevant to particular phenotype/conditions,
or other target annotations. These genes can be used as biomarkers that could be valuable
in disease prediction, drug target discovery, or further analysis to reconstruct the regulatory
network that underlies the manifestation of the phenotype.
Problem statement: Given a gene expression matrix M obtained from a series of preprocessing steps involving transformation and normalization of the raw data, in which there
are m genes (features) across n samples, m >> n in most cases, and xij is the expression
level of gene i and sample j. The problem becomes how to identify the genes (features) that
govern any one of the phenotypes in the samples.
In the ﬁeld of data mining and machine learning, feature selection usually aims to ﬁnd
the features that provide better classiﬁcation, thus the conventional categorization of feature
selection techniques is based on the relationships between the selection approach and the
classiﬁcation model: [5, 6]
• Filter methods assess the relevance of features by analyzing the data matrix, which is
5

a preprocessing step, independent of the classiﬁer, that will be used in the subsequent
learning problem.
• Wrapper methods assess the features with classiﬁers, thereby the features are evaluated
based on a particular classiﬁer, e.g. SVM. The methods attempt to select the most
discriminant subset of features by minimizing the prediction error of the classiﬁer
chosen.
• Embedded methods search in the combined space of feature subsets and the model
hypothesis set in the classiﬁers.
However, in the application of feature selection to biology to identify speciﬁc genes, the
biological question asked may not be necessarily based on the classiﬁcation accuracy of the
phenotypes, but in most cases it is to understand the phenotypes, e.g.:
• What is the most important diﬀerence between two phenotypes in terms of gene expression?
• What are the characteristic genes for cancer?
Thus ﬁlter approaches are the most widely applied in the literature to identify diﬀerentially expressed genes [7]. Univariate approaches dominate the ﬁeld since the output is
a ranking of the genes that are intuitive and easy to interpret and validate. Multivariate
approaches have been developed to account for potential interactions between genes’ expression. Filter approaches in feature selection can be separate into two classes [7]:
• Ranking approaches: deﬁne a certain scoring function to estimate the relevance indices
(scores) for each gene and select genes that are ranked at the top.
6

• Space searching approaches: features are selected by optimization of a predeﬁned cost
function.

Ranking approaches Ranking approach ﬁrst deﬁne a scoring function S(x) to quantify
the diﬀerence in gene expression between diﬀerent groups, score the genes and rank them
based on the scores estimated. Usually a higher score indicates an important gene that is
expressed diﬀerentially. The choices for the scoring functions include parametric and nonparametric models.
The approaches that are mostly widely used are parametric models which assumes
a Gaussian distribution. One could use t-statistics, fold change or ANOVA to quantify
the divergence or distance between the sample distributions from diﬀerent groups (phenotypes/conditions) for a given gene. For example, the commonly used scoring function for
gene expression [8] that applies Welch t-statistics is:

S=

x1 − x2
¯
¯
2
2
σx /n1 + σx /n2
1
2

(1.1)

in which x1 , x2 are the gene expression in the samples from two groups/phenotypes, and
n1 ,n2 are the number of samples in these two groups. There are modiﬁcations of the standard
t-test to address the potential problem of the small sample size and noise in the array data,
either by adjusting the estimation of variances or applying prior degrees of freedom and
variances in a Bayesian framework [9, 10].
Non-parametric approaches do not assume a target distribution but compares and combines the ranking of the gene expression level across diﬀerent samples. In these approaches,
a “ranking” is assigned to samples in diﬀerent groups based on one gene’s expression level.
7

For each gene, the score is computed by taking the sum or average of the ranking of those
samples in a particular group. The scoring functions for the Wilcoxon rank sum [11] and
Rank product [12] are shown as follows:
k
S=

Rj , k = min (n1 , n2 )

(1.2)

j=1
j=1
Rj 1/n )
S=(
n

(1.3)

in which Rj is the ranking of sample j based on the expression level of the gene being
investigated.
There are other types of scoring functions based on information theory to account for
more complex relationships (e.g. non-linear, non-monotonous) between gene expression and
phenotypes , e.g.:
S=
xi c

P (xi , c) log

P (xi , c)
P (xi )P (c)

(1.4)

in which P (xi , c) is the joint probability gene xi and group/phenotype c.
After scoring and ranking the genes, in many applications researchers might set a threshold to select genes. A threshold could be a pre-deﬁned fold-change diﬀerence between gene
expression in the diﬀerent groups (2-fold change is usually used as a threshold in earlier studies), or based on the selection of genes that minimize the training sample mis-classiﬁcation
rate if the goal is to classify the phenotypes, or the statistical signiﬁcance of the score. The
statistical signiﬁcance tests estimate the possibility that a particular score would have been
obtained by chance. P-value is commonly used, which represents the probability of achieving
a test statistics that is at least as extreme as the one observed in the data. Although there
8

are standard asymptotic methods based on a pre-deﬁned distribution of the test statistics,
gene expression data might not follow the distribution and the sample size of the data is
usually not suﬃciently large for such analysis. Therefore p-values are estimated empirically
by applying permutation test on the data, i.e. running tests (computing scores) which are
identical to the original scoring approaches but the target feature (e.g. the grouping of
samples) is permuted diﬀerently for each test, whereby a null distribution of the scores can
be generated to estimate p-values. Another issue of concerned in these tests is the multiple hypothesis testing problem [13] due to the large number of genes that are simultaneously
tested. The threshold should be adjusted to control the error rate (e.g. type I error: the false
positive rate). P-value with Bonferroni correlation or False Discovery Rate are commonly
applied to address the problem [14].

Space search approaches Instead of scoring and ranking individual genes, space search
approaches aim to ﬁnd the subgroup of features (genes) which optimize a pre-deﬁned cost
function. Cost functions in data mining are usually deﬁned to maximize the information
content of the features while minimizing the redundancy, e.g.:

I(xi , c)
M IQ = max
i∈Ωm (1/|Ωs |) j I(xi , xj )

(1.5)

The function computes the Mutual information quotient [15], in which the I(xi , c) is the
mutual information between gene xi and class label/phenotype c, and I(xi , xj ) is the mutual
information between the gene candidate xi ∈ Ωm to be added into the feature subset and
the genes xj ∈ Ωs that have been included in the subset. Ωm is the entire gene set and Ωs
is the current selection of a subset of features/genes.
9

Although these feature selection approaches are widely applied in the identiﬁcation of
genes that are speciﬁc for a phenotype, there are issues in the assumptions implicit in these
approaches that are inconsistent with the biological processes. Many ranking approaches
assume features are independent, but the genes that regulate each other and their expression
should not all be independent. Space search approaches account for potential relationships
between features but biologically the “redundant” genes should not be removed if they are
also speciﬁc or relevant to a phenotype. We therefore suggest applying the Relief method
that optimizes a distance function to seperate samples from diﬀerent groups and results in
a ranking of genes, which can address some of these problems. The details of our approach
are discussed in Chapter 2.

1.2.2

Current approaches to identify condition-dependent gene interactions

To uncover the complex gene regulatory network that renders a phenotype or disease state
of a biological system, not only does one need to identify speciﬁc genes but also identify the
interactions that are speciﬁcally functioning to render the phenotype. These interactions can
be discovered based on gene expression data to suggest some speciﬁc association, regulatory
relationship or functional relationship between genes. This step could be consider another
feature selection problem, i.e. given all possible interactions, e.g. the complete graph connecting all gene pairs, or based on some known biological network, how does one select the
interactions that are speciﬁc to a phenotype?
Similar to the ﬁlter approaches introduced in the last section to identify speciﬁc genes,
the methods to identify speciﬁc interactions can either deﬁne a scoring function to rank
10

interactions, or search the space of possible network structures that can best explain the
(in)dependencies within the expression proﬁles.
A scoring function that could be used to quantify the potential association between two
gene expression proﬁles is Pearson correlation:

cov(X, Y )

R(X, Y ) =

var(X)var(Y )

(1.6)

for gene X and gene Y : Nevertheless the dependencies between two genes’ expression could
be more complex than a linear relationship, thus researchers have used mutual information
to account for all diﬀerent types of dependencies:

P (x, y) log

I(X, Y ) =
x∈X y∈Y

P (x, y)
P (x)P (y)

(1.7)

The probabilities in the equation are meant to be estimated from frequency counts of discrete
variables, however gene expression measurements are usually continuous variables. One
may discretize the variables or approximate the probabilistic densities with non-parametric
approaches such as Parzen windows [16].
After calculating all the pair-wise scores one needs to select the features/interactions that
are “important”. Algorithms have been developed to reduce the number of candidates and
thereby identify important gene interactions, including:
• MRNET (maximum relevance/minimum redundancy, MRMR) [17] removes redundant
genes connecting to a target gene. For example, gene X1 and X2 both have high mutual
information with gene Y , thus one could connect X1 − Y and X2 − Y . Nevertheless, if
X1 and X2 are correlated, then they are “redundant” with respect to their connections
11

to Y , thus the algorithm will delete one of the connections to Y . When there are many
other genes that can be connected to Y , the algorithm selects from the least redundant
set of genes the ones that have the largest mutual information with gene Y .

• CLR(Context Likelihood of Relatedness) [18] suggests that the connections between
diﬀerent genes should apply diﬀerent thresholds. CLR computes mutual information
M I(X, Y ) between genes X and Y , and generates a distribution of mutual information to compute z-scores to evaluate the M I(X, Y ), i.e. to estimate the M I(X, Y )
relative to all possible M Is involving gene X or Y . In other words, is M I(X, Y ) much
higher than most of the other potential “background interactions” to gene X and gene
Y (M I(X, ∗) and M I(Y, ∗) in which ∗ represents any other gene that could interact
with X or Y. Finally, whether or not gene X and gene Y should be connected is determined by a hypothesis test of ZXY which determines whether the mutual information
between X and Y is signiﬁcantly higher than the background.

• ARACNE(Algorithm for the Reconstruction of Accurate Cellular Networks) [19] attempts to mathematically deﬁne and separate the direct from the indirect eﬀects. If
one knows the expression of genes X and Y are correlated, genes Y and Z are correlated as well, one would expect that X and Z could be correlated. ARACNE thus
compares the triplets: M I(X, Y ), M I(Y, Z), M I(X, Z) in order to remove indirect
eﬀects. Based on intuition that an indirect eﬀect could be smaller than the direct
eﬀects, ARACNE deﬁnes a “Data Processing Inequality” to test all gene triplets such
that for each triplet (X, Y, Z), the connection corresponding to the lowest mutual information M I = min{M I(X, Y ), M I(Y, Z), M I(X, Z)} is elimiated if it is lower than
12

a threshold:
M I ≤ M I2 (1 − )
where M I2 is the second lowest mutual information for the triplet and the

(1.8)
is a

pre-deﬁned parameter.
Although any of these methods can eﬀectively reduce the number of candidates, the
assumption of MRNET is questionable with respect to biology, since there could be “redundant” interactions that are eﬀectively functional under a condition to provide a robust
design of a biological function (e.g. diﬀerent isozymes). CLR removes “false correlations” in
the network by eliminating promiscuous cases wherein a gene weakly co-varies with a large
number of other genes. This is more reasonable since such promiscuity can arise when the
assayed conditions are inadequately or unevenly sampled, which could be real in biological
experiments [18]. From another perspective, the many “indirect eﬀects” is a major problem
in biological studies of regulation of gene expression. ARACNE provides a heuristic deﬁnition help focus on the direct interactions, thus has raised more interest and had some success
in the community.

1.2.3

The reconstruction of transcription regulatory network

Overall, the basic idea of “reconstruction of gene regulatory network” is the same as inferring gene interactions — to identify pair-wise relationship between the genes, or more
speciﬁcally, to determine whether a gene (or its product) directly controls the expression of
another. By learning the dependencies contained within the expression proﬁles, researchers
are attempting to reconstruct a network that depicts the global regulatory network of genes.
The development and application of the theory and tools for network inference, or so-called
13

“reverse engineering” tasks, have been predominantly based on data mining/learning techniques. Many approaches mentioned above have been used in network reconstruction. There
are other approaches that are not built upon the scoring of every single interactions but
searching in the space of possible network structures to identify the best structure to explain
the data. One example is the Bayesian Network (BN) [20], which aims to determine a directed acyclic graph that can represent and simplify the joint probability of the expression
of all the genes investigated, in which descendants are independent to each other given their
parent nodes so that the joint probability are separable based on such independencies learned
from data.
There are several in-depth reviews in the literature that introduce and compare these
diﬀerent modeling schemes and more recent developments for learning [21–23], including a
primer on regression methods [24]. In the review by Margolin et al. [25] they examine the
theoretical underpinnings behind current reverse-engineering algorithms that are based on
systems control theory (e.g. linear or non-linear regression model), probabilistic graphic
learning, and information theory.
However, these reviews and literature in this area have generally been concerned about
the practical aspect of data mining, and few have paid attention to the relevance of the
methodology to the biological problem. The focus on the diﬃculties in network reconstruction has been from the computational perspective. A major discussion point has been the
limited sample size available (in most cases less than a hundred samples) for identifying
pair-wise relationships between the genes (which could be hundreds of thousands pairs).
This results in an under-determined system requiring methods for dimensionality reduction,
i.e. clustering or module/pathway analysis [21, 26, 27]. Another challenge is the large search
14

space of possible regulatory schemes, which requires either advanced optimization strategies
or a priori information to reduce the computation time [22, 28, 29]. Nevertheless, in addition
to previous improvements to the “predictive power”or “computational eﬃciency”, it is important to understand how much biological information can be appropriately extracted from
expression data to deduce the rules of gene expression/transcriptional control.
From a biological perspective, instead of studying the physical network, many of these
reverse engineering methods are actually learning the inﬂuence network and trying to interpret the inﬂuence with gene expression control (the transcriptional regulation in most cases),
which results a transcriptional regulatory network but with a interwoven mixture of true and
promiscuous, direct or indirect eﬀects [25], thereby creating a considerable divide between
the inﬂuence network that is constructed and the real biological regulatory mechanisms. This
is due largely to the limitations of the dataset itself, and the presence of multiple, unobserved
levels of regulation leading to diﬃculties in the biological interpretations and undermining
the biological signiﬁcance of these inﬂuence networks and their further applications.
The reconstruction of gene network (inferring gene interactions) is widely applied to identify transcriptional regulatory network and transcriptional regulatory relationship between
transcription factor and genes, thus we attempt to dissect the information content in the
expression data from a biological perspective, and scrutinize the biological foundations of
the computational models, and critically analyze the underlying assumptions of most in silico learning approaches applied to expression data that confounds the interpretation of the
results [25], which are:

1. Statistical dependencies exist between the transcription factors and their target genes
with respect to both their expression levels.
15

2. Measurements of the relative amount of mRNA level in the microarray data are predictive of the activity of the regulatory molecules. This assumption can be further
sub-divided into three sub-types, as follows:
• Type 1 model assumes the expression level of a transcription factor correlates
with the activity of the transcription factor. (e.g. [30, 31])
• Type 2 model estimates the activity of a transcription factor based on the behavior
of its target genes. (e.g. [27, 32, 33])
• Type 3 model assumes co-expression implies co-regulation by the same transcription factor, and estimates the existence or activity of an uncharacterized cis-motif
by clustering analysis. (e.g. [34])
Using the yeast microarray data as an example, we combine information on the yeast transcriptional regulatory network and diﬀerent data-sets and -types to examine each of these
assumptions.

The association between transcription factor and their targets To illustrate the
ﬁrst assumption, we used yeast data to characterize the information in the expression data
that is used to infer interactions at the transcriptional level. The yeast dataset contains 255
conditions from environmental stress [35] and cell cycle [36] microarray experiments. These
datasets have been widely applied in previous studies to develop novel reverse-engineering
methods [34]. Actually the yeast environmental stress response dataset has been cited 2,260
times so far, and among the 1,000 of these citations that are related to computational studies,
there are 256 papers discussion about network reconstruction (citation data is provide by
googleScholar, http://scholar.google.com), which suggests the popularity of the dataset.
16

Many of the yeast transcription factors have been studied and their cis-regulatory modules
on gene promoters across the genome have been identiﬁed and are now available in public
databases such as YEASTRACT [37, 38], thus enabling the attainment of a putative transcriptional regulatory network based on known motifs on the gene promoters collated in
YEASTRACT.
Since the fundamental assumption is that the expression level of a gene depends on its
regulators, we calculate the correlation between the expression level of the transcription
factors and that of their target genes, where the target genes of a transcription factor in the
regulatory network are identiﬁed by corresponding cis-motifs on their promoters. As shown
in Table 1.1 on page 18, the average absolute correlation coeﬃcient of the expression data,
taking the absolute value since both positive and negative correlation represents perceptible
dependencies, is about 0.08 between the transcription factors and their target genes. This
appears to be negligible, even smaller than the background with a correlation of 0.19 between
any gene pair, suggesting that it is diﬃcult to directly identify TF-gene (a transcription factor
and its target gene) pairs based on the dependencies of their expression.

17

Table 1.1: The average Spearman correlation between the expression level of the transcription factors and the expression level
of their target genes, considering the conditions when the transcriptions factors are highly expressed (higher than their own
mean level for all available conditions). “1-to-1” considers genes with only one known TF that can bind to their promoters.

18

Next, we place a “1-to-1” constraint that considers only the TF-gene pairs in which the
target gene has no other type of known eﬀectors besides the transcription factor paired to it.
That is in contrast to an “n-to-1” relationship in which many transcription factors regulate
the expression of one gene. We identify 596 “1-to-1” TF-gene pairs in the yeast network.
This constraint assumes that the target genes of these 596 pairs are not regulated by multiple
transcription factors. We found that the average correlation of the expression data of these
pairs (0.16), albeit still lower than background noise (0.19), is about two times higher than
the overall TF-gene pairs (TABLE 1),. This suggests that the combinatorial regulation of
multiple transcription factors on a target gene plays an important role in the transcriptional
regulation, thereby complicating the TF-gene relationship and the use of correlation of their
expression proﬁles for inferring regulatory networks. Moreover, the correlation between
a TF-gene pair could be increased slightly when the samples containing lowly expressing
transcription factors are removed. A rationale for doing this is that low expression level may
suggest reduced control by the regulator [39].
Overall our results demonstrate a weak correlation in the expression exists between the
transcription factors and their target genes, thus making it diﬃcult to uncover transcriptional
regulation due to the high background. The high background could possibly be due to both
direct and indirect associations between the genes in the network. The result is consistent
with previous observations [40] that only a very small proportion of transcription factors’
mRNAs are signiﬁcantly correlated with the expression level of its target genes. Our results
also indicate that combinatorial regulation contributes to the reduced correlation between
the TFs and their target genes.
Besides the combinatorial eﬀect, there are many complex features on the binding sites
19

or DNA-TF interactions that could impair the association between transcription factors and
their target genes [41]. A transcription factor may not bind to all its targets with the same
binding aﬃnity, indeed many speciﬁcally interact with only a few targets depending on other
genomic features, i.e. DNA modiﬁcations, or due to stochasticity of the binding events [42].
Diﬀerences in the sequence of and around the binding site will also aﬀect the binding aﬃnity
[43]. Thus, to determine bindings that are functional remains a challenge [44–46]. Moreover,
there are other levels of regulation that cannot be obtained from the expression data and
transcriptional regulatory network. For example, the protein-DNA interactions may depend
on other protein co-factors in order to become functional, and sometimes the binding itself
requires adaptors [41, 47]. The mRNA level of the target genes may also be regulated at the
post-transcriptional level, through the coordination of diﬀerent rates of mRNA decay [48].

Incorporation of information from multiple sources It is increasingly evident that
the network reconstruction by microarray data alone is imperfect, in part due to the limited
sample size use to infer a very complex network, but more importantly because of the limitation of the information content of microarray, in which only the mRNA expression level is
measured. Incorporating biological knowledge on diﬀerent levels of regulation (e.g. protein
interactions) could improve the results. Many recent studies on reverse engineering have
attempted to integrate these information but focus on the technical utility of multi-source
information in providing priors to limit the search space, and enable further validation or
promote more intrinsic learning models [30, 49–51]. Here we show from a biological perspective, in an intuitive but quantitative manner, an enhancement in the TF-gene correlation by
incorporating other sources of information with the yeast expression data.
We consider the regulation of possible protein co-factors on the transcription factors,
20

which have been suggested to impact the binding and activation of transcription factors.
We impose an experimentally conﬁrmed yeast protein-protein interaction (PPI) network on
our transcription regulatory network. The physical interaction data is downloaded from
SGD http://www.yeastgenome.org/, and we compute for each transcription factor the
number of its potential interacting proteins—- given that the PPI is static and context
independent. We found that among all our 1-to-1 TF-target gene pairs (correlation of 0.16),
those transcription factors with a low number of possible co-factors exhibits signiﬁcantly
better TF-gene correlation (p < 0.05, average correlation coeﬃcient as high as 0.25), as
shown in Figure 1.1 on page 22.
In Figure 1.1 on page 22, part A, 596 1-to-1 TF-target gene pairs are considered. The
data is categorized into three groups according to the number of interaction partners of
each transcription factor in the yeast protein-protein interaction network. The group “Top
10%” consists of TF-target pairs in which the transcription factor has many interactions
(more than 97 possible cofactors). The group “Bottom 10%” consists of pairs where the
transcription factor has few protein interactions (no more than 2 cofactors). The others
(80%) are in the “Middle” group. The group “Bottom 10%” exhibits signiﬁcantly better
average TF-gene correlation (p < 0.05).
Since the regulatory mechanisms depend on the dynamics of TF-gene binding, incorporating more binding information as a constraint should better reveal the TF-gene relationship.
We searched the ChIP-Chip data and found a conditional ChIP-chip dataset [52] containing
23 transcription factors under H2 O2 (0.4nM ) treatment of yeast for which we also have
microarray expression proﬁles. We therefore can deﬁne conditional TF-gene pairs using actual binding proﬁles (i.e. in the CHIP-chip data a signiﬁcant p value indicates TF binding
21

Figure 1.1: Incorporation of information from multiple sources. A: Box plots of the average
correlation between the expression level of transcription factors and the expression of their
target genes. B: Box plots of the average correlation between the expression level of transcription factors and the expression of the target genes. For interpretation of the references
to color in this and all other ﬁgures, the reader is referred to the electronic version of this
dissertation.

22

on the gene promoters), rather than the sequence-level motif analysis which indicates only
the capability of TF binding instead of actual binding events. We compare the correlation
coeﬃcient of TF-gene pairs deﬁned by cis-motifs, or by non-conditional CHIP-chip (under
normal growth condition in YPD media), with that of gene pairs deﬁned by conditional binding information, the result is shown in part B of the Figure 1.1 on page 22. since CHIP-chip
data is not available for every transcription factor, only 99 pairs have binding information
and 20 of them are (condition-speciﬁc) conditional binding. The overall gene-TF (box on the
left) describe the average correlation of all 596 pairs under all conditions. Under the H2O2
condition, the box in the middle (YDP-CHIP) use the non-conditional binding (CHIP-chip
results under normal condition, 99 pairs) to determine the average correlation of the GeneTF pairs that actually bind, and the box on the right use conditional binding CHIP-chip
data (20 pairs). The average TF-gene correlation increased signiﬁcantly when conditional
binding data is available (box on the right, p < 0.01 compared with the box on the left).
The average correlation increased signiﬁcantly (coeﬃcient > 0.3) when conditional binding
data is available, conﬁrming the importance of context-speciﬁc information.

Estimation of the transcription factors’ activity Besides the TF-target gene association, learning transcriptional regulation from expression data relies on a second assumption,
that is, the mRNA measurements in the microarray data are predictive of the activity of
the actual regulatory molecules. As in the aforementioned correlation analysis, one simply
uses the expression level of a transcription factor as the identiﬁer for its activity, which we
deﬁned as Type 1 estimation. Since many transcription factors are largely reported as being
regulated by post-transcriptional modiﬁcations, simply equating mRNA level and protein
activity has been criticized [39]. Therefore, a diﬀerent type of estimation has been suggested
23

to represent the activity of a transcription factor, which is based on the behavior of its target
genes, which we call “Type 2” estimation. Instead of the mRNA level, Type 2 model uses the
expression level of the target genes to represent the transcription factor activity. Figure 1.2
on page 25 demonstrates the basic ideas in these diﬀerent type of models: Type 1 models
rely on the expression level of the transcription factors. Type 2 models compare the genes
with the transcription factor binding site (target genes) and genes without the binding site
(background) and use the diﬀerences between expression of the target genes and the background genes to represent the activity of the transcription factor. Type 3 models assume
that target genes expression are better correlated if the transcription factor is activated, and
use the target gene correlation to represent the activation of the transcription factor.

We apply the Type 2 estimation on the yeast expression data. We use the diﬀerence in
expression between a TF’s target genes and non-target genes as its activity level for a given
condition. We then compare the 1-to-1 TF-gene correlations with the results obtained using
the Type 1 estimation. As shown in Figure 1.3 on page 27, there is a signiﬁcant increase in
the average correlation coeﬃcient when using the Type 2 estimation. Genes with fewer TF
binding sites on their promoters has less uncertainty of their regulatory mechanism, thereby
these genes may contribute more to approximating the activity of its regulators. The TF
activity inferred from target genes can then be weighted by “1/n” where n is the total
number of TFs that are able to control a particular gene. Such weighted summation version
of Type-2 model provides better correlation (as high as 0.28, see Figure 1.3 on page 27).
The weighting reduces the contribution of genes with many diﬀerent cis-modules on their
promoters; presumably they are subjected to combinatorial regulatory eﬀects. Figure 1.4 on
page 28 shows that by applying the Type 2 estimation and incorporating context-speciﬁc
24

Figure 1.2: Schematic representations of diﬀerent models to estimate activity of transcription
factors.

25

information one could reach an average TF-gene correlation coeﬃcient of 0.5.

The Type-2 method provides better estimation of TF activity without assuming mRNA
level represents TF activity, however such a good character of Type-2, does not beneﬁt
much to the network reconstruction. To demonstrate this we simulate a kinetic model
d[Gene]/dt = (1 + [T F activity]/K) − 1 (with arbitrarily deﬁned eﬀective kinetic constant
K) for the transcription regulation of a single TF on its target gene, shown in the ﬁrst case
of Figure 1.5 on page 30. If the mRNA level would represent TF activity, which may not
be the case in many real biological systems, the mechanism by which a gene is regulated
by a single TF would be uncovered using Type-1 model (in the single inhibition case, as
shown in the responsive curve of the middle plot in Figure 1.5 on page 30). Type-2 model
may have better estimation of TF activity but could be diﬃcult to identify the regulatory
mechanisms, because the information used to estimate TF activity, in Type 2 model that is
the TF-target hypothetical relationship and the target gene expression, and the information
required to predict TF-target gene relationship, are overlapped in some way, although the
two pieces of information are not all the same—the TF activity is estimated from the overall
eﬀect of its potential targets, but a TF-gene relationship is established by testing whether or
not the change of the TF activity can explain the change of the expression of the particular
gene. In addition, the Type 2 estimation assumes the inﬂuence of a transcription factor
on all its targets is equal, which does not account for the variability in binding and the
function of a transcription factor on its various targets due to adapters and co-factors [53].
Another limitation is that the Type 2 and Type 1 estimations need a priori knowledge
about the transcriptional regulatory network, and such information is less often available in
other model organisms. Thus, the Type 3 estimation, which assumes co-expression genes
26

Figure 1.3: Distributions of the correlation achieved by three diﬀerent TF activity measures.
only “1-to-1” cases are considered. Although the average correlation does not improve much
in the Type 2 model, there are more genes whose expression level is better correlated with
their TFs. Type 1: The TF activity is represented by its expression level; Type 2-TFA: The
TF activity is represented by the diﬀerence between the mean expression value of its target
genes and the mean expression value of the (other) unrelated genes; Type 2-weighted TFA:
Genes with fewer TF binding sites on their promoters contribute more to approximating
the TF activity, TFA is then weighted by “1/n” where n is the total number of TFs that
are able to control a particular gene. The probability density is calculated using the kernel
smoothing density estimate function (ksdensity) in Matlab, with a Guassian kernel. A
histogram describing the frequencies of gene expression in diﬀerent categorical bins should
show a less continuous but similar distribution.

27

Figure 1.4: Box plots of the average correlation between activity of the transcription factors
and the expression of the target genes.596 1-to-1 TF-target gene pairs are considered. The
activity of transcription factors are estimated by Type 2 model. CHIP-chip data is not available for every transcription factor, so only 99 pairs have binding information and 20 of them
are conditional-speciﬁc binding. The overall gene-TF (box on the left) describes the average
correlation of all 596 pairs under all conditions. To estimate the activity of the transcription
factors, target genes of each transcription factor are determined by yeast transcriptional regulatory network, where the TF-gene interactions are based on motif/binding site. Under the
H2O2 condition, the box in the middle (YDP-CHIP) uses the unconditional binding (CHIPchip results under normal condition) to determine both the activity of transcription factors
and the average correlation of the Gene-TF pairs that actually bind, and the box on the
right uses conditional binding CHIP-chip data. The average TF-gene correlation increased
signiﬁcantly when conditional binding data is available (box on the right, p < 0.05 compared
with the box on the left).

28

are co-regulated by the same transcription factor, has been widely implemented to predict
cis-motifs or to estimate the activity of cis-motifs by co-expression analysis.
With the Type 3 model, one could use clustering analysis ﬁrst to identify the co-expressed
genes, followed by enrichment analysis on the promoters of genes within a same cluster, to
identify functional cis elements. However, the assumption that co-expression indicates coregulation is imperfect. Since co-expression may occur in situations other than TF-gene
associations [25], such as in a signaling cascade. Thus, one should be cognizant of the
assumptions that lead to the results, and further experiments are required to validate the
cis-elements uncovered in this manner.

Combinatorial regulation of transcription factors Most of aforementioned analysis
focus on the “1-to-1” pairs whereby the genes are likely regulated by only one transcription
factor, nevertheless there are many more genes (> 80%) with multiple promoter regions
that bind diﬀerent transcription factors, thereby complicating the regulatory mechanism
through TF cooperation. Models of TF binding and gene transcription have been extensively studied experimentally on prokaryotes with small-scale quantitative measurements of
numerous perturbations on a subset of the regulatory circuits [54]. A detailed thermodynamic binding and control model has been established [55] and successfully applied to many,
highly speciﬁc regulatory models in E.coli and Drosophila [31, 56], providing a quantitative
framework for studying the combinatorial regulation of transcription factors. Researchers
are now attempting to generalize the dynamic model to automate the procedure of learning the detailed mechanisms from high-throughput data. Questions have arisen on whether
or not the information in large-scale expression data is adequate to support these detailed
mechanistic models. Since current knowledge of combinatorial regulation in real eukaryotic
29

Figure 1.5: Simulations of the combinatorial transcription regulation. The interactions between transcription factors and between transcription factor and the initiation of the target
gene transcription are modeled with kinetic equations. Response curves are simulated and
plotted. For the regulation by single transcription factor, the x-axis is the activity of the
transcription factor and the y-axis is the expression level of the target gene. Both 3-D plots
and 2-D color maps are provided for the combinatorial regulation of two transcription factors, where the x- and y-axis represent the activity of the two transcription factors and the
z-axis/color represents the expression level of the target gene. Microarray data does not
directly provide activities of transcription factors, so the rightmost plots for each regulatory
scenario instead uses estimated activity (Type-2 model) of the transcription factors, showing what proﬁles that could be obtained from microarray analysis. Numerical simulation
of kinetic model is performed using the Runge Kutta method in Matlab and the plots are
generated with customized code in Matlab.
30

cell systems is still very limited, it is hard to assess predictions of combinatorial regulation.
We try to use simulation study to show some cases of combinatorial control as well as to
what extent diﬀerent mechanisms can be revealed by analyzing expression data. To address
the combinatorial control of two transcription factors we explore the mechanistic cooperation schemes and acquire kinetic equations from previous studies of theoretical modeling
(thermodynamic binding model) [55], then generate putative gene expression proﬁles. The
simulations show diﬀerent expression proﬁles depending on the cooperation mechanisms
(Figure 1.5 on page 30). These proﬁles demonstrate that the diﬀerences in the diﬀerent
cooperation schemes are so subtle that only a few speciﬁc perturbations/conditions would
capture the distinctive features, as depicted by the narrow transition (regions with signiﬁcant
changes of colors representing target gene expression in the 2-D response surface color-maps)
in the response surface curves in Figure 1.5 on page 30.

In Figure 1.6 on page 33 we show the clearest pattern that we have seen in the yeast
data with interpolation on the combinatorial regulation of two TFs on one gene. Compared
with Figure 1.5 on page 30, we show that without independent measurements of the actual
activity of the transcription factors, as well as enough number of perturbations to cover
all possible combinations of the two TF’s activity level, the subtle features in the diﬀerent
cooperation schemes are not suﬃciently distinctive, and will be further exacerbated by the
noise and limited array data measurements. Therefore, the “top-down” approaches have
serious limitations in their ability to learn these detailed mechanisms. Unlike the “bottomup” quantitative experiments performed for small systems, the high-throughput data involves
many layers of interconnected regulations making it diﬃcult to segregate the contributions
of each TF on the expression of its target genes. As observed by Gitter et al. [57] in the
31

systematic knock-out experiments [58], the overwhelming majority of its target genes would
not be aﬀected even if a transcription factor is knocked out.

In this section, we introduce brieﬂy the current approaches to identify genes and interactions that are speciﬁc to a phenotype. To identify genes, a problem is the selection of
genes that can distinguish a phenotype, thus many ﬁlter methods in feature selection can
be applied to address the problem. To identify interactions, or “reconstruct a network”, the
general idea is to determine whether or not a gene controls the expression of another. From
a biological perspective, we used a set of yeast microarray data as a working example to evaluate the fundamental assumptions implicit in learning the transcriptional regulation from
gene expression data. We show that the detailed transcriptional mechanism is overly complex for expression data in the conventional setting (samples of phenotypes and controls)
alone to reveal. A proper incorporation of multi-source biological knowledge, especially
context-speciﬁc information, is beneﬁcial for network reconstruction. The idea has been
implemented by integrating PPI (e.g. MATISSE [59]) or P-DNA data (e.g. MINDy, MARINA [19]) to provide a “reference” biological network to identify the interactions wherein
are speciﬁcally functioning in a phenotype. Furthermore, researchers have attempted to
combine the systematic design of high-throughput experiments with these computational
approaches to achieve better reconstruction of regulatory networks. A theoretic simulation
study by Yu et al. [20] shows that Dynamic Bayesian Network is both accurate and eﬃcient
in recovering the simulated molecular network when there is a large time-series dataset with
many (25-100) time points (although it is still diﬃcult to recover regulatory relationships
when a gene have multiple regulators, as shown by our yeast analysis). Sachs et al. [60]performed systematic perturbation experiments on a small network of protein signaling, with
32

Figure 1.6: Examples of combinatorial transcription regulation in gene expression data. In
these two examples, the genes have only two potential TF regulators based on TRN. The
space is interpolated based on the 255 datapoints.

33

ﬂow cytometry measurements of stimulatory/inhibitory interventional conditions on most of
the proteins in the network. Based on these perturbation data, they applied Bayesian Network and successfully reconstruct a detailed regulatory model for the RAS/MAPK signaling
in human primary immune cells. Therefore, combining systematic design of high-throughput
experiments with machine learning approaches can identify the causal/regulatory network
underlying a particular biological process.
Overall we suggest a conscientious inspection of both the biological assumptions underlying the mathematical formulations of the models, and the information contents in the
data in support of the statistical learning processes, which we believe is required in order to
achieve learning results with lasting biological signiﬁcance. This would help accelerate fruitful capitalization and continuation of computation in promoting our understanding of the
biological regulatory system. Unless coupled with interaction data, systematic perturbations
or time-series experiments, the “top-down” approaches have limitations in their ability to
learn detailed mechanisms and causal relationships. Thus in the next section I will introduce
the “bottom up” approach in systems biology that builds models based on the knowledge of
the regulatory mechanism of a gene network to provide speciﬁc hypothesis on why and how
a gene is involved in determining a phenotype.

1.3

Modeling based on mechanism

Systems biology integrates experimental and modeling approaches to develop a detailed
dynamic understanding of the functions and behaviors that are speciﬁc to a biological system.
The approach of systems modeling is diﬀerent from mining approaches that are attempting
to identify and characterize genes and interactions to distinguish phenotypes based on omics
34

data. The modeling approaches combines systems theory and biological information to
achieve a functional understanding of the system as dynamic processes [61].
The modeling study in systems biology integrates experimental results to develop a functional context to address the following complexity that arises in biological systems as dynamic
interacting networks:
• The large number of diﬀerent molecular components in cells.
• The enormous, heterogenous, direct or indirect interactions between the molecules.
• The complex functionality of molecular components. For example, the same component
could have very diﬀerent function in diﬀerent biological processes (multi-functional) or
organisms. A particular function can be accomplished by diﬀerent sets of components
(redundant).
• The dynamic behavior of biological systems. In many cases the dynamics are seldom
linear, and contain feedback circuits that are very complex. The diversity of biological
systems.

1.3.1

Modeling of gene network: An overview of systems theory

Modeling is a process which associates mathematical concepts with natural system(Wolkenhauer,
2001). Particularly, in gene regulation or cell signaling, one aims to connect the interaction
of the molecular components and the observable biological phenomenon with mathematical
variables and relationships so as to generate predictions and hypothesis. The basic modeling
approach can be categorized broadly into deterministic kinetic modeling, stochastic modeling
and qualitative discrete dynamic modeling. These are mathematical tools that can integrate
35

network information and formalize the causal structure of the signaling system with variables
and functions [62].

Simulation is a modeling experiment that can provide a temporal proﬁle representing the
dynamics of the system. For example, in a kinetic model, a simulation may apply numerical
integration to ﬁnd a solution of the diﬀerential equations. In a stochastic model, a simulation
is a single realization of the stochastic process in which the probability of the resultant state
of each variable corresponds to the distribution constrained by the stochastic equations.

The basic framework of modeling, which could be used to generalize many diﬀerent
models in systems biology, can be established with systems theory. The following derivations
in systems theory can be found in the review [63] . In systems theory, a system is deﬁned
by a set of objects and their relationships:

S ⊂ O1 × O2 × . . .

(1.9)

where the Oi are objects and × denotes the Cartesian product (a combination of objects in
the object set) that represents the relationship between objects. In a signaling system where
one observes that some stimulations cause some responses could be described as

S ⊆Ω×Γ

(1.10)

where the Ω and Γ represents the stimulus and response, respectively. For example, the
stimulus could be a set of diﬀerent ligands and the responses are the expression of speciﬁc
genes. There should be causal relationships between stimuli and responses, which can be
36

mapped:

σ :Ω → Γ
(1.11)
ω→γ
where Ω = {ω : I → U } and Γ = {γ : I → Y } deﬁnes each stimulus/response a temporal
sequence of events, that is, at any time t ∈ I, a stimulus u(t) of the system S results a y(t)
at time t. In practice, a time-course plot of the stimulation and response can visualize one
ω → γ mapping.
Thus we now have a “phenomenological” model from the stimulation to the response,
in the sense that we treat the internal system as a black-box and only consider the input
and output. There are many phenomenological models that have been applied to describe
physiological changes, such as diﬀerent bacterial growth curve under diﬀerent conditions
involving temperature, pH, nutrition and other factors in the medium. Unsatisﬁed with a
phenomenological description, systems biology usually looks for deeper mechanistic models,
which consider the transduction process within the cell at the molecular scale to reveal the
mechanistic design of the cell system. Thus the model should be extended with a state-space
X: Ω → X, X → Γ. The behavior of the system is encoded by the states x ∈ X and the
temporal evolution of x(t) = ϕ(t; t0 , x, ω) depends on the state-transition mapping:

ϕ:I ×I ×X ×Ω→X

(1.12)

where the state of the system at a particular time depends on the initial state x(t0 ), the
stimulation Ω , and probably the states at some other time.
A dynamic system S is ‘continuous-time’ if I is a set of real numbers: I = R+ . If I are
37

integers: I = Z+ , the system is a ‘discrete-time’ system. A dynamic model that determines
precise values x(t) are ‘deterministic’ and if the x(t) are random variables the model is
‘stochastic’. S is ﬁnite dimensional if X is a ﬁnite-dimentisional space (which means x(t)
are discrete variables with ﬁnite states). The X → Γ mapping could be used to model the
noise in the measurement of the system.
Therefore:
• The kinetic modeling approach which applies a set of diﬀerential equations to describe
the concentration changes of molecules essentially deﬁnes a deterministic continuoustime system S, where the mapping ϕ is represented as equations:

dx/dt = f (x(t), u(t))

(1.13)

• The Boolean Network modeling, is a method of discrete-dynamic modeling, deﬁnes a
discrete-time ﬁnite-dimensional system where the variables x(t) has ﬁnite states (e.g.
two states on/oﬀ only) and the mapping ϕ includes some state-transition rules. The
state of the system in the next time point is determined by the previous ones through
logic relationships between the variables.

x(t + 1) = F (x(t)) + G(u(t))

(1.14)

in which F represents the state-transition rules and the G deﬁnes the possible eﬀect of
the stimulation.
• The stochastic modeling approach deﬁnes a stochastic discrete-time ﬁnite-dimensional
system where the state change is discrete-time (similar to equation.6) but variables
38

represent random processes.
The mathematical concepts and techniques of modeling and simulation in systems dynamic theory are generic [64]. The systems approaches to manage complexity arising from
dynamic interactions are well developed, the general framework of modeling, diﬀerent type
of models and the simulation methods are not novel and have already been successfully applied to a wide range of processes in other areas, such as engineering, physics and chemistry.
For the systems modeling of gene networks, the major challenge is to build a model that is
consistent with our understandings of the biological mechanisms. In the next section I will
use one of my studies as an example to show how a gene network model can be developed
based on biological assumptions.

1.3.2

Example: discrete dynamic modeling of insulin signaling in
liver cells

The approach to build a systems model involving genes and interactions that determines a
phenotype can be decomposed to address the following two questions:
• How to model a phenotype?
• How to deﬁne the perturbations that determines or changes a phenotype?

To model a phenotype , we need to collect information how the phenotype is regulated.
As an example, the insulin signaling is a well-studied and complicated signaling network
in mammalian cells. It is composed of branched downstream signaling pathways and various feedback mechanisms, which could beneﬁt from modeling. In a separate study, our
group identiﬁed the involvement of a novel player, PKR, in the insulin signaling network of
39

HepG2 cells [65]. Research in our group has shown that as one of the downstream target
of insulin signaling, PKR is intricately involved in regulating the insulin signaling process
through a feedback mechanism [65]. Therefore, we model the phenotype of insulin signaling
in liver cells as a dynamic signaling process involving PKR, in which potential interactions
and components are collated from literature and the patterns of their dynamic activation
under insulin treatment deﬁnes a “phenotype”. Since the literature information and our
experimental approaches provides mostly qualitative information, we choose to use discrete
variables to represent the activation of components (the signaling molecules involved) and
specify the state transition rules to describe interactions between the components that determine a phenotype.
The signaling network can be formalized in terms of an oriented graph, where the vertices represent the elementary components involved in the process and the arcs describe the
regulatory interactions between those components. In the signaling network, each directed
arc reﬂects the direction of information ﬂow from the source vertex to its target in the signal transduction, and is labeled with a positive or negative sign which deﬁnes activation or
inhibition, respectively.
We associate each vertex in the signaling network with a discrete variable, which has
three states representing the activity of the protein —- 0: lower than control, 1: the control
state, 2: higher than control. Thus by deﬁnition every component starts at the “control”
state in the absence of insulin stimulation.
We deﬁne transition rules based on the activation/inhibition attributes on the arcs in
the signaling network. Two operations: shift up (+1) and shift down (−1) adapted from the
triple logic are applied in the model. If an activator is in a state “higher than control”, (e.g.
40

the kinase phosphorylation is increased), the state of its target will be shifted upwards. In
contrast, if the state of an inhibitor is higher than control, its target will be shifted downwards
in the next updating event. For some components, there may be multiple regulators that
are active in one updating event and the combinatory eﬀect is determined by comparing
the number of activating to inhibitory factors. The target is shift-up if there are more
activating factors, and vice versa [66, 67]. If the number of activating and inhibiting factors
equal, we assume the target remains at the control or current state. Finally, the state of a
component will decay if its regulators can no longer maintained their active state. This way
we construct a systems model for the insulin signaling in liver cells, whereby we can simulate
the evolving of the activity of the signaling proteins along with time to deﬁne a phenotype
to be investigated.

To deﬁne the perturbations that determine or change a phenotype , which is,
in this case the dynamics of the insulin signaling in liver cells, constraints are assigned to
components to mimic the perturbations on the network. If a protein is constantly inhibited
(in an experiment), we restrain the state of its corresponding variable in the model to be
always in either 0 (lower than control) or 1 (control state).
In the simulation each run starts with its own set of randomly generated initial states
and a simulation result represents the dynamic proﬁle of a single cell in the population. By
assuming that cells response independently to a signal, we can simulate a large number of
independent runs to mimic a population eﬀect and measure the average evolving proﬁle for
the population.
Time is modeled by regular intervals called time-steps. Since most components in our
network are kinases or phosphotases, and most reactions are protein phosphorylation and
41

dephosphorylation, we assume that the duration of the activation/inhibitions and the decay
processes in the signaling transduction are comparable and approximated by one time-step.
Since the reaction rates may be diﬀerent from cell to cell even for the same interaction, we
n
apply asynchronous updating of the state: Si = fi (S( j1 )( m1 ), S( j2 )( m2 ), . . . , S( jk )( ml )),
n
ml ∈ n − 1, n where Si is the state of component i at time-step n, and fi is the transition
function associated with i and its regulators j1 to jk , and the time-point corresponding to
the last change of the regulators can be either the last or current round of updates.
This is an example of how biological assumptions are applied to build a systems model of
gene network, more details of how this particular signaling network is analyzed by modeling
and experimentation can be found in our paper [68].Overall, this discrete dynamic model
provides an in silico model framework that integrates potential interactions and assesses the
contributions of the various interactions on the dynamic behavior of the signaling network.
Simulations with the model generated testable hypothesis on the response of the network
upon perturbation, which were experimentally evaluated to identify the pathways that function in our particular liver cell system. The modeling in combination with the experimental
results enhanced our understanding of the insulin signaling dynamics and aided in generating
a context-speciﬁc signaling network.
More importantly, by comparing model simulation with experiments, we found that even
in this well-studied signaling system, there are many components and interactions that are
not actually functioning, or not functioning in the way we thought they would be in our
system, since our current understandings of regulatory network are based on experiments
from diﬀerent research groups, on diﬀerent systems, under diﬀerent conditions. For example,
we found the AKT-PP2A feedback in the insulin signaling, in which PP2A can induce the
42

dephosphorylation of Akt, and thereby suppress the activity of Akt, is not functioning in
the HepG2 liver cells, although it has been suggested as one important regulatory module
in insulin signaling. These observations emphasize the importance and necessity in the
identiﬁcation of speciﬁc genes and their regulation to understand a phenotype.
In summary, systems modeling studies biological processes by systematically model
the gene network and its response to diﬀerent perturbations. The modeling studies provide a framework that can integrate experimental results to develop a functional context to
address the complexity that arises in biological systems as dynamic interacting networks.
The methods for modeling and simulation are generic but their biological applications have
many challenges, since many biological assumptions need to be considered to describe an
appropriate mathematical representation for the system of interest.

43

Chapter 2
Identiﬁcation of condition speciﬁc genes
and interactions — a multi-layer approach
An important topic in systems biology is the reverse engineering of regulatory mechanisms
through reconstruction of context-dependent gene networks. A major challenge is to identify
the genes and the regulations speciﬁc to a condition or phenotype of interest, given that regulatory processes are highly connected such that a speciﬁc response is typically accompanied
numerous collateral eﬀects.
In this chapter I will introduce and discuss the multi-layer approach that is able to
reconstruct condition-speciﬁc genes and their regulation through an integrative analysis of
large scale information of gene expression, protein interaction and transcriptional regulation.
Application on the yeast dataset correctly identiﬁed a context-speciﬁc network and the major
transcription factors that regulated at either the transcriptional or post-transcriptional level.
Current approaches have diﬃculty specifying these regulators. Further application on human
breast cancer identiﬁed Trop2 (TACSTD2) as a target gene, and discovered its regulation by
transcription factors CREB as well as NFkB, the latter regulated at the post-transcriptional
44

level. The predictions were further conﬁrmed through experimental studies.

2.1

The problem in current approaches to identify speciﬁc genes and interactions

The accumulation of high-throughput transcriptome data has driven the development and
application of computational approaches to infer networks, to elucidate gene regulation and
to identify targets. As introduced in chapter 1, there are many approaches developed and
applied to identify genes and interactions. Initial network inference methods based on gene
expression data were successful with prokaryotes [18]. However higher eukaryotes systems,
with their higher number of genes, provided many more candidate genes (several hundred)
and interactions (more than a thousand) (e.g. [69,70]. This generated numerous hypotheses,
and with this sheer number most of the candidates cannot be investigated or validated
through experiments, making it diﬃcult to assess the utility of the proposed approach on
these systems.
Further, a majority of the candidates are often related to general processes that are not
speciﬁcally responsive to the condition being investigated. With this large number of possible
candidates literature search to “manually“ characterize many of these predicted candidates
are typically performed to identify potential targets for further experimental investigation
based on one’s expertise. Alternatively, GSEA [71] and GO (http://www.geneontology.
org/) annotations have been applied in many reverse engineering studies to interpret the
results, which provide an understanding of the “general” processes involved rather than the
direct molecular mechanisms.
45

To address the problem of large number of candidates that are hard to validate and
characterize to answer biological questions, several studies integrated gene expression data
with interaction networks constructed from protein-protein interaction (PPI) and proteinDNA (P-DNA) interaction information [59, 72]. In these studies statistical methods were
applied on the gene expression data of the pair-wise interactions to identify active subnetworks or modules. In contrast, ARANCE based on gene expression data alone was able
to successfully identify a transcriptional molecular interaction network by removing indirect
interactions [19]. These approaches are based on structure learning methods that aim to infer “functional interactions” in accordance with certain presumed mathematical deﬁnitions,
i.e. diﬀerential expression, correlation or mutual information between the gene expression.
A correlation between two genes could suggest “regulating”, “being regulated” or indirect
relationship. However an absence of a correlation does not preclude a possible regulatory
relationship, e.g. post-transcriptionally regulated interactions. Thus subsequent methods
were developed to uncover these potential post-transcriptional interactions, given either a
transcription factor and target gene pair along with a list of potential regulators or a transcription factor and its target genes [73, 74].

We note that the problem of generating too many hypotheses is because the predictions
are not speciﬁc enough. For example, in biological systems a speciﬁc and direct response
to a perturbation usually is accompanied by a cascade of collateral eﬀects on many genes
and dozens of regulatory modules in the network. For example, Stephanopoulos et al. [69]
showed that in an experiment that knocked-out the GAL80 gene in yeast, and comparing the transcriptomes before and after the treatment showed that such modulation of a
single pathway eventually caused a global eﬀect throughout the whole bio-molecular inter46

action network. The speciﬁc response of the knock-out experiment is the activation of the
galactose-processing pathways by eliminating GAL80’s repression on the GAL4 transcription
factor, nevertheless the repression of this one pathway resulted in hundreds of diﬀerentially
expressed genes and dozens of activated modules, making it diﬃcult to identify the essential
“trigger”, i.e. the speciﬁc pathway in response to the perturbation. Therefore, in a typical
network inference or module analysis, many of the genes and modules identiﬁed would likely
be such “side-eﬀects” or collateral response rather than direct and speciﬁc eﬀects. For example (Xia Yang et al., 2010) in identifying the activity of the p450 gene in human liver, network
analysis (clustering and network reconstruction based on correlation) identiﬁed more than
5000 diﬀerentially expressed traits spanning many general functional modules including immune response, cell cycle, lipid metabolism, macromolecule biosynthesis, etc. This provides
a “rough measurement” of the overall inﬂuence but is ineﬀective in identifying the speciﬁc
pathway that regulates these eﬀects and in guiding the experimental design for further indepth functional studies. Therefore, the speciﬁc responses are usually concealed by many
less speciﬁc eﬀects, which are diﬃcult to distinguished based on current network analysis
methods.

2.2

Basic ideas to improve the speciﬁcity of prediction

In our network reconstruction framework, we propose to integrate microarray data from a
diverse set of conditions to provide a common context (more and better controls) for the
expression behaviors of genes, and apply advanced feature selection technique, to identify
the target genes (in layer I) that are most speciﬁc to the condition being investigated. From
the target genes, conditional gene regulations (in layer II), and conditional transcription
47

factor activity (in layer III) are then determined. Incorporating these diverse conditions for
comparison in the feature selection of genes and interactions reduces the false positive rate
and enhances the speciﬁcity.

2.2.1

Incorporating diverse conditions for the identiﬁcation of genes

Identiﬁcation of conditional speciﬁc genes or ranking of the changes in TF activity can be
modeled as a process of “feature selection”, as introduced in Chapter 1, which is to select the
best features (genes, TFs) that can determine a class (phenotype/condition). Traditional
methods to identify candidate genes typically compare two set of samples, e.g. treated and
untreated condition.
In contrast to traditional methods, we propose to integrating multiple conditions of gene
expression data and applying a context sensitive algorithm to identify the genes that specifically respond to the condition being investigated. Therefore, the computational approach
that we are looking for, which is to identify features (genes and TFs) that can distinguish
one phenotype from all the other phenotypes, should fulﬁll the following requirements from
the biological aspect:

1. The approach aims to weight and rank genes according to their “importance”.

2. The approach should account for the fact that features (genes) are not all independent.
Gene expression is controlled by a complex regulatory network, thus there are intrinsic
relationships between genes.

3. There are intriguing relationships between phenotypes. There are some phenotypes
that may have transcriptomes similar to our phenotype of interest, these phenotypes
48

could be more important to compare with to understand the unique changes in our
conditions.

Therefore, the approach should adopt a learning model that fulﬁlls these requirements:

1. It should be a feature selection process to weight genes/features: i.e. it is a mapping
from original feature space X to a new feature space X , by scaling each dimension
with weight w. There should be some constraints on ws so as to ensure a unique
solution.
f : x → wx,

w 2 = c, w ≤ 0
2

(2.1)

2. To account for the fact that features (genes) are not all independent, we should evaluate
each feature in the context of other features and samples, which suggests a local learning
model.

3. There are some phenotypes that may have transcriptomes similar to our phenotype of
interest. Thus we try to ﬁnd “nearest neighbors” for the samples by comparing transcriptomes. For a sample xn , its nearest neighbor of the same phenotype is Hxn , while
its nearest neighbor from a diﬀerent phenotype is Mxn . We can deﬁne a “margin”:

ρn = d(xn , Mxn ) − d(xn , Hxn )

(2.2)

We try to ﬁnd a mapping so that the distance between diﬀerent phenotypes d(xn , Mxn )
are as large as possible, while samples from our phenotype d(xn , Hxn )should be as close
49

as possible. So the problem can be formularized as maximize the margin:

max
n

ρn (w), with respect to w, s.t. w 2 = c, w ≤ 0
2

(2.3)

Based on our modeling of this learning problem, there is an algorithm family called
“Relief” that can solve this optimization problem, which we applied in the identiﬁcation of
genes and transcription factors for a phenotype.
The ReliefF algorithm The basic procedure of Relief algorithm (Kira and Rendell, 1992)
is shown as follows:

Algorithm: Relief (theoretical computing time complexity O(m ∗ n ∗ f ))
Input: n learning samples X, with f features, sampling parameter m
Output: for each feature Fi a quality weight Wi
Initiate: For i = 1 to N : Wi = 0
For l = 1 to m:
Randomly pick a sample xk ;
ﬁnd its nearest hit H and nearest miss M ;
For i = 1 to f : Wi = Wi − dif (i, xk , H)/m + dif (i, xk , M )/m
Return W
Relief estimates the quality of features through a nearest neighbor comparison to account
for the “local” context of features (gene expressions in our case). The algorithm selects
neighbors from the same condition (hit) and diﬀerent condition (miss) based on feature
vectors (the transcriptome of a sample). The function dif (i, S1 , S2 ) calculates the diﬀerence
between the values of feature i in two samples I1 and I2 . m is the number of samples
randomly sampled from a dataset of n total samples and f features. For each iteration, the
weight of features are updated with respect to whether the feature diﬀerentiates two samples
from the same condition (undesired property), and whether it diﬀerentiates samples from
diﬀerent condition (desired property).
50

When Relief is used to identify condition speciﬁc genes, the data matrix E (n genes ×m
samples) contains n feature vectors, and the feature vector for each gene is the expression of
the gene in diﬀerent samples. When Relief is used to identify condition speciﬁc TF activity
change, the data matrix A (k TFs ×m samples) contains k feature vectors, the feature vector
for each TF is the summation of the expression of its target genes in diﬀerent samples, i.e.
A = B × E, in which B (k TFs ×n genes) is the TF-Gene binding matrix based on proteinDNA binding information.

2.2.2

Integrating multiple conditions to identify regulatory relationships

Traditional approaches to determine gene regulatory relationship are commonly based on
mutual information (or correlation) between pairs of genes [19, 72], with comparison between a condition of interest and a reference condition. Instead of using a single “untreated”
reference condition, we suggest incorporating multiple conditions to provide a better reference pool, and compute the diﬀerence between the conditional and unconditional mutual
information (M I):

MI(gene pair|condition) − MI(gene pair in all conditions)

(2.4)

This idea is essentially an extension of the two-way relationship (gene x and gene y) to
a three way dependencies (gene-pair x1 , x2 and y) in calculating the mutual information
from gene expression, such that the regulators of y, genes x1 and x2 , could be readily distinguished. This approach, also called causal-ﬁltering, was initially suggested by Bontempi et
51

al. [75], in which they compute the conditional mutual information between x1 and x2 given
y, M I(x1 , x2 |y), and the unconditional one M I(x1 , x2 ). When x1 and x2 are the regulators
of y, one have M I(x1 , x2 |y) − M I(x1 , x2 ) > 0. In our case, since the eﬀectors/target genes
y for the condition of interest is identiﬁed by Relief and actually determines the phenotype,
thereby the three way dependencies (gene xs and y) are similar to conditional dependencies
(gene xs and the condition of interest), thus we compute the diﬀerences between conditional
mutual information and unconditional mutual information for genes on paths to the target
genes in the reference network (PPI and P-DNA network), and each gene pair consists a
potential “regulator” and a “target gene” in the network. Positive scores suggest a potential
causal relationship while zero scores suggest an indirect or downstream eﬀect and thus are
removed from the network. Any node with zero score is ﬁltered out unless it is a potential
transcription factor of the targets or it serves as an intermediate step on the path of a causal
(positive-scored) node to the targets.

2.3

The multi-layer approach and the proof-of-concept
applications

We take an alternative approach to identifying targets that are speciﬁcally responsible for the
condition of interest, and which may not necessary be hubs in a network [76]. Our approach
separates the network reconstruction into multiple levels, each addressing a speciﬁc biological
question wherein a particular scenario of biological regulation is modeled. The overall aim
is to provide more speciﬁc regulatory hypotheses by incorporating interaction data, however
in a very diﬀerent fashion from prior approaches. In contrast to prior approaches [59, 72, 77]
52

wherein an interaction network is ﬁrst constructed and the gene expression is then superimposed onto the network, we ﬁrst identify the most speciﬁc genes involved and then build
the interactions from there up, with the interactions obtained from PPI and P-DNA information. We separate the reconstruction of condition speciﬁc gene network into three layers
(layer I, II and III). Layer I aims to identify the genes that have distinct expression pattern
under the condition of interest, from which the conditional network is built. Once these
“speciﬁc” genes are identiﬁed, the network is expanded in layer II based upon known and
predicted interactions with these genes, obtained from the PPI and protein-DNA networks.
A ﬁltering approach based on mutual information is applied to the physical interaction network to reconstruct the regulatory pathway from the candidate genes. In layer III we infer
the transcription factor (TF) activity to identify the major regulators in the gene network,
accounting for post-transcriptional regulation. The multiple layers of learning with their distinct biological assumptions capture diﬀerent biological features in the regulation to achieve
reconstruction of condition-speciﬁc gene networks that accounts for both transcriptional and
post-translational molecular interactions. We establish the accuracy of our methodology
against synthetic datasets as well as a yeast dataset.

2.3.1

Layer I. Identiﬁcation of candidate genes

We propose that integrating multiple conditions of gene expression data and applying a context sensitive algorithm Relief identiﬁes the genes that speciﬁcally respond to the condition
being investigated. Unlike previous applications of pair-wise comparison with Releif [78], we
apply Relief to a diverse set of conditions. To model this, we generate a simulated dataset of
100 genes with 5 diﬀerent conditions (conditions A−E) plus a reference condition (condition
53

ref ), each condition has 10 samples. In each of these 5 conditions there are 10% genes that
are uniquely changed (speciﬁc responses) and another 20% are shared changes (overlapping
responses to represent collateral responses). The changes are simulated with a normal distribution of the mean and variance, diﬀerent from the unchanged (control) condition. The
goal of the gene selection/feature selection is to identify the 10 genes that change uniquely
for a given condition (condition A). Diﬀerent methods are applied to identify the 10% of the
genes that change uniquely for a given condition. Traditional approach to identify diﬀerentially expressed genes compares condition A with the reference condition, while we suggest
applying ReliefF on condition A against all the other conditions that are available. We
compare two diﬀerent approaches: traditional approach to identify diﬀerentially expressed
genes based on t-test, and the ReliefF algorithm; under two diﬀerent scenarios: with only
condition A and reference control, and integrating all the conditions that are available and
comparing against condition A. The ROC curves in Figure 2.1 on page 55 shows that Relief
applied on multiple conditions performs the best.

The “MEGA” yeast microarray dataset from the Audrey Gasch lab (http://gasch.
genetics.wisc.edu/datasets.html), which combines 500 yeast experiments from a variety of conditions, provides an integrated gene expression “reference pool” for analyzing yeast
data. We evaluate our approach by reconstructing a gene network for the “AltCarb” condition, i.e. adding extra alternative carbon source, in which the essential regulatory pathway is
known a priori- namely the GAL pathway. When adding extra carbon sources into the yeast
media, if the carbon source is galactose, the GAL pathway, also known as the “Leloir pathway” will turn ON, if it is other types of carbon source, such as glucose, the Leloir pathway
is turned OFF leading to “glucose repression” [79]. A successful gene selection procedure
54

Figure 2.1: Proof-of-concept example of applying Relief on multiple conditions A: We suggest
applying ReliefF on condition A against all the other conditions. B: The ROC curves for
identiﬁcation of speciﬁc genes for the condition of interest (condition A). The AUC (area
under curve): ReliefF with multiple conditions > ReliefF with 2 conditions > t-test with 2
conditions > t-test with multiple conditions.

55

is expected to identify the entire GAL pathway-the four GAL genes (GAL1, GAL2, GAL7,
GAL10), which are the essential transporters and enzymes for this condition.

In contrast to traditional analysis where conditions are compared with a same, untreated
reference (Figure 2.2 on page 57), our algorithm analyzes an integrated reference pool as a
representation of the diversity of gene expression behaviors to identify the most speciﬁc genes
for a target condition (Figure 2.3 on page 58). The scoring of the genes for the “AltCarb”
condition by Relief is shown in the ﬁgure. The four GAL genes are ranked as the top 4 genes
by Relief, and their scores (0.3-0.4) are much higher than the rest of the other genes (majority
of which scored less than 0.2). Therefore, our approach is able to re-discover all four speciﬁc
genes in the yeast Leloir pathway, which constitutes the core metabolic processing by yeast
in response to changes in the carbon sources.

To compare with a traditional approach, we also performed a per-gene permutation-based
t-test between the galactose treated and the control (untreated) yeast samples, which is a
standard method often applied to microarray analysis to uncover diﬀerential expression. This
analysis identiﬁed 236 genes (p < 0.01) with GAL7, GAL2, GAL1, GAL10 ranked at 17th,
53th, 87th and 124th, respectively, based upon their p-value. Many of the genes identiﬁed are
related to more general processes of stress or environmental changes, e.g. MSN2/4 for general
stress responses, which are not directly associated with the condition under investigation.
Therefore, our approach identiﬁed the most likely eﬀectors, and the integration of multiple
seemingly unrelated conditions, in fact, eﬀectively reduced the number of non-speciﬁc gene
candidates.
56

Figure 2.2: Comparing multiple conditions in yeast data. The samples “MEGA” yeast
microarray dataset from are plotted in 2-D with their ﬁrst two Principle Components. The
condition “AltCarb” is used as an example. A: A traditional treated/untreated analysis
compares the condition of interest with untreated yeast samples. The boundary is shown with
a black line, and the genes that contribute to such boundary can be identiﬁed with feature
selection approaches, e.g. t-test, ReliefF. B: We plot other conditions such as hyperoxide
stimulation, heat stress, etc. The samples of these conditions are similar to samples in the
“AltCarb” condition as compared with the same “untreated” reference samples, thus the
gene lists that are identiﬁed for these diﬀerent conditions based on such comparison could
be similar.

57

Figure 2.3: Application of Relief on the integrated dataset with multiple conditions A: The
integrated dataset provides better coverage of the sample space, and ReliefF compares “AltCarb” samples with all other samples in diﬀerent conditions to achieve better speciﬁcity.
Nearest Neighbors of AltCarb condition used in the ReliefF procedure are shown in green.
The closer a sample is to the “AltCarb”, the more important it is in the ReliefF comparisons. B: The score of the yeast genes provided by the ReliefF analysis correlates with the
importance or relevance of the gene to the speciﬁc condition. The genes behave distinctively
in the conditions being investigated and are scored the highest to illustrate their importance.

58

2.3.2

Layer II. Identiﬁcation of the potential gene regulatory relationships

To explore the regulatory mechanisms of the candidate gene “eﬀectors” identiﬁed in layer I for
a given condition, we integrate PPI and P-DNA or transcription regulatory network (TRN)
(from motif search, ChIP-chip data, or literature information if available) data of the same
system (denoted as the reference network), and apply mutual information on the expression
data to identify potential regulators of these target genes. We incorporate expression data of
multiple conditions to provide a better reference pool, and compute the diﬀerence between
the conditional and unconditional mutual information in Layer II to identify the factors (i.e.
genes or proteins) that directly (physically) interact with and causally regulate the target
genes. The information on the physical interactions is derived from PPI and protein-DNA
interaction data, and the potential causal factors are inferred based on the conditional mutual
information computed from the gene expression data of these interactions.
We establish the accuracy of our methodology against a yeast dataset of conditional TF
binding. There are 34 ChIP-chip datasets for yeast samples treated with H2O2 [52]. We
used the TF-gene binding indicated in these conditional ChIP-chips experiments as the true
TF-gene regulatory relationships, and applied our method in layer II to infer conditional
regulatory information for these 34 TFs. Figure 2.4 on page 61 shows the ROC curves
for inferring the TF-gene regulatory relationships. Incorporating information of potential
transcriptional regulatory network (TRN) based on binding motifs on the gene promoters
and literature information (solid lines in diﬀerent colors) signiﬁcantly reduce the false positive
rate as compared with approaches that do not take advantage of such information, shown in
59

dotted lines (e.g. mutual information only based approaches). Our approach of incorporating
multiple conditions enhances the speciﬁcity, which is shown by the red solid and dotted lines
(corresponding to with and without TRN) as compared with the traditional setting denoted
by the green solid and dotted lines.

We applied Layer II on the yeast “AltCarb” condition to reconstruct a condition speciﬁc
network, which resulted in 27 nodes and 88 interactions. The GAL switch genes identiﬁed
in Layer I are used as the targets in this layer to retrieve their potential regulators. Many
known regulators of the GAL pathway, such as GAL4 and GAL80 are identiﬁed in the
network because their expression level (e.g. GAL80) or their interacting partners’ (e.g.
GAL80 as the interacting partner of GAL4) expression level are correlated with the target
genes under this condition, leading to a potential regulatory pathway to the target genes.
For comparison, we applied ARACNE, another network reconstruction methods currently
available based on mutual information, on the same yeast data for the same condition. The
top ranked network module that is identiﬁed in ARACNE is the interaction “INH1-APA2”
and consists of 2 nodes, in which the INH1gene is an inhibitor of ATPase and APA2 is
involved in catabolism of bis(5’-nucleosidyl) tetraphosphates (based on annotation in SGD,
http://www.yeastgenome.org/). There is no evidence in the current literature to suggest
their involvement in response to the AltCarb condition, and further these two genes do not
physically or directly interact with the GAL genes. To ﬁnd the 4 GAL genes in the network
required the threshold in ARACNE to be relaxed to allow 2050 nodes and 11,530 edges
in the resultant network. In this large network the known, major regulator of the GAL
switch, GAL4 and GAL80, were not connected to the four GAL genes. This suggests that
many “collateral responses” may be stronger than the initial GAL response. Therefore, by
60

Figure 2.4: The ROC curves for inferring TF-gene regulatory relationships. The true
TF-gene regulatory relationships were extracted from conditional ChIP-chip data on yeast
(with binding p-value¡0.001) for the samples treated with H2O2. The prediction of TFgene regulatory relationships are based on mutual information (MI) between TFs and their
target genes. The traditional setting is M I(condition of interest) − M I(a ref condition),
shown in green dotted lines; while we propose to use a variety of conditions as reference:
M I(condition of interest) − M I(multiple conditions as ref s), shown in red dotted lines.
We apply the same approaches but incorporate the information of Protein-DNA interaction,
shown in solid lines, green: traditional setting compared with reference condition, red: compared with a variety of conditions as reference. Further, we use the sum of the target gene
expression as a feature of TF activity for a given condition, and apply Relief to identify the
TFs and genes that have distinct activity and expression proﬁle for the condition of interest
(H2O2). Those TF-gene pairs with signiﬁcant changes (top 30) on both TF activity and
gene expression are elevated to the top of the list of potential TF-gene regulatory relationship based on MI measurement of the multiple condition setting. The result is shown in blue
solid lines.

61

integrating the PPI and P-DNA interactions, and combining layers I and II, our approach
eﬀectively reduced the number of hypothesis, focused on the most speciﬁc candidates, and
identiﬁed potential “causal” and “direct” regulators.
Nevertheless, similar to many previous network reconstruction approaches, layer II is
based on statistical dependencies between gene expression, assuming a correlation between
the expression of the regulator and its targets, which may not necessarily hold in all cases,
especially when the transcriptional regulation involves a transcription factor that requires
post-transcriptional modiﬁcations or co-factors to be activated [80]. Thus we add another
layer in our framework, layer III, to address this challenge by accounting for the TF activity.

2.3.3

Layer III. Inference of TF activity and transcriptional regulation

In eukaryotes, post-transcriptional modiﬁcations or cofactors are required for many TFs to be
activated to regulate their target gene expressions. Although such protein-level information
cannot be directly measured in microarrays, the target genes’ expression can reﬂect such
regulatory events. Therefore, given potential target genes predicted from motif search, ChIPchip data or literature information, we can infer the changes of TF activity.
We obtained target genes for each TF from the yeast P-DNA (or so-called TRN) network,
and use the summation of its target gene expression level as the feature of a TF in the sample.
With these features we can apply ReliefF to identify TFs that show distinct activity (i.e.
features) in the condition of interest as compared with all the other conditions. Those TFgene pairs for which both TF activity and gene expression changed signiﬁcantly (e.g. within
the top 30 of genes and within top 30 of the TFs scored in ReliefF) are raised to the top of
62

the list of potential regulatory interactions identiﬁed in layer II.

Layer III is applied to the yeast H2O2 condition, and the results are shown by the blue
lines in Figure 2.4 on page 61. With the inference of the TF activity, the ROC shows a further
increase in the speciﬁcity (achieve a false positive rate of less than 0.01) of the predicted
relationships that are ranked at the top 10%.

We also applied layer III to identify the TFs that regulate the expression of the GAL genes
under the AltCarb (adding extra carbon source) condition. We use the top 3 predicted TFs
to reconstruct an essential regulatory network for the GAL system, based on known PPI and
P-DNA binding information. The results shown in Figure 2.5 on page 64 demonstrate that
our approach can re-discover the true network that includes the regulators GAL4, GAL80
and IMP2, the three speciﬁc TFs which regulates the GAL pathway for galactose utilization
and glucose repression [79]. In contrast, previous approaches based on TF expression level
(TYPE 1, approaches based on mutual information or correlation, e.g. Bayesian Network),
diﬀerences in target and non-target gene expressions (TYPE 2), or correlations within target
genes (TYPE 3) (as deﬁned in Chapter 1) did not correctly identify the essential regulatory
network (Figure 2.5 on page 64). In particular, GAL4 is post-transcriptionally regulated
by its protein interaction with GAL80 [79], whose activity changes can be captured by our
approach but not the other approaches compared. A GSEA based approach (e.g. MARINA)
may be able to ﬁnd the enrichment of the target gene groups for these TFs, however, it gives
a lower speciﬁcity since many other TFs are identiﬁed to be more enriched for the AltCarb
condition than the 3 GAL regulators.
63

Figure 2.5: Network reconstruction of the GAL pathway. We estimate the activity of the 25
transcription factors that can bind to GAL genes in the transcriptional regulatory network
(TRN) based on motif search and literature evidences, and use the top 3 TFs predicted to
reconstruct the essential regulatory networks, with the interactions extracted from the TRN
(green lines) and the PPI (blue lines). We compare diﬀerent approaches in estimating the
TF activity A: TYPE 1: TF activity is determined by its expression level, e.g. correlation
or mutual information based approaches; B: TYPE 2: TF activity is determined by the
expression level of their potential target genes in the TRN; C: TYPE 3: TF activity is
implicated by the co-expression of their target genes; D: Our new approach: use the target
gene expression information, i.e. sum of the target expressions and integrate a wide range of
conditions to determine the TF activity. The true network includes regulators GAL4, GAL80
and IMP2, shown by the nodes colored in magenta in the ﬁgures, which are speciﬁc TFs
regulating the GAL pathway for galactose utilization and glucose repression in the AltCarb
condition. Nodes colored in grey are non-speciﬁc TFs for the AltCarb condition.

64

2.4

Applications on human breast cancer

2.4.1

Layer I: Genes identiﬁed for breast cancer

We apply our multi-layer inference approach on an integrated human gene expression dataset
(ArrayExpress E-TABM-185) to identify potential biomarkers or targets and their regulators
for ER (estrogen receptor)+ and ER- breast cancer subtypes. The dataset integrates 5897
microarray experiments on diﬀerent human disease, which contains more than 1000 breast
cancer samples. Applying diﬀerential expression analysis is problematic with such an integrated dataset. A per-gene permutation-based t-test results in thousands of diﬀerentially
expressed genes for ER+ breast cancer. This high number of genes is hard to validate,
while very few genes are identiﬁed for ER- breast cancer due to the heterogeneity among the
samples in this subtype.
Layer I of our novel approach identiﬁed candidate biomarkers for both ER+ and ERbreast cancer, some of which are well-known targets for these subtypes. Expression proﬁle of
the predicted genes shows distinctive patterns in the breast cancer samples. The well-known
target ESR1 (estrogen receptor alpha) is correctly identiﬁed as the top feature for ER+
breast cancer, while the TACSTD2 (tumor-associated calcium signal transducer 2) gene is
identiﬁed for both ER+ and ER- breast cancer, which we recently discovered could be a
potential target for both ER+ and ER- breast cancer subtypes [80].

2.4.2

Layer II: network for ER positive breast cancer

We then reconstruct a regulatory network for ESR1 to assess whether our approach can
recapitulate the transcriptional regulators of the estrogen receptor. Human PPI and P65

DNA information is incorporated and layer II and layer III are applied. Finally we infer
a regulatory network for TACSTD2 to identify potential regulators for this novel target.
Figure 2.6 on page 67 shows the regulatory network inferred for ESR1, the nodes are sized
by scores computed in layer II based on the diﬀerences between conditional and unconditional
mutual information, genes with zero or negative scores (predicted not to regulate the target
gene in the condition) without any positively scored interacting partners were ﬁltered out.
Based on the binding sites predicted by DECODE in the GeneCards data collection, 7
transcription factors (TP53, JUN, AHR, TFAP2C, GATA3, FOXO3, REST) are selected as
potential regulators of ESR1. From BioGrid we extract a list of proteins that may physically
interacts with ESR1 or any of these transcription factors, which results in a large interaction
network. Our inference approach is then applied to determine which of these components may
be the eﬀective “causal factor” of ESR1’s speciﬁc expression pattern in breast cancer, results
in a conditional gene network with 143 proteins in this ﬁgure. Of all the 7 transcription
factors, 3 (TP53, JUN, GATA3) have positive “scores” in the causal ﬁltering approach,
suggesting they can regulate ESR1 expression in cancer cells, and such regulation may also
be aﬀected by (but may not depend on) their interacting partners, especially for those of
which have higher scores. The 4 other transcription factors are predicted to possibility aﬀect
ESR1 but not through direct transcriptional regulation, i.e. the regulation/activity of the TF
depends on a co-factor, which are interacting partners to the TFs as shown in the network
(Figure 2.6 on page 67).

ESR1 has been extensively studied in breast cancer and a literature search conﬁrmed that
many of the predicted regulators are correct. For example, the transcription factors TP53,
JUN, GATA3 that are predicted by our approach to directly regulate ESR1 expression,
66

Figure 2.6: The regulatory network for ESR1. Potential transcription factors that can bind
on ESR1 gene are colored in yellow. The causal impact (score) is represented by the size of
the nodes in the network.

67

has been experimentally studied and shown to be major transcriptional regulators of the
estrogen receptors. GATA3 binds to two cis-regulatory elements on the ESR1 gene and is
required for RNA polymerase II to be recruited to the ESR1 promoter [81]. This is consistent
with the conditional network constructed by our approach in which GATA3 has the highest
score (causal impact) among the transcription factors. TP53 has been shown to bind to
the promoter of ESR1 and regulate ESR1 expression in both ER+ (MCF-7) [82] and ER(MDA-MB-468) breast cancer cells [83]. Finally, it has been previously conﬁrmed that the
transcriptional activation complex that was recruited to the ER promoter involves JUN [82],
and JUN regulates ER transcription [84] [85].
The transcription factor TFAP2C has been shown to bind to the ESR1 promoter and
regulate its expression [86] [87] [88], and silencing TFAP2C reduces signiﬁcantly ESR1
expression and the estrogen response [89]. However, the activation of TFAP2C requires
multiple co-activators, p300 or CBP, whose recruitment depends on the adaptor protein
CITED2 [90]. Although the involvement of CITED2 in the activation of TFAP2 has not
been shown in breast cancer cells, the knockout or mutation of CITED2 signiﬁcantly diminished the TFAP2C transcriptional activity in liver cells [90, 91]. and in patients with heart
defects [92]. These results support the possibility that CITED2, predicted by our conditional network to aﬀect ESR1 expression through TFAP2C, may be an important regulator
of ESR1 expression. Note that TFAP2C by itself has less of a direct “causal” impact based
on the diﬀerences in its conditional and unconditional mutual information with ESR1, and
thus suggestive of post-transcriptional activation.
Similarly, in our network FOXO3 has no direct “causal” impact on ESR1 (i.e., no differences in conditional and unconditional mutual information), but its interacting partners,
68

SIRT3, shows a diﬀerence, based on the scores computed in layer II. However, based on the
literature FOXO3 was shown to bind the ESR1 promoter to regulate ESR1 expression [93].
Interestingly, SIRT3, one of the partners of FOXO3 that we predicted to directly impact
ESR1, was previously suggested to be a co-activator that increases FOXO3a dependent gene
expression [94].
In the network, the transcription factors, REST and AHR, are shown to have no impact
on ESR1 expression although based on the binding sites on the ESR1 promoter these TFs
could potentially regulate ESR1. To the best of our understanding from the literature, there
is no experimental evidence to date that indicate either of the two TFs directly binds the
promoter of ESR1 or regulates its expression. However, there is evidence showing that REST
may be involved in activating estradiol (E2) stimulation [95] but its expression is higher in
ER- than in ER+ patients [95]. Thus its role in E2 stimulation may require potential
cofactors, interaction partners or modiﬁcations, which is consistent with a zero score in layer
II.
There are many proteins that could directly interact with ESR1 and several were predicted to regulate its expression in our breast cancer conditional network for ESR1. From
the literature we found experimental evidence to support many as transcriptional regulators
of ER. For example, the 2 highest scored proteins are DDX17 and FLII. These factors are
transcriptional co-activator of the ESR TF, and can bind to the ESR protein to enhance ESR
activation of its own expression. DDX17 has been conﬁrmed as an estrogen receptor alpha
coactivator [96], while FLII is required for the recruitment of the SWI/SNF chromatin remodeling complex to enhance ER-mediated transcription of its target genes [97]. The largest
family (MED21, MED12, MED7, MED6, MED16) among these proteins that interact with
69

ESR1 are the components of the mediator complex, which is a co-activator that serves as
a scaﬀold for the assembly of a functional pre-initiation complex of RNA polymerase II, to
enhance ER transcription and function [98]. Given that ESR1 can regulate its own mRNA
expression [99], these co-factors of ER protein function may play a role in the regulation
of ESR1 expression. The known targets for the pathogenesis of breast cancer—- BRCA1 is
correctly identiﬁed in our network to interact directly and functionally with ESR1 to inhibit
estradiol (E2)-stimulated ESR1 transcriptional activity [100].

2.4.3

Layer II and III: the Trop2 network

Given that our approach successfully identiﬁed many transcriptional regulators of ESR1.
Inspired by the positive results obtained for ESR1, we applied this approach to identify the
potential transcriptional regulators of Trop2. We previously identiﬁed Trop2 as an important biomarker for breast cancer [80], however, there is no information currently available on
the regulation of TROP2. The “causal” network inferred for Trop2 is shown in Figure 2.8
on page 72. Based on this network we predicted CREB1 is a likely transcription factor that
regulates TROP2 expression. Our experiment (Figure 2.7 on page 71) shows that the activation of CREB1 by FI (Forskolin-IBMX) induces an up-regulation of Trop2 expression level
in the breast cancer cells, which supports our model prediction that CREB1 is a regulator
of Trop2.
Although the human protein-DNA binding information is far from completed as compared
with yeast, we obtained potential target genes of TFs for which binding motif is known,
including TP53, JUN, REST for ESR network and NFkB1, CREB, Evi1 for Trop2 network.
The protein-DNA interaction is obtained from TargetMine database [101] based on both
70

Figure 2.7: The mRNA-fold change of Trop2 in human mammary epithelial cell line and
the diﬀerent breast cancer cell lines upon FI treatment. Quantitative real-time PCR was
performed to measure Trop2 mRNA expression levels in MCF10A, MCF7, and MDA-MB231. The untreated cells (controls) and cells treated with 10µM forskolin and 100µM IBMX
(FI) for 1 day (n = 3) are shown. **: p < 0.01, ***: p < 0.001.

71

Figure 2.8: The regulatory network for TACSTD2. Seven transcription factors are predicted
to bind to TACSTD2 based on motif search, including NFKB1, EVI1, CREB1, ATF6,
NKX2-2, PAX4, and SOX5 (colored in yellow). Their interacting proteins are colored in
blue. The causal impact (score) is represented with the size of the nodes in the network. Of
the transcription factors that could regulate TACSTD2 only CREB1 shows a causal impact
(a positive score), and has the highest score among all the proteins in the network that is
connected to TACSTD2.

72

binding motif search on the gene promoter and literature curation. Application of Layer III
on these TFs identiﬁed that in the ESR network, TP53 activity changes the most, and in
the Trop2 network. NFkB1 activity changes more than CREB1 and Evi1. This predicts a
potential regulatory role of TP53 on ESR, and NFkB1 on Trop2, and their activity could be
controlled at the post-transcriptional level. It has been shown that TP53 transcriptionally
regulates ESR1 expression and the regulation relies on many protein cofactors [82], which
supports this prediction. For Trop2, our approach predicted another potential transcriptional
regulator: NFkB1, which is likely regulated at the post-transcriptional level, given that
the activity of NFkB1 is primarily controlled by its cytosol-to-nucleic translocation [102].
Experiment in human breast cancer cell lines (MCF10A and MDA-MB-231) shows that
Trop2 gene expression is down-regulated within 2 hours upon either the inhibition of NFkB’s
protein activity, or reducing its translocation by inhibiting IKK (Figure 2.9 on page 74),
which conﬁrms a regulatory role of NFkB on the transcription of Trop2 gene. This would
be diﬃcult to identify with current approaches, for example, applying GSEA on the known
TFs shows similar enrichment scores for almost all of the TFs, suggesting its lower speciﬁcity
when applied on large datasets with diverse conditions. These results are similar to what is
observed when GSEA is applied to the yeast dataset, where many of the enriched TFs show
a zero p-value while GAL4 was not detected. These results support that our approach can
identify TFs whose activity is regulated at the post-transcriptional level.

73

Figure 2.9: The TROP2 mRNA expression levels in diﬀerent cell types. MCF10A and
MDA-MB-231 were treated with IKK inhibitor VII and NF B activation inhibitor IV for 2
hrs, respectively. The TROP2 mRNA levels were measured by quantitative real-time PCR
(n=3). * indicates p < 0.05, **: p < 0.01, ***: p < 0.001. P value was compared to control.

74

2.5

Comparison of computational approaches to reconstruct gene network

There have been numerous computational approaches developed to reconstruct context dependent gene network based on gene expression proﬁles. The ﬁeld has been studied and
reviewed (e.g. [80]) extensively but predominantly from the perspective of computational
assumptions and methodologies. We attempt to clarify the biological problems and hypothesis that can be solved or predicted by these diﬀerent approaches. In Table 2.1 on page 77
we summarize and compare our approach with diﬀerent computational approaches that are
currently applied on gene network reconstruction. Approaches based on Bayesian network
or correlation network (e.g. in module analysis) infers a “functional network” where the
connections predict statistical inﬂuences (i.e. correlation) between gene expression, which
does not necessary provide clear information on the regulatory relationships and mechanisms. ARACNe [19] and MINDy [73] aim to study transcriptional regulation, which is built
on the correlation between the mRNA level of the TF and its target genes. MARINA [74]
and NCA [103, 104] can account for post-transcriptional regulation and infers TF activity
by incorporating known protein-DNA interactions. MARINA applies GSEA to select the
active TFs, where the gene sets contain the targets genes of the TFs, while NCA requires
many samples, to solve its complex parametric model. In contrast, the layer approach accounts for diﬀerent levels of transcriptional regulation to identify condition speciﬁc targets
and regulations. The applications of the layer approach on yeast and human breast cancer
data demonstrate that by integrating multiple conditions, better speciﬁcity in target identiﬁcation and network reconstruction can be achieved, which helps generate more speciﬁc
75

biological hypotheses on the target genes and their regulations.

76

Table 2.1: Computational approaches for gene network reconstruction.

77

Chapter 3
Identiﬁcation of novel targets by exploring gene switches
The learning approaches to identify genes and their regulation based on gene expression, e.g.
network reconstruction introduced in Chapter 1 and the multi-layer approach discussed in
Chapter 2, are able to ﬁnd important genes and interactions for a phenotype, such as the
Trop2 gene and its transcriptional regulators for human breast cancer. It is straightforward
to conﬁrm if an interaction exists in the condition, but one has to design more experiments
to understand the functional role of a gene and the regulation in a phenotype. Although
these computational approaches can predict candidate genes/biomarkers, they are not able
to predict how and why these genes are important for the phenotype in a biological sense.
To answer how and why a gene is important for a phenotype, one has to understand
how genes are regulating each other to determine a phenotype, which requires information
on the regulatory mechanisms. Therefore, in this chapter, we explore a functional module
of gene network—-the gene switches, with systems modeling, wherein we show that gene
switches can generate speciﬁc pattern of gene expression which can be identiﬁed by mining
microarray data. Our mining approach demonstrates that one can capitalize on genome-wide
78

expression proﬁling to capture dynamic properties of a complex network, thereby to predict
gene switches that could be important for a phenotype because they can participate in cell
fate decision.

3.1

Gene switches play essential role in cell fate decision, and could be good biomarkers and targets for
cancer

Given the complexity of gene regulatory networks, knowledge of the properties of individual
components in the network are not suﬃcient to elucidate the cell physiology. Thus systems
biology has evolved to uncover “emergent properties” that arise from the intricate interactions of gene networks. One such emergent property, “switch-like behavior” or “bistability”,
describes a dynamic feature of a particular gene [105] to preferentially toggle between two
steady-states. Multiple steady states are often observed in chemical and biochemical reactions (reviewed by [106]) and are characterized by a non-linear response. Bistability happens
to be a special case involving two steady-states, giving rise to a “switch-like behavior”. In
biochemical reactions, such “bistable” behavior shows a sharp sigmoid function or a hysteresis structure (see examples in Figure 3.1 on page 80), whereby the state of the variable ﬂips
between high and low levels. Such an “all-or-none” state transition usually depends on a
threshold, i.e., the concentration of the stimulator or regulator. Hysteresis depends further
on the previous state of the system.

79

Figure 3.1: Dynamics of gene switches and bimodality in their expression proﬁles A: A synthetic genetic switch that contains
a repressed positive feedback is stimulated by an inhibitor of the repressor. B: The stimulation-response curve of the genetic
circuit. C: The histogram of the steady-state gene expression level of 100 random sample simulations of the genetic switch
shown in A. D: A synthetic genetic toggle switch that contains double negative feedbacks. E: The state space of the gene
expression. Each trajectory (blue lines) is the response curve with respect to a particular initial condition. The red arrows
are the two attractor-states. F: The histogram of the steady state gene expression level.

80

The expression level of a gene switch does not change gradually but rather has two distinct
steady-states: HIGH or LOW, ON or OFF, ALL or NONE. The ability of switches to convert
a graded signal into a binary response ensures that a cell responds in a decisive manner or
unambiguously commit to a speciﬁc program [107]. Furthermore switches have been noted
for their noise-ﬁltering capacity. Endogenous noise are typically lower for fully repressed or
induced expression states than in a gene where the state changes continuously [108] [109].
Bistable behavior of gene switches have been reported to play pivotal roles in many
important aspects of cell physiology, including cell fate decisions, cell cycle control, and
cellular responses to environmental stimulation [110] [106]. E.coli lac operon is a famous gene
switch that uses a hysteretic feedback to decide between glucose and lactose utilization [111].
Many bistable systems have been discovered in bacteria, including the genetic transformation
in Bacilius subtillis and sporulation in many bacterial species [112]. In mammalian systems,
gene switches and bistability have been postulated as the underlying mechanism for cellular
diﬀerentiation, but rarely has this been conﬁrmed experimentally, until recently with the
work on neutrophil diﬀerentiation [113]. Another interesting observation is that cells have
“memory”, and hysteresis has been shown to govern short-term memory in lymphoid cells,
preserving information of past encounters with antigen [24]. Thus, the discovery of gene
switches in cellular responses has become a milestone in molecular biology and prompt
strong interest in understanding the function and design of gene networks [111].
Switches play a central role in cell decision, and the ability to predict whether switches
can occur without a priori detail information of the network would be signiﬁcant. For
instance, the ability to identify which genes are turned on or oﬀ in cancer versus normal cells
would have a tremendous impact on identifying the most pertinent molecular signatures
81

or targets for drug therapy. Therefore a major challenge confronting the ﬁeld, which we
address in this study, is how to eﬀectively identify gene switches or bistable states by mining
high-throughput data. An approach that could predict switches based on high-throughput
data not only provides candidates of biomarkers but also associates the candidates with a
potential biological function/mechanism for the phenotype.

3.2

How to identify gene switches

Despite the importance of gene switches, identifying multiple steady-states, and in particular
switches, has been diﬃcult. Our understanding of gene switches has been mostly based on
simulations of generic feedback circuits and well-characterized biological modules [114–117].
Theoretical studies of feedback circuits have elucidated general principles of network dynamics, but they usually lack solid evidence to associate these principles with real physiological
processes in cells. Few studies have succeeded in demonstrating functional roles of actual
switches in biological systems by coupling detailed kinetic modeling with rigorous experimentation [118] [24]. This is because well-characterized models with equations and kinetic
parameters are diﬃcult to obtain for real, complex biological systems, in part because current
techniques are not able to quantitatively measure reaction constants at the single-cell resolution for all the network components. Alternatively, researchers in synthetic biology have
designed artiﬁcial gene networks with speciﬁc functions and implemented the interactions
by manipulating or bringing together exogenous genetic components [119] [120] [121]. Thus
current methods of experimentally studying switches have been limited to well-characterized
or synthetic small modules.
Previous computational approaches addressed this question by analyzing the network
82

topology.

These studies assume that bistability requires particular feedback structures

[107, 122], and discovered dynamic features by searching for these structures (e.g. positive feedbacks) in protein-protein interaction and protein-DNA interaction networks [123].
However, these feedback structures do not ensure switch-like behavior. From modeling and
simulations of genetic circuits, positive feedback (or even feedback itself) has been shown to
be neither necessary [124] [110] nor suﬃcient [125] to ensure switch-like behavior. Furthermore, it is less likely that one can uncover a dynamic property from static networks.
Alternatively, we theorize that the dynamic “behavior” of a switch could be identiﬁed
by analyzing the gene expression proﬁles from a wide range of conditions. We propose a
top-down mining approach to identify gene switches from microarray gene expression data.
Taking advantage of the tremendous amount of expression data, our approach aims to identify bimodality, which we hypothesize is an essential characteristic of a gene switch.

3.2.1

Simulation of kinetic models of gene switch

A gene switch has two steady states, which will produce a bimodal distribution in its expression proﬁle when sampled across diﬀerent conditions. Figure 3.1 on page 80 show the
gene network topology of two typical regulatory circuits that exhibit bistable behavior. The
ordinary diﬀerential equations (ODEs) for the synthetic systems are as follows:

• Positive self-feedback:
A2
1
A
dA
= p[
][
][1 −
] − deg(A)
2 1 + R2
dt
2.5
1+A
R(i) = 10[1 −
83

Ki
]
1 + Ki

(3.1)

(3.2)

A represents the expression level of Gene A, pA2 /[1 + A2 ] describes the self-binding
and activation of the transcription, and 1/1 + R2 is the eﬀect of the repressor, in which
R depends on the stimulation —the concentration of the inhibitor i. deg(A) is a linear
function for the degradation of A. The model is constructed by [126] for a mammalian
cell system.

• Double-negative feedback:
dA
a
=
− deg(A)
dt
1 + B2

(3.3)

a
dB
=
− deg(B)
dt
1 + A2

(3.4)

A,B represents the expression level of Gene A and B, respectively. a is a parameter
about the strength of the cross-repression of the two genes. deg is a linear function for
degradation. The model is constructed by [119] in E.Coli

Figure 3.1 on page 80 part A shows the positive self-feedback transcriptional system under
the control of a transcriptional repressor. Part D in the ﬁgure shows the double-negative
feedback system, also known as a toggle switch, which produces mutually exclusive activation
of two genes. Both circuits have been synthesized and implemented in cell systems [126] [119]
to conﬁrm their switching behavior. Simulations based on the kinetic models of these systems
[126] [119] conﬁrm the on/oﬀ and toggle-like switching behavior in their response curve and
state space. By simulating random samples from a wide range of conditions with diﬀerent
initial states, this unique feature of two distinct steady-states of gene switches results in
a gene expression histogram proﬁle containing two modes Figure 3.1 on page 80. This
bimodality is observed despite the noise (20% Gaussian noise) imposed on the parameters.
84

3.2.2

Data mining to identify gene switches

A challenge in experimentally identifying gene switches is their population eﬀect. In single
cell experiments, if obtainable, the response curves would represent individual cell measurements, and a gene that switches will exhibit a steep jump between the steady states.
However many biological measurements (RT-PCR, Western Blotting), including microarray
analysis, provide the population-average. In fact, even with single cell measurements, individual clones can contribute cell-cell variances, with diﬀerences in the protein expression
levels across diﬀerent cells. In Figure 3.2 on page 86 the cell-cell variance is modeled by
a Gaussian distribution in the protein expression and diﬀerent cells in a clone would then
respond diﬀerently to stimulation, leading to a continuous change in the averaged response
curve. This explains, in part, the diﬃculty in identifying switches through standard experiments.

85

Figure 3.2: Schematic representation of a phase plane of a gene switch. A: Single cell dose response experiments should be able
to measure the response curve and uncover the switch-like behavior. B: Experimental measurements of the average expression
level of a cell population will mask the switching behavior. A Gaussian distribution is plotted to represent the cell-cell variances
in the population. Diﬀerent cells, according to their initial gene expression level, could have diﬀerent response curve (blue
trajectories). The averaging of the variation in the responses results in a seemingly graded response. C: Experiments across
a range of diﬀerent conditions allowing for the sampling of a large state space recover the switch-like behavior. Each sample
could fall in the neighborhood of a possible steady state (points on the blue trajectory). The steady states (on/oﬀ states of
the gene switch) are the dense regions of the possible response curves in the state space, i.e. the samples occurs at higher
frequencies in these states, which results in a bimodal distribution in the observed proﬁles.

86

We proposed that an unbiased sampling across a range of diﬀerent conditions could
address this issue and help reveal the dynamic feature of gene switches. In Figure 3.2 on
page 86 we show analytically, potential response-curves (the blue trajectories) in the whole
state space of a gene switch. Each sample within the system would asymptotically approach
one of the two possible steady states (dark blue region). Since the on/oﬀ states are the
steady states which most cells will concentrated in upon stimulation, the samples will have
higher probability of staying in these states, leading to a bimodal distribution in the observed
expression proﬁles.
To identify bimodality researchers have used the DIP statistics [127] or Gaussian mixturemodel to [128](http://www.astro.lsa.umich.edu/~ognedin/gmm/) identify bimodal
distributions. Since what we need is not only a quantity for bimodality, but also an explicit
separation of the conditions into two categories corresponding to two expression levels, we
choose the Gaussian mixture model. The Expectation Maximization algorithm is implemented to separate the data distribution into two Gaussian models. The criterion used to
assess the ﬁtting is the Akaike information criterion from information theory.

AIC = 2k − 2 log(L)

(3.5)

Where k represents the number of parameters, and L is the goodness of ﬁt, deﬁned by the
likelihood of observing the data given the model (one or two Gaussian in this case). We
use the diﬀerences between AICs of the two models, the “∆AIC”, to compares the ﬁt with
a unimodal vs. a bimodal distribution. ∆AIC provides a measure for an “unconditional”
bimodality in which the proﬁles show bimodal but the condition for the “switch” is yet to
be investigated. When a particular condition is speciﬁed (e.g. the breast cancer phenotype),
87

the separation D can be used to identify if there is a distinct state for the condition, or, a
“conditional speciﬁc” bimodality:

D=

µ1 − µ2
2
2
(σ1 + σ1 )/2

(3.6)

Where (µ1 , σ1 ) are the mean and deviation of samples in the speciﬁed condition, while
(µ2 , σ2 ) are the mean and deviation for all other samples. We perform theoretical analysis
and provide proof-of-concept applications on both synthetic and yeast microarray datasets.

3.2.3

A Proof-of-Concept application of the E2F-Rb network

The E2F-Rb network is a well-characterized system in mammalian cell fate determination,
whereby the Retinoblastoma (Rb) protein regulates the transcriptional factor, E2F, to control the restriction point for the G1-S transition in cell cycle [129]. A simpliﬁed kinetic model
was constructed for the E2F-Rb system [118], in which two genes Myc and CycD (Cyclin D:
Cdk4,6) are activated by suﬃcient growth signal (serum) to induce E2F activation, which
then directs the synthesis of downstream factors, such as CycE for DNA replication. The
E2F self-activation and CycE-mediated E2F activation constitute two positive feedbacks in
the system. It then was experimentally [118] conﬁrmed that the level of E2F switches ON or
OFF for cell-proliferation and cell-cycle arrest, respectively, suggesting E2F acts as a gene
switch, while CycD and Myc do not show such switch-like behaviors.
We perform simulations based on the kinetic model [118] to generate a synthetic geneexpression dataset. The stimulation-response curve of a single-cell is shown in Figure 3.3
on page 90, and conﬁrms a graded response for Myc/CycD and bistable dynamics for E2F.
The downstream factor CycE, controlled by E2F, also shows a switch-like response. Intro88

ducing a distribution in the expression level to represent cell-cell variation within a clone,
and averaging multiple simulations shows that population averaging for any one condition
disguises the switch-like behavior and is indistinguishable from a graded responses, which is
consistent with previous RT-PCR experiments [118].

89

Figure 3.3: Proof-of-concept example: simulation of the E2F-RB network. A: Simulation of the kinetic model based on
a ﬁxed initial condition represents measurement at the single cell-resolution of the system. The response curves of serum
stimulation on the diﬀerent genes in the model are plotted. B: Assign a Gaussian distribution with small variances on the
initial gene expression level of the untreated cells to represent the cell-cell variation in a clone. Simulation-results are computed
by averaging the responses of 100 cells in a clone.

90

We then simulate 100 cell clones, each clone with a random initial condition, and measure
the steady state expression level of the network components for each clone. In this way we
synthetically generate 100 “microarrays” for 100 diﬀerent conditions. It is clear genes that
have two steady states, i.e. E2F and CycE (eﬀector of the gene switch), exhibit two distinct
modes in their expression proﬁles (Figure 3.4 on page 92). Each gene’s ∆AIC value is
calculated from the synthetic expression data and the switches exhibit higher ∆AIC values
than the non-switches. Thus ∆AIC can be used to rank and help uncover genes that are
bistable.

3.2.4

A Proof-of-Concept application to Yeast microarray data

We apply our mining approach to an integrated yeast microarray dataset containing 500
yeast experiments and calculate the ∆AIC value for each gene in the dataset. With such
a large set of conditions, the ∆AIC value is fairly robus. A histogram of the ∆AIC value
among the yeast genes is shown in Figure 3.5 on page 93. Most genes have low ∆AIC, and
their expression appear unimodal. However, a few genes have high ∆AIC values and clearly
show bimodality.
The genes with high ∆AIC values have distinct states under diﬀerent conditions. By
collating and comparing those conditions under the two distinct expression states, one can
potentially identify the phenotypes in which the genes are functioning. Given that a phenotypic ontology is not available, it is diﬃcult to compare conditions. Nevertheless, one
approach is to categorize conditions by the type of perturbations, e.g. heat shock (with
diﬀerent temperature and length of time), hypo-osmotic shock (diﬀerent time points), and
extra carbon sources (diﬀerent carbon source), etc., and check if one of the two states of a
91

Figure 3.4: Mining approach to identify switches in the E2F-RB network. Sampling of 100
clones under diﬀerent, randomly generated initial conditions. The simulation results are
shown as histograms of the expression level of the diﬀerent genes, together with their ∆AIC
values.

92

Figure 3.5: A proof-of-concept application on the yeast dataset. A: A histogram representing
the distribution of AIC value among the yeast genes. Most genes have small ∆AIC and
exhibit an uni-modal expression proﬁle. A relatively small number of genes have high ∆AIC
and show bimodality in their expression. B: A negative correlation between bimodality
(described by ∆AIC ) and expression noise (as described by the coeﬃcient of expression
variation) in the yeast genes. C: The four GAL genes show bimodality and one of their
modes are enriched within the same condition/category.

93

putative switch is enriched within a category of conditions. Using this approach, we correctly
uncovered genes that have switch-like behavior, namely GAL1, GAL2, GAL7 and GAL10
(Figure 3.5 on page 93). These genes all have ∆AIC values that rank among the top 5% and
show bimodality, with one of their two modes containing conditions from the same category,
i.e. “adding extra carbon sources”. The bimodal proﬁles show that by adding 2% (weight
to volume) extra carbon sources into the media, with the exception of galactose as the extra carbon source, the expression of these four genes shut down. It has been reported that
these four genes function in the same pathway for galactose utilization, i.e., the well-known
“GAL genetic switch” (review: [79]). The addition of alternative carbon sources results in
“glucose-repression” of the GAL pathway. During this process, the high level of glucose or
other carbon sources (other than galactose) induces the formation of the repressor complex
(protein Mig1p and Cyc8-Tup1) and upon its binding to speciﬁc upstream repressing sequences (URSG) on the GAL promoters, it prevents the activation of these four GAL genes
by the transcription factor GAL4, thereby turning oﬀ the galatose utilization pathway.

Current knowledge on the existence and functional machinery of other gene switches is
limited. However we show next that by integrating information of the regulatory network
and proteomic data, the genes with high ∆AIC obtained from our analysis could be possible
switches or at least important genes with respect to the phenotypes. We calculate the ∆AIC
values of transcription factors in the yeast transcriptional regulatory network (based on
binding motif data), and observe that the leaf-nodes — genes that are only regulated by one
factor and are not regulating any other transcription factors —- tend to have signiﬁcantly
lower ∆AIC value (average ∆AIC = 135 ± 9 compared with overall average ∆AIC =
223 ± 58 for transcription factors, p < 0.01). These genes which have few regulators and
94

do not transcriptionally control transcription factors are less likely to have feedbacks at the
transcriptional level, and therefore switching dynamics. Thus the dynamic property we infer
of the molecular components within a network is contingent on the network organization.
Next, we analyze single-cell proteomic data that includes noise in the protein expression
measurements. We ﬁnd a weak but signiﬁcant negative correlation between the ∆AIC value
of a gene and its coeﬃcient of variation, which captures the noise of its protein expression
(Figure 3.5 on page 93). This suggests that genes with higher ∆AIC value, showing bimodality, tend to express relatively lower levels of noise. This observation that genes with
lower expression noise under normal conditions are more tightly controlled highlights their
importance in the network, and is consistent with previous suggestions that gene switches
have noise-ﬁltering capacity [108] [109].

3.3

Identify characteristic signatures of human breast
cancer

Since the state of gene switches in the genetic network governs the phenotype [130], we
postulate that recognizing speciﬁc gene switches will enable one to identify biomarkers or
molecular signatures that would be better drug targets for treating a disease. We demonstrate the utility of our mining approach in human breast cancer by analyzing a paired
breast cancer/normal tissue expression dataset against the integrated human gene expression dataset.
We analyze the paired breast cancer/normal tissue expression dataset (GSE15852) (Pau
Ni et al, 2010) against the integrated human gene expression dataset [131] to identify char95

acteristic signatures of human breast cancer. First, we calculate the separation value D [132]
for the top 10% ranked genes by ∆AIC to examine whether the expressions of these genes
are bimodal when comparing the breast cancer (1119) samples against all other phenotypes
(4,777 samples for 300 conditions). Biologically this indicates whether a gene potentially
shows bistability and could be involved in the “switching” or transition to a breast cancer
phenotype. D > 2 has been suggested to indicate whether the separation into two Gaussian
distributions or modes is distinctive [132]. Considering the large amount of noise in the
microarray data, we accept separation values of greater than or close to 2 (¿1.8) to indicate
bimodality, which results in 17 genes showing distinct bimodality in breast cancer.
Next, an independent microarray dataset (GSE15852) with 43 paired breast cancer samples of diverse histopathological characteristics is analyzed to test if the 17 genes are expressed
diﬀerently and show distinct bimodality in breast tumor as compared to normal breast tissues. Comparing such “local” expression proﬁle (paired normal and cancer conditions) with
the “global” expression proﬁle (across various conditions) identiﬁed that of these 17 genes,
12 genes (ESR1, SPDEF, IRX5, ERBB3, ERBB2,CRABP2,RAB25, FXYD3, TACSTD2,
DSP, AGR2, CDH1) show bimodality in both datasets (Figure 3.6 on page 97 shows the
ﬂow chart of the procedure). One type of genes is bimodal within the breast cancer samples,
herein denoted Type 1, with estrogen receptor-alpha (ESR1) having the highest separation.
The other type of gene switch shows predominantly one modality within the breast cancer
samples, herein denoted as Type 2, and is where we ﬁnd the TACSTD2 (a.k.a. Trop2) gene
having the highest separation value within this group.
Many of the genes that show Type 1 bimodal behavior also exhibit the biomdality within
the breast cancer samples (Figure 3.6 on page 97). Known therapeutic targets for breast can96

Figure 3.6: Identiﬁcation of potential gene switches for breast cancer. We analyzed the
integrated dataset to search for bimodality in the gene expression proﬁles. Genes are ranked
based on their ∆AIC calculations. The top 10% are selected to compute the separation
D with respect to breast cancer. An independent dataset is then used to further examine
candidate genes.

97

cer, such as ESR1, ERBB2 (HER2) and ERBB3 (HER3), are identiﬁed as showing bimodality
in their gene expression level in breast cancer. Their bimodality in the cancer samples represents well-known subtypes in breast cancer, i.e. ER+/ER- and HER2+/HER2- subtypes.
ESR1 (estrogen receptor alpha) is a well-known transcription factor involved in the development and progression of breast cancer. Previous immunohistochemical analysis showed a
bimodal distribution in estrogen receptors (ER) expression —- the majority of breast cancer
patients express either ER-negative (low expression) or unambiguously ER-positive (high
expression), of which (≈ 80%) are ER+, while moderate ER immunostaining is rarely observed [133]. This supports our discovery of bimodality of the ESR1 gene expression within
the breast cancer samples. It has been a decade since researchers attempted to explore the
mechanism underlying such an all-or-none expression pattern of estrogen receptors. It was
previously reported that the ESR promoter activity is increased by co-transfection of the
wild-type ESR expression vector, suggesting a positive contribution of ESR to its own expression [99]. A recent study uncovered that miR-375 is involved in a forward feedback loop
that regulates ESR1 expression, whereby ESR1 enhances miR-375 expression and miR-375
targets and reduces the expression level of RASD1 (ras dexamethasone-induced 1) gene,
which is a transcriptional inhibitor of ESR1 [134]. These studies provide evidence of a potential positive-feedback (with a double-negative circuit) induced bistability of the ESR1
expression, where the topology is similar to a toggle-switch design. ERBB2 and ERBB3
interact with each other and are known to be transcriptionally regulated by ESR1 [135]. A
recent study [136] identiﬁed a positive feedback of ERBB2 through the transcription factor
c-Jun, which could provide a potential explanation for the bimodality observed for ERBB2.

The molecular characterization of the Type 1 genes (e.g. ESR, HER2) suggests the
98

development of therapies for ER+/PR+ and HER2+ would be eﬀective for these breast
cancer subtypes, however ≈ 15 − 20% of the breast cancer tissues expressing low levels
of these biomarkers (i.e. triple negative subtype) have poor prognosis and few treatment
options. Moreover, patients that are responsive to commonly used drugs, such as tamoxifen
(estrogen antagonist) and trastuzumab (anti-HER2 agent), eventually acquire resistance
to the drugs. ≈ 30% of tamoxifen-responsive tumors become resistant [137] [138], and the
resistance invariably ensues at some point with trastuzumab. Given the increase in resistance
to drugs that target the ESR receptor alternative therapeutic targets are needed.
The second type of potential gene switch, herein denoted as Type 2, shows unimodal
behavior in the breast cancer tissue (Figure 3.6 on page 97) and is diﬀerentially expressed
in almost all the paired breast tumor/normal tissues as compared with non-breast cancer
samples. The top gene showing this type of switching behavior is Trop2. Type 2 gene
switches uncovered by our analysis show a distinct state in the breast cancer samples, and
could be a potential biomarker or drug target that does not rely on the ESR receptor. We
characterized the Trop2 gene, and found it to be distinctively expressed at higher levels
in almost all of the breast cancer samples, ER+/-, HER2+/- subtypes. We conﬁrm that
the expression of Trop2 gene is high in breast cancer cell lines MCF7 and MDA-MB-231 as
compared with non-cancer cells (Figure 3.7 on page 100).
Our network reconstruction in Chapter 2 identiﬁed CREB and NFkB as potential transcription factors that regulate the expression of Trop2. We also observe a signiﬁcant increase
in the correlation between the expression level of CREB and Trop2 in the paired breast cancer dataset. The correlation coeﬃcients in the normal breast tissue are 0.15, 0.06, 0.03 for
the three CREB probes in the Aﬀymetrix array, and the correlation coeﬃcients in the breast
99

Figure 3.7: The expression proﬁles of Trop2 in breast cancer. A: The scatter plot shows the
gene expression level of Trop2. x-axis indicating the expression level, the values in y-axis
are randomly generated to reduce the overlap between samples. Subtypes of breast cancers
are determined by their expression levels of ESR1, PR, and Her2. B: The Trop2 mRNA
levels in human mammary epithelial cell line, MCF10A, in breast cancer cell lines, MCF7
and MDA-MB-231, and in primary rat astrocytes were measured by quantitative real-time
PCR (n = 3). **: p < 0.01, ***: p < 0.001.

100

tumor tissues are 0.46, 0.21, 0.31, respectively.
To assess the possible switching behavior of Trop2, we performed ﬂow cytometry to probe
the Trop2 protein level at single-cell resolution. For both MCF10A and MCF7 breast cell
lines the Trop2 protein level shows a bimodal distribution in their cell population (Figure 3.8 on page 102), which is a property of a bistable system. We stimulated the cells
with FI (Forskolin and IBMX) to induce cAMP, which is an activator of CREB [139], and
measured the Trop2 levels. Both Trop2 mRNA and protein levels increased signiﬁcantly
upon stimulation, thereby supporting a possible transcriptional regulation by CREB. Upon
activation of Trop2 by FI, a decrease in one of the modes with a concomitant increase in the
other mode, instead of a gradual increase in the protein level, (Figure 3.8 on page 102) is
indicative of a switching behavior. The activation essentially increases the number of cells
with Trop2 levels at the ON state and decreases the cells with Trop2 at the OFF level. In
contrast, the expression of the Trop2 protein in primary rat astrocytes shows a unimodal
expression under the same test conditions. Furthermore stimulation of astrocytes by FI leads
to a non-signiﬁcant change in the protein level and with the cells predominantly remaining
in the OFF steady-states.

101

Figure 3.8: A switch-like behavior in Trop2 expression. Flow cytometry analysis of Trop2 expression in MCF10A, MCF7 and
primary rat astrocytes (Black lines). The cells were treated with 10µM forskolin and 100µM IBMX (FI) for 1 day (Red lines)
and the two modes of Trop2 in MCF10A, MCF7, and primary astrocyte cell population are pointed out by the blue arrows.
Note the primary astrocytes have only one mode.

102

Our mining approach uncovered a unique expression pattern of Trop2 in breast cancer,
and experiments conﬁrmed Trop2 show bimodal behavior in breast cancer cell lines. Trop2
(Trop2) is a cell surface glycoprotein, ﬁrst discovered to be highly expressed in trophoblast
cells that become invasive and metastasized to form the outer layer of blastocyst in embryo
development [140]. Recent studies, along with our analysis of breast cancer samples, found
Trop2 to be highly expressed in a variety of epithelial cancers and show low to no expression in
normal somatic cells. High expression of Trop2 in squamous-cell carcinoma [141], pancreatic
[142], colorectal [143] and gastric [144] cancers have been associated with poor prognosis and
higher incidence of metastasis and death. Trop2 was identiﬁed as an oncogene in colorectal
cancer cells [145]. Although not essential for cell proliferation under normal condition, ectopic
expression of Trop2 enhances anchorage-independent cell growth, promotes tumorigenesis
and metastasis in colon cancer cells. Knock-down or inhibition of the protein reduces the
invasiveness of aggressive colon cancer cells [145]. In our analysis we also found Trop2 to be
highly expressed in many colon cancer samples and shows bimodality, however the percentage
of colon cancer samples with Trop2 at the ON state (∼ 60%) are less than in breast cancer
(∼ 99%), suggesting Trop2 could be a better target for breast cancer.

In previous microarray analysis of breast tumors, [146] Huang et al. studied “aggregate
patterns of gene expression” with respect to lymph node status and recurrence, and identiﬁed “metagenes” that could predict the outcomes of the patients. Trop2 is found among
the “metagenes” in their list; however the list consists of more than a hundred genes with
potential predictive value. In contrast, we ﬁnd the Trop2 gene to be the top gene in the
list that shows the Type 2 behavior. Interestingly, the distinctive HIGH/LOW expression
level of the Trop2 gene has been implicated as a marker for stem cell characteristics in
103

prostate basal cells [147] and hepatic oval cells [148]. The prostate basal cells and hepatic
oval cells, considered progenitor cells, show HIGH expression of the Trop2 gene and maintain self-renewal capability [147] [148], and thereby implicating a potential role of Trop2 in
cancer initiating stem cells. Although Trop2 has been reported to be associated with cancer,
the regulatory mechanism of Trop2 remains unclear. Combining computational prediction
(network reconstruction in Chapter 2) and experimental analysis, we found that CREB and
NFkB could regulate Trop2 in breast cancer cells.
Since the completion of our study of network reconstruction and gene switches in breast
cancer, two research papers have been published in ONCOGENE [149, 150] that provide
experimental evidence in human cancer tissues to support Trop2 as an oncogene. These
independent experimental studies conﬁrm the eﬀectiveness of our approach in predicting
speciﬁc genes for a phenotype and potential targets for disease. Our discovery of the transcriptional regulators of Trop2 complements the network biology study of Trop2 in [150].
We identiﬁed novel regulators of Trop2 that have not been discovered as yet based on previous approaches of network reconstruction. Our discovery of NFkB as a regulator of Trop2,
together with the evidences provided in [150] that Trop2 regulates NFkB expression, suggest a potential positive feedback structure that supports the “switching” behavior of Trop2
expression.

In this chapter we apply systems modeling to deﬁne a speciﬁc pattern that is emerged
from the complex interactions in gene network—-the gene switches, and explore the pattern
by mining gene expression data, to be able to provide more speciﬁc predictions on gene
function. Our mining approach demonstrates that in the absence of a priori knowledge of
the speciﬁc network architecture, one can capitalize on genome-wide expression proﬁling to
104

capture dynamic properties of a complex network. To the best of our knowledge, this is
the ﬁrst attempt in applying mining approaches to explore gene switches on a genome-scale
and this is a ﬁrst case a single cell level bimodality and bistability can be predicted from
microarray data.

Researchers recognized that “genetic switches” behave in a discrete manner, but this
feature is usually lost in biochemical analysis of large cell populations due to the diﬃculty in
distinguishing between changes in the proportion of cells and their expression level in the two
states [113]. For example, it is hard to determine from population measurements whether
the expression level of a gene increases gradually by 70%, or whether 70% of the cells are
“switched” ON. In this study, we provide an alternative approach to identifying possible
gene switches by capitalizing upon the large amount of available microarray data. The large
sample set enables the characterization of the state space by uncovering the presence of the
two attractor-states where the majority of the samples should fall. Thus, if an ON/OFF
switch behavior exists in a system the state space will show bimodality or bistability, which
are relatively stable with respect to perturbations [151]. It has been suggested that bistability
or multiple steady states [130] exists in large gene networks [152] [153], and these attractorstates represent diﬀerent phenotypes [130]. Thus, by sampling across diﬀerent conditions,
which are less aﬀected by population averaging, one can reveal this dynamic feature of
regulatory networks. In this sense, our study provides a diﬀerent perspective that takes
advantage of the large integrated set of expression data, and suggests a mechanism-based
framework to perform the meta-analysis. This approach of integrating microarray data from
a diverse set of conditions provides a common “context” of gene behaviors, whereby one can
obtain a better understanding of the speciﬁc function of a gene for a particular condition
105

under investigation.
By applying the computational analysis on human microarray data, we uncovered a
unique expression pattern of Trop2 in breast cancer, and experiments conﬁrmed Trop2 show
bimodal behavior in breast cancer cell lines, further, our perturbation study suggest a potential bistable mechanism is involved. Therefore, not only does our computational approach predict biomarkers/targets, but also it can suggest the mechanism how and why the
biomarker/target could be functionally important.

106

Chapter 4
Identiﬁcation of genes (microRNAs) that
determine a phenotype
Network reconstruction based on gene expression and interaction data in chapter 2 identiﬁes
genes and regulation that are speciﬁcally functioning in a phenotype and the mining approach in chapter 3 identiﬁes potential gene switches. These learning approaches treat the
genes as “features” and the phenotype as “labels” (a particular annotation on samples) to
ﬁnd features that are able to diﬀerentiate between diﬀerent labels/annotations, by comparing
the transcriptomes that represent diﬀerent phenotypes/labels. Therefore, these approaches
essentially identify genes and interactions that distinguish a phenotype. However, one of
the key goals in systems biology is to understand how complex molecular and cellular outcomes arise from the dynamic interactions at the detailed mechanistic level, wherein the
“phenotypes” are not independent labels/annotations for samples but involve many diﬀerent aspects to deﬁne a biological state, which is reﬂected by the complex molecular and
cellular outcomes. For example, a “cancer” phenotype represents very special genetic states,
gene regulatory mechanisms, cellular behaviors, as well as speciﬁc metabolic states. In this
sense, the learning approaches are limited by treating phenotypes as “labels”/annotations.
107

To predict genes that determine a phenotype, it is necessary to model the phenotype as
a complex biological state and perform simulation studies to perturb the state, that is, as
introduced in Chapter 1, to address the following two questions:
• How does one model a phenotype?
• How does one deﬁne the perturbations that determine or change a phenotype?
In this chapter, we study the cancer phenotype by modeling the human metabolic network, thereby “cancer” is no longer merely a label/annotation but modeled as an entire
metabolic state. We developed a novel approach to simulate context dependent metabolic
states upon perturbation of gene expression, which is then applied to predict microRNAs
that can inhibit cancer growth.
We model the phenotype in terms of metabolic states because:
1. Metabolism is crucial to cell growth and proliferation. Deﬁciencies or alterations
in metabolic functions are known to be involved in many human diseases. For example, the pathogenesis of diabetes results from malfunction in the regulation of
metabolic pathways, leading to alterations in insulin signaling, oxidative metabolism,
and lipid/fatty acid metabolism [154]. Dysregulation of the metabolic system is also
implicated in carcinogenesis [155]. Most cancer cells have higher glycolytic rates, the
so-called “Warburg eﬀect” [156, 157] [158]. A recent study of breast cancer further
uncovered alterations in glucose metabolism mediated by phosphoglycerate dehydrogenase (PHGDH) enzyme [159], whose expression was found to be associated with poor
prognosis [160]. Since metabolism plays an essential role in cell growth and proliferation, genes regulating metabolism have been used as drug targets in the treatment of
108

cancer [161] [162] and other diseases involving metabolic disorders [163] [164], including diabetes, atherosclerosis and fatty liver disease. Thus, understanding the human
metabolic system is important and provides a complementary approach to study and
identify potential treatments for complex human diseases.

2. It is a complex system that is well-studied in characterizing many of its components/interactions and much of the detailed mechanistic information is available for
systems modeling. A (steady) state of the system can be deﬁned by metabolic ﬂuxes
distributed in the network and reliable techniques are available (experimental and computational metabolic ﬂux analysis) to measure the state. Current reconstructions of
the global human metabolic network provide a computational platform to integrate
knowledge gained over the past 50 years of research on human metabolism [165], thus
enable a systems modeling approach to study in silico the global eﬀect of perturbations
on the network to generate hypotheses and help understand the mechanisms underlying
the genotype-phenotype relationship.

4.1

Systems modeling of a metabolic state

To model a phenotype as a metabolic state, we need to characterize the components and
interactions in the system by the reconstruction of global human metabolic network, and we
need a simulation approach that can be used to deﬁne a metabolic state.
109

4.1.1

Reconstruction of global human metabolic network

The global human metabolic network has been manually curated based on an extensive
collection and evaluation of the genomic and bibliomic data. The ﬁrst two installation of
the network were released in 2007: the Edinburgh Human Metabolic Network [166] and the
human Recon 1 [165], each contains a list of human reactions, metabolites and gene-proteinreaction relationships. The Gene-Protein-reaction (GPR) represents functional relationships
between genes/proteins (e.g. enzymes) and the corresponding reactions they catalyze or
control. For example, in human Recon 1, the genes are ﬁrst mapped to their transcripts,
accounting for alternative splicing. Then, based on Boolean rules of OR and AND, the
transcripts are mapped to proteins. The proteins are then mapped to reactions by Boolean
rules based on current knowledge of their eﬀects on the reactions.
The two networks (Edinburgh Human Metabolic Network and the human Recon 1), developed independently by diﬀerent research groups, consist of many diﬀerent genes and reactions. The Edinburgh Human Metabolic Network contains more genes and metabolites, but
was not compartmented in its initial release. Compartmentalization requires assignments
of metabolic reactions into diﬀerent cellular organelles (cytoplasm, nucleus, endoplasmic
reticulum, mitochondria, lysosome, peroxisome, and Golgi apparatus) and accounts for the
transportation and exchange of metabolites between organelles. Human Recon 1 is a compartmented network which could be used in reconstructing predictive models for systems
biology studies, therefore, most of the recent applications have been based on Recon 1. An
overview of the publications thus far that used Recon 1 is reviewed by Bordbar and Palsson [167]. Notably, in 2010, the compartmentalization of the Edinburgh Human Metabolic
Network was completed and its current release is a compartmented, and more complete
110

human metabolic network [168].
The reconstruction of the global human metabolic network uses a bottom-up approach.
Researchers begin by compiling reactions of cellular metabolism to build a network through
the collection of gene annotations, enzymes and pathway information from genome (e.g.
NCBI, Ensembl) and pathway (e.g. KEGG, ExPASy) databases. Researchers then reﬁne
the network by manually collating literature evidences, including journal articles, reviews
and textbooks on metabolic functions, biomass composition, growth conditions and genereaction associations. The constructed draft network is converted to biochemical models
to evaluate the basic functionality, and simulations are performed to check for consistency
with the current knowledge. The whole process runs iteratively to incorporate as much
information and minimize gaps and inconsistencies. The protocol for the reconstruction
process is available in [169].
The major diﬀerence in a metabolic network as compared with other biological network,
e.g. Protein-Protein Interaction, Protein-DNA network, is that the metabolic network represents a biochemical system that is charge-balanced, mass-balanced and compartmentalized,
which not only provides information about whether there is an interaction, but also how it
happens and what it is produced as a biochemical reaction, thus can be directly converted
into mathematical equations based on the biochemical reactions for model predictions.

4.1.2

Modeling and simulation based on human metabolic network

A reconstructed human metabolic network can be represented by a system of stoichiometric reactions. This system of reactions can be modeled as ordinary diﬀerential equations,
111

however the reaction rate constants and metabolite concentrations are typically diﬃcult to
obtain, thereby limiting their applicability to small well-studied networks. However, since
the stoichiometry of metabolic reactions are not organism or context-dependent but is ﬁxed
by mass balance, one could apply Constraint Based Modeling (e.g. Flux Balance Analysis, FBA [170]) to simulate the state of the system without detailed kinetic data, assuming
that the ﬂux distributions based on the stoichiometric mass balance are at steady state or
pseudo-steady state.
Mathematical representation of reaction network and Constraint Based Modeling:
• Reactions: S (Stoichiometric Matrix), with m compounds (rows) and n reactions
(columns). The stoichiometric coeﬃcients are negative for the substrates of each reaction, and positive for the products.
• Flows: v (n by 1 vector) on all reactions
• Concentrations: X (m by 1 vector) of all compounds
Thus we have
dX/dt = Sv

(4.1)

Assuming pseudo-steady state, the time derivative is zero, therefore: Sv = 0. So the ﬂux
distribution v that satisﬁes this equation is in the null space of S.
In the human metabolic network, n > m results in an under-determined system that does
not have a unique solution. Adding constraints permit a “feasible” solution to the system of
equations, for example, a “ﬂux capacity” constraint determines the upper and lower bounds
of the ﬂux through a reaction. Imposing mass balance and capacity constraints will deﬁne
the space of feasible steady-state ﬂux distributions of the network. Geometrically the space
112

looks like a “ﬂux cone” in the null space of S. A visualization of the “ﬂux cone” is shown
in [171] to demonstrate the way a solution space could be narrowed by the steady-state and
capacity constraints. Further, in FBA we deﬁne an objective function Z, which is a linear
function of ﬂuxes. An objective function could be

Z = cT V

(4.2)

in which c is a column vector to assign weights to each reaction, cT is the transpose of
the vector c, and V is the ﬂux vector through all reactions. Optimization of the objective
function Z identiﬁes a unique (or multiple) set of ﬂux conﬁgurations within the ﬂux cone.
The constrained linear optimization problem can be solved by linear programming [172] [171].

The form of the objective functions, constraints and the optimization problems can vary
depending on the biological applications, which are diﬀerent variants of the Constraint Based
Modeling. For example, the Flux Sensitivity Analysis (FSA) estimates the objective ﬂux
change in response to perturbations in some reactions of interest [173]. The Flux Variability
Analysis (FVA) explores the solution space to exam the maximum/minimum ﬂuxes for each
reaction. Further, current approaches to reconstruct context dependent metabolic networks
are essentially diﬀerent variants of the Constraint Based Modeling. A detailed review of the
algorithms in the Constraint Based Modeling is provided in [174].
113

4.2

Prediction of the metabolic state-change upon perturbations

Metabolic genes could be diﬀerentially expressed under diﬀerent conditions in diﬀerent cell
types, resulting in diﬀerent metabolic states of the cell. Changes in gene expression drive
changes in metabolic ﬂuxes which manifest cellular phenotypes. This is the central idea to
associate gene expression with metabolic network to determine a metabolic state for a phenotype (i.e. reconstruct context-dependent metabolic network) and study how perturbations
in gene expression can change the phenotype.
A summary of the pipeline for the systems modeling and its applications based on human
metabolic network is shown in Figure 4.1 on page 115. Since the global human metabolic networks (the human Recon 1 and Edinburgh Human Metabolic Network) are generic metabolic
networks that collate information from all types of human cells, the reconstruction of a
context-dependent (i.e. condition/cell-type/tissue/organ speciﬁc) network is required prior
to in silico analysis of the particular system under investigation. Once the context-dependent
reconstruction is obtained, one can simulate the metabolic phenotypes under diﬀerent perturbations to identify essential gene targets or pathways, or predict cellular responses to
diﬀerent treatments.

4.2.1

Reconstructing context-dependent metabolic network

Similar to the reconstruction of the global human metabolic networks, one could follow the
protocol (Thiele and Palsson, 2010a) to collate literature evidences and gene annotations,
and manually identify the context-dependent reactions to reconstruct a condition-speciﬁc
114

Figure 4.1: A pipeline for systems biology applications of human metabolic network.

115

metabolic network. This has been applied in [175] [176] to achieve comprehensive reconstructions of hepatic and neuronal cells. However, the process requires manual evaluation of
thousands of papers, and curates thousands of genes and metabolites, which is lengthy, and
requires tremendous eﬀort and labor. Therefore, many studies have focused on automating the reconstruction of a cell-type or tissue-speciﬁc metabolic network, by incorporating
high throughput gene expression data rather than manual curation of the cell/tissue-speciﬁc
network.

Assuming changes in gene expression drives changes in the metabolic states; the basic
idea in automated reconstruction is to identify active reactions by incorporating conditionspeciﬁc gene expression proﬁles. Most reconstruction approaches start by analyzing the
gene expression data to determine if a gene is “present” (highly expressed) or “absent”
(low expression level) for the condition being investigated, then selects the active reactions
according to their corresponding gene/enzymes’ expression level. For example, the Gene
Inactivity Moderated by Metabolism and Expression (GIMME) algorithm, developed by
Becker et al. [177], uses gene expression data to determine active and suppressed genes
with an expression threshold, and determines active reactions based on the state (active or
suppressed) of the corresponding enzymes. “Inactive reactions” are removed unless they are
required for a desired functionality (according to a predeﬁned objective function). Another
approach to study human tissue-speciﬁc metabolic states was developed by Shlomi et al.
[178]. The approach does not require an objective function but matches active reactions
with expression data by solving a network ﬂux to maximize the number of enzymes that are
highly expressed and catalyze ﬂux-carrying reactions. For a better quality reconstruction,
Jerby et al. [179] developed a Model Building Algorithm (MBA), which determines the active
116

“core reactions” with multiple sources of information including literature, transcriptome and
proteomic data. The MBA then reconstructs a consistent network (no gaps or zero-ﬂux
reactions) with all the pre-deﬁned core reactions (evidences obtained from both literature
and data), adding as many of the likely active reactions (evidences obtained only from
high throughput data), and as few of the other reactions. Nevertheless, manual curation is
necessary with this approach to collate and analyze the literature and high-throughput data.

4.2.2

Simulating phenotypes based on metabolic network model

Constraint based modeling and simulation based on a condition-speciﬁc human metabolic
network can be used to predict the ﬂux distribution in the network for that speciﬁc metabolic
phenotype. Such in silico analysis have been used to generate hypotheses on the cell growth,
ATP production, or the states of speciﬁc metabolic functions upon perturbation [180]. For
example, modeling and simulations of a metabolic model of human kidney reconstructed
with the GIMME algorithm were used to evaluate the metabolic phenotypes associated with
the side-eﬀects of a drug treatment [181]. The side eﬀect of a particular drug is determined
by its oﬀ-targets, which are the enzymes/genes that are not the therapeutic targets but
nevertheless are predicted to bind and be inhibited by the drug. FBA was performed on the
perturbed network where the reactions catalyzed by the oﬀ-target enzymes were inhibited
by the drug, to evaluate the systematic consequences of the drug and determine if the
treatment leads to deﬁciency in metabolic functions [181]. This systematic approach has
also been applied to metabolic disorders of the liver [179] and cancer cells [182] [183], and
was able to correctly identify many of the genes essential for the metabolic disorders [179],
interpret metabolic state-changes (e.g. Warburg eﬀects [183]), and predict drug targets for
117

the metabolic system [182]. In the cancer study [182], gene expression in cancer cell lines is
analyzed to identify highly expressed metabolic enzyme-encoding genes in cancer, which are
used in MBA algorithm to reconstruct a “cancer metabolic network model”. FBA is then
applied to predict metabolic states (cell proliferation) across diﬀerent gene knock-downs.
Speciﬁcally, in each prediction one could turn oﬀ the reaction associated with the gene that
is knock-down and apply a FBA on the constrained model to see if the cell proliferation
(represented by an objective function) is reduced. The genes predicted to be important for
cancer metabolism are conﬁrmed to be highly essential in a shRNA gene knockdown dataset
which lists experimental identiﬁed cancer growth-supporting genes. FBA is also applied on
non-cancer cells (reconstructed metabolic network with expression data of normal cells) to
determine genes important for normal cells’ metabolism. Genes that only aﬀect cancer cells
are predicted to be drug targets. Many known targets of FDA-approved metabolic anticancer
drugs are re-discovered in this approach. Folger et al. [182] further simulated double gene
knockdowns to explore combinations of synthetic lethal drug targets.

4.3

A novel approach to reconstructing context-dependent
networks

Applications of the metabolic model (i.e. to predict a metabolic state upon perturbation on
gene expression) require reconstruction of context-dependent networks, thus computational
approaches to automate the generation of condition-speciﬁc models are of great interest.
Current approaches assume the reactions are either absent or present according to the ONs
and OFFs state of the genes, which could be problematic since a small change in the ex118

pression analysis could result in very diﬀerent lists of present and absent genes/reactions,
which thereby aﬀects the reconstruction. For example, a gene expressed a little bit lower
than the median expression level in the transcriptome of a phenotype may be classiﬁed as
absent/OFF, which would eliminate the reaction it associates with. However a gene is not
necessarily entirely OFF when expressed in low levels, and a reaction may not necessarily
be completely abandoned just because an enzyme catalyzing the reaction is expressed at
a low level. The assumption to associate genes and reactions is questionable, thus current
approaches have relied on the stoichiometric constraints to mitigate the potential inconsistencies. A recent study by Colijn et al. [184] associated gene expression levels with the
constraints on the reactions, i.e. the lower a gene is expressed, the lower the ﬂux that could
be conducted through its corresponding reactions:

max Z = cT V , subject to SV = 0, xj ≤ Vj ≤ yj
V

(4.3)

In which xj and yj are determined by gene expression level. Although it could be more
sensitive to changes (noises) in gene expression by directly associating gene expression level
with ﬂux bounds, the approach is less dependent on the discretization of gene expression and
the present/absent call of the reactions. It has been applied to bacteria to predict metabolic
modulators and responses to diﬀerent drugs [184].
Inspired by this study [184], we develop a novel approach to reconstruct human contextdependent metabolic network to associate gene expression changes with constraints on the
reactions. The model is based on two assumptions that diﬀer from previous approaches:
1. The information on the context-dependent gene expression, e.g. microarray data, tissue
speciﬁc gene database, studies of gene expression in the literature, provide information
119

on the relative gene expression state change between phenotypes, rather than
the “expression level/state” of a gene. A single “expression value” cannot indicate a
gene’s expression level (or ON/OFF state) without a comparison involving diﬀerent
phenotypes, because for diﬀerent genes, the amount of expression that is required
to turn them ON should be very diﬀerent. For example, signaling molecules could
have large impact when they are expressed in a small amount, while housekeeping
genes may require a constitutively large amount of expression to support cell survival,
thus the expression level of these two types of genes (i.e. signaling molecules: LOW,
housekeeping: HIGH) within the samples for a given phenotype do not correlate with
their ON/OFF states.
2. The state-change of the genes/enzymes, as indicated by the context information, determines the state-change of reactions, by modulating the maximum capacity of a reaction
tunnel. This is diﬀerent from previous approaches that associate gene expression with
reaction states by either removing a reaction when the corresponding gene expression
is low, or correlating the ﬂux through the reaction with gene expression level. Our
assumption is based on the enzyme kinetics (Michaelis-Menten equation) which shows
that the maximum reaction rate (i.e., the maximum capacity of ﬂux through a reaction tunnel) is positively correlated with the concentration/activity of the gene/enzyme
that catalyzes the reaction:
Vmax = Kcat [E]

(4.4)

In which the Kcat is a reaction constant and [E] is the concentration of the enzyme.
Although the constant Kcat is less available and could be diﬀerent in diﬀerent enzymes, our ﬁrst assumption compares the maximum capacity of ﬂuxes in two conditions
120

Vnew /Vref , thus we have:
Vnew
[Enew ]
=
Vref
[Eref ]

(4.5)

Where Kcat cancels (assuming the enzyme function does not change, i.e. no signiﬁcant
mutations occurred, which holds in perturbation applications of a given cell type or a
given patient in a relatively short time-scale). Thus, the upper bounds of the reactions
in the context dependent network can be computed by Vref based on a reference
network, and the [Enew ]/[Eref ] represents the state-change (fold change) in gene
expression.
Our approach is designed to reconstruct metabolic network that is speciﬁc for a phenotype/condition, by incorporating context information of the gene expression on the particular
phenotype of interest. The process is shown in Figure 4.2 on page 123. Based upon a reference metabolic network that have been deﬁned (e.g. the generic human metabolic network)
and the context information, we estimate the activity change of each reaction to adjust
the upper bounds (i.e. the maximum capacity) on the reaction tunnel, which results in a
new metabolic network model that is speciﬁc to the condition. There are two steps in the
algorithm:
1. Compute the states of the genes under the condition of interest: the context
information is used to determine the change of gene expression between the reference
condition and the condition of interest. For example, based on microarray data, the
state of a gene changes when it is diﬀerentially expressed, and the amount of changes
δ(Sg ) can be computed based on the “log fold-change” obtained from microarray analysis. Qualitative information from the literature or tissue speciﬁc dataset directly
determines the state of a gene under a condition, but the amount of state-change is
121

computed by subtracting the reference state from the state under the condition, in
order to determine the state of the reactions in the next step. In these cases the state
of the genes is represented by discrete variables Sg = {−1, 0, 1} (i.e. low, median,
high), and the state-changes (plus 1: unregulated; minus 1: downregulated) depend on
the context information. When we start from the generic human metabolic network,
the reference condition is deﬁned by having all the genes initially in state 1 and the
initial bounds are computed with Flux Variance Analysis.
2. Compute the states of reactions: based on the GPR (Gene-Protein-Reaction)
information in the network, genes that are components of an enzyme complex have
the “AND” relationship, thus the change of the activity of these complexes are deﬁned by δ(Sc ) = min{δ(Sg1 ), δ(Sg2 ), . . . , δ(Sgj )} in which δ(Sgj ) are the statechange of gene component j in the complex c. Complexes/genes that are diﬀerent isozymes for the same reaction have the “OR” relationship, thus the amount
of state-change on a reaction is then deﬁned by the state-changes of the isozymes:
δ(Sr ) = max{δ(Sc1 ), δ(Sg2 ), . . . }. Finally, the upper bound of a reaction is changed:
Vnew = Vold ∗ 2δ(Sr ) since the state-changes are deﬁned as the “log fold change”.

122

Figure 4.2: Reconstructing context speciﬁc metabolic network. A gene expression phenotype is associated with a metabolic
state. The context information provides the state-change of the gene expression between two phenotypes. The state-change
of each reaction is then determined by the state-change of the genes that regulate the reaction.

123

We apply this novel approach to model metabolic network of human liver cancer and
predict microRNAs that can inhibit cancer growth.

4.4

Prediction of therapeutic microRNA based on condition speciﬁc metabolic network

MicroRNA expression has been found to be deregulated in human cancer [185]. For example, Calin et al. [186] observed down-regulation of miRNAs mir-15 and mir-16 in most
lymphocytic leukemia patients, which suggests that miRNAcould be involved in cancer. He
et al. [187] were the ﬁrst to identify a potential non-coding oncogene, miR-17-92, which
promotes c-Myc-induced tumorigenesis in mice. In another study, Johnson et al. [188] discovered a tumor suppressor miRNA let-7 which inhibits expression of the oncogene RAS in
lung cancer cell lines.
A global decrease in miRNA levels has been observed in human cancers [189,190]. Knockout of the miRNA processing enzymes Drosha and Dicer enhances cancer cell growth in vitro
and their invasiveness in mice [191], which conﬁrms that the widespread reduction in miRNA
expression could promote tumorigenesis. Therefore, miRNAs may have an intrinsic function
in tumor suppression, and could be alternative therapeutic targets. Synthetic miRNA can
be introduced into mammalian systems [192], and a pioneer study of therapeutic miRNA
delivery of miR-26s in HCC (hepatocellular cancer) mice model successfully inhibited tumor
cell proliferation and induced cancer-speciﬁc apoptosis [193].
Although miRNAs may be good alternative targets for cancer treatment, it has been
diﬃcult to identify which miRNAs to target for a particular type of cancer, since the un124

derlying mechanisms of why and how miRNAs are involved in cancer are largely unknown.
Experimental evidences of how miRNAs regulate their targets in cancer cells have been limited. The expression of single targets, such as RAS [188], or E2F1 [194], has been shown
to be regulated by miRNAs in explaining their association with cancer. However, each
miRNA may regulate various target genes, and the same miRNA may have oncogenic or
anti-tumorgenic activity depending on the context or cell type in which the targets are expressed. Current computational studies focus primarily on analyzing miRNA expression
proﬁles to identify signatures that can separate particular cancer from normal samples [195],
or combining miRNA and gene expression data to identify a “context-speciﬁc” target in the
cancer samples [196] Nevertheless it remains unclear whether the altered miRNA expressions
are the cause or consequence of the carcinogenesis processes, and which miRNAs could be
good targets for treatment.
We propose a diﬀerent approach to tackling this problem, by integrating miRNA target
prediction, metabolic modeling and the context-speciﬁc gene expression data. We apply
our novel approach of metabolic network reconstruction to reconstruct a context-speciﬁc
metabolic system for human liver, and ﬂux analysis based on the system is used to predict
the miRNAs whose overexpression or delivery could inhibit cancer cell growth by inhibiting
its target metabolic genes.

4.4.1

Prediction of the metabolic state in liver cancer cells upon
perturbation of gene expression induced by miRNAs

Information of human liver speciﬁc gene expression is obtained from a curated dataset in
the literature [178]. Our approach is then applied to reconstruct a liver cancer speciﬁc
125

metabolic system from the human generic metabolic network, and use the system to predict
the growth and proliferation rate of cancer cells. Cancer cells are assumed to modulate
their metabolic functions to be able to support their rapid growth and proliferation, thus a
biomass production function was deﬁned as their objective function based on experimentally
measured DNA/RNA/Amino acids/Lipid composition in cancer cells [183]. We use FBA
to optimize this biomass function in the liver speciﬁc metabolic network, and the biomass
production rate reﬂects the maximum rate of cell growth and proliferation that can be
achieved.
Metabolic gene targets of each microRNA are obtained from TargetScan, which is based
on sequence complementarity and conservation of the targeting sites within vertebrates [197],
and only those 153 conserved miRNA family with conserved binding sites (across mammalians) are included in our analysis. The gene targets are assumed to be inhibited (50%
knock down) upon overexpression or therapeutic delivery of their miRNA regulator, since
miRNA would bind to these target mRNAs by base pairing and forms RNA-induced silencing complex which then causes inhibition of protein translation and/or degradation of the
mRNAs [198]. A conditional speciﬁc metabolic system is constructed upon overexpression of
each miRNA with their targets turned oﬀ and simulated to obtain their maximum achievable
biomass production rate F , which is compared with the rate F0 in “wild-type” liver cancer
without the miRNA up-regulation. A score F0 − F is computed for each miRNA to indicate
their ability to reduce cancer growth. The miRNAs are ranked by this score. The procedure
is shown in Figure 4.3 on page 127.

126

Figure 4.3: A pipeline for predicting therapeutic miRNAs for human liver cancer.

127

To assess the accuracy of the prediction, we search in the literature for the 153 miRNA
families and collate a test set including 41 miRNAs that have been experimentally studied
in liver cancer, in which there are in vivo or in vitro evidences showing 23 of these inhibits
liver cancer growth/metastasis if over-expressed while 18 do not. Based on our scoring and
ranking of miRNAs, we plot the ROC curve for the predictions of these 41 miRNAs in the
test set (the black line in Figure 4.4 on page 129), and the result shows above 82% accuracy
(Area Under Curve). To compare, we build a similar prediction system but using the GIMME
approach for network reconstruction that was previously applied to human metabolism (the
green line in Figure 4.4 on page 129). The accuracy of GIMME is 64%, only slightly better
than random (23/41 ≈ 56%), and is much lower than the predictions based on our novel
approach, for example, with the same cut-oﬀ with a false positive rate of 0.33, our approach
achieves 0.91 precision (true positive rate) which is much higher than GIMME (0.57).
Current studies in the role of microRNAs in liver cancer (reviewed in [199]) have been
focused on their targets in the signaling process, involving the apoptosis pathway (e.g. Bclw, Ras), cell cycle progression and migration/invasion signaling (e.g. CDK, cyclins, PI3K
signaling PTEN, c-Met, FOS). From an alternative perspective, our predictions are based
on metabolic system thus the results suggest that miRNAs could regulate cancer growth by
modulating metabolic functions directly. Thus we further use the reconstructed metabolic
system for liver cancer to explore the potential mechanisms by which miRNAs inhibit cancer
growth.

Metabolic functions modulated by miRNAs to inhibit cancer growth To study
the metabolic functions that miRNAs modulate to inhibit cancer growth, we average the ﬂux
change induced by the top 50 miRNAs (predicted to inhibit cancer growth) and by the bot128

Figure 4.4: The ROC curve of the prediction based on a test-set. Black line shows the
prediction result from our approach of network reconstruction, and the gree line shows the
result when we apply GIMME.

129

tom 50 (predicted not to inhibit cancer growth) ranked based on their scores, and compared
them to identify the subsystems that change the most. We found that the largest ﬂux changes
occured for the production of nucleotides and amino acids, the glycolysis/gluconeogenesis
system, Citric Acid Cycle and pyruvate metabolism, as well as the transport processes across
the cell membrane, mitochondrial and peroxisomal membranes. These are essential metabolic
functions that support the biosynthetic processes and energy requirements of cells. Tumors
speciﬁc metabolic machinery regulates these processes to facilitate cell growth and proliferation [157], and we show that these miRNAs could alter processes that inhibit cancer
growth.

One of the most profound cancer biochemical phenotype is the Warburg eﬀect [156].
Cancer cells metabolize glucose at high rates, and shift the ﬂux downstream of glucose from
mitochondrial tricarboxylic acid cycle to rapid anerobic glycolysis, thereby producing vast
amounts of lactate that is secreted from the cancer cells [157]. The Warburg eﬀect is known
as a metabolic adaptive response in cancer cells to satisfy the high demand of the molecular
building blocks of the cell, i.e. nucleotides, fatty acids, lipids and amino acids as well as ATP
for facilitating proliferation [157, 183]. An indicator of the Warburg eﬀect is the excessive
lactate production, thus we tested the ﬂux change in response to miRNA perturbations on
the reaction catalyzed by lactate dehydrogenase which concerts pyruvate into lactate. The
result (Figure 4.5 on page 132) shows that the ﬂux is decreased signiﬁcantly (on average by
50%) by the overexpression of the miRNAs that scored higher (top 50), while the bottom 50
miRNA have less impact on this ﬂux, which suggests that miRNAs could inhibit liver cancer
growth by mitigating the Warburg eﬀect to alter the cancer metabolic phenotype.
130

miRNA target metabolic enzymes that are essential for cancer growth Each
miRNA could have multiple targets, but there is no signiﬁcant correlation between the
number of targets in metabolism and their ability to inhibit cancer growth, nor is there
signiﬁcant diﬀerence in the number of targets between miRNAs predicted to inhibit cancer
growth and those that are not tumor suppressors. We hypothesize that there are some
important enzymes, upon targeting, whereby the cell growth/proliferation is reduced. We
apply our reconstruction and simulation approach to estimate the biomass production change
upon knocking down each metabolic gene individually, and identiﬁed 48 genes (among more
than 1900 metabolic genes) that are predicted to be essential for the growth and proliferation
of liver cancer cells, in which 24 of them can be targeted by miRNAs. We show these genes
in Table 4.1 on page 133. We search for genome-wide pooled shRNA screen data [200] that
is available in breast cancer, pancreatic and ovarian cancer cell lines and found that many
genes in our list, including PYGB, GBE1, SCD, Enolase, pyruvate kinase and solute carrier
family proteins have been identiﬁed as essential genes in more than 2 cancer cell lines wherein
their knockdown by shRNA signiﬁcantly reduced the survival and proliferation rate of the
cells. Since there are no liver cell lines in these studies, we further looked into a collection of
liver cancer gene signatures in the Liverome database [201] to determine if these genes are
diﬀerentially expressed in liver cancer tissues and/or in highly invasive liver cancer cell lines.
We found most of these genes have been identiﬁed as gene signatures for liver cancer and/or
for the invasiveness/metastasis of liver cancer, based on previous gene expression analysis
(Table 4.1 on page 133).

131

Figure 4.5: The ﬂux change in response to miRNA perturbations on the reaction catalyzed
by lactate dehydrogenase.

132

Table 4.1: The essential metabolic enzymes for human liver cancer that can be targeted by miRNAs

133

Although the evidences we found suggest these genes could be important for liver cancer,
they have not been fully explored — according to text mining studies in the abstracts of
pubmed literatures [201], there are only one enzyme (enolase) that have been experimentally
studied and associated with liver cancer. However, some of them have been studied in other
types of cancers to demonstrate their functional role in tumor growth. The glycolytic pathway is directly responsive to the Warburg eﬀect in cancer, thus the key glycolytic enzymes,
Enolases and pyruvate kinases in our list, have been shown to promote cancer invasion and
proliferation [202–204]. Knockdown of pyruvate kinase PKM2 reverses the Warburg eﬀect
and suppresses tumorigenesis in mice model [202]. The high expression of Enolase correlates
with poor prognosis in breast cancer and a decrease in its expression in tamoxifen-resistant
breast cancer cells signiﬁcantly augments the eﬀectiveness of tamoxifen treatment [203]. The
Pentose Phosphate Pathway uses glucose to generate ribose rings which are essential for the
synthesis of DNA and RNA. Both of the two key enzymes which can catalyze the production
of ribose-5-phosphate are on our list of essential genes: RPE and RPIA. Our simulation
predicts the down regulation of either of them reduces DNA/RNA synthesis, which is consistent with a recent study in pancreatic cancer [205] that showed knockdown of either of
the two enzymes (or both) reduced glucose ﬂux into nucleotide production and suppressed
tumor growth. The fatty acid metabolism and lipid synthesis are also very important in cell
proliferation as they are primary components of cellular membrane. Enzyme SCD activity
is involved in the synthesis of unsaturated fatty acids, and is shown as a major factor that
could promote the oncogenic process and suggested to be a therapeutic target for prostate
cancer [206–209]. The other enzymes HADH and PLA2G4A have been less studied in cancer
metabolism. In addition to these three pathways, there are ﬁve genes involved in glycogenol-

134

ysis/glycogenesis predicted to be essential genes for liver cancer, as well as multifunctional
genes DLD and solute carrier family proteins. These processes have not been explored in liver
cancer thus far in the literature, and could be interesting targets to study experimentally.
Figure 4.6 on page 136 summarizes the metabolic pathways and processes that could be
aﬀected by the essential enzymes we identiﬁed which are targeted by the miRNAs identiﬁed
in our analysis.

We check the miRNAs that target these enzymes that could be essential for liver cancer.
There is a strong association between targeting these enzymes and the miRNA’s ability to
inhibit liver cancer growth. 90% of the miRNAs ranked at the top 50 based on their score
target at least one of these essential enzymes, while none of the miRNAs at the bottom 50
target any of these enzymes. Among the 90% of the miRNAs ranked at the top 50, most of
them (80%) inhibit cancer growth more than by knocking-down any of the single essential
enzymes that they potentially target. These observations suggest that miRNAs modulate
metabolic function by simultaneously targeting metabolic enzymes that are essential for cell
growth and proliferation. Table 4.2 on page 138 shows the 5 miRNAs that have the highest
score and their essential metabolic gene targets predicted in our analysis, as well as previous
studies showing their signaling targets and involvement in liver cancer. There are a few
studies that explore the regulation of cancer metabolic functions by miRNAs (see review
[210]) but currently they focus only on glycolysis, in which essential enzymes PGM1, ENO1
are shown to be regulated by miRNA-29a, miR-17-92, nevertheless it is unclear whether the
miRNAs directly target these enzymes. The possible interactions between miRNA, signaling
process, transcription factors and the expression of metabolic genes complicated the study in
exploring function roles of miRNA in cancer. We provide an alternative modeling approach
135

Figure 4.6: The metabolic pathways and processes that could be aﬀected by the essential
enzymes.

136

that focus on direct metabolic targets of miRNAs, but the limitation is that there could be
indirect eﬀects potentially responsive to signaling and transcription regulations mediated by
miRNAs. Future studies in miRNAs will be aiming to incorporate more complicated gene
networks in regulating the expression of metabolic genes, as well as account for the potential
interactions between miRNA, signaling, transcription and metabolic systems to achieve a
systematic understanding of cancer metabolism.

137

Table 4.2: The 5 miRNAs ranked at the top in our analysis and their potential target genes

138

In summary, we developed a novel approach to simulate context dependent metabolic
states upon perturbation of gene expression, which is able to incorporate the metabolic
network, gene expression states to predict the steady state ﬂux distributions in the cancer
metabolic system. A condition speciﬁc metabolic system was constructed for human liver
cancer (HCC) and overexpression of each miRNA was simulated to predict their eﬀect on
reducing cancer cell growth. Compared with experimental evidences that we collected from
the literature, our approach achieved 80 percent accuracy in predicting miRNAs that can
suppress metastasis and progression of liver cancer. Our approach can be used as a framework
to explore the mechanism by which miRNAs modulate metabolic functions to aﬀect cancer
growth. We analyze the metabolic functions altered by miRNAs and identify the essential
metabolic genes that are targeted by the miRNAs. We suggest that miRNAs modulate
metabolic function by directly targeting metabolic enzymes that are essential for cell growth
and proliferation.

139

Chapter 5
Conclusion
The recent advent of high-throughput technology has enabled the global analysis of genes,
proteins, and their interactions, driving the development and application of computational
approaches to study gene regulation on the genome scale, by reconstructing in silico the
regulatory interactions of gene networkS. The challenge is to determine:
• What are the essential entities (genes, proteins) in a network that confer a phenotype?
• How do these entities work together to regulate biological processes?
We focus on the analysis of gene expression. Our research aim to develop novel approaches to identify genes and their regulation that are involved in conferring a phenotype.
Given the complexity of gene regulatory networks, knowledge of the properties of individual
components in the network are not suﬃcient to elucidate the cell physiology. As reported in
Chapters 2-4, the research has focused on 2 aspects of this problem:

Regulatory processes are highly connected such that a speciﬁc response is typically accompanied numerous collateral eﬀects. As also pointed out by Kholodenko
et al. in [211]
140

“A ‘local’ perturbation that is initially conﬁned to a particular network node
can propagate and cause widespread “global” changes in the network and thereby
mask immediate connections and routes. This issue is particularly pertinent to
large omics data sets, because even in response to a single local perturbation,
the omics snapshots of the cellular state arise from a plethora of interactions
spreading through cell networks.”

To identify genes and interactions that are speciﬁc to a phenotype/condition, not only
should we ﬁlter out the “false positives”, which is the key goal of many previous studies
in statistical analysis and data mining of biological data, but also, more importantly, to
identify the speciﬁc responses rather than the many general/collateral eﬀects. Therefore,
in our network reconstruction framework, we design a multi-layer approach that is able to
reconstruct condition-speciﬁc genes and their regulation through an integrative analysis of
large scale information of gene expression, protein interaction and transcriptional regulation.
We propose to integrate microarray data from a diverse set of conditions to provide a common
context (more and better controls) for the expression behaviors of genes, and apply advanced
feature selection technique, to identify the target genes that are most speciﬁc to the condition
being investigated. From the target genes, conditional gene regulations, and conditional
transcription factor activity are then determined. We show that incorporating these diverse
conditions for comparison in the feature selection of genes and interactions enhances the
speciﬁcity of the predictions.

Systematic understanding of a biological network. As a complex system, there are
“emergent properties” that arise from the intricate interactions of gene networks, by which
cells process external and internal signals to determine a phenotype, i.e. a state of the system.
It is the interactions of the genes and proteins that determine a phenotype, thus the challenge
is to understand how complex molecular and cellular functions and responses arise from these
141

dynamic interactions. We have been studying an important dynamic feature in the gene
regulatory network—the “switch-like behavior”, and propose a top-down mining approach
to exploring gene switches on a genome-scale level. Our mining approach demonstrates that
one can capitalize on genome-wide expression proﬁling to capture dynamic properties of a
complex network. In another study, we aim to understand systematically the regulation
of human metabolic functions. A novel approach is proposed to tackle this challenge by
integrating metabolic modeling and context-speciﬁc gene expression data to simulate context
dependent metabolic states upon perturbation of gene expression (e.g. induced by miRNA).

Overall these explorations in the ﬁeld of computational systems biology aim to identify
genes and their regulation that determines a phenotype. Applications on human breast cancer identiﬁed Trop2 as a target gene, a potential gene switch, and regulation of this gene by
transcription factors, CREB as well as NFkB. Studies on the network reconstruction of human metabolic system predicted therapeutic miRNAs for human liver cancer, and suggested
miRNAs could be implicated in the metabolic regulation of cancer. Future studies in systems
biology could combine reverse engineering approaches based on the omics data and systems
modeling of gene network to achieve a coherent mechanistic picture of biological regulation. For example, we have thus considered only the direct targets of miRNAs in metabolic
system as discussed in Chapter 4, but there could be indirect targets whose expression is
altered by TFs that are regulated by miRNAs. Therefore one could further incorporate the
transcriptional regulatory network to describe regulations at this level. miRNA expression
proﬁles and gene expression proﬁles upon perturbation of miRNAs, if available, could be
integrated to identify the lowly expressed miRNAs for a given cancer type as well as infer
the miRNA-target relationships. Through the reverse-engineering of a miRNA-target net142

work representing an integrated picture of transcriptional regulation, gene expression and
miRNA expression, one would be able to determine which genes are likely the true (direct
or indirect) targets of a miRNA under the condition of interest, and how the target genes
could respond to the over-expression or delivery of the miRNA. This way, one could have
a more realistic model of target genes’ expression that respond to the over-expression of
a particular miRNA, instead of assuming all predicted gene targets to be inhibited upon
over-expression or therapeutic delivery of their miRNA regulators. The reverse engineering
approach can be integrated into the systems modeling of metabolic network to account for
the layer of transcriptional regulation, thus to achieve a better model in predicting the condition speciﬁc metabolic states upon perturbation of miRNAs. Furthermore, there are other
layers of regulation of gene expression and enzyme activity, including post-transcriptional
and post-translational processes. Many genes involved in post-transcriptional regulation of
the metabolic genes or enzymes could also aﬀect the metabolic states, which are excluded
from current applications, and there are genetic changes that could be important in cancer phenotypes. Although it is diﬃcult to have a comprehensive model due to limitations
of our current knowledge of biological processes, nonetheless we could incorporate more
information to provide a more comprehensive model of the regulatory network given the
increasing amount of transcriptome data, protein interaction data, genomic data and other
high-throughput data available.

143

BIBLIOGRAPHY

144

BIBLIOGRAPHY
[1] P Nurse. Reductionism and explanation in cell biology. Novartis Found. Symp., 213:93–
101; discussion 102–105–93–101; discussion 102–105, 1998.
[2] G D Vladutiu. Heterozygosity: an expanding role in proteomics. Mol. Genet. Metab.,
74(1-2):51–63, October 2001.
[3] Hui Ge, Albertha J M Walhout, and Marc Vidal. Integrating ’omic’ information: a
bridge between genomics and systems biology. Trends Genet., 19(10):551–560, October
2003.
[4] Yuri Lazebnik. Can a biologist ﬁx a radio?–or, what i learned while studying apoptosis.
Cancer Cell, 2(3):179–182, September 2002.
[5] Isabelle Guyon and Andr Elisseeﬀ. An introduction to variable and feature selection.
J. Mach. Learn. Res., 3:1157–1182, March 2003.
[6] Yvan Saeys, Iaki Inza, and Pedro Larraaga. A review of feature selection techniques
in bioinformatics. Bioinformatics, 23(19):2507–2517, October 2007.
[7] Cosmin Lazar, Jonatan Taminau, Stijn Meganck, David Steenhoﬀ, Alain Coletta, Colin
Molter, Virginie de Schaetzen, Robin Duque, Hugues Bersini, and Ann Nowe. A
survey on ﬁlter techniques for feature selection in gene expression microarray analysis.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM,
pages –, February 2012.
[8] Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical
methods for identifying diﬀerentially expressed genes in replicated cdna microarray
experiments. STATISTICA SINICA, 12:111–139, 2002.
[9] A D Long, H J Mangalam, B Y Chan, L Tolleri, G W Hatﬁeld, and P Baldi. Improved
statistical inference from dna microarray data using analysis of variance and a bayesian
145

statistical framework. analysis of global gene expression in escherichia coli k12. J. Biol.
Chem., 276(23):19937–19944, June 2001.
[10] P Baldi and A D Long. A bayesian framework for the analysis of microarray expression
data: regularized t -test and statistical inferences of gene changes. Bioinformatics,
17(6):509–519, June 2001.
[11] Lin Deng, Jian Pei, Jinwen Ma, and Dik Lun Lee. A rank sum test method for
informative gene discovery. In KDD ’04, pages 410–419, New York, NY, USA, 2004.
ACM.
[12] Rainer Breitling, Patrick Armengaud, Anna Amtmann, and Pawel Herzyk. Rank
products: a simple, yet powerful, new method to detect diﬀerentially regulated genes
in replicated microarray experiments. FEBS Lett., 573(1-3):83–92, August 2004.
[13] Sandrine Dudoit, Juliet Shaﬀer, and Jennifer Boldrick. Multiple hypothesis testing in
microarray experiments. U.C. Berkeley Division of Biostatistics Working Paper Series,
pages –, August 2002.
[14] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society.
Series B (Methodological), 57(1):289–300, 1995.
[15] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol, 3(2):185–205, April 2005.
[16] Emanuel Parzen. On estimation of a probability density function and mode. The
Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
[17] Patrick E Meyer, Kevin Kontos, Frederic Laﬁtte, and Gianluca Bontempi. Informationtheoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform
Syst Biol, pages 79879–79879, 2007.
[18] Jeremiah J Faith, Boris Hayete, Joshua T Thaden, Ilaria Mogno, Jamey Wierzbowski,
Guillaume Cottarel, Simon Kasif, James J Collins, and Timothy S Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression proﬁles. PLoS Biol, 5(1):e8–e8, January 2007.
[19] Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky,
Riccardo Dalla Favera, and Andrea Califano. Aracne: an algorithm for the reconstruc146

tion of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7 Suppl 1:S7–S7, 2006.
[20] Jing Yu, V. Anne Smith, Paul P. Wang, Er J. Hartemink, and Erich D. Jarvis. Using
bayesian network inference algorithms to recover molecular genetic regulatory networks. pages –, 2002.
[21] Michael Hecker, Sandro Lambeck, Susanne Toepfer, Eugene van Someren, and Reinhard Guthke. Gene regulatory network inference: data integration in dynamic models-a
review. BioSystems, 96(1):86–103, April 2009.
[22] Wei-Po Lee and Wen-Shyong Tzou. Computational methods for discovering gene networks from expression data. Brief. Bioinformatics, 10(4):408–423, July 2009.
[23] Timothy S Gardner and Jeremiah J Faith. Reverse-engineering transcription control
networks. Phys Life Rev, 2(1):65–88, April 2010.
[24] Debopriya Das, Matteo Pellegrini, and Joe W. Gray. A primer on regression methods
for decoding cis-regulatory logic. PLoS Comput Biol, 5(1):e1000269–e1000269, January
2009.
[25] Adam A Margolin and Andrea Califano. Theory and limitations of genetic network
inference from microarray data. Ann. N. Y. Acad. Sci, 1115:51–72, December 2007.
[26] Benjamin de Bivort, Sui Huang, and Yaneer Bar-Yam. Dynamics of cellular level
function and regulation derived from murine expression array data. Proceedings of the
National Academy of Sciences of the United States of America, 101(51):17687–17692,
December 2004.
[27] Jan Ihmels, Gilgi Friedlander, Sven Bergmann, Ofer Sarig, Yaniv Ziv, and Naama
Barkai. Revealing modular organization in the yeast transcriptional network. Nat.
Genet, 31(4):370–377, August 2002.
[28] D. Marbach, R. J. Prill, T. Schaﬀter, C. Mattiussi, D. Floreano, and G. Stolovitzky.
Revealing strengths and weaknesses of methods for gene network inference. Proceedings
of the National Academy of Sciences, 107(14):6286–6291, March 2010.
[29] Kuang Lin and Dirk Husmeier. Modelling transcriptional regulation with a mixture
of factor analyzers and variational bayesian expectation maximization. EURASIP J
Bioinform Syst Biol, pages 601068–601068, 2009.
147

[30] Eran Segal, Michael Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne
Koller, and Nir Friedman. Module networks: identifying regulatory modules and their
condition-speciﬁc regulators from gene expression data. Nat. Genet, 34(2):166–176,
June 2003.
[31] E Segal, T Raveh-Sadka, M Schroeder, U Unnerstall, and U Gaul. Predicting expression patterns from regulatory sequence in drosophila segmentation. NATURE,
451(7178):535–U1–535–U1, January 2008.
[32] Chiara Sabatti and Gareth M. James. Bayesian sparse hidden components analysis for
transcription regulation networks. Bioinformatics, 22(6):739–746, March 2006.
[33] Jason Ernst, Oded Vainas, Christopher T Harbison, Itamar Simon, and Ziv BarJoseph. Reconstructing dynamic regulatory maps. Mol. Syst. Biol, 3:74–74, 2007.
[34] Michael A Beer and Saeed Tavazoie. Predicting gene expression from sequence. Cell,
117(2):185–198, April 2004.
[35] A P Gasch, P T Spellman, C M Kao, O Carmel-Harel, M B Eisen, G Storz, D Botstein, and P O Brown. Genomic expression programs in the response of yeast cells to
environmental changes. Mol. Biol. Cell, 11(12):4241–4257, December 2000.
[36] P T Spellman, G Sherlock, M Q Zhang, V R Iyer, K Anders, M B Eisen, P O Brown,
D Botstein, and B Futcher. Comprehensive identiﬁcation of cell cycle-regulated genes
of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell,
9(12):3273–3297, December 1998.
[37] Pedro T. Monteiro, Nuno D. Mendes, Miguel C. Teixeira, Soﬁa d’Orey, Sandra Tenreiro, Nuno P. Mira, Helio Pais, Alexandre P. Francisco, Alexandra M. Carvalho, Artur B. Lourenco, Isabel Sa-Correia, Arlindo L. Oliveira, and Ana T. Freitas. Yeastractdiscoverer: new tools to improve the analysis of transcriptional regulatory associations
in saccharomyces cerevisiae. Nucl. Acids Res., 36(suppl1):D132–136–D132–136, January 2008.
[38] Miguel C Teixeira, Pedro Monteiro, Pooja Jain, Sandra Tenreiro, Alexandra R Fernandes, Nuno P Mira, Marta Alenquer, Ana T Freitas, Arlindo L Oliveira, and Isabel
S-Correia. The yeastract database: a tool for the analysis of transcription regulatory
associations in saccharomyces cerevisiae. Nucleic Acids Res, 34(Database issue):D446–
451–D446–451, January 2006.
148

[39] Yanxin Shi, Itamar Simon, Tom Mitchell, and Ziv Bar-Joseph. A combined expressioninteraction model for inferring the temporal activity of transcription factors. pages
82–97, Singapore, 2008. Springer-Verlag.
[40] Markus J. Herrgard, Markus W. Covert, and Bernhard . Palsson. Reconciling gene
expression data with known genome-scale regulatory network structures. Genome
Research, 13(11):2423–2434, November 2003.
[41] Bryan J Venters and B Franklin Pugh. How eukaryotic genes are transcribed. Crit.
Rev. Biochem. Mol. Biol, 44(2-3):117–141, June 2009.
[42] Eric H. Davidson. Genomic Regulatory Systems: In Development and Evolution. Academic Press, January 2001.
[43] Sergey V Nuzhdin, Anna Rychkova, and Matthew W Hahn. The strength of
transcription-factor binding modulates co-variation in transcriptional networks. Trends
Genet, 26(2):51–53, February 2010.
[44] Zeba Wunderlich and Leonid A Mirny. Diﬀerent gene regulation strategies revealed by
analysis of binding motifs. Trends Genet, 25(10):434–440, October 2009.
[45] Xiao-yong Li, Stewart MacArthur, Richard Bourgon, David Nix, Daniel A Pollard,
Venky N Iyer, Aaron Hechmer, Lisa Simirenko, Mark Stapleton, Cris L Luengo Hendriks, Hou Cheng Chu, Nobuo Ogawa, William Inwood, Victor Sementchenko, Amy
Beaton, Richard Weiszmann, Susan E Celniker, David W Knowles, Tom Gingeras,
Terence P Speed, Michael B Eisen, and Mark D Biggin. Transcription factors bind
thousands of active and inactive regions in the drosophila blastoderm. PLoS Biol,
6(2):e27–e27, February 2008.
[46] Duygu Ucar, Andreas Beyer, Srinivasan Parthasarathy, and Christopher T Workman.
Predicting functionality of protein-dna interactions by integrating diverse evidence.
Bioinformatics, 25(12):i137–144–i137–144, June 2009.
[47] Robert J White and Andrew D Sharrocks. Coordinated control of the gene expression
machinery. Trends Genet, 26(5):214–220, May 2010.
[48] Maria J Amorim, Cristina Cotobal, Caia Duncan, and Juan Mata. Global coordination
of transcriptional control and mrna decay during cellular diﬀerentiation. Mol Syst Biol,
6:380–380, 2010.
149

[49] AJ Hartemink, DK Giﬀord, TS Jaakkola, and RA Young. Combining location and
expression data for principled discovery of genetic regulatory network models. pages
449, 437–449, 437, 2002.
[50] Adriano V Werhli and Dirk Husmeier. Reconstructing gene regulatory networks with
bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol, 6:Article15–Article15, 2007.
[51] Shao-Shan Carol Huang and Ernest Fraenkel. Integrating proteomic, transcriptional,
and interactome data reveals hidden components of signaling and regulatory networks.
Sci Signal, 2(81):ra40–ra40, 2009.
[52] Christopher T Harbison, D Benjamin Gordon, Tong Ihn Lee, Nicola J Rinaldi, Kenzie D Macisaac, Timothy W Danford, Nancy M Hannett, Jean-Bosco Tagne, David B
Reynolds, Jane Yoo, Ezra G Jennings, Julia Zeitlinger, Dmitry K Pokholok, Manolis
Kellis, P Alex Rolfe, Ken T Takusagawa, Eric S Lander, David K Giﬀord, Ernest
Fraenkel, and Richard A Young. Transcriptional regulatory code of a eukaryotic
genome. Nature, 431(7004):99–104, September 2004.
[53] Yongping Pan, Chung-Jung Tsai, Buyong Ma, and Ruth Nussinov. Mechanisms of
transcription factor selectivity. Trends Genet, 26(2):75–83, February 2010.
[54] Douglas F Browning and Stephen J Busby. The regulation of bacterial transcription
initiation. Nat. Rev. Microbiol, 2(1):57–65, January 2004.
[55] Lacramioara Bintu, Nicolas E Buchler, Hernan G Garcia, Ulrich Gerland, Terence Hwa,
Jan Kondev, and Rob Phillips. Transcriptional regulation by the numbers: models.
Curr. Opin. Genet. Dev, 15(2):116–124, April 2005.
[56] Lacramioara Bintu, Nicolas E Buchler, Hernan G Garcia, Ulrich Gerland, Terence
Hwa, Jan Kondev, Thomas Kuhlman, and Rob Phillips. Transcriptional regulation by
the numbers: applications. Curr. Opin. Genet. Dev, 15(2):125–135, April 2005.
[57] Anthony Gitter, Zehava Siegfried, Michael Klutstein, Oriol Fornes, Baldo Oliva, Itamar
Simon, and Ziv Bar-Joseph. Backup in gene regulatory networks explains diﬀerences
between binding and knockout results. Mol. Syst. Biol, 5:276–276, 2009.
[58] Zhanzhi Hu, Patrick J Killion, and Vishwanath R Iyer. Genetic reconstruction of a
functional transcriptional regulatory network. Nat. Genet, 39(5):683–687, May 2007.
150

[59] Igor Ulitsky and Ron Shamir. Identiﬁcation of functional modules using network topology and high-throughput data. BMC Systems Biology, 1(1):8–8, January 2007.
[60] Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauﬀenburger, and Garry P. Nolan.
Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, April 2005.
[61] Olaf Wolkenhauer and Mihajlo Mesarovic. Feedback dynamics and cell function: Why
systems biology is called systems biology. Mol Biosyst, 1(1):14–16, May 2005.
[62] Steven S. Andrews and Adam P. Arkin. Simulating cell biology. Current Biology,
16(14):R523–R527–R523–R527, July 2006.
[63] O Wolkenhauer, S N Sreenath, P Wellstead, M Ullah, and K-H Cho. A systemsand signal-oriented approach to intracellular dynamics. Biochem. Soc. Trans., 33(Pt
3):507–515, June 2005.
[64] Mihajlo D. Mesarovic, M. and Yasuhiko Takahara. General Systems Theory: Mathematical Foundations. Academic Press, February 1975.
[65] Xuerui Yang, Aritro Nath, Michael J Opperman, and Christina Chan. The doublestranded rna-dependent protein kinase diﬀerentially regulates insulin receptor substrates 1 and 2 in hepg2 cells. Mol. Biol. Cell, 21(19):3449–3458, October 2010.
[66] Fangting Li, Tao Long, Ying Lu, Qi Ouyang, and Chao Tang. The yeast cell-cycle
network is robustly designed. Proc. Natl. Acad. Sci. U.S.A., 101(14):4781–4786, April
2004.
[67] Maria I Davidich and Stefan Bornholdt. Boolean network model predicts cell cycle
sequence of ﬁssion yeast. PLoS ONE, 3(2):e1672–e1672, 2008.
[68] Ming Wu, Xuerui Yang, and Christina Chan. A dynamic analysis of irs-pkr signaling
in liver cells: a discrete modeling approach. PLoS ONE, 4(12):e8040–e8040, 2009.
[69] Gregory Stephanopoulos, Hal Alper, and Joel Moxley. Exploiting biological complexity for strain improvement through systems biology. Nat Biotech, 22(10):1261–1267,
October 2004.
[70] Florence Jaﬀrezic and Gwenola Tosser-Klopp. Gene network reconstruction from microarray data. BMC Proceedings, 3(Suppl 4):S12–S12, 2009.
151

[71] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R
Golub, Eric S Lander, and Jill P Mesirov. Gene set enrichment analysis: a knowledgebased approach for interpreting genome-wide expression proﬁles. Proc. Natl. Acad.
Sci. U.S.A, 102(43):15545–15550, October 2005.
[72] Trey Ideker, Owen Ozier, Benno Schwikowski, and Andrew F Siegel. Discovering
regulatory and signalling circuits in molecular interaction networks. Bioinformatics,
18 Suppl 1:S233–240–S233–240, 2002.
[73] Rui-Sheng Wang, Guangxu Jin, Xiang-Sun Zhang, and Luonan Chen. Modeling posttranscriptional regulation activity of small non-coding rnas in escherichia coli. BMC
Bioinformatics, 10(Suppl 4):S6–S6, 2009.
[74] Celine Lefebvre, Presha Rajbhandari, Mariano J Alvarez, Pradeep Bandaru, Wei Keat
Lim, Mai Sato, Kai Wang, Pavel Sumazin, Manjunath Kustagi, Brygida C Bisikirska,
Katia Basso, Pedro Beltrao, Nevan Krogan, Jean Gautier, Riccardo Dalla-Favera, and
Andrea Califano. A human b-cell interactome identiﬁes myb and foxm1 as master
regulators of proliferation in germinal centers. Mol Syst Biol, 6:377–377, June 2010.
[75] Gianluca Bontempi and Patrick E Meyer. Causal ﬁlter selection in microarray data.
ICML 2010, pages –, 2010.
[76] Katia Basso, Adam A Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera,
and Andrea Califano. Reverse engineering of regulatory networks in human b cells.
Nat. Genet., 37(4):382–390, April 2005.
[77] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee, and Trey Ideker. Networkbased classiﬁcation of breast cancer metastasis. Molecular Systems Biology, 3(1):–,
October 2007.
[78] Yuhang Wang and F. Makedon. Application of relief-f feature ﬁltering algorithm to
selecting informative genes for cancer classiﬁcation using microarray data. pages 497–
498. IEEE, August 2004.
[79] Marta Rubio-Texeira. A comparative analysis of the gal genetic switch between notso-distant cousins: Saccharomyces cerevisiae versus kluyveromyces lactis. FEMS Yeast
Research, 5(12):1115–1128, 2005.
[80] Ming Wu, Li Liu, and Christina Chan. Identiﬁcation of novel targets for breast cancer
by exploring gene switches on a genome scale. BMC Genomics, 12:547–547, 2011.
152

[81] Jrme Eeckhoute, Erika Krasnickas Keeton, Mathieu Lupien, Susan A. Krum, Jason S.
Carroll, and Myles Brown. Positive cross-regulatory loop ties gata-3 to estrogen receptor expression in breast cancer. Cancer Research, 67(13):6477–6483, July 2007.
[82] Stephanie Harkey Shirley, Joyce E Rundhaug, Jie Tian, Noirin Cullinan-Ammann,
Isabel Lambertz, Claudio J Conti, and Robin Fuchs-Young. Transcriptional regulation of estrogen receptor-alpha by p53 in human breast cancer cells. Cancer Res.,
69(8):3405–3414, April 2009.
[83] Mozhgan Rasti, Rita Arabsolghar, Zahed Khatooni, and Zoherh Mostafavi-Pour. p53
binds to estrogen receptor 1 promoter in human breast cancer cells. Pathology Oncology
Research: POR, pages –, June 2011.
[84] V Doucas, G Spyrou, and M Yaniv. Unregulated expression of c-jun or c-fos proteins
but not jun d inhibits oestrogen receptor activity in human breast cancer derived cells.
EMBO J., 10(8):2237–2245, August 1991.
[85] L M Smith, S C Wise, D T Hendricks, A L Sabichi, T Bos, P Reddy, P H Brown, and
M J Birrer. cjun overexpression in mcf-7 breast cancer cells produces a tumorigenic,
invasive and hormone resistant phenotype. Oncogene, 18(44):6063–6070, October 1999.
[86] E R Schuur, L A McPherson, G P Yang, and R J Weigel. Genomic structure of the
promoters of the human estrogen receptor-alpha gene demonstrate changes in chromatin structure induced by ap2gamma. J. Biol. Chem., 276(18):15519–15526, May
2001.
[87] L A McPherson and R J Weigel. Ap2alpha and ap2gamma: a comparison of binding
site speciﬁcity and trans-activation of the estrogen receptor promoter and single site
promoter constructs. Nucleic Acids Res., 27(20):4040–4049, October 1999.
[88] George W Woodﬁeld, Yizhen Chen, Thomas B Bair, Frederick E Domann, and
Ronald J Weigel. Identiﬁcation of primary gene targets of tfap2c in hormone responsive
breast carcinoma cells. Genes Chromosomes Cancer, 49(10):948–962, October 2010.
[89] George W Woodﬁeld, Annamarie D Horan, Yizhen Chen, and Ronald J Weigel. Tfap2c
controls hormone response in breast cancer cells through multiple pathways of estrogen
signaling. Cancer Res., 67(18):8439–8443, September 2007.
[90] Jose Braganca, Jyrki J Eloranta, Simon D Bamforth, J Claire Ibbitt, Helen C Hurst,
and Shoumo Bhattacharya. Physical and functional interactions among ap-2 transcription factors, p300/creb-binding protein, and cited2. J. Biol. Chem., 278(18):16021–
16029, May 2003.
153

[91] Jose Braganca, Tracey Swingler, Fatima I. R. Marques, Tania Jones, Jyrki J. Eloranta,
Helen C. Hurst, Toshihiro Shioda, and Shoumo Bhattacharya. Human creb-binding
protein/p300-interacting transactivator with ed-rich tail (cited) 4, a new member of
the cited family, functions as a co-activator for transcription factor ap-2. Journal of
Biological Chemistry, 277(10):8559–8565, March 2002.
[92] Silke Sperling, Christina H Grimm, Ilona Dunkel, Siegrun Mebus, Hans-Peter Sperling,
Arno Ebner, Raﬀaello Galli, Hans Lehrach, Christoph Fusch, Felix Berger, and Stefanie
Hammer. Identiﬁcation and functional analysis of cited2 mutations in patients with
congenital heart defects. Hum. Mutat., 26(6):575–582, December 2005.
[93] Shangqin Guo and Gail E Sonenshein.
Forkhead box transcription factor
foxo3a regulates estrogen receptor alpha expression and is repressed by the
her-2/neu/phosphatidylinositol 3-kinase/akt signaling pathway. Mol. Cell. Biol.,
24(19):8681–8690, October 2004.
[94] Kristi Muldoon Jacobs, J Daniel Pennington, Kheem S Bisht, Nukhet Aykin-Burns,
Hyun-Seok Kim, Mark Mishra, Lunching Sun, Phuongmai Nguyen, Bong-Hyun Ahn,
Jaime Leclerc, Chu-Xia Deng, Douglas R Spitz, and David Gius. Sirt3 interacts with
the daf-16 homolog foxo3a in the mitochondria, as well as increases foxo3a dependent
gene expression. Int. J. Biol. Sci., 4(5):291–299, 2008.
[95] Michael W. Bronson, Sara Hillenmeyer, Richard W. Park, and Alexander S. Brodsky.
Estrogen coordinates translation and transcription, revealing a role for nrsf in human
breast cancer cells. Mol Endocrinol, 24(6):1120–1135, June 2010.
[96] M Watanabe, J Yanagisawa, H Kitagawa, K Takeyama, S Ogawa, Y Arao, M Suzawa,
Y Kobayashi, T Yano, H Yoshikawa, Y Masuhiro, and S Kato. A subfamily of rnabinding dead-box proteins acts as an estrogen receptor alpha coactivator through the
n-terminal activation domain (af-1) with an rna coactivator, sra. EMBO J., 20(6):1341–
1352, March 2001.
[97] Kwang Won Jeong, Young-Ho Lee, and Michael R Stallcup. Recruitment of the swi/snf
chromatin remodeling complex to steroid hormone-regulated promoters by nuclear receptor coactivator ﬂightless-i. J. Biol. Chem., 284(43):29298–29309, October 2009.
[98] Yun Kyoung Kang, Mohamed Guermah, Chao-Xing Yuan, and Robert G Roeder. The
trap/mediator coactivator complex interacts directly with estrogen receptors alpha and
beta through the trap220 subunit and directly enhances estrogen receptor function in
vitro. Proc. Natl. Acad. Sci. U.S.A., 99(5):2642–2647, March 2002.
154

[99] C G Castles, S Oesterreich, R Hansen, and S A Fuqua. Auto-regulation of the estrogen
receptor promoter. J. Steroid Biochem. Mol. Biol., 62(2-3):155–163, June 1997.
[100] S Fan, J Wang, R Yuan, Y Ma, Q Meng, M R Erdos, R G Pestell, F Yuan, K J
Auborn, I D Goldberg, and E M Rosen. Brca1 inhibition of estrogen receptor signaling
in transfected cells. Science, 284(5418):1354–1356, May 1999.
[101] Li Chen, George Wu, and Hongkai Ji. hmchip: a database and web server for exploring
publicly available human and mouse chip-seq and chip-chip data. Bioinformatics,
pages –, March 2011.
[102] T. D. Gilmore. Introduction to nf-—[kappa]—b: players, pathways, perspectives.
Oncogene, 25(51):6680–6684, October 2006.
[103] Simon J Galbraith, Linh M Tran, and James C Liao. Transcriptome network component analysis with limited microarray data. Bioinformatics, 22(15):1886–1894, August
2006.
[104] James C. Liao, Riccardo Boscolo, Young-Lyeol Yang, Linh My Tran, Chiara Sabatti,
and Vwani P. Roychowdhury. Network component analysis: Reconstruction of regulatory signals in biological systems. PNAS, 100(26):15522–15527, December 2003.
[105] Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits.
Chapman and Hall/CRC, July 2006.
[106] John J Tyson, Reka Albert, Albert Goldbeter, Peter Ruoﬀ, and Jill Sible. Biological
switches and clocks. J R Soc Interface, 5 Suppl 1:S1–8–S1–8, August 2008.
[107] James E. Ferrell and Wen Xiong. Bistability in cell signaling: How to make continuous
processes discontinuous, and reversible processes irreversible. Chaos, 11(1):227–236,
March 2001.
[108] William J Blake, Mads KAErn, Charles R Cantor, and J J Collins. Noise in eukaryotic
gene expression. Nature, 422(6932):633–637, April 2003.
[109] Hunter B Fraser, Aaron E Hirsh, Guri Giaever, Jochen Kumm, and Michael B Eisen.
Noise minimization in eukaryotic gene expression. PLoS Biol, 2(6):e137–e137, April
2004.
[110] Joseph R Pomerening. Uncovering mechanisms of bistability in biological systems.
Curr. Opin. Biotechnol, 19(4):381–388, August 2008.
155

[111] Benno Muller-Hill. The Lac Operon: A Short History of a Genetic Paradigm. Walter
de Gruyter, September 1996.
[112] Wiep Klaas Smits, Oscar P. Kuipers, and Jan-Willem Veening. Phenotypic variation
in bacteria: the role of feedback regulation. Nat Rev Micro, 4(4):259–271, April 2006.
[113] Hannah Chang, Philmo Oh, Donald Ingber, and Sui Huang. Multistable and multistep
dynamics in neutrophil diﬀerentiation. BMC Cell Biology, 7(1):11–11, 2006.
[114] Tetsuya Kobayashi, Luonan Chen, and Kazuyuki Aihara. Modeling genetic switches
with positive feedback loops. J. Theor. Biol, 221(3):379–399, April 2003.
[115] Patrick B Warren and Pieter Rein ten Wolde. Chemical models of genetic toggle
switches. J Phys Chem B, 109(14):6812–6823, April 2005.
[116] Ozlem Demir and Isil Aksan Kurnaz. An integrated model of glucose and galactose
metabolism regulated by the gal genetic switch. Comput Biol Chem, 30(3):179–192,
June 2006.
[117] Anna Ochab-Marcinek. Predicting the asymmetric response of a genetic switch to
noise. J. Theor. Biol, 254(1):37–44, September 2008.
[118] Guang Yao, Tae Jun Lee, Seiichi Mori, Joseph R. Nevins, and Lingchong You. A
bistable rb-e2f switch underlies the restriction point. Nat Cell Biol, 10(4):476–482,
April 2008.
[119] Timothy S. Gardner, Charles R. Cantor, and James J. Collins. Construction of a
genetic toggle switch in escherichia coli. Nature, 403(6767):339–342, January 2000.
[120] Emery Conrad, Avraham E Mayo, Alexander J Ninfa, and Daniel B Forger. Rate
constants rather than biochemical mechanism determine behaviour of genetic clocks.
J R Soc Interface, 5 Suppl 1:S9–15–S9–15, August 2008.
[121] J Christopher Anderson, Christopher A Voigt, and Adam P Arkin. Environmental
signal integration by a modular and gate. Mol. Syst. Biol, 3:133–133, 2007.
[122] J E Ferrell and E M Machleder. The biochemical basis of an all-or-none cell fate switch
in xenopus oocytes. Science, 280(5365):895–898, May 1998.
156

[123] Tetsuya Shiraishi, Shinako Matsuyama, and Hiroaki Kitano. Large-scale analysis of
network bistability for human cancers. PLoS Comput Biol, 6(7):e1000851–e1000851,
July 2010.
[124] Nick I. Markevich, Jan B. Hoek, and Boris N. Kholodenko. Signaling switches and
bistability arising from multisite phosphorylation in protein kinase cascades. The Journal of Cell Biology, 164(3):353–359, February 2004.
[125] Ertugrul M Ozbudak, Mukund Thattai, Han N Lim, Boris I Shraiman, and Alexander
Van Oudenaarden. Multistability in the lactose utilization network of escherichia coli.
Nature, 427(6976):737–740, February 2004.
[126] Beat P. Kramer and Martin Fussenegger. Hysteresis in a synthetic mammalian gene
network. Proceedings of the National Academy of Sciences of the United States of
America, 102(27):9517–9522, July 2005.
[127] P. M. Hartigan. Algorithm as 217: Computation of the dip statistic to test for
unimodality. Journal of the Royal Statistical Society. Series C (Applied Statistics),
34(3):320–325, January 1985.
[128] Alexander L Muratov and Oleg Y Gnedin. Modeling the metallicity distribution of
globular clusters. 1002.1325, pages –, February 2010.
[129] Joseph R. Nevins. The rb/e2f pathway and cancer. Human Molecular Genetics,
10(7):699–703, April 2001.
[130] Sui Huang, Gabriel Eichler, Yaneer Bar-Yam, and Donald E Ingber. Cell fates as
high-dimensional attractor states of a complex gene regulatory network. Phys. Rev.
Lett, 94(12):128701–128701, April 2005.
[131] Helen Parkinson, Misha Kapushesky, Nikolay Kolesnikov, Gabriella Rustici, Mohammad Shojatalab, Niran Abeygunawardena, Hugo Berube, Miroslaw Dylag, Ibrahim
Emam, Anna Farne, Ele Holloway, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Tim F Rayner, Faisal Rezwan, Anjan Sharma, Eleanor Williams, Xiangqun Zheng Bradley, Tomasz Adamusiak, Marco Brandizi, Tony Burdett, Richard
Coulson, Maria Krestyaninova, Pavel Kurnosov, Eamonn Maguire, Sudeshna Guha
Neogi, Philippe Rocca-Serra, Susanna-Assunta Sansone, Nataliya Sklyar, Mengyao
Zhao, Ugis Sarkans, and Alvis Brazma. Arrayexpress update–from an archive of
functional genomics experiments to the atlas of gene expression. Nucleic Acids Res,
37(Database issue):D868–872–D868–872, January 2009.
157

[132] Keith A. Ashman, Christina M. Bird, and Stephen E. Zepf. Detecting bimodality in
astronomical datasets. The Astronomical Journal, 108:2348–2348, December 1994.
[133] Laura C Collins, Maria L Botero, and Stuart J Schnitt. Bimodal frequency distribution
of estrogen receptor immunohistochemical staining results in breast cancer: an analysis
of 825 cases. Am. J. Clin. Pathol, 123(1):16–20, January 2005.
[134] Pedro de Souza Rocha Simonini, Achim Breiling, Nibedita Gupta, Mahdi Malekpour,
Mahmoud Youns, Ramesh Omranipour, Fatemeh Malekpour, Stefano Volinia, Carlo M
Croce, Hossein Najmabadi, Sven Diederichs, Ozgr Sahin, Doris Mayer, Frank Lyko,
Jrg D Hoheisel, and Yasser Riazalhosseini. Epigenetically deregulated microrna-375 is
involved in a positive feedback loop with estrogen receptor alpha in breast cancer cells.
Cancer Res, 70(22):9175–9184, November 2010.
[135] D Harari and Y Yarden. Molecular mechanisms underlying erbb2/her2 action in breast
cancer. Oncogene, 19(53):6102–6114, December 2000.
[136] Ali Naderi, Ji Liu, and Glenn D Francis. A feedback loop between bex2 and erbb2
mediated by c-jun signaling in breast cancer. Int. J. Cancer, pages n/a–n/a–n/a–n/a,
2011.
[137] Elizabeth A. Musgrove and Robert L. Sutherland. Biological determinants of endocrine
resistance in breast cancer. Nat Rev Cancer, 9(9):631–643, 2009.
[138] Rebecca B Riggins, Randy S Schrecengost, Michael S Guerrero, and Amy H Bouton.
Pathways to tamoxifen resistance. Cancer Lett, 256(1):1–24, October 2007.
[139] Linxia Zhang, Linsey C Seitz, Amy M Abramczyk, Li Liu, and Christina Chan. camp
initiates early phase neuron-like morphology changes and late phase neural diﬀerentiation in mesenchymal stem cells. Cell. Mol. Life Sci, 68(5):863–876, March 2011.
[140] M Lipinski, D R Parks, R V Rouse, and L A Herzenberg. Human trophoblast cellsurface antigens deﬁned by monoclonal antibodies. Proc. Natl. Acad. Sci. U.S.A,
78(8):5147–5150, August 1981.
[141] Dominic Fong, Gilbert Spizzo, Johanna M Gostner, Guenther Gastl, Patrizia Moser,
Clemens Krammel, Stefan Gerhard, Michael Rasse, and Klaus Laimer. Trop2: a
novel prognostic marker in squamous cell carcinoma of the oral cavity. Mod. Pathol,
21(2):186–191, February 2008.
158

[142] D Fong, P Moser, C Krammel, J M Gostner, R Margreiter, M Mitterer, G Gastl, and
G Spizzo. High expression of trop2 correlates with poor prognosis in pancreatic cancer.
Br J Cancer, 99(8):1290–1295, 2008.
[143] Takahiro Ohmachi, Fumiaki Tanaka, Koshi Mimori, Hiroshi Inoue, Katsuhiko Yanaga,
and Masaki Mori. Clinical signiﬁcance of trop2 expression in colorectal cancer. Clin.
Cancer Res, 12(10):3057–3063, May 2006.
[144] G Muhlmann, G Spizzo, J Gostner, M Zitt, H Maier, P Moser, G Gastl, M Zitt, H M
Mller, R Margreiter, D Ofner, and D Fong. Trop2 expression as prognostic marker for
gastric carcinoma. J. Clin. Pathol, 62(2):152–158, February 2009.
[145] Jianbo Wang, Ryan Day, Yiyu Dong, Steven J. Weintraub, and Loren Michel. Identiﬁcation of trop-2 as an oncogene and an attractive therapeutic target in colon cancers.
Molecular Cancer Therapeutics, 7(2):280–285, February 2008.
[146] Erich Huang, Skye H Cheng, Holly Dressman, Jennifer Pittman, Mei Hua Tsou,
Cheng Fang Horng, Andrea Bild, Edwin S Iversen, Ming Liao, Chii Ming Chen, Mike
West, Joseph R Nevins, and Andrew T Huang. Gene expression predictors of breast
cancer outcomes. The Lancet, 361(9369):1590–1596, May 2003.
[147] Andrew S. Goldstein, Devon A. Lawson, Donghui Cheng, Wenyi Sun, Isla P. Garraway,
and Owen N. Witte. Trop2 identiﬁes a subpopulation of murine and human prostate
basal cells with stem cell characteristics. Proceedings of the National Academy of
Sciences, 105(52):20882–20887, December 2008.
[148] Mayuko Okabe, Yuko Tsukahara, Minoru Tanaka, Kaori Suzuki, Shigeru Saito, Yoshiko
Kamiya, Tohru Tsujimura, Koji Nakamura, and Atsushi Miyajima. Potential hepatic
stem cells reside in epcam+ cells of normal and injured mouse liver. Development,
136(11):1951–1960, June 2009.
[149] M Trerotola, P Cantanelli, E Guerra, R Tripaldi, A L Aloisi, V Bonasera, R Lattanzio, R de Lange, U H Weidle, M Piantelli, and S Alberti. Upregulation of trop-2
quantitatively stimulates human cancer growth. Oncogene, pages –, February 2012.
[150] E Guerra, M Trerotola, A L Aloisi, R Tripaldi, G Vacca, R La Sorda, R Lattanzio,
M Piantelli, and S Alberti. The trop-2 signalling network in cancer growth. Oncogene,
pages –, May 2012.
[151] Daniel Kaplan and Leon Glass. Understanding Nonlinear Dynamics. Springer, April
1995.
159

[152] K. Kappler, R. Edwards, and L. Glass. Dynamics in high-dimensional model gene
networks. Signal Processing, 83(4):789–798, April 2003.
[153] Atsushi Mochizuki. An analytical study of the number of steady states in gene regulatory networks. J. Theor. Biol, 236(3):291–310, October 2005.
[154] Aleksej Zelezniak, Tune H. Pers, Simo Soares, Mary Elizabeth Patti, and Kiran Raosaheb Patil. Metabolic network topology reveals transcriptional regulatory signatures of
type 2 diabetes. PLoS Comput Biol, 6(4):e1000729–e1000729, April 2010.
[155] M G Vander Heiden, S Y Lunt, T L Dayton, B P Fiske, W J Israelsen, K R Mattaini,
N I Vokes, G Stephanopoulos, L C Cantley, C M Metallo, and J W Locasale. Metabolic
pathway alterations that support cell proliferation. Cold Spring Harbor Symposia on
Quantitative Biology, pages –, January 2012.
[156] O WARBURG. On the origin of cancer cells. Science, 123(3191):309–314, February
1956.
[157] Matthew G. Vander Heiden, Lewis C. Cantley, and Craig B. Thompson. Understanding the warburg eﬀect: The metabolic requirements of cell proliferation. Science,
324(5930):1029–1033, May 2009.
[158] Ralph J DeBerardinis, Julian J Lum, Georgia Hatzivassiliou, and Craig B Thompson.
The biology of cancer: metabolic reprogramming fuels cell growth and proliferation.
Cell Metab., 7(1):11–20, January 2008.
[159] Jason W Locasale, Alexandra R Grassian, Tamar Melman, Costas A Lyssiotis, Katherine R Mattaini, Adam J Bass, Gregory Heﬀron, Christian M Metallo, Taru Muranen,
Hadar Sharﬁ, Atsuo T Sasaki, Dimitrios Anastasiou, Edouard Mullarky, Natalie I
Vokes, Mika Sasaki, Rameen Beroukhim, Gregory Stephanopoulos, Azra H Ligon,
Matthew Meyerson, Andrea L Richardson, Lynda Chin, Gerhard Wagner, John M
Asara, Joan S Brugge, Lewis C Cantley, and Matthew G Vander Heiden. Phosphoglycerate dehydrogenase diverts glycolytic ﬂux and contributes to oncogenesis. Nat
Genet, 43(9):869–874, 2011.
[160] Sirkku Pollari, Sanna-Maria Kknen, Henrik Edgren, Maija Wolf, Pekka Kohonen, Henri
Sara, Theresa Guise, Matthias Nees, and Olli Kallioniemi. Enhanced serine production
by bone metastatic breast cancer cells stimulates osteoclastogenesis. Breast Cancer
Res. Treat., 125(2):421–430, January 2011.
160

[161] Natalie J Serkova, Jennifer L Spratlin, and S Gail Eckhardt. Nmr-based metabolomics:
translational application and treatment of cancer. Curr. Opin. Mol. Ther., 9(6):572–
585, December 2007.
[162] C M Galmarini, F Popowycz, and B Joseph. Cytotoxic nucleoside analogues: diﬀerent
strategies to improve their clinical eﬃcacy. Curr. Med. Chem., 15(11):1072–1082, 2008.
[163] Bei B. Zhang, Gaochao Zhou, and Cai Li. Ampk: An emerging drug target for diabetes
and the metabolic syndrome. Cell Metabolism, 9(5):407–416, May 2009.
[164] Joel P. Berger, Taro E. Akiyama, and Peter T. Meinke. Ppars: therapeutic targets for
metabolic disease. Trends in Pharmacological Sciences, 26(5):244–251, May 2005.
[165] Natalie C Duarte, Scott A Becker, Neema Jamshidi, Ines Thiele, Monica L Mo,
Thuy D Vo, Rohith Srivas, and Bernhard Palsson. Global reconstruction of the
human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad.
Sci. U.S.A., 104(6):1777–1782, February 2007.
[166] Hongwu Ma, Anatoly Sorokin, Alexander Mazein, Alex Selkov, Evgeni Selkov, Oleg
Demin, and Igor Goryanin. The edinburgh human metabolic network reconstruction
and its functional analysis. Mol Syst Biol, 3:135–135, September 2007.
[167] A. Bordbar and B. O Palsson. Using the reconstructed genome?scale human metabolic
network to study physiology and pathology. Journal of Internal Medicine, 271(2):131–
141, February 2012.
[168] Tong Hao, Hong-Wu Ma, Xue-Ming Zhao, and Igor Goryanin. Compartmentalization
of the edinburgh human metabolic network. BMC Bioinformatics, 11:393–393, 2010.
[169] Ines Thiele and Bernhard Palsson. A protocol for generating a high-quality genomescale metabolic reconstruction. Nat Protoc, 5(1):93–121, 2010.
[170] Gregory N. Stephanopoulos, Aristos A. Aristidou, and Jens Nielsen. Metabolic Engineering: Principles and Methodologies. Academic Press, October 1998.
[171] Jeﬀrey D Orth, Ines Thiele, and Bernhard O Palsson. What is ﬂux balance analysis?
Nat Biotech, 28(3):245–248, March 2010.
[172] Yoshihiro Toya, Nobuaki Kono, Kazuharu Arakawa, and Masaru Tomita. Metabolic
ﬂux analysis and visualization. J. Proteome Res., 10(8):3313–3323, August 2011.
161

[173] Javier Delgado and James C Liao. Inverse ﬂux analysis for reduction of acetate excretion in escherichia coli. Biotechnology Progress, 13(4):361–367, January 1997.
[174] Jong Myoung Park, Tae Yong Kim, and Sang Yup Lee. Constraints-based genomescale metabolic simulation for systems metabolic engineering. Biotechnology Advances,
27(6):979–988, December 2009.
[175] Christoph Gille, Christian Blling, Andreas Hoppe, Sascha Bulik, Sabrina Hoﬀmann,
Katrin Hbner, Anja Karlstdt, Ramanan Ganeshan, Matthias Knig, Kristian Rother,
Michael Weidlich, Jrn Behre, and Herrmann-Georg Holzhtter. Hepatonet1: a comprehensive metabolic reconstruction of the human hepatocyte for the analysis of liver
physiology. Mol. Syst. Biol., 6:411–411, September 2010.
[176] Nathan E Lewis, Gunnar Schramm, Aarash Bordbar, Jan Schellenberger, Michael P
Andersen, Jeﬀrey K Cheng, Nilam Patel, Alex Yee, Randall A Lewis, Roland Eils,
Rainer Knig, and Bernhard Palsson. Large-scale in silico modeling of metabolic
interactions between cell types in the human brain. Nat. Biotechnol., 28(12):1279–
1285, December 2010.
[177] Scott A. Becker and Bernhard O. Palsson. Context-speciﬁc metabolic networks are
consistent with experiments. PLoS Comput Biol, 4(5):e1000082–e1000082, May 2008.
[178] Tomer Shlomi, Moran N Cabili, Markus J Herrgrd, Bernhard Palsson, and Eytan Ruppin. Network-based prediction of human tissue-speciﬁc metabolism. Nat. Biotechnol.,
26(9):1003–1010, September 2008.
[179] Livnat Jerby, Tomer Shlomi, and Eytan Ruppin. Computational reconstruction of
tissue-speciﬁc metabolic models: application to human liver metabolism. Mol Syst
Biol, 6:401–401, September 2010.
[180] Matthew A Oberhardt, Bernhard Palsson, and Jason A Papin. Applications of
genome-scale metabolic reconstructions. Mol. Syst. Biol., 5:320–320, 2009.
[181] Roger L Chang, Li Xie, Lei Xie, Philip E Bourne, and Bernhard Palsson. Drug oﬀtarget eﬀects predicted using structural analysis in the context of a metabolic network
model. PLoS Comput. Biol., 6(9):e1000938–e1000938, 2010.
[182] Ori Folger, Livnat Jerby, Christian Frezza, Eyal Gottlieb, Eytan Ruppin, and Tomer
Shlomi. Predicting selective drug targets in cancer through metabolic networks. Mol
Syst Biol, 7:–, June 2011.
162

[183] Tomer Shlomi, Tomer Benyamini, Eyal Gottlieb, Roded Sharan, and Eytan Ruppin.
Genome-scale metabolic modeling elucidates the role of proliferative adaptation in
causing the warburg eﬀect. PLoS Comput Biol, 7(3):e1002018–e1002018, March 2011.
[184] Caroline Colijn, Aaron Brandes, Jeremy Zucker, Desmond S. Lun, Brian Weiner,
Maha R. Farhat, Tan-Yun Cheng, D. Branch Moody, Megan Murray, and James E.
Galagan. Interpreting expression data with metabolic ﬂux models: Predicting mycobacterium tuberculosis mycolic acid production. PLoS Comput Biol, 5(8):e1000489–
e1000489, 2009.
[185] Stefanie Sassen, Eric A. Miska, and Carlos Caldas. Micrornaimplications for cancer.
Virchows Arch, 452(1):1–10, January 2008.
[186] George Adrian Calin, Calin Dan Dumitru, Masayoshi Shimizu, Roberta Bichi, Simona
Zupo, Evan Noch, Hansjuerg Aldler, Sashi Rattan, Michael Keating, Kanti Rai, Laura
Rassenti, Thomas Kipps, Massimo Negrini, Florencia Bullrich, and Carlo M Croce.
Frequent deletions and down-regulation of micro- rna genes mir15 and mir16 at 13q14
in chronic lymphocytic leukemia. Proc. Natl. Acad. Sci. U.S.A., 99(24):15524–15529,
November 2002.
[187] Lin He, J. Michael Thomson, Michael T. Hemann, Eva Hernando-Monge, David Mu,
Summer Goodson, Scott Powers, Carlos Cordon-Cardo, Scott W. Lowe, Gregory J.
Hannon, and Scott M. Hammond. A microrna polycistron as a potential human oncogene. Nature, 435(7043):828–833, June 2005.
[188] Steven M Johnson, Helge Grosshans, Jaclyn Shingara, Mike Byrom, Rich Jarvis, Angie
Cheng, Emmanuel Labourier, Kristy L Reinert, David Brown, and Frank J Slack. Ras
is regulated by the let-7 microrna family. Cell, 120(5):635–647, March 2005.
[189] Jun Lu, Gad Getz, Eric A Miska, Ezequiel Alvarez-Saavedra, Justin Lamb, David Peck,
Alejandro Sweet-Cordero, Benjamin L Ebert, Raymond H Mak, Adolfo A Ferrando,
James R Downing, Tyler Jacks, H Robert Horvitz, and Todd R Golub. Microrna
expression proﬁles classify human cancers. Nature, 435(7043):834–838, June 2005.
[190] Arti Gaur, David A Jewell, Yu Liang, Dana Ridzon, Jason H Moore, Caifu Chen,
Victor R Ambros, and Mark A Israel. Characterization of microrna expression levels
and their biological correlates in human cancer cell lines. Cancer Res., 67(6):2456–2468,
March 2007.
[191] Madhu S Kumar, Jun Lu, Kim L Mercer, Todd R Golub, and Tyler Jacks. Impaired
microrna processing enhances cellular transformation and tumorigenesis. Nat. Genet.,
39(5):673–677, May 2007.
163

[192] Naotake Tsuda, Kouichiro Kawano, Clay L Eﬀerson, and Constantin G Ioannides.
Synthetic microrna and double-stranded rna targeting the 3’-untranslated region of
her-2/neu mrna inhibit her-2 protein expression in ovarian cancer cells. Int. J. Oncol.,
27(5):1299–1306, November 2005.
[193] Janaiah Kota, Raghu R. Chivukula, Kathryn A. O’Donnell, Erik A. Wentzel, Chrystal L. Montgomery, Hun-Way Hwang, Tsung-Cheng Chang, Perumal Vivekanandan,
Michael Torbenson, K. Reed Clark, Jerry R. Mendell, and Joshua T. Mendell. Therapeutic microrna delivery suppresses tumorigenesis in a murine liver cancer model. Cell,
137(6):1005–1017, June 2009.
[194] Kathryn A O’Donnell, Erik A Wentzel, Karen I Zeller, Chi V Dang, and Joshua T
Mendell. c-myc-regulated micrornas modulate e2f1 expression. Nature, 435(7043):839–
843, June 2005.
[195] Yu Liang. An expression meta-analysis of predicted microrna targets identiﬁes a diagnostic signature for lung cancer. BMC Med Genomics, 1:61–61, 2008.
[196] Xionghui Zhou, Juan Liu, Changning Liu, S. Rayner, Fengji Liang, Jingfang Ju,
Yinghui Li, Shanguang Chen, and Jianghui Xiong. Context-speciﬁc mirna regulation
network predicts cancer prognosis. pages 225–243. IEEE, September 2011.
[197] Robin C. Friedman, Kyle Kai-How Farh, Christopher B. Burge, and David P. Bartel. Most mammalian mrnas are conserved targets of micrornas. Genome Research,
19(1):92–105, January 2009.
[198] David P Bartel. Micrornas: genomics, biogenesis, mechanism, and function. Cell,
116(2):281–297, January 2004.
[199] S. Huang and X. He. The role of micrornas in liver cancer progression. British Journal
of Cancer, 104(2):235–240, 2011.
[200] Judice L. Y. Koh, Kevin R. Brown, Azin Sayad, Dahlia Kasimer, Troy Ketela, and
Jason Moﬀat. Colt-cancer: functional genetic screening resource for essential genes in
human cancer cell lines. Nucl. Acids Res., pages –, November 2011.
[201] Langho Lee, Kai Wang, Gang Li, Zhi Xie, Yuli Wang, Jiangchun Xu, Shaoxian Sun,
David Pocalyko, Jong Bhak, Chulhong Kim, Kee-Ho Lee, Ye Jang, Young Yeom,
Hyang-Sook Yoo, and Seungwoo Hwang. Liverome: a curated database of liver cancerrelated gene signatures with self-contained context information. BMC Genomics,
12(Suppl 3):S3–S3, 2011.
164

[202] Heather R. Christofk, Matthew G. Vander Heiden, Marian H. Harris, Arvind Ramanathan, Robert E. Gerszten, Ru Wei, Mark D. Fleming, Stuart L. Schreiber, and
Lewis C. Cantley. The m2 splice isoform of pyruvate kinase is important for cancer
metabolism and tumour growth. Nature, 452(7184):230–233, March 2008.
[203] Shih-Hsin Tu, Chih-Chiang Chang, Ching-Shyang Chen, Ka-Wai Tam, Ying-Jan
Wang, Chia-Hwa Lee, Hsiao-Wei Lin, Tzu-Chun Cheng, Ching-Shui Huang, Jan-Show
Chu, Neng-Yao Shih, Li-Ching Chen, Sy-Jye Leu, Yuan-Soon Ho, and Chih-Hsiung
Wu. Increased expression of enolase alpha in human breast cancer confers tamoxifen
resistance in human breast cancer cells. Breast Cancer Res. Treat., 121(3):539–553,
June 2010.
[204] Michela Capello, Sammy Ferri-Borgogno, Paola Cappello, and Francesco Novelli.
-enolase: a promising therapeutic and diagnostic tumor target. FEBS Journal,
278(7):1064–1074, 2011.
[205] Haoqiang Ying, Alec C Kimmelman, Costas A Lyssiotis, Sujun Hua, Gerald C
Chu, Eliot Fletcher-Sananikone, Jason W Locasale, Jaekyoung Son, Hailei Zhang,
Jonathan L Coloﬀ, Haiyan Yan, Wei Wang, Shujuan Chen, Andrea Viale, Hongwu
Zheng, Ji-hye Paik, Carol Lim, Alexander R Guimaraes, Eric S Martin, Jeﬀery Chang,
Aram F Hezel, Samuel R Perry, Jian Hu, Boyi Gan, Yonghong Xiao, John M Asara,
Ralph Weissleder, Y Alan Wang, Lynda Chin, Lewis C Cantley, and Ronald A DePinho. Oncogenic kras maintains pancreatic tumors through regulation of anabolic
glucose metabolism. Cell, 149(3):656–670, April 2012.
[206] Vanessa Fritz, Zohra Benfodda, Genevive Rodier, Corinne Henriquet, Franois Iborra,
Christophe Avancs, Yves Allory, Alexandre de la Taille, Stphane Culine, Hubert Blancou, Jean Paul Cristol, Franoise Michel, Claude Sardet, and Lluis Fajas. Abrogation
of de novo lipogenesis by stearoyl-coa desaturase 1 inhibition interferes with oncogenic signaling and blocks prostate cancer progression in mice. Mol. Cancer Ther.,
9(6):1740–1754, June 2010.
[207] Daniel Hess, Jeﬀrey W. Chisholm, and R. Ariel Igal. Inhibition of stearoylcoa desaturase activity blocks cell cycle progression and induces programmed cell death in lung
cancer cells. PLoS ONE, 5(6):e11394–e11394, June 2010.
[208] R. Ariel Igal. Stearoyl-coa desaturase-1: a novel key player in the mechanisms of cell
proliferation, programmed cell death and transformation to cancer. Carcinogenesis,
31(9):1509–1515, September 2010.
[209] R. Ariel Igal. Roles of stearoylcoa desaturase-1 in the regulation of cancer cell growth,
survival and tumorigenesis. Cancers, 3(2):2462–2477, May 2011.
165

[210] Pankaj K. Singh, Kamiya Mehla, Michael A. Hollingsworth, and Keith R. Johnson.
Regulation of aerobic glycolysis by micrornas in cancer. Molecular and Cellular Pharmacology, 3(3):125–134, December 2011.
[211] Boris Kholodenko, Michael B Yaﬀe, and Walter Kolch. Computational approaches for
analyzing information ﬂow in biological networks. Sci Signal, 5(220):re1–re1, April
2012.

166