INFERRING REGULATORY INTERACTIONS IN TRANSCRIPTIONAL REGULATORY
NETWORKS
By
Sherine Awad Mahmoud

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2013

ABSTRACT
INFERRING REGULATORY INTERACTIONS IN TRANSCRIPTIONAL REGULATORY
NETWORKS
By
Sherine Awad Mahmoud
Living cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the differential expression of
target genes in the context of transcriptional regulatory networks (TRNs), either individually or
in groups. Deciphering the mechanisms of how the TFs control the expression of target genes is
a challenging task, especially when multiple TFs collaboratively participate in the transcriptional
regulation. Recent developments in biotechnology have been applied to uncover TF-target binding
relationships to reconstruct draft regulatory circuits at a systems level. Furthermore, to identify
regulatory interactions in vivo and consequently reveal their functions, TF single/double knockouts and over-expression experiments have been systematically carried out. However, the results
of many single or even double-knockout experiments are often non-conclusive, since many genes
are regulated by multiple TFs with complementary functions.
To predict the TF combinations that the knocking out of them are most likely to bring about
the phenotypic change, we developed a new computational tool called TRIM that models the interactions between the TFs and the target genes in terms of both the TF-target interaction’s function
(activation or repression) and its corresponding logical role (necessary and/or sufﬁcient). We used
DNA-protein binding and gene expression data to construct regulatory modules for inferring the
transcriptional regulatory interaction models for the TFs and their corresponding target genes. Our
TRIM algorithm is based on an HMM and a set of constraints that relate gene expression patterns
to regulatory interaction models. However, TRIM infers up to 2-TFs interactions. Inferring the

collaborative interactions of multiple TFs is a computationally difﬁcult task, because when multiple TFs simultaneously or sequentially control their target genes, a single gene responds to merged
inputs, resulting in complex gene expression patterns. We developed mTRIM to solve this problem
with a modiﬁed association rule mining approach. mTRIM is a novel method to infer TF collaborations in transcriptional regulation networks. It can not only identify TF groups that regulate the
common targets collaboratively but also TFs with complementary functions. However, mTRIM
ignores the effect of miRNAs on target genes.
In order to take miRNAs’ effect into considerations, we developed a new computational model
called TmiRNA that incorporates miRNAs into the inference. TmiRNA infers the interactions
between a set of regulators including both TFs and miRNAs and the set of their target genes. We
used our model to study the combinatorial code of Human Cancer transcriptional regulation.

To Prophet Mohammed (PUH)

iv

ACKNOWLEDGMENTS

I also would like to thank my committee members; Dr. Gregg Howe, Dr. Rong Jin, and Dr.
Titus Brown for their valuable comments and feedbacks.
Special thanks to Dr. Titus Brown for his great effort in one of the most beneﬁcal classes I ever
had (CSE891, Open problems in Bioinformatics). Thanks to Dr. Jason Pell (former colleague in
GED lab) for his help and support. I also would like to thank our helpful grad director Dr. Eric
Torng for his support. I also would like to thank Dr. Radha Hyder and Dr. Hassan Khalil for their
kind advice and support.
Many thanks to Dr. Rami Halloush (former colleague in Chen lab) for his generous help,
advice, and continous encouragements.
I would like to thank Dr. Yuhua Jiao for her support and advice. encouragements. I also would
like to thank Dr. Bruce Rosa and Jiajie Peng (former colleagues in Chen lab) for their help and
advice.
I am extremely grateful to my parents and brothers for their continuous support and patience.
Many thanks to my friends Shireen Ahmed, Ghada Abdel Samie and Azza Sallam for their
overseas support and encouragments. Many thanks to my friend Donna Fernandez for the great
time I spent with her in MSU, her generous advice, and encouragements.
Many thanks to my spirtual fathers and professors, Dr. Ahmed Sharaf Eldin and Dr Hamed
Nassar for their conitnous encouragments and care.
Finally, I would like to thank my generous funding of my PhD by the Egyptian government.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Experimental Methods For Understanding Transcriptional Regulation
1.2 Computational Methods For Understanding Transcriptional Regulation
1.3 Computational Models . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . .
1.3.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Association Rules Mining . . . . . . . . . . . . . . . . . . .
1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Dissertation Contribution . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

1
2
4
7
7
10
11
13
13
15

Chapter 2
Inferring 2-TF Regulatory Interactions
2.1 Introduction . . . . . . . . . . . . . . . . . . .
2.2 Concepts . . . . . . . . . . . . . . . . . . . . .
2.3 Methods . . . . . . . . . . . . . . . . . . . . .
2.3.1 Constructing Regulatory Modules . . .
2.3.2 Designing the HMM model . . . . . . .
2.4 Experimental Results . . . . . . . . . . . . . .
2.4.1 Data preparation . . . . . . . . . . . .
2.4.2 Evaluation criteria . . . . . . . . . . . .
2.4.3 Evaluation of TRIM on yeast data . . .
2.5 Application of TRIM on Arabidopsis data . . .
2.6 Conclusions . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

16
16
20
21
21
23
28
29
30
31
35
41

Chapter 3
Inferring K-TF Regulatory Interactions . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 mTRIM Step 1: Inferring Individual Regulatory Interactions
3.2.3 mTRIM Step 2: Mining Multiple-TF Regulatory Interactions
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Evaluation 1: Single TF Knock-outs . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

42
42
45
46
47
50
53
54
58

vi

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

3.4

3.3.3 Evaluation 2: Genetic Interaction . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.4 Evaluation 3: Synthetic Transcriptional Regulatory Networks . . . . . . . . 59
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 4
Human Cancer Transcription Regulation . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Inferring Regulatory Sequences . . . . . . . . . . . . . . . . . . . . . .
4.2.2.1 Candidate Generations . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Constructing Regulatory Modules . . . . . . . . . . . . . . . . . . . . .
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Evaluating TmiRNA Regulatory Modules using Gene Ontology Functional
Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Evaluating TmiRNA using Synthetic miRNA-TF Regulatory Networks .
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

65
65
67
68
68
70
71
72

. 74
. 76
. 79

Chapter 5
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
BIBLIOGRAPHY . . . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . . . 84

LIST OF TABLES

Table 1.1

TF-Target interactions can be modeled in terms of the TF’s functional role
as an activator (up-regulates the target gene’s expression) or a repressor
(down-regulates the target gene’s expression), and the logical role of the
TF as being necessary and/or sufﬁcient. The two categories (functional
and logical roles) can be combined. . . . . . . . . . . . . . . . . . . . . 14

Table 2.1

Constrains for the inferring TF-target interaction models in a 2-TF collaborative regulatory module. They are based on the TF’s expression direction and the response of the target gene. . . . . . . . . . . . . . . . . . 26

Table 2.2

An illustrative example of HMM training process. There are 10 observations of gene expression changes for a collaborative regulatory module
with 2-TF: 0 means no signiﬁcant gene expression change; 1 means signiﬁcant up-regulation and -1 means signiﬁcant down-regulation. . . . . . 28

Table 2.3

The emission probabilities of the ﬁrst four observations of gene expression changes in the illustrative example shown in Table 2.2. After processing these four observations, the model suggests that the most possible
regulatory interaction model for (T F1 , g) is RS with probability 81% and
for (T F2 , g) is AN with probability 68%. . . . . . . . . . . . . . . . . . . 28

Table 2.4

Yeast cell cycle time series data that are used for model training and testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Table 2.5

Summary of TRIM’s predictions for independent and two TFs regulatory
modules on yeast and Arabidopsis. . . . . . . . . . . . . . . . . . . . . . 32

Table 2.6

The design of the 6 experiments on Arabidopsis abiotic stress data. X:
testing data; T: training data . . . . . . . . . . . . . . . . . . . . . . . . . 38

Table 2.7

The prediction results of the 10 TFs on Arabidopsis abiotic stress data
with TRIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Table 3.1

Illustrative example of time-series gene expression data for the genes in
Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

viii

Table 3.2

Illustrative example of regulatory interaction identiﬁcation on the TRN in
Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Table 3.3

Regulatory modules size for Harbison and Reimand datasets . . . . . . . 56

Table 3.4

The number and type of the regulatory interactions for individual TFs
predicted by mTRIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 3.5

Number of the multiple-TF regulatory interactions identiﬁed by mTRIM. . 58

ix

LIST OF FIGURES

Figure 2.1

TRIM framework for inferring TF-target regulatory interaction models.
TRIM has two main steps: 1) a regulatory module construction step that
includes both a clustering of co-expressed genes and a topological analysis to classify independent and collaborative regulatory modules from
a given large-scale TRN, and 2) a HMM model to infer TF-target interaction models in the independent and collaborative regulatory modules
identiﬁed in step 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 2.2

An illustrative example of regulatory module construction. T F1 and T F2
regulate genes g1 and g2 , and g1 and g2 belong to the same gene expression cluster, so this regulatory module contains T F1 , T F2 , g1 and g2 . . . . 23

Figure 2.3

A simpliﬁed representation of the HMM model for the collaborative regulatory module with 2-TF, T F1 and T F2 , where each TF has four states
(i.e., AS, AN, RS and RN). Each state emits two possible outputs, active
or inactive. One can view a state as representing whether a speciﬁc interaction model of an individual TF-target relation is valid or invalid. In the
training process, if an active output is emitted from one of the four states
of T F1 , the model will transit to a state belonging to T F2 and vice versa. . 24

Figure 2.4

In-degree distribution of the yeast TRN for Reimand dataset. . . . . . . . 25

Figure 2.5

In the independent regulatory modules, the overlap between the single
TF knockout supported TF-target pairs (815) and TRIM prediction results
(833) for the functional roles (activation or repression) is 549, greater than
the overlap between knockout and the results of DREM (873), which is
456. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Figure 2.6

A histogram of gene expression correlation of 26 TFs in the independent
regulatory modules of yeast using TRIM, DREM or all time-points after
applyng decay adjustments. These 26 TFs are the intersection between
the results of DREM and TRIM. . . . . . . . . . . . . . . . . . . . . . . 34

Figure 2.7

A histogram of gene expression correlation of 15 TFs in the 2-TF collaborative regulatory modules of yeast using TRIM, DREM or all time-points
after applying decay adjustments. These 15 TFs are the intersection between the results of DREM and TRIM. . . . . . . . . . . . . . . . . . . . 35

x

Figure 2.8

Figure 2.9
Figure 2.10

Figure 2.11

Distribution of gene expression correlation of the independent regulatory
modules. For interpretation of the references to color in this and all other
ﬁgures, the reader is referred to the electronic version of this dissertation.
Distribution of gene expression correlation of 2-TF collaborative regulatory modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The independent regulatory modules after decay adjustments. For interpretation of the references to color in this and all other ﬁgures, the reader
is referred to the electronic version of this dissertation . . . . . . . . . . .
Pearson correlation of target genes of (A) ATGL1, (B) WRKY53 and (C)
RD26 using different combinations of training and testing data with TRIM.

36
37

38
41

Figure 3.1

an illustrative example of regulatory interaction in a TRN, in which three
TFs collaboratively co-regulate two target genes . . . . . . . . . . . . . . 46

Figure 3.2

In-degree distribution of the yeast TRN for Harbison dataset. . . . . . . . 55

Figure 3.3

Evaluation of the 2-TF regulatory interactions using genetic interactions
on Harbison dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 3.4

Evaluation of the 2-TF regulatory interactions using genetic interactions
on Reimand dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 3.5

Parameter adjustments on the evaluation of the high-order regulatory interactions using genetic interactions. (a,d) Different gene expression cutoffs, (b,e) Different GI signiﬁcance cutoffs, (c,f) Different mTRIM cutoffs on Harbison dataset and Reimand dataset respectively. . . . . . . . . 61

Figure 3.6

A synthetic transcriptional regulatory network, in which the solid lines
represent transcriptional regulations and the dotted lines represent TFDNA bindings only (meaning binding but not regulation). . . . . . . . . . 62

Figure 3.7

Comparing mTRIM with DREM and TRIM on two sets of synthetic data
with different noise rates: Speciﬁcity . . . . . . . . . . . . . . . . . . . . 63

Figure 3.8

Comparing mTRIM with DREM and TRIM on two sets of synthetic data
with different noise rates: Precision . . . . . . . . . . . . . . . . . . . . . 63

Figure 3.9

Comparing mTRIM with DREM and TRIM on two sets of synthetic data
with different noise rates: Sensitivity. . . . . . . . . . . . . . . . . . . . . 64

Figure 4.1

A distribution of miRNAs that are active in different tissues. . . . . . . . 73

xi

Figure 4.2

In-degree distribution of the Human TRN . . . . . . . . . . . . . . . . . 74

Figure 4.3

Fitting gene expression Levels to normal distribution . . . . . . . . . . . 75

Figure 4.4

Functional enrichment evaluation of TmiRNA under different categories
(a) Biological Process, (b) Molecular Functions, and (c) Cellular Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 4.5

Synthetic network used for evaluating TmiRNA.

Figure 4.6

Performance of TmiRNA on synthetic data with 10% noise rate. . . . . . 78

Figure 4.7

Performance of TmiRNA on synthetic data with 40% noise rate. . . . . . 78

xii

. . . . . . . . . . . . . 77

Chapter 1
Introduction
Genetic information encoded in a cell’s genome can determine the properties of a cell. In order
to understand how such information deﬁne distinct cell types and cellular states, we need to understand how such information is differentially and dynamically retrieved. Transcriptional factors
(TFs) and microRNAs (miRNAs) represent the most numerous factors that control the expression
of genomic information. Hence, understanding the transcription regulation and how a set of genes
are regulated in concert by a group of TFs is important to decipher such information.
Understanding transcription regulations is a key to understand cellular differentiation and how
an organism responds to a stimuli. In addition, numerous human disease have been associated with
mutations in transcriptional regulatory elements. It is also recently discovered that chromosomal
rearrangement involving regulatory elements result in variety of cancers. Moreover, many human
diseases are caused by complex interactions of multiple genes and not only caused by mutations in
single genes. Therefore, understanding transcriptional modulation can have therapeutic beneﬁts.
Furthermore, engineering transcriptional activators can be used as therapeutic agents in diseases
caused by loss of gene expression [9, 10].
In Section 1.1 and Section 1.2, we explain both the experimental and computational methods
used to study transcription regulation respectively. In Section 1.3, we provide a brief introduction
to the computational models used in this dissertation. In Section 1.4, we deﬁne our problem
and the focus of this dissertation. We brieﬂy summarize the contributions of the dissertation in
Section 1.5. Finally, in Section 1.6, we provide the organizational structure of the dissertation.
1

1.1

Experimental Methods For Understanding Transcriptional
Regulation

Information about a genetic network may be revealed experimentally by applying a directed perturbation to the network [4, 17, 61, 79]. This helps observing the steady-state expression levels of
every gene in the network in the presence of the perturbation [4, 17, 61, 79]. Perturbations may be
genetic, or biological. In genetic perturbation, the expression levels of one or more genes are ﬁxed
by deletion or over-expression. In biological perturbation, one or more non-genetic conditions are
altered, such as a change in growth media, a temperature increase [4, 17, 61, 79]. Single gene perturbation has the following advantages; 1) It is possible to observe directly the functional effect of
controlled changes, 2) it easy to interpret the data, 3) it leads to gene variants that have more easily
detectable effects, 4) introduces qualitatively different proteins and thereby uncovers the range of
gene function. For the above reasons, single gene perturbation has been proven to be successful
in extending our knowledge of pathway components and interactions [4, 61, 79]. However, if there
is no one-to-one correspondence between elements and functions, single perturbation experiments
will fail to reveal the functional structure of a system [4, 61, 79]. For example, if a gene A can
be activated by two routes, in one route, gene A is activated directly be gene C and in another
route, gene A is activated by gene B and gene C. In such case, only double knockout can lead to
the loss of function phenotype for gene C [4, 61, 79]. Hence, multiple perturbations is essential
for understanding moderately complex biological systems [4, 61, 79]. To verify the TF-target gene
relationships and to detect the TF functions in vivo, TF knockouts and/or over-expression experiments are usually carried out [34]. Knockout experiments can also detect whether a TF is necessary
to its targets. Over-expression experiments can also detect whether a TF is sufﬁcient to its targets.
However, single knockout or over-expression may not provide statistically signiﬁcant evidence
2

due to redundancy or confounding signals from indirect regulatory feedback [49]. For example,
it has been shown that approximately 73% (about 4,500) of the known genes of Saccharomyces
cerevisiae (yeast) are non-essential [32]. The results of the single knockout or over-expression
experiments are therefore often non-conclusive, as it is highly likely that multiple non-essential
genes can be involved. This has led to the development of automated experimental methods for
double-knockouts to provide more statistically signiﬁcant determination of the TF functions [29].
However, to systematically knockout (or over-express) all possible combinations of the TFs in the
whole genome is still challenging. Given an organism with k TFs, the total number of possible
double-TF combinations is [13]:
k ∗ (k − 1)/2

(1.1)

For complex organisms, k can be easily in the range of thousands. Instead of blindly trying
out all possible TF pairs for double-knockout experiments, one solution is to select the TF pairs
that are most likely to bring about the phenotypic change. To do so, we need to understand the
interaction models employed by the TFs to inﬂuence the regulatory patterns of the target genes in
the network. In other words, we need to uncover the models of TF-target regulatory interactions
of the TF pairs and the target gene in terms of the TF-target interactions’ directions (activation or
repression) and their corresponding logical roles (necessary and/or sufﬁcient).
A fundamental step in transcriptional regulation is a transcription factor binding DNA[19,20].
Recent advances in DNA sequencing and microarray technologies lead to a huge increase in the
amount of information that can be accumulated about this process [28, 30]. Particularly, ChIP-chip
and ChIP-seq experiments enabled genome-wide mapping of TF binding sites in a single experiment [28, 30, 73]. The knowledge of these binding sites is a key step in determining which gene is
regulated by which TFs. However, the binding sites alone are not sufﬁcient to infer transcriptional

3

regulation [28, 48].
In simple organisms, such as bacteria or yeast, transcriptional regulation usually occurs by
a TF binding in a promoter region, near the transcription start site of a gene [28]. However, in
more complex organisms, regulation occurs through long-range enhancers, often spanning many
tens of thousands of base pairs [28, 30]. In such case, it is not obvious which gene a speciﬁc
binding site might be regulating [28, 30]. This is because the enhancer regulation is independent
of the binding site orientation corresponding to the regulated gene and transcription factors bound
to a particular gene may regulate a different gene through an enhancer [28, 30]. Additionally,
binding sites in gene-rich locations of a genome have a large number of potential targets in its
neighborhood [28, 30]. Although we might have an indication of the maximum range at which
an enhancer is likely to function, any gene within that distance is considered a potential target
[28, 30]. In addition, the presence of a regulator at a promoter region indicates binding but not
function or regulation [6]. For the above reasons, binding information is not sufﬁcient to infer
transcriptional regulation [6, 28]. Therefore, binding data can be combined with expression data
from microarrays, and with sequence motif searches for transcription factor binding sites. When
two or all the three experiments agree, we expect to have a reliable assignment of a transcription
factor to a gene [58, 76].

1.2

Computational Methods For Understanding Transcriptional
Regulation

A signiﬁcant effort has been devoted to modeling and learning regulatory networks. These studies modeled regulatory interactions and dynamics with various degrees of details, ﬂexibility, and
assumptions. Such models can be hierarically classiﬁed based on the model of regulatory inter4

actions. Differential equations and stochastic models can be found on the highest level as they
provide detailed descriptions of regulatory systems and provide simulations of systems dynamics
[6, 23, 62]. However, these models require accurate estimations of a large number of parameters
and hence they are computationally demanding. At the next level lies Boolean networks. Boolean
networks simply considers genes as on or off gates. They also include logic interactions including AND, OR, XOR,..etc. Though, the assumption held by Boolean networks is simple, Boolean
networks can be learned with fewer data. In addition, Boolean networks are robust and easy to
interpret. Linear model are intermediate approaches in terms of complexity and robustness. Linear
models describe the expression level of a gene as a weighted sum of the levees of its predictors
[8, 67].
Reducing the dimensionality of the search space before inference can simplify the process of
learning and/or modeling regulatory networks. One approach to this reduction is to group coregulated genes into clusters where gene expression clustering is commonly used. However, coexpressed genes does not correspond to co-regulated genes specially under different experimental
conditions as genes are often regulated differently under different conditions. For this reason,
Biclustering was developed to ﬁnd co-regulated genes on the basis of coherence under subsets of
observed conditions [11].
In order to infer the causal relationship among different elements such as genes, proteins,
metabolites, neurons and so on, based upon multi-dimensional temporal data, Granger causality and dynamic Bayesian network inference are used and massively reported in research. The
dynamic Bayesian network inference performs better than the Granger causality when the data
size is short, while Granger causality proved better performance when the data size is long [85].
The previous brieﬂy explained approaches have been widely used to decipher the mechanisms
of how the TFs control the differential expression of a target gene in a TRN. In order to increase
5

the understanding of the regulatory mechanism, we need to decipher how multiple TFs collaborate
in regulating their targets. In the previous section, we have showed that systematically knockout
(or over-express) all possible combinations of the TFs in the whole genome is still challenging.
Connecting the expression patterns to the interaction model can help revealing the role of each TF
instead of blindly trying each individual TF [6, 30]. Deciphering the roles of TFs under different
conditions, help understanding the dynamic mechanisms of TRNs.
Several approaches have been studied to reveal the dynamic mechanism of TRNs. Dynamic
Regulatory Events Miner (DREM) integrates timeseries expression data and ChIP-chip or motif
information to infer an annotated global temporal map [24]. This global map uncovers the transcriptional regulatory events that can help discovering the timeseries expression patterns and the
factors controlling these events during a cells response to stimuli [24]. DREM takes a unique
approach by focusing its modeling of temporal regulatory interactions on bifurcation points. Bifurcation events occur when a set of genes share a similar expression path up to a certain time
point diverge [24]. This approach may lead to better insights regarding the system being studied,
this is because TFs may bind to different genes at different time points, and DREM can derive
dynamic maps that associate TFs with the genes they regulate and their activation time points.
These insights help identifying master and secondary regulators respectively that control the initial
response, and responsible for more speciﬁc pathways respectively. These insights can also explain
several aspects of the observed response, for example the condition-speciﬁc activity of factors and
the activation of certain network motifs [24]. Although this approach can identify master and secondary regulators, it cant infer the logical roles of the TFs; it also can’t infer the combinatorial
interactions of multiple transaction factors [24, 79]. Another approach used a computational approach to capture combinatorial effects of multiple transcription factors in transcription control.
Binding and expression data were used to identify regulatory modules including subsets of reg6

ulators and genes together with a regulatory program [79]. However, this approach reduced the
combinatorial functions of TFs to the properties of individual TFs and it did not consider the combinatorial interactions of TFs. In addition, the method uses an incremental approach that does not
study the functions of the TFs simultaneously because of the scalability issue introduced by the
greedy search.
However, the previous approaches do not take the effect of microRNAs (miRNAs) into considerations. Le et al. in [40] developed a protein interaction-based MicroRNA Modules (PIMiM) , a
new model that integrates sequence, expression and interaction data to identify modules of mRNAs
controlled by sets of miRNAs. PIMiM combines regression with network information to discover
these modules. PIMiM is used to infer condition-speciﬁc regulation of miRNAs and their targets.
Le et al. in [72] proposed an approach to infer the causal knockdown effects of each single miRNA
on the regulations of its targets, using expression proﬁles of miRNAs and mRNAs without taking
into consideration prior target information. However, the proposed approach considers the effect
of a single miRNA on each of its targets and it does not consider the level of regulations that a
group of miRNAs has on a particular gene.

1.3

Computational Models

In this section, we provide a brief background for the computational methods used in the dissertation.

1.3.1

Hidden Markov Model (HMM)

A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled
is assumed to hold the Markov process property with unobserved (hidden) states. A stochastic
7

process is said to have the Markov property if the conditional probability distribution of its future
states depends only upon the current state. Hidden Markov model is very popular machine learning
tool in bioinformatics [19, 22]. To formalize the notation of hidden Markov models, let p be the
path or the sequence of states. This path follows a simple Markov chain which means that the
probability of a state depends solely on the previous state [19, 22]. This chain can be represented
as shown in Equation 1.2.
akl = P(Πi = l|Πi−1 = k)

(1.2)

Where Πi is the i th state in the path and akl is the transition probability from state k to state
l. A begin state is used to model the beginning of sequences in Markov chains. The transition
probability a0k presents the probability for the model to transit from this begin state to state k, this
intrinsically means the probability that the Markov chain starts at state k. Similarly, we can use
an end state to model the ending of a state sequence to transit the sequence into an end state. In
addition, each state can produce a symbol from a distribution over all possible symbols. Hence, an
emissions probability ek (b) in Equation 1.3 represents the probability that symbol b is seen when
the model is in state k [19, 22].
ek (b) = P(xi = b|Πi = k)

(1.3)

There are three fundamental problems associated with hidden Markov models are introduced in
the following sub-sections.
Considering the underlying states to ﬁnd out what the sequence of observations means is called
decoding. Several approaches have been proposed for decoding. We will present here the most
common among them, the Viterbi algorithm. The Viterbi algorithm is a dynamic programming
algorithm for ﬁnding the most likely sequence of hidden states called the Viterbi path that results
in a sequence of observed events. If we have to choose only one path for our prediction, then the
8

one with the highest probability is the most probable path to be chosen [19, 22].

Π∗ = argmaxΠ P(x, Π)

(1.4)

Since many different state paths can produce the same sequence x, we need to compute of a
particular observation sequence x as in Equation 1.5.

P(x) = ∑ P(x, Π)

(1.5)

π

But the number of possible paths Π increases exponentially with the length of the sequence. This
makes the brute force enumeration for all possible paths inapplicable. This probability can be
computed using a similar approach to the Viterbi algorithm, by replacing the maximization steps
to summations [19, 22].
The aim of HMM learning is to determine the emission and transition probabilities from the
training samples. The forward-backward algorithm can be used to determine the model parameters.
The algorithm works basically by updating the weights in order to better explain the observed
training sequences. The algorithm ﬁrst computes the probability of Akl and Ek (b) by looking at the
probable path of the training sequence using the current akl and ek (b). Equation 1.6 and Equation
1.7 are used to update the values of the transition and emission probabilities respectively. The
process is repeated till a stopping criterion is reached [19, 22].

akl =

ek (b) =

Akl
∑l Akl

(1.6)

Ek (b)
∑b Ek (b )

(1.7)

9

1.3.2

Bayesian Inference

Bayesian models provide full joint probability distributions over both observed data and unobserved model parameters. Bayesian statistical inference is carried out using Bayes rule. For a
probabilistic model, the parameters of the model are to be inferred from data. Assume we would
like to infer parameters for a model M from a set of D. Suppose there is a probability distribution
over the parameters . If we condition on M , gives the version of Bayes theorem as in Equation 1.8.

P(Θ|D, M) =

P(D|Θ, M)P(Θ|M)
P(D|M)

(1.8)

Where P(Θ|M) is the prior probability which has to be chosen in a reasonable manner, P(Θ|D, M)is
the posterior probability, and P(D|Θ, M) is the likelihood probability [19, 22].
The EM algorithm is an efﬁcient iterative procedure to compute the Maximum Likelihood (ML)
estimate if there is some missing or hidden data. In ML estimation, we would like to estimate the
model parameter(s) for which the observed data are the most likely. Each iteration of the EM
algorithm consists of two processes: The E-step, and the M-step.
E-step: In this step, the missing data are estimated given the observed data and current estimation of the model parameters. This is achieved using the conditional expectation, explaining the
choice of terminology [19, 22]. M-step: in this step, the likelihood function is maximized under the
assumption that the missing data are known. The estimation of the missing data from the E-step is
used in behalf of the actual missing data. Convergence is assured since the algorithm is guaranteed
to increase the likelihood at each iteration [19, 22].
The fact that each iteration improves the likelihood of (Θ), in addition to the simplicity and
ease of implementation make the EM algorithm so useful. However, the rate of convergence can
become excruciatingly slow as approaching local optima. In addition, EM works best when the
10

percentage of missing information is small and the dimensionality of the data is not too large.
Hence, EM can require much iteration, and higher dimensionality can signiﬁcantly slow down the
E-step [19, 22].

1.3.3

Association Rules Mining

Association analysis is a useful methodology for discovering interesting relationships hidden in a
large datasets where the uncovered relationships are represented in the form of association rules or
frequent items. Formally, an association rule is an expression of the form A ⇒ B, where A, and B
are disjoint itemsets, such that A ∩ B =0. The signiﬁcance of an association rules can be measure
/
using the support and the conﬁdence of the rule. The support measures how often the rule occurs
in a dataset. While the conﬁdence measures how frequently items in B are found in transactions
that contains A. Formally,
Support, s(A ⇒ B) =

s(A ∪ B)
N

Con f idence, c(A ⇒ B) =

s(A ∪ B)
s(A)

(1.9)
(1.10)

where N is the number of transactions in the dataset. When a rule has a very low support, it means
it happens by chance and hence less interesting. On the other side, conﬁdence of a rule reﬂects the
reliability of the inference. If A⇒ B has a high conﬁdence, then it is more likely that B occurs in
the transactions that contain A. Association rules mining can be mathematically deﬁned as follows:
Given a set of transactions T, we need to ﬁnd all the rules with support ≥ ms, and conﬁdence ≥mc,
where ms and mc are the minimum support and minimum conﬁdence thresholds respectively and
are computed using Equation 1.9 and Equation 1.10 respectively. The basic strategy used by
many association rule mining algorithms is to divide the problem into two subtasks:

11

1. Frequent Itemset Generation: The goal of this step is to ﬁnd all the frequent itemsets. Frequent itemset is an itemset that satisfy the minimum support threshold.
2. Rule Generation: The goal of this step is to ﬁnd all the rules that have conﬁdence higher than
the minimum conﬁdence.
Apriori is one of the oldest algorithms that is widely used in mining association rules [1]. Apriori
applies the Apriori principle to prune the exponential space. Apriori principle states that if an
itemset is frequent, then all of its subsets must also be frequent, or if an item set is infrequent then
all its supersets must also be infrequent.
Apriori algorithms performs several passes over the data in order to mine association rules.
In the ﬁrst pass, the algorithm simply counts the number of items occurrence to determine the
frequent 1-itemsets, L1 . In pass k, the algorithm uses the Lk−1 -itemsets found in the k-1 pass, to
generate the k-candidate itemsets (Ck ) using an Apriori generator function. The Apriori generator
works as follows:
1. Join Lk−1 with Lk−1 .
2. Delete every c ∈ Ck such that any k − 1-subset of c is not in Lk−1
Then the database is scanned again to count the support of candidates in Ck .
Although many association rule mining algorithms use support and conﬁdence, support and
conﬁdence have some limitations. Many potentially interesting rule that have low support items
might be eliminated because of the given minimum support. The conﬁdence ignores the support
of the itemset of the rule consequent. Hence, various measures have been used to qualify the
signiﬁcance of an association rule. In addition, Apriori suffers from a number of inefﬁciencies.
One major trade-off of Apriori is that the candidate generation step generates large numbers of
subsets.
12

1.4

Problem Statement

We deﬁne a regulatory module R (TF; G; I) as a set of genes G regulated in concert by a group
of one or more TFs that govern the target genes’ behaviors via appropriate TF-target regulatory
interactions in I [7]. There are two types of regulatory modules. In an independent regulatory
module, the target genes are solely regulated by one TF. A TF’s regulatory interaction model can
be deﬁned in terms of two properties: the TF’s functional role as an activator or a repressor, and its
logical role as being necessary or sufﬁcient. Table 1.1 shows such a regulatory interaction model
as proposed by previous researchers [3, 7] . Note that the categories in the TF’s functional and
logical roles can be combined. A TF-target interaction model in R can be Activator Necessary
(AN), Activator Sufﬁcient (AS), or Activator Necessary and Sufﬁcient (ANS). Similarly for TFs
that are repressors, they can be RN, RS, or RNS [5]. A multiple regulatory interaction is deﬁned
as a group of TFs that collaborate to control the expression levels of the same target genes. The
directions of all the TFs in the group, therefore, form a transcriptional regulation pattern of the
target genes. In a multiple-TF regulatory interaction, a group of TFs collaborate to control the
expression levels of the same target genes.
This dissertation focuses on identifying TF groups that are collaboratively responsible for target gene expressions by uncovering the regulatory interactions in terms of their directions and
corresponding logical roles from gene expressions and TF-DNA binding data.

1.5

Dissertation Contribution

The following is a summary of the contributions of this dissertation:
1. We developed an HMM based model called TRIM that relate gene expression patterns to

13

Table 1.1: TF-Target interactions can be modeled in terms of the TF’s functional role as an activator (up-regulates the target gene’s expression) or a repressor (down-regulates the target gene’s
expression), and the logical role of the TF as being necessary and/or sufﬁcient. The two categories
(functional and logical roles) can be combined.
Role
Functional

Concept
Activator
Repressor
Necessary

Direction of
Logical

Sufﬁcient
Necessary and Sufﬁcient

Description
Response of target gene is inline with the expression
change of TF
Response of target gene is opposite to the expression
change of TF
Decreasing TF’s expression level leads to responses
opposite to its functional role
Increasing TF’s expression level leads to the responses consistent with its functional role
Increasing TF’s expression level leads to responses
consistent with its functional role, and decreasing the
TF’s expression level leads to the responses opposite
to its functional role

regulatory interaction models based on a set of constraints. TRIM infers the regulatory
interactions of one or two TFs. Although, multiple TFs can collaboratively regulate their set
of target genes, TRIM still solves a sub-problem where target genes are regulated by 2 TFs.
2. In order to infer k-TFs regulatory interactions, we developed mTRIM, an integration of an
EM-based Bayesian inference and a new association rule mining approach built on a set
of basic constraints that relate gene expression patterns to regulatory interactions. mTRIM
infers the regulatory interactions of k-TFs that collaboratively regulate a set of target gene
with no limitations on the size of k. Clearly, mTRIM has higher scalability than TRIM. In
addition, mTRIM gives an insight on the replacements between TFs.
3. To better study transcription regulation and to enhance the performance, we incorporated
miRNAs into the inference. We designed TmiRNA, a sequential mining approach that infer the interactions of k regulators including TFs and miRNAs and their set of target genes.
Instead of constructing regulatory modules based on binding network and gene expressions,
14

TmiRNA provides a new approach to construct regulatory modules based on regulatory interactions.

1.6

Dissertation Outline

In this chapter, we provided a brief background about transcriptional regulation. We brieﬂy described the the experimental methods that can be used to study transcriptional regulation. Furthermore, we explained some of the theoretical bounds of those experimental methods, and the
need to have new computational models to study transcriptional regulation. Then, we explained
brieﬂy the basics of the computational models used through out the dissertation. Based on this
solid foundation, we stated the main problem we aim to solve in this dissertation.
In Chapter two, we describe our ﬁrst approach (TRIM) to infer TFs interaction models. This
approach is limited to 2-TFs.
In Chapter three, we describe a new approach (mTRIM) to infer k-TFs interaction models, with
no bound on the size of k.
In Chapter four, we incorporate miRNA into the inference. We described TmiRNA, a new
sequential mining approach, to infer k-regulators interaction models, the regulators are a combination of TFs and miRNAs. We also describe how TmiRNA constructs regulatory modules based on
interaction models.
Finally, in Chapter ﬁve, we conclude our work and shed light to our future work.

15

Chapter 2
Inferring 2-TF Regulatory Interactions

2.1

Introduction

The complex gene expression programs in living cells are moderated by regulatory proteins called
transcription factors (TFs) that control the transcription of genes in the context of transcriptional
regulatory networks (TRNs) [56]. The TFs interact with their target genes in the TRNs to up or
down-regulate gene expression. They can act independently or collaboratively with other TFs,
leading to different TF-target interaction models that inﬂuence the regulation patterns of target
genes in different ways [22, 79].
Various experimental methods have been developed to unravel the complex regulatory mechanisms behind biological processes. Recent developments in biotechnology (e.g., chromatin immunoprecipitation, yeast one-hybrid and next-generation sequencing) have been used to indirectly
or directly uncover TF binding relationships [16, 58] to reconstruct draft regulatory circuits at a
systems level [22, 57, 79]. To verify the TF-target gene relationships and to detect the TF functions
in vivo, TF knockouts and/or overexpression experiments are usually carried out [29].
However, single knockout or overexpression may not provide statistically signiﬁcant evidence
due to redundancy or confounding signals from indirect regulatory feedback [28]. For example, it
has been shown that approximately 73% (about 4,500) of the known genes of S. cerevisiae (yeast)
are non-essential [30]. The results of the single knockout or overexpression experiments are therefore often non-conclusive, as it is highly likely that multiple non-essential genes can be involved.
16

This has led to the development of automated experimental methods for double-knockouts to provide more statistically signiﬁcant determination of the TF functions [73]. However, to systematically knockout (or overexpress) all possible combinations of the TFs in the whole genome is still
challenging. Given an organism with k TFs, the total number of possible double-TF combinations
is as in Equation 1.1. For complex organisms, k can be easily in the range of thousands.
Instead of blindly trying out all possible TF pairs for double-knockout experiments, one solution is to select the TF pairs that are most likely to bring about the phenotypic change. To do
so, we need to understand the interaction models employed by the TFs to inﬂuence the regulatory patterns of the target genes in the network. In other words, we need to uncover the models
of TF-target regulatory interactions of the TF pairs and the target gene in terms of the TF-target
interactions’ directions (activation or repression) and their corresponding logical roles (necessary
and/or sufﬁcient).
In this chapter, we design a set of constraints that relate gene expression patterns to regulatory interaction models, and propose an algorithm TRIM (Transcriptional Regulatory Interaction
Model Inference) [3] to systemically infer the regulatory interaction models between individual
TFs, as well as any two TFs, and their target genes, from wild-type time-series gene expression
data. Our TRIM algorithm is based on a hidden Markov model (HMM). Experimental results on
yeast data showed that TRIM outperformed the existing algorithms for inferring the regulatory interaction models of TFs and their target genes for individual TFs as well as pairs of TFs that are in
collaborative regulatory modules. In addition, on an individual Arabidopsis binding network, we
showed that the target genes’ expression correlations can be signiﬁcantly improved by incorporating the TF-target regulatory interaction models inferred by TRIM into the expression data analysis,
which may introduce new knowledge in transcriptional dynamics and bioactivation. We deﬁne a
regulatory module R(T F, G, I) as a set of genes G regulated in concert by a group of one or more
17

TFs that govern the target genes’ behaviors via appropriate TF-Target regulatory interactions in I
[61]. There are two types of regulatory modules. In an independent regulatory module, the target
genes are solely regulated by one TF. In a collaborative regulatory module, the target genes are
regulated by multiple TFs. In this chapter, we focus on collaborative regulatory modules with up
to 2-TFs. In the following chapters, we will show high order regulatory modules.
A TF’s regulatory interaction model can be deﬁned in terms of two properties: the TF’s functional role as an activator or a repressor, and its logical role as being necessary or sufﬁcient. Table 1.1 shows such a regulatory interaction model as proposed by previous researchers [61, 79].
Note that the categories in the TF’s functional and logical roles can be combined. A TF-target
interaction model in R can be Activator Necessary (AN), Activator Sufﬁcient (AS), or Activator
Necessary and sufﬁcient (ANS). For example, pheromone response elements are necessary and
sufﬁcient for basal and pheromone-induced transcription of the FUS1 gene of yeast[26]. Similarly
for TFs that are repressors, they can be RN, RS or RNS [4].
Yeang and Jaakkola [79] attempted to characterize the combinatorial regulatory models of multiple TF-target interactions using a heuristic approach to measure how well R ﬁts the associated
binding and gene expression data with a log likelihood function. The regulatory module’s likelihood is maximized with a greedy approach by incrementally adding genes to the module and
monitoring the predictions of the TF-target interactions for optimality. However, this incremental
approach does not study the functions of the TFs simultaneously because of the scalability issue
introduced by the greedy search. Their method also uses a p-value based approach to calculate
the signiﬁcance of the combinatorial property of a TF, determined by the gap of log likelihood
scores between their model and a model built on the randomized gene expression data based on
the entire time frame. However, as stated in Ernst et al [22], a TF usually functions at speciﬁc
“activation timepoints” instead of throughout the entire time frame. This means that the identiﬁca18

tion of TF-target interaction modules should be focused on such activation timepoints rather than
comparing with random gene expression data of the entire time frame. In this chapter, we include
a step to recognize the activation timepoints of the target genes in TRIM. Our experimental results
will show that the target genes’ expression correlation are indeed markedly improved by taking the
actual activation timepoints into consideration.
Another related algorithm is DREM, which was proposed to derive dynamic regulatory networks that associate TFs with target genes and their activation timepoints [22]. To uncover transcriptional regulatory events leading to the observed temporal expression patterns and the underlying factors that control these events during a cell’s response to stimuli, DREM integrates timeseries gene expression data and protein-DNA binding data to build a global temporal map. The
method mainly works by identifying bifurcation timepoints where the expression of a subset of
genes diverges from the rest of the genes. The bifurcation points are then annotated with the TFs
regulating these transitions, which result in a uniﬁed temporal map. The method can therefore
facilitate the determination of the time when TFs are exerting their inﬂuence, and assigns genes
to paths in the map based on their expression proﬁles and the TFs that control them. Unlike the
method by Yeang and Jaakkola [79], DREM’s ability to derive dynamic maps that associate TFs
with the genes they regulate and their activation timepoints has indeed led to better insights for
the regulatory module being studied. For example, one can identify master regulators that control
the initial response and secondary regulators that are responsible for speciﬁc pathways. Numerous aspects of the observed response, including the condition-speciﬁc activity of factors and the
activation of certain network motifs, can also be explained using DREM. However, unlike our
method, DREM does not infer the logical roles of the TFs; for example, whether a speciﬁc master
or secondary TF is necessary or sufﬁcient for regulating a set of target genes. Such knowledge
are essentially useful for understanding the complex regulatory mechanisms of many biological
19

processes. We will show in this chapter that by incorporating the knowledge of the regulatory
interaction model, we can signiﬁcantly improve the computation of gene expression correlation.

2.2

Concepts

The kernel of our TRIM method is an HMM, which is a stochastic model that assumes the Markov
property holds and all the states are unobserved (hidden). A stochastic process is said to have the
Markov property if the conditional probability distribution of its future states depends only upon
the current state, as shown in Equation 2.1:

p(x(t + 1)|x(0), ...., x(t)) = p(x(t + 1)|x(t))

(2.1)

An HMM consists of a set of hidden states and each state has a probability distribution over
the possible outputs [18]. In an HMM, a state ki transits to another state k j with a probability
P(k j (t + 1)|ki (t)). A state k can emit an output b with emission probability ek (b) = P(out put =
b|state = k) [18, 19, 51]. See Section 1.3.1 for more details about HMM.
In this chapter, each state refers to a possible TF-target interaction model, while the output
emitted by a state will indicate whether a particular interaction model is true [19].
By taking advantage of the HMM, our TRIM algorithm can consider the inﬂuence from multiple TFs simultaneously. Prior biological knowledge on the TFs such as gene perturbation experiments can also be incorporated in the setting of the initial emission probabilities, while the
time-dependent regulatory relations can be effectively captured by preserving the time dependency
within the HMM.
In this chapter, we assume that a TF-target interaction is consistent in the context of transcriptional control as long as the experimental conditions are unchanged. We also assume that
20

the activity of a TF is proportional to its mRNA abundance over time. Although these assumptions may be violated in practice, existing algorithms for inferring TF-target interaction models at
different levels of complexity [6, 22, 61, 79] have all been developed with these assumptions.

2.3

Methods

Given a large-scale TRN, we design our TRIM algorithm for inferring TF-target interaction models
systematically from large-scale TRNs. A TRN can be represented as a directed graph, in which
each node is a TF or a gene, and each edge represents a regulation relationship between a TF and a
target gene. The framework of TRIM, as shown in Figure 2.1, consists of two steps: the ﬁrst step
constructs the regulatory modules from the TRN; the second step infers the TF-target interaction
models in each of the regulatory modules.
Figure 2.4 shows that the genes in yeast, one of the most well studied eukaryotic organisms,
are regulated mostly by individual TFs (69.6%) or pairs of TFs (18.2%). Therefore, in this chapter,
we focus on independent regulatory modules and collaborative regulatory modules with two TFs
(we call these “2-TF collaborative modules”).

2.3.1

Constructing Regulatory Modules

Given a large-scale TRN, a TF may regulate multiple target genes simultaneously but with different
types of TF-target interactions. To construct regulatory modules, we extract the subnetworks using
gene expression clustering followed by graph partitioning (Figure 2.1 step 1).
We ﬁrst cluster all the target genes in a TRN based on their gene expression values with Cluster
3.0 (speciﬁcally, k-means), which uses Pearson correlation coefﬁcient for gene similarity metric
[20]. The clusters are then evaluated with Gene Ontology enrichment analysis using Bingo, and un21

Figure 2.1: TRIM framework for inferring TF-target regulatory interaction models. TRIM has
two main steps: 1) a regulatory module construction step that includes both a clustering of coexpressed genes and a topological analysis to classify independent and collaborative regulatory
modules from a given large-scale TRN, and 2) a HMM model to infer TF-target interaction models
in the independent and collaborative regulatory modules identiﬁed in step 1.
enriched clusters are discarded [47]. Genes in the same cluster are considered to be co-expressed.
And the co-expressed and co-regulated genes are usually weighed to be regulated by the same
TF(s) with a similar interaction model. So for the target genes that are regulated by the same single TF (or the same TF pair), we partition them based on whether they are in the same cluster to
construct independent regulatory modules (or collaborative regulatory modules). An illustrative
example is shown in Figure 2.2 T F1 and T F2 regulate genes g1 and g2 , and g1 and g2 belong to
the same gene expression cluster, so this regulatory module contains T F1 , T F2 , g1 and g2 .

22

Figure 2.2: An illustrative example of regulatory module construction. T F1 and T F2 regulate
genes g1 and g2 , and g1 and g2 belong to the same gene expression cluster, so this regulatory
module contains T F1 , T F2 , g1 and g2 .

2.3.2

Designing the HMM model

In the next step (Figure 2.1 step 2), we design a new HMM model [51] to infer the regulatory
interaction models for the TF-target interactions in every regulatory module detected above. For
the regulatory modules with two TFs, we run the HMM model directly. For the regulatory modules
with a single TF, we add a dummy TF with its expression value constantly zero. Some researchers
have pointed out that designing an HMM model is a sort of art [19]. In the following text, we
describe the details of how we design the structure, set the initial probabilities, and develop the
updating method for emission probabilities and transition probabilities for our HMM for 2-TF
collaborative regulatory modules.
Structure. To model 2-TF collaborative modules, our HMM consists of two TFs, T F1 and T F2 ,
where each TF has four states (i.e., AS, AN, RS and RN), as shown in Figure 2.3. Each state
emits two possible outputs, active or inactive. One can view a state as a representation of whether
a particular regulatory interaction model for an individual TF-target interaction is valid (active) or
invalid (inactive). In the training process, if one of the four states of T F1 emits an active output, the
23

HMM will transit to a state belonging to T F2 based on the constrains (to be introduced later), and
vice versa. Initial Probability. The initial emission probabilities of the states in the HMM are set
equally (if we do not have any prior information), or they can be determined by prior knowledge
obtained from TF perturbation experiments. In addition, we use uniform transition probabilities, if
there is an arrow in Figure 2.3

Figure 2.3: A simpliﬁed representation of the HMM model for the collaborative regulatory module
with 2-TF, T F1 and T F2 , where each TF has four states (i.e., AS, AN, RS and RN). Each state emits
two possible outputs, active or inactive. One can view a state as representing whether a speciﬁc
interaction model of an individual TF-target relation is valid or invalid. In the training process,
if an active output is emitted from one of the four states of T F1 , the model will transit to a state
belonging to T F2 and vice versa.

Emission and Transition probabilities. We train the HMM model with discretized gene expression data. We discretized the gene expression levels as up or down-regulation instead using
absolute absence or presence, by comparing the expression changes between two consecutive timepoints to determine whether there was increase (+1) / decrease (−1) [52]. For the regulatory
module with multiple target genes, all the gene expression changes are used sequentially for HMM
training.
In a mRNA transcriptional process, the expression change of a TF is usually earlier than the

24

Figure 2.4: In-degree distribution of the yeast TRN for Reimand dataset.
change of its targets. Therefore, we adopt the concept of time lagging in this HMM model training
process [59].
At timepoint n, the input to the HMM is a set of gene expression changes of the TFs from
timepoint n − l to n, where l is a time lag that is deﬁned by the user to capture the effects of the
TFs at the earlier timepoints (n − l, . . . , n) on the target gene at timepoint n. To update the emission
probabilities properly, we design a set of constraints that relate the gene expression patterns of a
TF and its target to the regulatory interaction model (see Table 2.1). The emission probability of
a state (active/non-active) can then be updated with Equation 2.2, where b and b are the outputs
of state k and Ek (b ) is the probability for emitting output b from state k [19]. The ﬁnal emission
probability of each state reﬂects the likelihood of the state being active or inactive.

ek (b) =

Ek (b)
∑b Ek (b )

(2.2)

In a HMM, a path is a sequence of states that follows the Markov chain of hidden states, in which

25

the probability of a state depends only on the probability of the previous state. In this way, we
are able to consider the regulatory effects of the two TFs simultaneously, unlike previous works.
The Viterbi algorithm is applied to effectively ﬁnd the most probable path [18, 51]. The ﬁnal path
contains two states with a state for each TF’s TF-target interaction model. For example, a path
“AS-RN” means that the ﬁrst TF is activator sufﬁcient and the second TF is repressor necessary
for the same target genes.
Note that the HMM is a probabilistic model which cannot assume combined events or none of
the events to occur. Therefore, the model described above does not include a state for “Neither” or
“Both Necessary and Sufﬁcient (N+S)”. To infer neither/N+S regulatory models, a post-processing
step is required. We use the distribution of coefﬁcient of variation (CV) of all the emission probabilities to determine whether a regulatory interaction model can be N+S or neither: if none of the
probabilities are signiﬁcant, the model outputs “neither”; if the probabilities of both N and S states
are signiﬁcant, and if there is a signiﬁcant difference between the probabilities of the two states,
TRIM outputs the more signiﬁcant state, otherwise our model outputs N+S.
Table 2.1: Constrains for the inferring TF-target interaction models in a 2-TF collaborative regulatory module. They are based on the TF’s expression direction and the response of the target
gene.
TF1
Up

TF2
Down

Target-gene expression
Up

Up

Down

Down

Up
Down
Up
Down
Up
Up
Down
Down

Up
Down
Up
Down
-

Up
Up
Down
Down
Up
Down
Up
Down
26

TF-target Interaction Model
T F1 is Activator Sufﬁcient or T F2 is Repressor Necessary
T F1 is Repressor Sufﬁcient or T F2 is Activator Necessary
At least one TF is Activator Sufﬁcient
At least one TF is Repressor Necessary
At least one TF is Repressor Sufﬁcient
At least one TF is Activator Necessary
T F1 is Activator Sufﬁcient
T F1 is Repressor Sufﬁcient
T F1 is Repressor Necessary
T F1 is Activator Necessary

We show an illustrative example of our HMM in Table 2.2. In this example, T F1 and T F2 regulate the same target gene g. For simplicity, no time lag is used in the example (l = 0). Given
the gene expression changes of the two TFs and the target gene, we can infer the regulatory interaction models for both of the TF-target pairs using the HMM as follows. We ﬁrst initialize the
active emission probabilities of all the states equally to 0.5. At time t0 , as none of the expression
changes is signiﬁcant, nothing is done. At time t1 , the down-regulation of both T F2 and g triggers the active emission probability of state AN of T F2 (Table 2.1 Row 10). The active emission
probability of state (T F2 , AN) is updated by adding the new frequency and then being normalized,
i.e., (0.5 + 1)/2 = 0.75, while the inactive emission probability of (T F2 , AN) is 0.25. Meanwhile,
the active emission probabilities of all other states are decreased to 0.25 and the inactive emission
probabilities of them are increased to 0.75. See Table 2.3 for the updated emissions. At time
t2 , the up-regulation of T F1 and the down-regulation of g trigger the state RS of T F1 (Table 2.1
Row 8). The active emission probability of (T F1 , RS) is updated to be 0.625 and the other active
emission probabilities are adjusted accordingly. At time t3 , state RS of T F1 and state AN of T F2
are triggered given the two-TF constraint in Table 2.1 Row 2. The active emission probability of
(T F2 , AN) is updated to (0.375 + 1)/2 = 0.688 and the active emission probability of (T F1 , RS)
to (0.625 + 1)/2 = 0.813. See Table 2.3 for the emission probabilities of the four observations.
The training process continues until all the gene expression observations are processed. For this
example, the RS’s active emission probability (0.675) is the highest amongst all the active probabilities of T F1 and the RS’s active emission probability (0.203) is the highest amongst all the active
probabilities of T F2 in the end. The ﬁnal output for the TF-target interaction model is subject to the
distribution of all the active emission probabilities in all the regulatory modules. In this example,

27

Table 2.2: An illustrative example of HMM training process. There are 10 observations of gene
expression changes for a collaborative regulatory module with 2-TF: 0 means no signiﬁcant gene
expression change; 1 means signiﬁcant up-regulation and -1 means signiﬁcant down-regulation.
Time
TF1
TF2
g

t0
0
0
0

t1
0
-1
-1

t2
1
0
-1

t3
1
-1
-1

t4
1
0
1

t5
1
1
-1

t6
0
-1
0

t7
1
0
1

t8
1
-1
0

t9
1
0
-1

we conclude that the most possible regulatory interaction model for (T F1 , g) is repressor sufﬁcient
(RS) and for (T F2 , g) is Neither.

Table 2.3: The emission probabilities of the ﬁrst four observations of gene expression changes in
the illustrative example shown in Table 2.2. After processing these four observations, the model
suggests that the most possible regulatory interaction model for (T F1 , g) is RS with probability
81% and for (T F2 , g) is AN with probability 68%.
AS

AN

RS

RN

AS

AN

t0
TF1
TF2

0.50
0.50

0.50
0.50

2.4

0.125
0.125

0.125
0.375

RN

0.25
0.25

0.25
0.25

0.813
0.062

0.062
0.062

t1
0.50
0.50

0.50
0.50

0.25
0.25

0.25
0.75

t2
TF1
TF2

RS

t3
0.625
0.125

0.125
0.125

0.062
0.062

0.062
0.688

Experimental Results

We evaluate the performance of TRIM using the yeast regulatory network. For comparison, we
applied DREM v3.0 on the same dataset. We did not compare TRIM with Yeang and Jaakkola [79]
since their objective is to build a reliable TRN, and the method does not output the TF function,
which is the focus of our work.

28

Table 2.4: Yeast cell cycle time series data that are used for model training and testing. .
Data type

Time-points

α-factor synchronization
CDC-28 heat-based synchronization
CDC-15 heat-based synchronization

Every 7 minutes for 2 hours
Every 10 minutes for 2.67 hours
Every 20 minutes for the ﬁrst hour, every
10 minutes for the next 3 hours, and then
every 20 minutes for the ﬁnal hour
Every 30 minutes for 6.5 hours

Elutriation synchronization

2.4.1

Data preparation

Using the same statistical approach in Reimand et al [57], we generated a large-scale yeast regulatory network with 2,230 regulatory relations between 268 TFs and 1,509 target genes by ﬁltering
the yeast ChIP-chip binding data [42] and the binding-cite predictions [21, 46]. The in-degree distribution (number of TFs per gene) in Figure 2.4 shows that 87.8% of the target genes are regulated
by one or two TFs, indicating that studying independent and 2-TF collaborative regulatory models
is sufﬁcient to cover the majority of the yeast TRN.
To train and evaluate TRIM, four widely used time-series microarray datasets from yeast cell
cycle studies were collected [68]. These datasets contain 73 timepoints in total. In these experiments, yeast cells were ﬁrst synchronized to the same cell cycle stage, released from synchronization, and then the total RNA samples were taken at even intervals for a period of time (see details
in Table 2.4).
To decide whether a gene is signiﬁcantly up or down regulated, we used a gene expression
change cutoff at 0.35. In this experiment, we used three of them (CDC15, CDC28 and elu) as the
training data and one (Alpha) as the testing data, since the Alpha dataset contains the most available
gene expression values after applying the cutoff. A time lag l = 2 was used in this experiment.
For evaluating the inferred regulatory interaction models for the TF-target interactions of the

29

independent regulatory modules, single TF knockout microarray data for yeast were collected [30].
A p-value cut-off at 0.05 was applied to determine whether a gene is signiﬁcantly affected by a TF
knockout [30].

2.4.2

Evaluation criteria

To test whether our method can correctly predict the TF-targets interaction models, we adopt a
similar approach as used in Segal et al [61] to examine the distribution of the differentially expressed genes in the modules. With the regulatory interaction models learned from training data
for the TF-target interactions in a 2-TF collaborative regulatory module, we can obtain the active
timepoints on the testing data at which a TF functions. If the inferred regulatory interaction model
is correct and the TF has the consistent function on the training and the testing data, then the gene
expression correlations on the active timepoints should be signiﬁcantly higher than on the whole
time frame of the testing data.
Mathematically, the set of activation timepoints Tx is deﬁned as follows: let ei be the gene
expression change (1, -1 or 0) of a TF at timepoint ti ; for each TF and its TF-target interaction
model x, timepoint ti is in Tx if and only if Indicator(ti ) = 1 or Indicator(ti+1 ) = 1. (See Equation
2.3).

Indicator(ti ) =



1













0

if x = sufﬁcient and ei = 1, or
x = necessary and ei = −1, or
x = (necessary and sufﬁcient) and |ei | = 1
otherwise
(2.3)
30

To compute the gene expression correlation, we group together the target genes regulated by
the same TF with the same interaction model. We then compute the gene expression correlation
on each group of the target genes over their activation timepoints, and normalize them. Mathematically, given a set of genes G and a set of timepoints T , if TF T Fk is predicted to be necessary for
genes in Gn , sufﬁcient for genes in Gs and both necessary and sufﬁcient for genes in Gb , then the
correlation score of all the genes regulated by T Fk is computed with Equation 2.4.

∑

COR(T Fk ) =

gi ,g j ∈Gn

C(gi , g j , Tn ) +

∑

gi ,g j ∈Gs

C(gi , g j , Ts ) +

∑

gi ,g j ∈Gb

C(gi , g j , Tb )

|Gn ∪ Gs ∪ Gb |
(2.4)

where Gx ⊆ G and Tx ⊆ T ; C(gi , g j , Tx ) is the Pearson correlation score between gi and g j (i = j)
at all the activation timepoints in Tx (x can be n, s or b).
For comparison, we applied the same approach on DREM prediction results. Since DREM did
not predict the effectiveness of TFs as necessary or sufﬁcient, we grouped the target genes regulated
by a TF based on the TF’s dependence (activation or repression). The Pearson correlation of the
same genes (grouped by TRIM prediction results) on all the timepoints were also computed.

2.4.3

Evaluation of TRIM on yeast data

In the yeast regulatory network, the 1,509 target genes were grouped into 50 clusters with Cluster
3.0 [20]. We then partitioned the yeast regulatory network into 1,051 independent regulatory modules and 275 collaborative regulatory modules with two TFs. Finally, by combining the regulatory
modules accordingly (see Section 2.3.1), the total number of the independent regulatory modules is
reduced from 1,051 to 640, while the number of 2-TF collaborative regulatory modules is reduced

31

Figure 2.5: In the independent regulatory modules, the overlap between the single TF knockout
supported TF-target pairs (815) and TRIM prediction results (833) for the functional roles (activation or repression) is 549, greater than the overlap between knockout and the results of DREM
(873), which is 456.
Table 2.5: Summary of TRIM’s predictions for independent and two TFs regulatory modules on
yeast and Arabidopsis.
Activator
Repressor
Unknown
A) Independent regulatory modules of yeast
Necessary
37
35
3
Sufﬁcient
68
50
17
Necessary & Sufﬁcient
228
179
216
Neither
218
B) 2-TF collaborative regulatory modules of yeast
Necessary
74
61
0
Sufﬁcient
121
61
0
Necessary & Sufﬁcient
6
5
0
Neither
212
C) 2-TF collaborative regulatory modules of of Arabidopsis
Necessary
180
136
0
Sufﬁcient
126
101
0
Necessary & Sufﬁcient
6
0
0
Neither
87
from 275 to 257. In the 640 independent regulatory modules, there are a total of 1,051 TF-target
pairs. TRIM was able to infer the TF-target interaction models for 833 pairs (see Table 2.5A). On
the same dataset, DREM was able to infer 873 TF-target relations with its p-value cut-off at 0.05.
We used the single TF knockout microarray data to directly verify our inferred TF-target interaction models for the independent regulatory modules. First, for the functional roles (activation
32

or repression) of TFs, our TRIM successfully predicted 549 out of 815 models with a successful
rate of 67.4%, which is clearly higher than the results of DREM (56.0%), as shown in Figure 2.5.
Second, for evaluating the inference performance of the logical roles of TFs, with knockout data,
we can only look at the “necessary” logical role: a TF is necessary for its target genes if the expression values of the target genes are signiﬁcantly changed when the TF is knocked out. In the single
TF knockout data, a total of 815 independent regulatory modules were considered as necessary.
Among them, TRIM predicted 682 pairs to be necessary or necessary and sufﬁcient with a success
rate of 83.6%. (We are unable to perform a similar comparison for DREM since DREM does not
predict necessary TFs).
In addition, the gene expression correlation scores of the TF-target pairs that are predicted by
both TRIM and DREM are shown in Figure 2.8 (see Section 2.4.2 for the score computation).
The median correlation score for TRIM (0.52) is clearly higher than that of DREM (0.46) and
than using all the timepoints (0.29). To further evaluate the performance of TRIM on independent
modules, we estimated the mRNA synthesis rate of each gene by applying decay adjustments [50]
on the training and testing data:

µg = eg (α + λg )

(2.5)

where µg is the synthesis rate of gene g; eg is the gene expression of g; α is a consistent variable
representing the growth rate of yeast; λg is the decay rate of each gene g, which could be estimated
with the mRNA half-life (Lg ) for each gene with equation λg = ln(2)/Lg , where the mRNA halflife data was obtained from [50]. For the growth rate of yeast (α), we used the same value as used
in [50], i.e., ln(2)/cell cycle length and the cell cycle length is set to 150 minutes. After applying
the decay adjustments, the median of the expression correlation scores of TRIM further increased

33

from 0.52 to 0.57, while the scores of DREM and all-timepoints are almost unchanged (0.40 and
0.31 respectively, see Figure 2.10, indicating that by focusing on the mRNA synthesis rates, TRIM
model is able to predict the TF-target interaction models more precisely. (the detailed correlation
values are shown in Figure 2.6).

In the 540 2TF-target pairs in the 257 2-TF collaborative

Figure 2.6: A histogram of gene expression correlation of 26 TFs in the independent regulatory
modules of yeast using TRIM, DREM or all time-points after applyng decay adjustments. These
26 TFs are the intersection between the results of DREM and TRIM.

regulatory modules, TRIM was able to infer the interaction models for 328 pairs. Table 2.5B
shows the summary of the number of interaction models predicted by TRIM for 2-TF modules.
On the same dataset, DREM was able to infer 440 TF-target pairs.
Figure 2.9 shows the expression correlation scores of the 2TF-target pairs that are predicted
by both TRIM and DREM. The median correlation score for TRIM (0.87) is signiﬁcantly higher
than that of DREM (0.54) and than using all the timepoints (0.27) We also applied similar decay
estimation approach on the 2-TF collaborative modules and the result is consistent with the single
TF module (see Figure 2.7).
In summary, the experimental results on the yeast data showed that TRIM can successfully
34

Figure 2.7: A histogram of gene expression correlation of 15 TFs in the 2-TF collaborative regulatory modules of yeast using TRIM, DREM or all time-points after applying decay adjustments.
These 15 TFs are the intersection between the results of DREM and TRIM.
predict TF-target interaction models for both independent and 2-TF collaborative modules.

2.5

Application of TRIM on Arabidopsis data

In order to maintain a stable intracellular environment, living cells utilize complex and specialized transcriptional regulatory systems to react against a variety of external perturbations, such as
temperature change, drought, UV, etc. Many of the adaptive
mechanisms contributing to cellular homeostasis operate through TRNs to regulate the expression of anti-stress genes [82]. The key step to understand the TRN behavior is therefore to explore
the roles of relevant TFs by using time-series gene expression data under different stress conditions. In this study, we applied TRIM to infer the roles (i.e. interaction models) of TFs in 2-TF
regulatory modules of Arabidopsis TRN under 8 different abiotic stress conditions. The objective
of this study is to infer the TF-target interaction models in Arabidopsis TRN with all the available

35

1
0.8
0.6
0.4
0.2
0

TRIM

DREM

All Time−points

(a)
Figure 2.8: Distribution of gene expression correlation of the independent regulatory modules. For
interpretation of the references to color in this and all other ﬁgures, the reader is referred to the
electronic version of this dissertation.
abiotic stress data, so that by looking at the gene expression patterns (as shown in Figure 2.11), we
are able to tell whether a TF is a general abiotic stress TF, a speciﬁc abiotic stress TF, or a TF that
does not function under abiotic stress. In addition, since our TRIM (like all the other current algorithms) assumes that TF-target interaction models will remain consistent under different abiotic
stress conditions, we also need to verify whether this assumption holds for the various biological
functions of the target genes.
We obtained Arabidopsis regulatory data from AtRegNet, which contains 11,355 direct interactions between TFs and target genes [53]. In AtRegNet, a direct interaction means either a TF binds
directly to the target gene (detected by electromobility shift assay, yeast one-hybrid, or ChIP), or
36

1

0.8

0.6

0.4

0.2

0

TRIM

DREM

All Time−points

(a)
Figure 2.9: Distribution of gene expression correlation of 2-TF collaborative regulatory modules
a TF directly regulates the target gene based on use of transgenic plants expressing an inducible
TF-GR fusion protein. We partitioned the Arabidopsis genes in AtRegNet into 50 clusters with
Cluster 3.0. A total of 8 abiotic stress data sets (Cold, Drought, Genotoxic , Heat, Osmotic, Salt,
UVB and Wounding) were collected from AtGenExpress to train and test our TRIM’s performance
in Arabidopsis [38]. The datasets were ﬁltered for signiﬁcant variability in mRNA expression using Bonferroni corrected p-values at 0.05. For each experimental run, three of the data sets were
reserved to evaluate the output of the model, while the remaining ﬁve were used for model training. In the experiment, six runs were conducted each with a different combination of training and
testing data (see detail in Table 2.6). A time lag l = 2 was used in this experiment.

Note that

in combining data from different abiotic stress experiments, we have assumed that the TF-target
37

1

0.8

0.6

0.4

0.2

0

TRIM

DREM

All Time−points

(a)
Figure 2.10: The independent regulatory modules after decay adjustments. For interpretation of
the references to color in this and all other ﬁgures, the reader is referred to the electronic version
of this dissertation
Table 2.6: The design of the 6 experiments on Arabidopsis abiotic stress data. X: testing data; T:
training data
Round
1
2
3
4
5
6

ID
DWG
WSG
CSG
CSH
DOG
DOU

Cold
T
T
X
X
T
T

Drought
X
T
T
T
X
X

Genotoxic
X
X
X
T
X
T

Heat
T
T
T
X
T
T

Osmotic
T
T
T
T
X
X

Salt
T
X
X
X
T
T

UVB
T
T
T
T
T
X

Wounding
X
X
T
T
T
T

interaction models will remain consistent under different abiotic stress conditions. This is also a
common assumption by previous researchers [38, 39]. In this study, we compare the results of the
TRIM model across multiple combinations of training and testing data. If our hypothesis is true,
38

the interaction model and subsequent correlation scores given by our model should be consistent
across all groups and the results would not be sensitive to how we grouped the data. Otherwise, it
means that the TF-target interaction model may vary under different conditions. In this study, we
found that the hypothesis is true for selected classes of TFs, e.g., developmental TFs.
On average, our TRIM identiﬁed 318 2-TF collaborative regulatory modules involving 32 TFs
(the prediction results is shown in Table 2.5C). 10 of these TFs had sufﬁcient data to be analyzed
by gene expression correlation. Detailed results are shown in Table 2.7. In all, we found that the
developmental TFs were the most consistent in terms of prediction and scoring, though their correlation scores were not always the highest on average. The only exception to this rule, APETALA2,
is involved in seed development and thus, we would expect it to be more differentially expressed,
since the production of seeds involves a reproductive decision which is likely to be stress sensitive. Other TFs, such as speciﬁc stress response genes and those associated with photosynthetic
processes, are less consistent.
We would also expect general stress response factors to show very consistent predictions and
score highly with the given data, yet they are absent from our dataset. However, we missed these
TFs because they are involved in more complicated regulatory modules with 3 or more TFs and
extending TRIM to k-TF modules. In the following chapter, we propose a new model (mTRIM)
that handles k-TF modules, that would capture their behavior.
In addition to grouping these TFs by function, we analyzed 3 individual TFs (ATGL1, WRKY53
and RD26) in depth. We show their Pearson correlation distributions in Figure 2.11. The TF with
the highest average correlation score, ATGL1, a developmental gene involved in trichome pattering, is an example of a case where our hypothesis held, as ∼80% of the interactions that could
be predicted were either AN or RN across all six experiments. The correlation scores were the
highest when the variable modules were predicted as either AN or RN. Alternatively, the TFs with
39

Table 2.7: The prediction results of the 10 TFs on Arabidopsis abiotic stress data with TRIM
High (> 0.6)
DWG, WSG,
CSG, CSH,
DOG
AT1G14350 FLP/MYB124N/A

Low(< 0.3)
N/A

Gene Function
Development,Trichome Pattering

N/A

AT1G24260 SEPAL
LATA3
AT2G20180 PIF1

DOG

N/A

N/A

N/A

AT4G36920 APETA
LA2
AT3G47640 PYE

DWG, CSG

AT4G23810 WRKY53

WSG, CSH,
DOG
WSG, DOG,
DOU
N/A

AT4G27410 RD26

DOG, DOU

AT1G09530 PIF3

DWG,WSG

DWG, WSG,
CSG, CSH
CSG, DOU

AT5G11260 HY5

DWG

DOG*

Development, Guard Cell
Differentiation
Development, Flower Development
Development, Regulation of
Seed Germination
Development, Seed Development
Stress Response, Iron Starvation
Stress Response, Response to
Chitin and Bacteria
Stress Response, Response to
Dessication
Photosynthetic, Red Signalling
/Anthrocyanian
Metabolism
Photosynthetic, Red Signalling (via PHYA)

Locus
Gene
AT3G27920 ATGL1

DWG, CSG,
CSH
DOG,DOU

the two lowest average scores, WRKY53 and RD26, biotic and drought stress associated regulators, showed very inconsistent prediction patterns. The highest Pearson correlation scores of RD26
occur when drought was included in the evaluation set. This is conﬁrmed by the fact that there is
little difference in the correlation whether we apply TRIM or use all timepoints to calculate the
score. Hence, RD26 is an example of condition sensitive TF which violates our assumption.
WRKY53, on the other hand, has low and similar correlation scores whether we use activation
timepoints predicted by TRIM or all the timepoints, which is true across all six runs. Therefore,
the issue here is most likely not sensitivity to particular grouping of data, but rather that the data is
uninformative for this particular case. This conclusion is not unexpected given WRKY53 responds
primarily to biotic rather than abiotic stress.
40

Figure 2.11: Pearson correlation of target genes of (A) ATGL1, (B) WRKY53 and (C) RD26 using
different combinations of training and testing data with TRIM.

2.6

Conclusions

Revealing the mechanisms of the transcriptional regulatory programs in TRNs is essential for understanding the complex control by which genes are expressed in living cells. In this chapter, we
model the interactions between the TFs and the target genes in terms of both the TF-target interaction’s function (activation or repression) and its corresponding logical role (necessary and/or
sufﬁcient). Based on the characterizations proposed by Yeang and Jaakkola [79], we deﬁne the
combinatorial regulatory interaction models for possibly multiple TF-target interactions in TRNs.
We used DNA-protein binding and gene expression data to construct regulatory modules for inferring the transcriptional regulatory interaction models for the TFs and their corresponding target
genes. Our TRIM algorithm is based on a HMM and a set of constraints that relate gene expression
patterns to regulatory interaction models. We have shown in this chapter how to apply TRIM to
infer the transcriptional regulatory interaction models for TFs in collaborative regulatory modules
involving two TFs. It can thus be used to help predict the phenotype of TF double-knockouts to
reduce the number of double knock-out or over-expression experiments needed.

41

Chapter 3
Inferring K-TF Regulatory Interactions

3.1

Introduction

In the previous chapter, we deﬁned an individual-TF regulatory interaction in terms of two properties: the TF’s functional role as an activator or a repressor, and its logical role as being necessary
or sufﬁcient (see Table 1.1) [61, 79]. The categories in the TF’s functional and logical roles are
combinable; they can be activator necessary (AN), activator sufﬁcient (AS), or activator necessary
and sufﬁcient (ANS).
We explained how recent biotechnology (such as ChIP [54] and yeast one-hybrid [15]) have
been applied to uncover TF-target binding relationships [16, 58] to reconstruct draft regulatory
circuits at a systems level [22, 57, 79]. For example, to identify regulatory interactions in vivo and
consequently reveal their functions, TF single/double knockouts and over-expression experiments
have been systematically carried out [29]. We also showed that high order genetic variations are
needed for precise inference of transcriptional regulations.
Considering the prohibitive costs and the tremendous number of possible combinations of
higher-order gene knockouts, it is currently impossible for researchers to examine all of possible gene knockout combinations experimentally. One solution to this problem is to select only
the TF groups that are most likely to bring about the phenotypic change. To do this, we need to
understand the regulatory interactions between multiple TFs to regulate their set of common target
genes. However, this is also a difﬁcult task, because when multiple TFs simultaneously or sequen42

tially control their target genes, a single gene responds to merged inputs, resulting in complex gene
expression patterns [5, 6]. The exhaustive approach requires enumerating all TF combinations,
which, given the high complexity of combinatorial, is simply impractical at the whole genome
level.
In the previous chapter, we described TRIM [3], a Hidden Markov model that relate gene
expression patterns to regulatory interactions, in order to solve a relatively simpler sub-problem
that considers only two TFs.
In this chapter, we extend the 2-TF regulatory interaction to multiple-TF regulatory interaction.
A multiple regulatory interaction is deﬁned as a group of TFs that collaborate to control the expression levels of the same target genes. The directions of all the TFs in the group, therefore, form
a transcriptional regulation pattern of the target genes. Similarly for TFs that are repressors, they
can be RN, RS or RNS [4]. In a multiple-TF regulatory interaction, a group of TFs collaborate to
control the expression levels of the same target genes. The directions of all the TFs in the group,
therefore, form a transcriptional regulation pattern of the target genes.
To predict regulatory interactions for all possible collaborative TFs, we propose an algorithm
called “mTRIM” (multiple Transcriptional Regulatory Interaction Mechanism) in this chapter. By
uncovering the regulatory interactions in terms of their directions (activation or repression) and corresponding logical roles (necessary and/or sufﬁcient) from gene expression and TF-DNA binding
data, mTRIM identiﬁes TF groups that are collaboratively responsible for target gene expressions.
Such inferences may provide high-quality candidate sets for further experimentally detecting the
collaborative functions of gene regulations that are largely unknown [5].
Yeang and Jaakkola [79] attempted to characterize the combinatorial regulation of multipleTF regulatory interactions using a heuristic approach to measure how well a regulatory module
ﬁts the associated binding and gene expression data with a log-likelihood function (See details
43

in Section 2.1). However, as stated in [22], a TF usually functions at speciﬁc “activation time
points” instead of throughout the entire time course, meaning that the identiﬁcation of regulatory
interaction modules should be focused on activation time-points rather than the entire time frame.
To derive dynamic regulatory networks that associate TFs with target genes at their activation
time-points, an algorithm called DREM was proposed [22]. he method mainly works by identifying bifurcation time-points where the expression of a subset of genes diverges from the rest of the
genes (See details in Section 2.1). However, DREM does not infer the logical roles of the TFs (i.e.,
whether a speciﬁc TF is necessary or sufﬁcient for regulating a set of target genes). Such knowledge is extremely useful for designing high-order genetic variation experiments to understand the
complex regulatory mechanisms of biological processes.
In this chapter, we demonstrate that by inferring the logical roles of the TFs, the target gene
co-expression correlation increased signiﬁcantly.
TRIM is an HMM based model which was developed to infer the collaboration of at most 2
TFs that regulate the same target genes [3]. In the HMM, the functions of a TF are hidden states.
The model starts with random priors, and then is iteratively trained using EM till convergence.
Since each possible function of a TF is a node in the HMM, there are four nodes (AS, AN, RS, and
RN) for each TF. With the design of HMM (and the limited training data), the number of TFs the
model can handle is greatly limited.
The enumeration of all TF combinations is clearly an NP problem. Therefore, we focused on
the most important biological problem (i.e., 2-TF combination) and therefore “hardcoded the problem in TRIM. In this paper, alternatively, we solve the efﬁcient problem by using association rule
mining algorithms which is capable to handle a large amount of data or high-level combinations.
In this chapter, we propose a new model mTRIM [2]for inferring regulatory interactions for
multiple TFs with an EM-based Bayesian inference approach [18, 71] and a modiﬁed bottom-up
44

association rule mining method. Experimental results evaluated with yeast genetic interactions, TF
knockouts and a synthetic dataset shows that our algorithm is signiﬁcantly better than the existing
ones.

3.2

Methods

mTRIM is developed to efﬁciently infer regulatory interactions for all possible collaborative TFs
in a TRN. The feasibility is achieved in two steps. First, an EM-based Bayesian inference approach is developed to identify all the signiﬁcant individual TF regulatory interactions, meaning
that individual TFs that can regulate the target genes independent to the existence of other TFs.
For the TFs which require collaborations with other TFs to drive the target genes, or are actually
non-deterministic (meaning lack of clear evidence of regulation), their p-values are insigniﬁcant.
They are considered as the inputs of the second step.
Second, in order to identify the collaboration of k TFs (k ≥ 2), i.e., k-TF regulatory interaction,
a bottom-up association rule mining approach is developed. While the signiﬁcant TF groups are
reported to the users, the insigniﬁcant ones are joined with each other to mine (k + 1)-TF regulatory interactions. It should be noted that unlike the conventional association rule mining which
seeks the longest possible patterns, mTRIM outputs the shortest signiﬁcant results, in that the goal
of mTRIM is to discover the smallest group of TFs that can regulate the target genes, so that biological experiments with high-order genetic variations can be subsequently carried out for the
understanding of the behavior of TRNs.

45

3.2.1

Concepts

A TRN can be represented as a directed graph in which each node is a TF or a gene, and each
edge pointing from a TF to a gene represents a regulation relationship between them. In many
organisms, in-depth transcriptome analysis has revealed the modular architecture of gene expression [33]. A regulatory module is a self-consistent regulatory unit R(T F, G, I) representing a set of co-expressed genes G = {g1 , g2 , . . . , gn } regulated in concert by a group of TFs in
T F = {t f1 ,t f2 , . . . ,t fm } that govern the target genes’ behaviors via regulatory interaction I [61].
An example of the regulatory module is shown in Figure 3.1.
A regulatory interaction I =< ht f1 , . . . , ht fi , . . . , ht fm >⇒ hg (which is the ﬁnal output of mTRIM)
is deﬁned as a set of TFs {t f1 , . . . ,t fm } co-regulating a set of genes {g1 , . . . , gn }, where ht fi is the
behavior of TF i; hg is the behavior of all the target genes in R, and hx ∈ {↑, ↓, −}, meaning upexpress, down-express and no change respectively. For example, if t f1 ↑ and t f2 ↓ always cause the
target genes g1 and g2 to be up-regulated, the regulatory interaction is < t f1 ↑,t f2 ↓>⇒ g ↑. For
individual regulatory interactions, I ∈ {AN, AS, RN, RS, ANS, RNS}. In this work, we assume that

Figure 3.1: an illustrative example of regulatory interaction in a TRN, in which three TFs collaboratively co-regulate two target genes

46

a regulatory interaction is consistent in the context of transcriptional control as long as the experimental conditions are unchanged. Note that binaries gene expression values are used in mTRIM,
since TF activity is not always proportional to its mRNA abundance [25].

3.2.2

mTRIM Step 1: Inferring Individual Regulatory Interactions

To solve a relatively easier problem of inferring the regulatory interactions for each individual TF
and to prepare input for multi-TF regulatory interaction inference, an EM-based Bayesian inference
algorithm has been developed [18, 71]. To deﬁne the probabilities, we followed the deﬁnitions in
[18]. Eq 3.2 represents the prior probability of the interaction model Im . Eq 3.3 represents the
probability of gene expression correlation between TFs and targets given the interaction model Im .
In the Bayesian model, the training dataset is a matrix that contains gene expression levels of TFs
and their targets, from which Γ(Im ) is estimated using Eq 3.4. And then, the likelihood is calculated
using Eq 3.3. The prior probabilities are randomly assigned initially, which are estimated using the
frequency of Im . In each iteration, the posterior probabilities and the frequency of Im are updated.
The iteration will continue till the posterior probabilities converge.
To deﬁne the probabilities, we followed the deﬁnitions in [18]. Eq 3.2 represents the prior
probability of the interaction model Im . Eq 3.3 represents the probability of gene expression correlation between TFs and targets given the interaction model Im . In the Bayesian model, the training
dataset is a matrix that contains gene expression levels of TFs and their targets, from which Γ(Im ) is
estimated using Eq 3.4. And then, the likelihood is calculated using Eq 3.3. The prior probabilities
are randomly assigned initially, which are estimated using the frequency of Im . In each iteration,
the posterior probabilities and the frequency of Im are updated. The iteration will continue till the
posterior probabilities converge. Let Pos be the posterior probability of a TF t fm to have a speciﬁc
regulatory interaction Im in regulatory module Rk , where Im ∈ {AN, AS, RN, RS} (ANS and RNS
47

will be discussed later). To infer Pos, both the prior probabilities Pri and the likelihood Lk of the
same TF need to be computed, given that:

Pos(t fm , Rk , Im ) = Pri(Im ) × Lk(t fm , Rk , Im )

(3.1)

where Pri(Im ) is the prior probability of regulatory interaction Im (deﬁned in Eq 3.2) and the
likelihood Lk(t fm , Rk , Im ) is deﬁned in Eq 3.3.
The prior probability Pri(Im ) captures how likely a given interaction Im exists given the background of all of the other TFs:

Pri(Im ) =

f re(Im )
|R| × |T F|

(3.2)

where f re(Im ) is the frequency of regulatory interaction Im in all of the regulatory modules, |R| is
the number of the regulatory modules, |T F| is the number of TFs, and Im ∈ {AS, RS, AN, RN}.
Given the deﬁnition of a regulatory interaction, the likelihood Lk(t fm , Rk , Im ) indicates how
likely t fm in Rk has regulatory interaction Im , which is deﬁned by the expression level changes of
the TF and its targets:
|G|

Lk(t fm , Rk , Im ) =

T −1
∑t=1 ∑n=1 Γ(Im )
|R|
|T F| T −1 |G|
∑r=1 ∑m=1 ∑t=1 ∑n=1 Γ(Im )

(3.3)

where T is the number of time-points in the training data, |G| is the number of genes in regulatory

48

module Rk , and Γ(Im ) is deﬁned as:

Γ(Im ) =


1






















if Im = AS and (t fm ↑ and g ↑), or
if Im = RS and (t fm ↑ and g ↓), or
(3.4)

if Im =AN and (t fm ↓ and g ↓), or
if Im = RN and (t fm ↓ and g ↑)

0

otherwise

An expectation-maximization (EM) algorithm is adopted to maximize the posterior probabilities Pos(t fm , Rk , Im ). The EM model is initialized with each TF assigned a random regulatory interaction. In the expectation step, we compute the likelihood of each TF to be a speciﬁc interaction
using Eq 3.3. Consequently, the posterior probabilities of interactions for every TF is updated with
Eq 3.1. As a result, each TF is assigned with the regulatory interaction with the highest posterior
|T F|

|G|

probability. In the maximization step, we maximize the scoring function S(Rk ) = ∑m=1 ∑n=1 Γ(Im )
for each regulatory module Rk , which measures how the interaction of each TF in Rk matches the
target gene expression changes. Note that in the iteration the priors are updated but the likelihoods
are constant.
Finally, in order to determine whether Im is “necessary and sufﬁcient” (ANS and RNS) or “no
decision”, the following strategy is adopted: if none of the posterior probabilities are signiﬁcant,
the output is “no decision”; if the probabilities of both N and S states are signiﬁcant, and there is
no signiﬁcant difference between them, the output is ANS or RNS depending on the target gene
expression direction; otherwise the output is the regulatory interaction with the highest posterior
probability.
An illustrative example is shown in Figure 3.1, in which t f1 , t f2 and t f3 regulate target genes g1

49

Table 3.1: Illustrative example of time-series gene expression data for the genes in Figure 3.1.
tf1
tf2
tf3
g1
g2

t0
↑
↑
↓
↑
↑

t1
↑
↑
↓
↑
↑

t2
↑
↑
↓
↑
↑

t3
↑
↑
↓
↑
↑

t4
↑
↑
↓
↑
↑

t5
↑
↑
↓
↓
↓

t6
↓
↑
↓
↓
↓

t7
↓
↑
↓
↓
↓

t8
↓
↑
↓
↓
↓

t9
↓
↑
↓
↓
↓

t10
↓
↑
↓
↓
↓

t11
↓
↑
↓
↓
↓

and g2 , and they all belong to the same regulatory module Rk . With the gene expression changes in
Table 3.1, we start with equal prior probabilities, i.e., Pri(AS) = Pri(RS) = Pri(AN) = Pri(RN) =
0.25, so Lk(t f1 , Rk , AN) = 12/26 = 0.461, (Eq 3.3). After 10 iterations, in the expectation step,
Pri(AN) is updated to 0.70 (Eq 3.2), hence Pos(t f1 , Rk , RS) = 0.70 × 0.461 = 0.323 (Eq 3.1). In
the maximization step, we have < t f1 ↓>⇒ g ↓, because the maximum posterior probability is
assigned to AN with p-value 0.05 (see Table 3.2 row 1).

3.2.3

mTRIM Step 2: Mining Multiple-TF Regulatory Interactions

Besides the individual TF regulatory interactions, a signiﬁcant portion of TFs collaboratively work
together to regulate the same target genes. In order to identify these multiple-TF regulatory interactions, a new association rule mining approach has been developed. Instead of using the concepts of
support and conﬁdence that are commonly used in a conventional association rule mining application [1], we deﬁne an afﬁnity scoring function (called A f nScore) according to the gene expression
agreement between the TF groups and their target genes, to meet the biological meaning of a
multiple-TF regulatory interaction.
Mathematically, A f nScore of each candidate regulatory interaction I =< ht f1 , ht f2 , . . . , ht fm >⇒

50

Table 3.2: Illustrative example of regulatory interaction identiﬁcation on the TRN in Figure 3.1.
I0
I1
I2
I3
I4

Regulatory Interaction
< t f1 ↓>⇒ g ↓
< t f1 ↑,t f2 ↑>⇒ g ↑
< t f2 ↑,t f3 ↓>⇒ g ↑
< t f1 ↑,t f3 ↓>⇒ g ↑
< t f1 ↑,t f2 ↑,t f3 ↓>⇒ g ↑

A f nScore
0.347
0.173
0.347
0.347

p-value
0.05
0.06
0.09
0.06
0.04

hg is calculated with:

A f nScore(I) =

P(ht f1 , ht f2 , . . . , ht fm , hg ) ∗ P(hg )
P(ht f1 , ht f2 , . . . , ht fm )

(3.5)

where P(x) is the number of times that x appears in the given time series gene expression dataset
divided by the product of the total number of time points and the total number of target genes.
The p-value of each candidate regulatory interaction is computed by considering the distribution
of A f nScore for the regulatory interactions with the same number of TFs. Only the candidate
interactions with p-values smaller than 0.05 are reported to the user. Speciﬁcally, if all the TFs in I
are up-regulated, the TFs are “sufﬁcient”; if they are all down-regulated, the TFs are “necessary”;
otherwise, each TF acts differently to drive the target genes to the same direction.
To identify all the signiﬁcant k-TF regulatory interactions, the new association rule mining
algorithm starts with an empty set Qk and all the insigniﬁcant (k − 1)-TF interactions saved in
Pk−1 (see pseudocode in Algorithm 1 line 1). For interactions I1 =< ht f1 , . . . , ht fk−1 >⇒ hg and
I2 =< ht f1 , . . . , ht fk−1 >⇒ hg in Pk−1 , we combine them and compose a new interaction I12 (line
3), if I1 and I2 are combinable. We deﬁne that I1 and I2 are combinable if and only if they satisfy
the conditions that hg = hg , ht fi = ht fi (for i = 1, 2, .., k − 2) and ht fk−1 = ht fk−1 . If none of the
(k − 1)-TF subsets of I12 is signiﬁcant (line 2-8), I12 is added to candidate set C and its A f nScore
is computed. Finally, we compute p-values for all of the k-TF candidates in C using t-test, report

51

all of the signiﬁcant regulatory interactions to the user, and save all the insigniﬁcant ones to Pk for
the identiﬁcation of the (k + 1)-TF regulatory interactions (line 9-17).
Algorithm 1 Procedure Generating Signiﬁcant Patterns
Input: Qk−1 : Set of signiﬁcant (k − 1)-TF regulatory interactions
Pk−1 : Set of insigniﬁcant (k − 1)-TF regulatory interactions
θ : p-value threshold
Output: Qk : Set of signiﬁcant k-TF regulatory interactions
Pk : Set of insigniﬁcant k-TF regulatory interactions
1: Qk ← 0; Pk ← 0; candidate set C ← 0;
/
/
/
2: for all I1 , I2 ∈ Pk−1 with I1 =<ht f1 ,..,ht fk−2 ,ht fk−1 >⇒ hg1 , I2 =<ht f1 ,..,ht fk−2 ,ht fk−1 >⇒ hg2 ,
and hg1 = hg2 do
3: I12 ← <ht f1 ,...,ht fk−2 ,ht fk−1 ,ht fk−1 > ⇒ hg1 ;
4: if none of the (k − 1)-subset of I12 is in Qk−1 then
5:
Compute A f nScore(I12 );
6:
C ← C ∪ {I12 };
7: end if
8: end for
9: for all I ∈ C do
10: Compute p-values pvalue(I);
11: if pvalue(I)<θ then
12:
Qk ← Qk ∪ {I};
13: else
14:
Pk ← Pk ∪ {I};
15: end if
16: end for
17: return Qk and Pk

For an illustrative example, there are 40 possible multiple-TF regulatory interactions in the
regulatory module shown in Figure 3.1. Using the time-series gene expression data in Table 3.1,
all the 2-TF regulatory interaction candidates are screened and their p-values are computed (see
Table 3.2 row 2-4). Since none of the 2-TF regulatory interaction candidates is signiﬁcant, a 3-TF
interaction I4 =< t f1 ↑,t f2 ↑,t f3 ↓>⇒ g ↑ is generated by merging I2 and I3 . The A f nScore of I4
is ((10/24) ∗ (10/24))/(12/24)) = 0.347 and its p-value is 0.04 (see Table 3.2 row 5). Based on I0
and I4 , we conclude that the target genes g1 and g2 are induced by the up-expression of t f1 and t f2
and the down-expression of t f3 , and the same target genes are repressed by the down-expression

52

of t f1 .
In terms of time complexity, consider a candidate k-TF regulatory interaction:
I =< ht f1 , . . . , ht fk >⇒ hg .
The algorithm computes A f nScore and p-values of all of the subsets, I − {t f j } (∀ j = 1, 2, ..., k).
If one of them is signiﬁcant, I is immediately pruned. Hence the time complexity is O(k) for each
candidate k-TFs regulatory interaction. Every merging operation requires at most k − 2 equality
comparisons. In the best-case scenario, it produces a viable candidate k-TF interaction. In the
worst case, the algorithm merges every pair of infrequent (k − 1)-TF candidates. Therefore, the
|T F|

|T F|

overall cost of merging candidates is between ∑k=2 (k − 2)|Pk | and ∑k=2 (k − 2)|Pk−1 |2 , where Pk
is the candidate set of k-TF regulatory interactions. To improve the algorithm efﬁciency, a hash
tree is constructed for the storage and quick access to all of the candidates. Because the maximum
|T F|

depth of the hash tree is k, the cost for populating the hash tree of candidates is O(∑k=2 k|Pk |).
During candidate pruning, it is required to verify whether the k − 1 subsets of every candidate k-TF
regulatory interactions are signiﬁcant. Since the cost for looking up an item in a hash tree is O(k),
the time complex of candidate pruning step is O(∑w k(k − 2)|Pk |).
k=2

3.3

Experimental Results

mTRIM was applied on two independently-constructed yeast transcriptional regulatory networks
(the Harbison dataset [27] and the Reimand dataset [57]) to identify regulatory interactions. For
performance comparison, DREM v3.0 [6] and TRIM [3] were both applied on the same datasets.
We did not compare mTRIM with Yeang’s method [79] because the latter’s objective is to build
a reliable TRN instead of predicting regulatory interactions. We evaluated these methods systematically with three independent sources: single TF knockouts [30] for individual regulatory

53

interactions, genetic interactions (GI) [14] for 2-TF regulatory interactions and synthetic data for
high-order regulatory interactions.
Using the EM-Based Bayesian inference approach, 658 signiﬁcant individual regulatory interactions were mined in the Harbison dataset and 164 signiﬁcant ones were mined in the Reimand
dataset (Table 3.4). The results show that while many individual TFs drive target genes’ behaviors,
it is clear that most of them (4,414 in the Harbison dataset and 1,539 in the Reimand dataset) are
“no decision”. It indicates that a large proportion of TFs need to work collaboratively with other
TFs.
Multiple-TF regulatory interactions were inferred with a new association mining algorithm.
In total, 670 regulatory interactions with multiple TFs were discovered (Table 3.5). The results
show that at most 6 TFs collaboratively regulate the same target genes. All the TF combinations
with more than 6 TFs are either insigniﬁcant or have a signiﬁcant subset. The whole experiments
ﬁnished in 30 minutes on a high performance computer cluster.

3.3.1

Data preparation

Yeast ChIP-chip binding data [27]1 was downloaded, and a p-value cutoff of 0.001 was applied
(the same threshold used in [22]) to obtain the Harbison dataset. It contains 169 TFs, 2,864 target
genes and 6,253 TF-DNA bindings. Next we applied the same statistical approach as in [57] to
ﬁlter the union of the yeast ChIP-chip binding data [42]2 and the binding-site predictions [21, 46]
3

to generate the Reimand dataset with 2,230 TF-DNA binding relationships between 268 TFs

and 1,509 target genes. To obtain the regulatory modules in the TRNs, all the target genes were
1 http://younglab.wi.mit.edu/regulatory_code
2 http://jura.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=

17&f=downloaddata
3 http://rulai.cshl.edu/SCPD

54

Figure 3.2: In-degree distribution of the yeast TRN for Harbison dataset.
clustered based on their gene expression values with Cluster 3.0 (speciﬁcally, k-means), which uses
Pearson correlation coefﬁcient for gene similarity metric [20], resulting in 50 clusters. The clusters
are then evaluated with Gene Ontology enrichment analysis using Bingo [47], and unenriched
clusters are discarded. To construct regulatory modules from the clustering results, the target genes
that are regulated by the same TFs were partitioned if they are not in the same cluster. Finally, 2,172
and 1,031 regulatory modules were obtained in the Harbison and Reimand networks respectively.
The distribution of genes and regulatory modules (Figure 3.2 reveals that many genes are bound by
multiple TFs). Table 3.3 shows the regulatory modules sizes before and after network partitioning
for both Harbison and Reimand datasets.
To identify the individual and collaborative regulatory interactions in the above datasets, three
widely used time-series microarray datasets (alpha, CDC28 and elu) from yeast cell cycle studies were collected [68] as training data. These datasets contain 49 time points in total. In these
experiments, yeast cells were ﬁrst synchronized to the same cell cycle stage, released from synchronization, and then the total RNA samples were taken at even intervals for a period of time
55

Table 3.3: Regulatory modules size for Harbison and Reimand datasets
Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Total
1
2
3
4
5
6
7
8
9
10
Total

Before Network After Network
Partition
Partition
Harbison Dataset
1512
988
599
504
323
296
176
153
102
94
34
34
26
23
31
29
10
8
17
15
11
8
9
2
3
3
4
3
0
0
3
2
1
1
1
1
2
2
2864
2172
Reimand Dataset
1076
630
275
253
87
82
39
36
17
16
7
6
6
6
1
1
0
0
1
1
1509
1031

(Table 2.4). In order to decide whether a gene is signiﬁcantly up or down regulated, a gene expression change cutoff of 0.35 was applied.

56

Table 3.4: The number and type of the regulatory interactions for individual TFs predicted by
mTRIM.
(a) Habrison Dataset
Activator Repressor
Necessary
194
184
Sufﬁcient
118
162
Necessary & Sufﬁcient
29
69
No Decision
4414
(b) Reimand Dataset
Activator Repressor
Necessary
22
43
Sufﬁcient
42
32
Necessary & Sufﬁcient
7
18
No Decision
1543
To evaluate the individual regulatory relations, single-TF knockout microarray data were collected [30], and a p-value cut-off of 0.05 (as used in [30]) was applied to determine whether a
gene is signiﬁcantly affected by a TF knockout. To evaluate the 2-TF regulatory interactions, we
downloaded the SGA genetic interaction dataset [14], which is composed of 1,711 queries crossed
to 3,885 array strains. Of the 1,711 queries, 1,377 are deletion mutants of non-essential genes and
334 are essential gene alleles. The SGA dataset contains 762,146 genetic interactions. Two genes
are genetically interacted if mutations in both of them produce a phenotype that is signiﬁcantly
different to each mutation’s individual effects. In a 2-TF regulatory interaction, if TFs collaboratively regulate the same target genes, the down-regulation of both TFs should have a signiﬁcantly
different phenotype as the down regulation of each individual TF. Therefore, such TF pairs should
have a signiﬁcant p-value in the GI dataset. To evaluate the high-order multiple-TF regulatory
interactions, a synthetic binding network were built, which contains 11 TFs, 17 target genes and
58 regulation/binding relationships. The network also contains two feed forward loops. Corresponding time-series gene expression data containing 500 time-points were randomly generated
with 10% or 40% noise rate.

57

Table 3.5: Number of the multiple-TF regulatory interactions identiﬁed by mTRIM.
Dataset
Harbison
Reimand

3.3.2

2-TF
350
95

3-TF
61
15

4-TF
82
7

5-TF
43
7

6-TF
10
0

Evaluation 1: Single TF Knock-outs

We used the single TF knockout microarray data to evaluate the performance of mTRIM on individual TF regulatory interaction predictions in terms of the identiﬁcation of “necessary” TFs (i.e.,
if the expression values of the target genes are signiﬁcantly changed when the TF is knocked out).
For the Harbison dataset, the prediction precision of mTRIM is 94.44%, higher than the results
of TRIM (82.50%). Using the Reimand dataset, mTRIM has a precision of 91.94%, signiﬁcantly
higher than the results of TRIM (61.54%). DREM is not compared since it does not predict “necessary” TFs.

3.3.3

Evaluation 2: Genetic Interaction

In a regulatory module with two TFs, if both TFs collaborate to regulate the same target genes,
the down-regulation of both TFs should have signiﬁcantly different phenotypes from the downregulation of each individual TF. Therefore, such TF pairs should have a signiﬁcant p-value in the
GI dataset. To this end, for the pairs of TFs that are predicted by mTRIM to work collaboratively,
we adopted the GI dataset [14] for evaluation. Figure 3.3 and Figure 3.4 show the Receiver Operating Characteristic curve (ROC) of mTRIM, TRIM and DREM on Harbison dataset and Reimand
dataset respectively. For Harbison dataset, the area under curve (AUC) of mTRIM is 0.81, much
higher than the AUC of DREM (0.51) and TRIM (0.75). For Reimand dataset, the AUC of mTRIM
is 0.80, higher than DREM (0.52) and TRIM (0.64). In addition, to explore whether the perfor-

58

Figure 3.3: Evaluation of the 2-TF regulatory interactions using genetic interactions on Harbison
dataset.
mance of mTRIM is sensitive to parameter settings, we altered its parameters systematically. For
the Harbison dataset, Figure 3.5 shows the AUC values with different gene expression cutoffs,
GI cutoffs, and p-value cutoffs of A f nScore respectively. Similarly, for Reimand dataset, Figure
3.5 shows the varying of the AUC values using different thresholds. These show that our method
is robust with the GI cutoff and p-value cutoff of A f nScore, although its performance gradually
decreases with the increase of gene expression cutoffs.

3.3.4

Evaluation 3: Synthetic Transcriptional Regulatory Networks

A synthetic transcriptional regulatory network was generated to evaluate the performance of mTRIM
in detecting high-order multiple-TF regulatory interactions (see Figure 3.6). The synthetic network
59

Figure 3.4: Evaluation of the 2-TF regulatory interactions using genetic interactions on Reimand
dataset.
has 28 nodes (11 TFs and 17 target genes) and 58 edges, in which the solid line represents a real
transcriptional regulation and 12 (20.69%) dotted lines represent TF-DNA bindings but no regulation. The dotted lines were added to the network in order to test the precision of mTRIM. For the
synthetic network, two time series gene expression datasets with 500 time-points were generated.
In order to test the robustness of mTRIM, we repeated the simulation test twice with different rates
of noises added to the simulated gene expression data sets.
A comparison between all the three algorithms (see Figure 3.7, Figure 3.8, and Figure 3.9
) indicates that the performance of mTRIM is constantly the best on precision, speciﬁcity and
sensitivity (equivalent to recall). Precisely, the precision of mTRIM is 87.5%, while the precisions
of DREM and TRIM are 62.5% and 66.67% respectively (Figure 3.8). The recall of mTRIM is
60

Figure 3.5: Parameter adjustments on the evaluation of the high-order regulatory interactions using
genetic interactions. (a,d) Different gene expression cutoffs, (b,e) Different GI signiﬁcance cutoffs,
(c,f) Different mTRIM cutoffs on Harbison dataset and Reimand dataset respectively.
signiﬁcantly higher than TRIM because it identiﬁed 4 out of 5 regulatory interactions with more
than two TFs, while TRIM, because of the scalability issue, cannot ﬁnd any regulatory interactions
with more than two TFs (see Figure 3.9). It also shows that mTRIM is less sensitive to the change
of the noise rates from 10% to 40% in the gene expression data than the other two algorithms.

3.4

Conclusion

Revealing the mechanisms of the transcriptional regulatory programs in TRNs is essential for understanding the complex control by which genes are expressed in living cells. The inference of
collaborative protein-DNA functions helps paving the critical path for new drug development. In
61

Figure 3.6: A synthetic transcriptional regulatory network, in which the solid lines represent transcriptional regulations and the dotted lines represent TF-DNA bindings only (meaning binding but
not regulation).
62

Figure 3.7: Comparing mTRIM with DREM and TRIM on two sets of synthetic data with different
noise rates: Speciﬁcity

Figure 3.8: Comparing mTRIM with DREM and TRIM on two sets of synthetic data with different
noise rates: Precision

63

Figure 3.9: Comparing mTRIM with DREM and TRIM on two sets of synthetic data with different
noise rates: Sensitivity.
this chapter, we identify the regulatory interactions between TFs and target genes with mTRIM, an
integration of an EM-based Bayesian inference and a new association rule mining approach built on
a set of basic constraints that relate gene expression patterns to regulatory interactions. mTRIM is
not limited by the number of TFs. The experimental results show that mTRIM is clearly better than
the existing algorithms. We compared in this chapter mTRIM, TRIM and DREM on three independent datasets (i.e., single TF-knockouts, genetic interactions, and synthetic data). Clearly mTRIM
has higher scalability than TRIM, since the latter can only handle up to 2-TF regulatory modules,
missing 31.3% regulatory modules in Harbison dataset and missing 14.4% regulatory modules in
Reimand dataset. Since it is difﬁcult to obtain the ground truth for algorithm performance evaluation, we generated two sets of synthetic data and used them to validate the experimental results.
In the next chapter, we would like to extend this work by including extra data. For example,
since miRNA can degrade the genes induced by certain TFs [36], we will consider miRNA-target
bindings aiming to enhance the performance and understand how different regulators collaborate
to regulate target gene expressions.

64

Chapter 4
Human Cancer Transcription Regulation

4.1

Introduction

As explained in previous chapters, understanding the regulation of gene expression is critical to
decipher the control of biological processes in cellular organisms. At the transcriptional level, the
main regulators are the transcription factors, that can regulate the global gene expressions behavior by activating or repressing their target genes. However, recent studies have revealed another
sort of regulation, that is microRNA (miRNA) regulation at post-transcriptional level [65, 72, 84].
This regulation takes place via mRNA cleavage or translational repression by binding to the 3
untranslated region of target mRNA. In general, a single miRNA can concurrently down-regulate
hundreds of target mRNAs [65, 72, 84]. Hence, understanding the regulatory mechanisms of the
two main regulators (TFs and miRNAs) are necessary to decipher the gene regulation mechanism.
It is necessary to understand the regulatory relationships between TFs, miRNAs and target genes.
However, the combined regulations of miRNAs and TFs are complicated due to several reasons.
The regulation mechanism does not only include the interactions between each regulator and the
target genes, but also involves the interactions between the regulators themselves, and one TF
might cancel the effect of another. In addition, the miRNA might block the effect of the TFs
on their targets. Besides, miRNAs often have overlapping targets and act combinatorially which
creates a complex regulatory networks [60].
Although biological experiments can help discover these complicated relationships, but the
65

process is extremely costly and time consuming. Hence, computational approaches may help understanding such complex relationships. Reverse engineering has been widely used in inferring
TF-controlled transcriptional regulation networks [6, 55, 63]. However, successful applications of
reverse engineering in inferring combinatorial networks involving TFs and miRNAs were rarely
seen except when small-scaled combinatory networks of miRNAs and TFs were mapped around
some selected genes [80]. The major obstacle is the lack of simultaneously measured miRNA
expression data and mRNA expression data. These parallel miRNA expression and miRNA expression expression datasets are continuously released to public [12] but for a set of samples for
example, various tumor samples [45, 80].
The co-regulation of TFs and miRNAs has been studied previously in [65, 84]. The authors in
[65, 84] provided a method to ﬁnd out the shared downstream targets between TFs and miRNAs.
The method then used a statistical tests to measure the signiﬁcance of the shared targets between
the regulators, and to prune out the insigniﬁcant co-regulating interactions.
A rule based approach is proposed in [74], that discover the regulatory interactions that include
miRNAs, TFs, and their targets based on the predicted target bindings. Le Bechec et al. in [41]
utilized target prediction databases to construct a regulatory network that involves miRNAs, TFs,
and mRNAs.
Recently, Huang et al. in [31] developed a web tool (mirConnX) for constructing the regulatory networks that include miRNA, TFs, and mRNAs. Their method used both predicted targets
and expression data to build the network. However, an edge in this network represents association, which may not necessarily indicate regulatory interaction. Zacher et al. in [81] proposed a
Bayesian inference approach that used expression data to infer the activity of miRNAs and TFs
individually, however, this approach does not consider the interactions between miRNAs and TFs.
Le et al. in [72] used Bayesian network learning to learn the network structure to construct
66

a regulatory network from multiple sources of data: gene expression proﬁles of miRNAs, TFs
and target genes. Then, the learned network is further learned to identify the interactions between
miRNAs and TFs. The method also applies a network motif ﬁnding algorithm to further infer the
network. Z. Liang et al. in [44] proposed mirAct, a web tool that uses the negative regulatory
relationship between miRNAs and their target genes to infer the miRNA activity based on geneexpression data. mirAct evaluates the activity change of a miRNA by determining the miRNA
activity in a sample, then it analyze the collective behavior of miRNA activity in different classes
of samples.
In this chapter, we extend the deﬁnition of a regulatory module R, to include a set of regulators
L that regulates a set of target genes via an interaction type I. The regulators includes a set of TFs
and a set of miRNAs. The interaction type I represents the way the regulators L collaborates with
the set of their target genes. We describe TmiRNA (Transcription Regulations and miRNA), a new
model to infer the interactions types between different regulators L and their set of target genes.

4.2

Methods

A regulatory network can be represented as a directed graph in which each node is a TF, a miRNA,
or gene, and each edge pointing from a TF or a miRNA to a gene represents a regulation relationship between them. A regulatory interactions is a self-consistent regulatory unit R(L, G, I)
representing a set of co-expressed genes G = {g1 , g2 , . . . , gn } regulated in concert by a group of
regulators L which includes a set of TFs and miRNAs L = {t f1 ,t f2 , . . . ,t fm , mir1 , mir2 , . . . , mirn }
that govern the target genes’ behaviors via regulatory interaction I [72].

67

4.2.1

Concepts

A regulatory interaction sequence S is a non-empty set of regulators L. A k-sequence is a set
of k regulators and their target gene, the target gene is considered a dummy item in the regulatory sequence and does not contribute to the value of k. In addition, any sequence that includes
only miRNA regulators and a target gene, is also excluded, as miRNA-targets regulatory relations are not the focus of our work. Speciﬁcally, a regulatory interaction sequence at tissues t
St =< ht f1 , . . . , ht fi , . . . , ht fm , amir1 , . . . , amirn >⇒ hg is deﬁned as a set of TFs {t f1 , . . . ,t fm } and a
set of miRNAs {mir1 , . . . , mirn } co-regulating a set of genes {g1 , . . . , gn }, where ht fi is the behavior
of TF i in tissues t; amiri is a miRNA that is bound to g and functioning in tissue t. St is expressed as
St = ht fi , if there is no miRNAs bound to g, or the miRNAs bound to g but not functioning in tissue
t. hg is the behavior of the target genes in t; hx ∈ {↑, ↓, −}, meaning up-express, down-express and
no change respectively.
For example, given three tissues, t1 , t2 , and t3 . t f1 , t f2 , mir1 are bound to gene g1 . mir1
functions in t2 , t3 only. t f1 is up-expressed in t1 , and t2 and down-expressed in t3 , and t f2 is upexpressed in all the tissues. Then we have three regulatory interaction patterns, Pt1 =< t f1 , ↑,t f2 , ↑
, g1 >, Pt2 =< t f1 , ↑,t f2 , ↑, mir1 , g1 >, and Pt3 =< t f1 , ↓,t f2 , ↑, mir1 , g1 >, and the three regulatory
sequences have g1 as a dummy item, i.e, St1 ={Pt1 , g1 },St2 ={Pt2 , g1 }, St3 ={Pt3 , g1 } .

4.2.2

Inferring Regulatory Sequences

In order to identify multiple-TF and miRNAs regulatory interactions, a modiﬁed sequential pattern
mining approach has been developed. A sequential pattern mining approach is deﬁned as follows:
Given a set of regulatory sequences Si , and given a user-speciﬁed threshold, sequential pattern
mining is to ﬁnd all frequent sub-regulatory sequences, i.e., the sub-regulatory sequence whose

68

occurrence frequency in the set of tissues is no less than a given threshold. Instead of using the
concepts of support that is commonly used in a conventional frequent mining application [1, 69],
we deﬁne a scoring function (called Rscore) according to the gene expression agreement between
the regulators groups and their target genes.

Rscore(s) = 2 ∗ ⊕(s) + (s)
where ⊕(s) and

(4.1)

(s) are calculated using in Equation 4.2 and Equation 4.3 respectively.

⊕(s) = Fs+ ∗ log

Fs+
e(+)

(4.2)

(s) = Fs− ∗ log

Fs−
e(−)

(4.3)

where e+ and e− are computed using Equation 4.4 and Equation 4.5 respectively and Fs+ and
Fs− are the frequency of s in the positive set and the negative set respectively. If the target gene
direction in s is up, the positive set is the set of all sequences that up-regulates their targets, and the
negative set is the set of all sequences that down-regulate or does not change their targets. If the
target gene direction in s is down, the positive set is the set of all sequences that down-regulates
their targets, and the negative set is the set of all sequences that up-regulate or does not change
their targets.

e(+) = Ns ∗ log

69

F+
B

(4.4)

e(−) = Ns ∗ log

F−
B

(4.5)

where F(+) and F(−) are the number of positive set and negative set respectively. B is the
background, the number of up-regulated, down-regulated and no-change genes in every tissue.
The key steps of our approach is candidate generations and is explained in details in the following sub-section.

4.2.2.1

Candidate Generations

In TmiRNA, each sequence contains only one element composed of a set of items. An item could
be a TF or a miRNA. A frequent k-sequence is a sequence with a p-value of the Rscore less than
the user-speciﬁed threshold. Let Pk denote the set of all frequent k-sequences,and Ck the set of
candidate k-sequences. Given a sequence S and a sub-sequence c, c is a contiguous subsequence
of s if any of the following conditions hold:
1. c is derived from s by dropping an item (not the target gene);
2. c is a contiguous subsequence of c , and c is a contiguous sub-sequence of s.
Canidate generation is generated in two steps:
1. Join Phase: We generate candidate sequences by joining Ck−1 with Ck−1 . A sequence s1
is joined with s2 if the subsequence obtained by dropping the last item of s1 is obtained by
dropping the last item of s2 , and the dummy item (target gene) of s1 is equal to the dummy
item of s2 . The candidate sequence generated by joining s1 and s2 is extended with the last
item in s2 .

70

2. Prune Phase: Compute the Rscore for every k-sequence using Equation 4.1. A k-sequence
s is signiﬁcant if the p-value of its Rscore within all k-sequences distribution is less than a
user-given threshold. For any signiﬁcant sequence s that has a k − 1 contiguous subsequence
in P, delete s, otherwise add s to P. For any insigniﬁcant sequence, it is added to candidate
set C.
The algorithm starts with all C1 -sequences. A ci in C1 possibly includes each T Fi and its target
gene. The Rscore of each ci ∈ C1 sequence is computed using equation 4.1. Then, the prune phase
is applied. Speciﬁcally, ∀ci ∈ C1 , if ci is signiﬁcant, it is added to P and is output to the user.
Otherwise, ci is added to C. Then, the join phase is applied to join Ck−1 with Ck−1 sequences. Any
signiﬁcant sequence c that does not have a k − 1 contiguous subsequence in P is reported to the
user and added to P, otherwise, it is deleted.
For counting the candidates, we followed the same approach used in [69]. Given a set of
candidates sequences C and a data-sequence S, the goal is to ﬁnd all sequences in C that are found
in S. A hash-tree data structure is used to reduce the number of candidates in C.

4.2.3

Constructing Regulatory Modules

Starting with the binding network as an initial regulatory network, each gene constructs a separate
regulatory network with its regulators (TFs and miRNAs). Then TmiRNA is used to ﬁnd signiﬁcant
regulatory sequences. A set of genes that have the same directions and regulated by the same set
of regulators are grouped into the same regulatory network. Each signiﬁcant regulatory interaction
that drive the same set of target genes G to the same direction is considered a replacement for the
target genes set G.

71

4.3

Experimental Results

We applied TmiRNA on human NCI-60 cancer cell lines to infer regulatory interactions. We downloaded series GSE5846 Gene Expression Omnibus. The series represents NCI-60 cancer cell lines,
which were proﬁled with their genome-wide expression patterns using Affymetrix HG-U133A
chips. The Total RNA sample of each of the NCI-60 cell lines was obtained before the treatment
of any anticancer compound. It contains 60 samples. In total, we got 12385 gene ensembles.
We used miRNA-targets binding dataset from a collection of seven algorithm/databases i.e.,
TargetScan[43] (release 5.2), PITA [37], Starbase [78], HitSensor[83], miR2Disease[35], miRecords
[77]and TarBASE [64]. The sequences of mature miRNAs were collected from the miRBase (release 19) and 3’ UTR sequences of mRNAs were collected from the UCSC Genome Browser
(http://genome.ucsc.edu). We used only the binding relation that has been predicted by at least two
prediction methods. As a result, we have 564934 relations. We also used binding data obtained
from [80]. The data contains 4289 relations including 2850 TF to gene relations, 1439 miRNA to
gene relations. To combine the data-set, and save balance between both sets, we used only genes
that have at least one TF connection. As a result, we have 77684 miRNAs to gene relations and
2850 TF to gene relations. See Figure 4.2 for TFs distributions in human TRN. We can see that
71.75% of genes are regulated by individual TFs, 21.79% of genes are regulated by two TFs, and
6.45% of the genes are regulated by three or four TFs. We used SAM in [75] to identify genes
with statistically signiﬁcant changes in expression in speciﬁc tissues. As the genes expressions ﬁts
Normal distribution (See Figure 4.3 for ﬁtting the gene expression levels distribution to Normal
distribution).
We used miTEA in [70] to determine whether a given miRNA functions in a speciﬁc tissue.
miTEA [70] is a framework for miRNA target enrichment analysis. miTEA uses a ranked list of

72

Figure 4.1: A distribution of miRNAs that are active in different tissues.
genes to ﬁnd miRNAs of which their targets are enriched in the top of the the ranked list. We
provided miTEA with a list of ranked genes based on their down-regulation, as identiﬁed by SAM,
miTEA reports the list of miRNAs that functions in tissue t. See Figure 4.1 for the distributions of
miRNAs that function in different tissues. We can see that 50.95% of miRNAs are active in four
tissues, 12.59% of miRNAs are not active in any of the given tissues, and 36.54 % of miRNAs are
active in 8 or more tissues.
Since there is no ground truth for evaluating our model. We used gene Ontology functional
enrichment and synthetic data to give an insight on the performance of our model.

73

Figure 4.2: In-degree distribution of the Human TRN

4.3.1

Evaluating TmiRNA Regulatory Modules using Gene Ontology Functional Enrichment

To evaluate the coherence of the regulatory modules constructed by TmiRNA, a functional analysis
consisted on Fishers exact (two-tailed) test implemented in DAVID v6.7 1 [66] was used to identify
the functional categories enriched among all target genes associated (FDR-adjusted P-value 0.05).
Categories that at least 75% of their genes are signiﬁcant at FDR-adjusted P-value 0.05 are considered enriched. We compare the biological meaningfulness of the regulatory modules constructed
by TmiRNA to those constructed by network partitioning as describe in Chapter two and three.
Speciﬁcally, in network partitioning, we cluster all the target genes in a TRN based on their gene
expression values with Cluster 3.0 (speciﬁcally, k-means), which uses Pearson correlation coefﬁcient for gene similarity metric [20]. Genes in the same cluster are considered to be co-expressed.
1 http://david.abccncifcrf.gov/

74

Figure 4.3: Fitting gene expression Levels to normal distribution

Figure 4.4: Functional enrichment evaluation of TmiRNA under different categories (a) Biological
Process, (b) Molecular Functions, and (c) Cellular Components.
And the co-expressed and co-regulated genes are usually weighed to be regulated by the same
TF(s) with a similar interaction model. For the target genes that are regulated by the same set TFs
and miRNAs, we partition them based on whether they are in the same cluster to construct their
regulatory modules.
Figure 4.4 shows a comparison between functional enrichment of the regulatory networks gen-

75

erated by TmiRNA, and regulatory networks generated by network partitioning, under 3 categories,
(a) biological process, (b) molecular functions, and (c) cellular components. Under biological
process category, 75% of regulatory modules of TmiRNA are enriched, higher than network portioning regulatory modules (12.90%). Under the molecular functions category, 62.50% of TmiRNAs regulatory modules are enriched, signiﬁcantly higher than network partitioning regulatory
modules (16.12%). Under cellular components category, 100% the regulatory modules formed
by TmiRNA are enriched while only 48.38% of the network partitioning modules are enriched.
Clearly, TmiRNA better clusters genes into biologically meaningful groups.

4.3.2

Evaluating TmiRNA using Synthetic miRNA-TF Regulatory Networks

A synthetic transcriptional regulatory network was generated to evaluate the performance of TmiRNA
detecting high-order multiple-TF regulatory interactions (see Figure 4.5). The synthetic network
has 61 nodes (21 TFs, 14 miRNAs, and 26 target genes) and 104 edges, in which the solid line
represents a real transcriptional regulation dotted lines represent TF-DNA bindings but no regulation. For the synthetic network, two time series gene expression data-sets with 500 time-points
were generated with a different (10% and 40%) noise rate. The networks contains 6 loops.
The synthetic network simulates the case where a set of k-TFs collaboratively regulates a target
k

gene, we design the gene expressions such that, if (hg = 0) ⇒ abs( ∑ ht fi = k) and if (ht fi = 0 for
i=1

i ∈ {1, 2, . . . , k}) ⇒ hg = 0. A comparison between all the TmiRNA and mTRIM algorithms (see
Figure 4.6 and Figure 4.7) indicates that the performance of TmiRNA is slightly better in terms
of precision and recall. This can be explained by the synthetic network design that considers the
miRNA-target binding as regulation as we cannot use miTEA on synthetic miRNAs to infer active
miRNAs in various tissues.

76

Figure 4.5: Synthetic network used for evaluating TmiRNA.

77

Figure 4.6: Performance of TmiRNA on synthetic data with 10% noise rate.

Figure 4.7: Performance of TmiRNA on synthetic data with 40% noise rate.

78

4.4

Conclusion

Considerable efforts have been considered to explore the transcriptional regulatory networks such
that the TFs are the main regulator. Other studies investigated the post-transcriptional regulatory
networks where miRNAs only are the main regulators. In the previous Chapter, we proposed
mTRIM, an approach to infer collaborations of TFs in transcriptional regulatory networks. However, mTRIM ignores the post-transcriptional regulation of miRNAs. In this work, we proposed
TmiRNA, a new approach to infer the regulatory interactions between multiple regulators, that
involves both TFs and miRNAs. TimRNA shows better results than mTRIM. In addition, gene
Ontology functional enrichment shows that the regulatory modules inferred by TmiRNA have better coherence.

79

Chapter 5
Conclusions and Future Work
In this chapter we conclude our work and outline directions for future work.

5.1

Conclusions

The overall goal of the approaches presented in this dissertation, is to infer the regulatory interactions between multiple TFs and the set of their target genes.
In chapter one, we provided a brief background about transcriptional regulations. We described
in short both the experimental and computational approaches used to study transcription regulation.
We provided a short introductory to the computational models used in the rest of the dissertation.
In chapter two, we deﬁned the regulatory interaction as a set of one or two TFs’ interactions
with their set of target genes. We described TRIM, an HMM based approach that infer the regulatory interactions of at most 2-TFs and the set of their target genes. TRIM showed promising results
on two organisms, yeast and Arapidopsis. Thus TRIM is a useful tool to help predict the phenotype
of TF double-knockouts to reduce the number of double knock-out or over-expression experiments
needed. However, it is common that a set of TFs (one or more) collaboratively regulate a set of
target genes. In order to infer the interactions between multiple TFs and their set of target genes,
we need to a new model that can address the scalability issue of combining multiple TFs.
In chapter three, we designed a new model mTRIM, a Bayesian inference and an association
rules based approach that infer the regulatory interactions between multiple TFs and their set of
80

target genes. mTRIM captures the least number of TFs that can derive their set of target genes to
a speciﬁc direction. Unlike TRIM, mTRIM has no limitations on the number of TFs it handles.
However, mTRIM does not take the effect of miRNAs on their targets into considerations.
In chapter four, we extended the deﬁnition of regulatory interactions to include not only the
TFs but the miRNAs as well. We presented TmiRNA, a sequential pattern mining approach that
infers the regulatory interactions of multiple TFs, miRNAs, and their set of target genes. TmiRNA
is based on an afﬁnity score that takes the negative hits of the regulators into account, which
helps capturing the interactions that are signiﬁcantly different from the negative set. However,
TmiRNA doesn’t utilize miRNA expressions which could potentially enhance the performance of
the inference.
Although the approaches detailed in this dissertation shows successful results and better performance when compared with competitor models, however these approached don’t address TFmiRNA regulation, the cases when a TF regulates a miRNA. These approaches also don’t address
miRNA-TF regulation, the cases when a miRNA down-regulates a TF.

5.2

Future Work

Although the approaches detailed in this dissertation infer multiple regulators interactions with
their set of targets, there is still plenty of room for improvements. The intended future work falls
into the following directions:
• The approaches detailed in this dissertation infer collaborations between transcription factors. However, the proposed approaches can’t infer the competence between TFs. TFs compete with each others. This competition is achieved via the binding of the TFs to overlapping
binding sites. The winner of this competition depends on some factors that include the TF
81

concentrations, the number of TFs, and the quantitative nature of the protein–protein and
protein–DNA interactions. We need to design new models that can take these factors into
considerations.
• In the presented approaches, we used static binding of transcription factors to their targets.
As a result, the temporal order of binding of transcriptional factors to their target has not been
examined. In our future work, we plan to integrate and analyze dynamic binding information
of transcription factors to their targets which helps revealing the hierarchical circuit in which
a combinatorial set of transcription factors target distinct sets of genes at different times.
• We also plan to study metabolic behaviours that can be controlled by gene networks. Microorganisms engage in continuous adaptations of their physiology as they are exposed to
changing environments. Gene networks that control metabolism should restore metabolic
functions upon environmental changes. Although metabolic networks are better understood
than their corresponding gene networks, the identity of the metabolites that regulate the activity of transcription factors of metabolic genes and the kinetics of reactions in the gene
network are harder to be determined. As a result, it is not yet understood which metabolic
behaviours can be controlled by gene networks as well as the functional limits of gene networks. Whether gene networks can optimise metabolic functions is an open question we
need to address in our future work.
• Finally, we need to use indirect biological evidences including multiple TF knockouts, metabolic
pathways, protein-protein interactions, etc., for biological validation.

82

BIBLIOGRAPHY

83

BIBLIOGRAPHY
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. of 20th Intl.
Conf. on VLDB, pages 487–499, 1994.
[2] S. Awad and J. Chen. Inferring transcription factor collaborations in gene regulatory networks. BMC Systems Biology, In Press, 2013.
[3] S. Awad, N. Panchy, S. Ng, and J. Chen. Inferring the regulatory interaction types of
transcription factors in transcriptional regulatory networks. Journal of Bioinformatics and
Computational Biology, 10(5):1250012, 2012.
[4] O. Babur, E. Demir, M. Gnen, C. Sander, and U. Dogrusoz. Discovering modulators of gene
expression. Nucleic Acids Research, 38:5648–5656, 2010.
[5] S. Balaji, M. Babu, M. Iyer, M. Luscombe, and L. Aravind. Comprehensive analysis of
combinatorial regulation using the transcriptional regulatory network of yeast. Molecular
Biology, 360:213–227, 2001.
[6] Z. Bar-Joseph, G. Gerber, T. Lee, N. Rinaldi, J. Yoo, F. Robert, B. Gordon, E. Fraenkel,
T. Jaakkola, R. Young, and K. Gifford. Computational discovery of gene modules and regulatory networks. Nature Biotechnology, 21:1337–1342, 2003.
[7] Z. Bar-Joseph, A. Gitter, and I. Simon. Studying and modelling dynamic biological processes
using time-series gene expression data. Nature Genetics Reviews, 13, 2012.
[8] R. Bonneau, D. Reiss, P. Shannon, M. Facciotti, L. Hood, N. Baliga, and V. Thorsson. The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology
data sets de novo. Genome Biology, 7:R36, 2006.
[9] M. Cascante, L. Boros, B. Comin-Anduix, P. De Atauri, J. Centelles, and P. Le. Metabolic
control analysis in drug discovery and disease. Nature Biotechnology, 20:243 – 249, 2002.
[10] J. Chen, Z. Hu, M. Phatak, J. Reichard, J. Freudenberg, S. Sivaganesan, and M. Medvedovic.
Genome-wide signatures of transcription factor activity: Connecting transcription factors,
disease, and small molecules. PLoS Comput Biol, 9, 2013.

84

[11] G. Chena, P. Sullivanb, and M. Kosorok. Biclustering with heterogeneous variance. PNAS,
10:12253–12258, 2013.
[12] S. Cho, Y. Jun, S. Lee, H. Choi, S. Jung, Y. Jang, C. Park, S. Kim, S. Lee, and W. Kim.
mirgator v2.0: an integrated system for functional investigation of micrornas. Nucleic Acids
Research, 39:D158D162, 2011.
[13] W. Cookson, L. Liang, G. Abecasis, M. Moffatt, and M. Lathrop. Mapping complex disease
traits with global gene expression. Nature Genetics Reviews, 10:184–194, 2009.
[14] M. Costanzo, A. Baryshnikova, J. Bellay, et al. The genetic landscape of a cell. Science,
327:425–431, 2010.
[15] B. Deplancke, D. Dupuy, M. Vidal, and J. Walhout. A gateway-compatible yeast one-hybrid
system. Genome Research, 14(10b):2093–2101, 2004.
[16] B. Deplancke, A. Mukhopadhyay, W. Ao, M. Elewa, A. Grove, J. Martinez, R. Sequerra,
L. Doucette-Stamm, S. Reece-Hoyes, A. Hope, A. Tissenbaum, E. Mango, and M. Walhout.
A gene-centered c.elegans protein-dna interaction network. Cell, 125:1193–1205, 2006.
[17] D. Deutscher, I. Meilijson, S. Schuster, and E Ruppin. Can single knockouts accurately single
out gene function? BMC Systems Biology, 2, 2008.
[18] R. Duda, P. Hart, and D. Stork. Pattern Classiﬁcation. 2nd edition, 2001.
[19] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic
models of proteins nucleic acids.
[20] M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. PNAS, 95:14863–14868, 1998.
[21] I. Erb and E. Nimwegen. Statistical features of yeast’s transcriptional regulatory code.
Proceeding of International Conference on Computational and System Biology, 1:111–118,
2006.
[22] J. Ernst, O. Vainas, C. Harbison, I. Simon, and Z. Bar-Joseph. Reconstructing dynamic
regulatory maps. Molecular Systems Biology, 3(74):1–13, 2007.
[23] N. Friedman. Probabilistic models for identifying regulation networks. Bioinformatics,
19:II57, 2003.

85

[24] B. Futcher. Transcriptional regulatory networks and the yeast cell cycle. Current Opinion in
Cell Biology, 14:676–683, 2002.
[25] S. Gygi, Y. Rochon, R. Franza, and R. Aebersold. Correlation between protein and mrna
abundance in yeast. Molecular and Cellular biology, 19:1720–1730, 1999.
[26] D. Hagen, G. McCaffrey, and G. Sprague. Pheromone response elements are necessary and
sufﬁcient for basal and pheromone-induced transcription of the fus1 gene of saccharomyces
cerevisiae. Molecular and Cellular biology, 11(6):2952–2961, 1991.
[27] K. Harbison, B. Gordon, T. Lee, N. Rinaldi, K. Macisaac, T. Danford, N. Hannett,
J. Tagne, D. Reynolds, J. Yoo, E. Jennings, J. Zeitlinger, D. Pokholok, M. Kellis, A. Rolfe,
K. Takusagawa, E. Lander, D. Gifford, E. Fraenkel, and R. Young. Transcriptional regulatory
code of a eukaryotic genome. Nature, 431:99–104, 2004.
[28] A. Honkela, C. Girardot, E. Gustafson, Y. Liu, E. Furlongb, N. Lawrence, and M. Rattray.
Model-based method for transcription factor target identiﬁcation with limited data. PNAS,
107:7793–7798, 2010.
[29] S. Hoth, M. Morgante, J. Sanchez, M. Hanafey, S. Tingey, and N. Chua. Genome-wide gene
expression proﬁling in arabidopsis thaliana reveals new targets of abscisic acid and largely
impaired gene regulation in the abi1-1 mutant. Journal of Cell Science, 115:4891–4900,
2006.
[30] Killion P. Hu, Z. and V. Iyer. Genetic reconstruction of a functional transcriptional regulatory
network. Nature Genetics, 39:683–687, 2007.
[31] G. Huang, C. Athanassiou, and P. Benos. mirconnx: condition-speciﬁc mrna-microrna network integrator. Nucleic Acids Research, 39:W416W423, 2011.
[32] T. Ideker and V. Thorsson. Discovery of regulatory interactions through perturbation: Inference and experimental design. Paciﬁc Symposium of Biocomputing, 5, 2000.
[33] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai. Revealing modular
organization in the yeast transcriptional network. Nature Genetics, 31:370–377, 2002.
[34] R. Jansen. Studying complex biological systems using multifactorial perturbation. Nature
Genetics, 4:145–151, 2003.
[35] Q. Jiang, Y. Wang, Y. Hao, L. Juan, M. Teng, X. Zhang, M. Li, G. Wang, and Y. Liu.
miR2Disease: a manually curated database for microRNA deregulation in human disease.
Nucleic acids research, 37(suppl 1):D98–D104, 2009.
86

[36] J. Joung, K. Hwang, J. Nam, S. Kim, and B. Zhang. Discovery of microrna-mrna modules
via population-based probabilistic learning. Bioinformatics, 23:1141–1147, 2007.
[37] M. Kertesz, N. Iovino, U. Unnerstall, U. Gaul, and E. Segal. The role of site accessibility in
microRNA target recognition. Nature Genetics, 39(10):1278–1284, October 2007.
[38] J. Kilian, D. Whitehead, J. Horak, D. Wanke, S. Weinl, O. Batistic, C. D’Angelo,
E. Bornberg-Bauer, J. Kudla, and K. Harter. The atgenexpress global stress expression data
set: protocols, evaluation and model data analysis of uv-b light, drought and cold stress responses. Plant Journal, 50:347–363, 2007.
[39] J. Kreps, Y. Wu, H. Chang, T. Zhu, X. Wang, and J. Harper. Transcriptome changes for
arabidopsis in response to salt, osmotic, and cold stress. Plant Physiol, 130:2129–2141,
2002.
[40] H. Le and B. Ziv. Inferring transcription factor collaborations in gene regulatory networks.
Bioinformatics, 29:i89i97, 2013.
[41] A. Le Bchec, E. Portales-Casamar, G. Vetter, M. Moes, P. Zindy, A. Saumet, D. Arenillas,
C. Theillet, W. Wasserman, C. Lecellier, and E. Friederich. Mir@nt@n: a framework integrating transcription factors, micrornas and their targets to identify sub-network motifs in a
meta-regulation network model. BMC Bioinformatics, 12, 2011.
[42] T. Lee, N. Rinaldi, F. Robert, D. Odom, Z. Bar-Joseph, G. Gerber, N. Hannett, C. Harbison,
C. Thompson, I. Simon, J. Zeitlinger, E. Jennings, H. Murray, B. Gordon, B. Ren, J. Wyrick,
J. Tagne, T. Volkert, E. Fraenkel, D. Gifford, and R. Young. Transcriptional regulatory networks in saccharomyces cerevisiae. Science, 298:799–804, 2002.
[43] B. Lewis, C. Burge, and D. Bartel. Conserved seed pairing, often ﬂanked by adenosines,
indicates that thousands of human genes are microRNA targets. Cell, 120:15–20, 2005.
[44] Z. Liang, H. Zhou, Z. He, H. Zheng, and J. Wu. miract: a web tool for evaluating microrna
activity based on gene expression data. Nucleic Acids Research, 39:W139–W144, 2011.
[45] J. Lu, G. Getz, E. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck, A. Sweet-Cordero, B. Ebert,
R. Mak, A. Ferrando, J. Downing, T. Jacks, R. Horvitz, and T. Golub. Microrna expression
proﬁles classify human cancers. Nature, 435:834–838, 2005.
[46] K. MacIsaac, T. Wang, B. Gordon, D. Gifford, G. Stormo, and E. Fraenkel. An improved
map of conserved regulatory sites for saccharomyces cerevisiae. BMC Bioinformatics, 7:113,
2006.
87

[47] S. Maere, K. Heymans, and M. Kuiper. Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioninformatics, 21:3448–3449,
2005.
[48] M. Maienschein-Cline, J. Zhou, K. White, R. Sciammas, and A. Dinner. Discovering transcription factor regulatory targets using gene expression and binding data. Bioinformatics,
28:206–213, 2011.
[49] F. Markowetz. How to understand the cell by breaking it: Network analysis of gene perturbation screens. PLoS Comput Biol, 6, 2010.
[50] C. Miller, B. Schwalb, K. Maier, D. Schulz, S. Dumcke, B. Zacher, A. Mayer, J. Sydow,
L. Marcinowski, L. Dolken, D. Martin, A. Tresch, and P. Cramer. Dynamic transcriptome
analysis measures rates of mrna synthesis and decay in yeast. Molecular Systems Biology,
(458):1–13, 2010.
[51] S. Mukherjee and S. Mitra. Hidden markov models, grammars, and biology: a tutorial.
Journal of Bioinformatics and Computational Biology, 3:491–526, 2005.
[52] I. Ong, J. Glasner, and D. Page. Modelling regulatory pathways in e. coli from time-series
expression proﬁles. Journal of Bioinformatics, 18:S241–S248, 2002.
[53] S. Palaniswamy, S. James, H. Sun, R. Lamb, R. Davuluri, and E. Grotewold. Agris and
atregnet: A platform to link cis-regulatory elements and transcription factors into regulatory
networks. Plant Physiology, 140:818–829, 2006.
[54] P. Park. Chipseq: advantages and challenges of a maturing technology. Nature Reviews
Genetics, 10(10):669–680, 2009.
[55] J. Qian, J. Lin, N. Luscombe, H. Yu, and M. Gerstein. Prediction of regulatory networks: genome-wide identiﬁcation of transcription factor targets from gene expression data.
Bioinformatics, 19:1917–1926, 2003.
[56] P. Qiu. Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochemical. and Biophysical Research Communications,
309:495–501, 2003.
[57] J. Reimand, J. Vaquerizas, A. Todd, J. Vilo, and N. Luscombe. Comprehensive reanalysis of
transcription factor knockout expression data in saccharomyces cerevisiae reveals many new
targets. Nucleic Acids Research, 38:4768–4777, 2010.

88

[58] B. Ren, F. Robert, J. Wyrick, O. Aparicio, E. Jennings, I. Simon, J. Zeitlinger, J. Schreiber,
N. Hannett, E. Kanin, T. Volkert, C. Wilson, S. Bell, and R. Young. Genome-wide location
and function of dna binding proteins. Science, 290:2306–2309, 2000.
[59] W. Schmitt, R. Raab, and G. Stephanopoulos. Elucidation of gene interaction networks
through time-lagged correlation analysis of transcriptional data. Genome Res, 14(8):1654–
1663, 2004.
[60] M. Schulza, K. Panditb, C. Cardenasb, N. Ambalavananc, N. Kaminskib, and Z. Bar-Joseph.
Reconstructing dynamic microrna-regulated interaction networks. PNAS, 2013.
[61] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Module
networks: identifying regulatory modules and their condition-speciﬁc regulators from gene
expression data. Nature Genetics, 34:166–167, 2003.
[62] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller. Rich probabilistic models for
gene expression. Bioinformatics, 17:S243–52, 2001.
[63] E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules
from dna sequence and gene expression. Bioinformatics, 19:i273–i282, 2003.
[64] P. Sethupathy, B. Corda, and A. Hatzigeorgiou. TarBase: A comprehensive database of
experimentally supported animal microRNA targets. RNA, 12(2):192–197, 2006.
[65] R. Shalgi, D. Lieber, M. Oren, and Y. Pilpel. Global and local architecture of the mammalian
micrornatranscription factor regulatory network. PLoS Comput Biol, 3:e131, 2007.
[66] B. Sherman, D. Huang, Q. Tan, Y. Guo, S. Bour, D. Liu, R. Stephens, M. Baseler, C. Lane,
and R. Lempicki. A gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics, 8, 2007.
[67] I. Shmulevich, E. Dougherty, S. Kim, and W. Zhang. Probabilistic boolean networks: a rulebased uncertainty model for gene regulatory networks. Bioinformatics, 18:261–274, 2002.
[68] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein,
and B. Futcher. Comprehensive identiﬁcation of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9:3273–
3297, 1998.
[69] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. EDBT 96 Proceedings of the 5th International Conference on Extending
Database Technology: Advances in Database Technology, 1057:1–17, 1996.
89

[70] I. Steinfeld, R. Navon, R. Ach, and Z. Yakhini. mirna target enrichment analysis reveals
directly active mirnas in health and disease. Nucleic Acids Research, 41:e45, 2013.
[71] T. Thorne and M. Stumpf.
Inference of temporally varying bayesian networks.
Bioinformatics, 24:3298–3305, 2012.
[72] Le. Thuc, L. Liu, B. Liu, A. Tsykin, G. Goodall, K. Satouand, and J. Li. Inferring microrna
and transcription factor regulatory networks in heterogeneous. BMC Bioinformatics, 14,
2013.
[73] A. Tong and C. Boone. Synthetic genetic array analysis in saccharomyces cerevisiae.
Methods in Molecular Biology, 313:171–191, 2006.
[74] D. Tran, K. Satou, T. Ho, and T. Pham. Computational discovery of mir-tf regulatory modules
in human genome. Bioinformation, 8:371377, 2010.
[75] V. Tusher, R. Tibshirani, and G. Chu. Signiﬁcance analysis of microarrays applied to the
ionizing radiation response. PNAS, 98:51165121, 2001.
[76] A. Valouev, D. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzoglou, R. Myers, and
A. Sidow. Genome-wide analysis of transcription factor binding sites based on chip-seq data.
Nature Methods, 5:829–834, 2008.
[77] F. Xiao, Z. Zuo, G. Cai, S. Kang, X. Gao, and T. Li. miRecords: an integrated resource for
microRNA–target interactions. Nucleic acids research, 37(suppl 1):D105–D110, 2009.
[78] J. Yang, J. Li, P. Shao, H. Zhou, Y. Chen, and L. Qu. starBase: a database for exploring microRNA-RNA interaction maps from Argonaute CLIP-Seq and Degradome-Seq data.
Nucleic Acids Research, 39(suppl 1):D202–D209, 2011.
[79] H. Yeang and T. Jaakkola. Modeling the combinatorial functions of multiple transcription
factors. Journal of Computational Biology, 13:463–480, 2006.
[80] H. Yu, K. Tu, Y. Wang, J. Mao, L. Xie, Y. Li, and Y. Li. Combinatorial network of transcriptional regulation and microrna regulation in human cancer. BMC Systems Biology, 6,
2012.
[81] B. Zacher, K. Abnaof, S. Gade, E. Younesi, A. Treschand, and H. Frhlich. Joint bayesian
inference of condition speciﬁc mirna and transcription factor activities from combined gene
and microrna expression data. Bioinformatics, 28:17141720, 2012.

90

[82] Q. Zhang and M. andersen. Dose response relationship in anti-stress gene regulatory networks. PLoS Comput Biol, 3:e24, 2007.
[83] Y. Zheng and W. Zhang. Animal microRNA target prediction using diverse sequence-speciﬁc
determinants. Journal of Bioinformatics and Computational Biology, 8(4):763–788, 2010.
[84] Y. Zhou, J. Ferguson, J. Chang, and Y. Kluger. Inter and intra combinatorial regulation by
transcription factors and micrornas. BMC Genomics, 8, 2007.
[85] C. Zou and J. Feng. Granger causality vs. dynamic bayesian network inference: a comparative
study. BMC Bioinformatics, 10, 2009.

91