STRUCTURES AND BOOLEAN DYNAMICS IN
GENE REGULATORY NETWORKS
By
Anthony Szedlak

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Physics – Doctor of Philosophy
2017

ABSTRACT
STRUCTURES AND BOOLEAN DYNAMICS IN
GENE REGULATORY NETWORKS
By
Anthony Szedlak
Cancer is a complex family of diseases primarily characterized by the accumulation of mutations in cells within multicellular organisms, leading to accelerated proliferation and aggressive
competition over limited space and resources. This is coupled with a loss of DNA replication quality control mechanisms, creating genetically heterogeneous clonal lines within patients and even
within individual tumors. Each of these clonal lines may respond differently to the same perturbations, making the system naturally resistant to drug therapies. Great progress has been made in
identifying and deciphering the roles of driver mutations in individual genes, but the mechanisms
of oncogenesis can only be truly understood in the context of the misregulation of dynamical processes in cells’ underlying gene regulatory networks (GRNs).
This dissertation discusses the topological and dynamical properties of GRNs in cancer, and
is divided into four main chapters. First, the basic tools of modern complex network theory are
introduced. These traditional tools as well as those developed by myself (set efficiency, interset
efficiency, and nested communities) are crucial for understanding the intricate topological properties of GRNs, and later chapters recall these concepts. Second, the biology of gene regulation is
discussed, and a method for disease-specific GRN reconstruction developed by our collaboration is
presented. This complements the traditional exhaustive experimental approach of building GRNs
edge-by-edge by quickly inferring the existence of as of yet undiscovered edges using correlations
across sets of gene expression data. This method also provides insight into the distribution of common mutations across GRNs. Third, I demonstrate that the structures present in these reconstructed
networks are strongly related to the evolutionary histories of their constituent genes. Investigation
of how the forces of evolution shaped the topology of GRNs in multicellular organisms by growing outward from a core of ancient, conserved genes can shed light upon the “reverse evolution” of

normal cells into unicellular-like cancer states. Next, I simulate the dynamics of the GRNs of cancer cells using the Hopfield model, an infinite range spin-glass model designed with the ability to
encode Boolean data as attractor states. This attractor-driven approach facilitates the integration of
gene expression data into predictive mathematical models. Perturbations representing therapeutic
interventions are applied to sets of genes, and the resulting deviations from their attractor states are
recorded, suggesting new potential drug targets for experimentation. Finally, I extend the Hopfield
model to modular networks, cyclic attractors, and complex attractors, and apply these concepts
to simulations of the cell cycle process. Futher development of these and other theoretical and
computational tools is necessary to analyze the deluge of experimental data produced by modern
and future biological high throughput methods.

This dissertation is dedicated to my family and my teachers

iv

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my adviser, Carlo Piermarocchi, supporting me throughout my dissertation work, and for always encouraging me to tilt at my own ideas, be they giants or
windmills.
I would also like to thank our past and present collaborators in San Diego: Giovanni Paternostro, Nicholas Smith, Yunyi Kang, Andrew Hodges, and Edison Ong. The interdisciplinary work
presented herein is the fruit of their patience.
Importantly, I would like to thank the members of my dissertation committee: David Arnosti,
Norman Birge, Lisa Lapidus, and George Mias. Years ago I leveraged their agreeable natures
against them and convinced them to join my committee. This dissertation and my oral defense are
the last vestiges of their momentary lapses of reason, and the only remaining impediments to their
freedom.
Lastly, I would like to thank the National Science Foundation, National Institutes of Health, and
the Department of Defense for sponsoring this quixotic misadventure called a doctoral dissertation.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPTER 1

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

3
4
4
7
10
12
16
19
20

CHAPTER 3 GENE REGULATORY NETWORKS
3.1 Basic biology . . . . . . . . . . . . . . . .
3.2 Gene regulation . . . . . . . . . . . . . . .
3.3 Cancer . . . . . . . . . . . . . . . . . . . .
3.4 GRN reconstruction . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

26
26
28
30
34

CHAPTER 4 EVOLUTION OF GENE REGULATORY NETWORKS
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Single-gene evolutionary measures . . . . . . . . . . . . . . .
4.3 Evolution and GRN topology . . . . . . . . . . . . . . . . . .
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

44
44
46
47
53

CHAPTER 5 THE HOPFIELD MODEL . . . . . . . .
5.1 Why Boolean? . . . . . . . . . . . . . . . . . .
5.2 The Hopfield model . . . . . . . . . . . . . . .
5.3 Cancer signalling and the Hopfield model . . .
5.3.1 Mathematical details . . . . . . . . . .
5.3.2 Control Strategies . . . . . . . . . . . .
5.3.2.1 Directed acyclic networks . .
5.3.2.2 Directed cycle-rich networks .
5.3.3 Cancer signaling . . . . . . . . . . . .
5.3.3.1 Lung Cell Network . . . . . .
5.3.3.2 B Cell Network . . . . . . . .
5.4 Conclusion . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

60
60
65
70
73
75
76
77
84
84
90
93

CHAPTER 2 COMPLEX NETWORKS
2.1 The Königsberg bridge problem
2.2 Basic terminology . . . . . . . .
2.3 The problem with pictures . . .
2.4 Basic network measures . . . . .
2.5 Advanced network measures . .
2.6 Set and interset efficiency . . . .
2.7 Communities . . . . . . . . . .
2.8 Nested communities . . . . . . .

CHAPTER 6

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

GENERALIZED HOPFIELD MODELS . . . . . . . . . . . . . . . . . . . 98
vi

6.1
6.2
6.3

Stability in Hopfield GRNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Programmable nonequilibrium Hopfield systems . . . . . . . . . . . . . . . . . . . 100
Cell cycle and the Hopfield model . . . . . . . . . . . . . . . . . . . . . . . . . . 104

CHAPTER 7

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX A
EULER’S SOLUTION TO THE KöNIGSBERG BRIDGE PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX B
INTERSET EFFICIENCY NORMALIZATION . . . . . . . . . .
APPENDIX C
NESTED COMMUNITY PLOTS . . . . . . . . . . . . . . . . . .
APPENDIX D
MEASURING EXPRESSION . . . . . . . . . . . . . . . . . . .
APPENDIX E
PROOF OF THEOREM FROM SECTION 5.3.2.2 . . . . . . . . .
APPENDIX F
HOPFIELD MEAN FIELD SOLUTION . . . . . . . . . . . . . .
APPENDIX G
CORRELATED ATTRACTORS . . . . . . . . . . . . . . . . . .

. 110
.
.
.
.
.
.
.

111
112
114
120
123
126
133

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

vii

LIST OF TABLES

Table 3.1

Pathway enrichment of 21 common AML mutations plus their first neighbors
in AML 2.1. Nearly all pathways are related to the hallmarks of cancer from
Fig. 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Table 4.1

Correlations (Pearson’s R and Spearman’s ρ) between evolutionary measures
and network measures in AML 2.3. The most significant correlation is between PageRank and ER for individual genes, and between PageRank and the
mean age of DAVID groups (see Fig. 4.4). . . . . . . . . . . . . . . . . . . . . . 51

Table 4.2

Evolutionary properties of communities and DAVID groups in AML 2.3. Gene
evolutionary rates (ERs) take real values from 0 (most conserved) to approximately 6.9 (most variable), and ages take integer values from 0 (oldest) to 12
(youngest). The table is organized as follows. "Comm. Index" is the index
of the ten largest communities. "Num. genes" is the number of genes in the
community. "Comm. ER" indicates whether the community is significantly
hotter (i.e. has a higher ER) or colder (i.e. has a lower ER) than the mean of
300 equally-sized sets of genes randomly selected from the network, with a
significance threshold of p = 10−3 . "Diff. in mean" is the difference between
the mean ER of the community and the mean ER of the 300 randomly selected
sets. "p-value" is the significance of the difference. "Comm. age", "Diff. in
mean", and "p-value" are the same as previously stated, but for age rather than
ER. "DAVID group name" is the name of the DAVID group that DAVID identified as enriched in each community. “Group type” states whether the DAVID
group is a protein type (P), location of final gene product (L), biological process (B), or cellular component (C). "Num. genes" is the number of genes in
the DAVID group. "DAVID Benjamini" is the significance of the enrichment
of the DAVID group, as reported by DAVID. The remaining DAVID group
columns are computed in the same manner as the community columns. . . . . . 58

Table 4.3

Evolutionary properties of communities and DAVID groups in HumanNet.
See Table 4.2 for explanation of column headers. . . . . . . . . . . . . . . . . . 59

Table 5.1

General properties of the full networks. The network used for the analysis of
lung cancer is a generic one obtained by combining the data sets in [100] and
[161]. The B cell network is a curated version of the B cell interactome obtained from [89] using a network reconstruction method and gene expression
data from B cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

viii

Table 5.2

Properties of the largest weakly connected differential subnetworks for all cell
types. I = IMR-90 (normal), A = A549 (cancer), H = NCI-H358 (cancer), N =
Naïve (normal), M = Memory (normal), D = DLBCL (cancer), F = Follicular
lymphoma (cancer), L = EBV-immortalized lymphoblastoma (cancer). . . . . . . 88

Table 5.3

Best single genes and their impacts for the p=1 and p=2 models. The unconstrained (UNC) and constrained (CON) case are shown. The constrained case
refer to target that are kinases and are expressed in the cancer case. I = IMR-90
(normal), A = A549 (cancer), H = NCI-H358 (cancer), N = Naïve (normal), M
= Memory (normal), D = DLBCL (cancer), F = Follicular lymphoma (cancer),
L = EBV-immortalized lymphoblastoma (cancer). . . . . . . . . . . . . . . . . . 92

Table 5.4

Reference table of all symbols used in this chapter. . . . . . . . . . . . . . . . . 97

ix

LIST OF FIGURES

Figure 2.1

Detailed map of Königsberg. For clarity, the bridges have been circled in red. .

5

Figure 2.2

Bridges of Königsberg, overlayed with Euler’s graphical representation. . . . .

5

Figure 2.3

[A] A weighted network with undirected edges, and [B] a weighted network
with two directed edges and one undirected edge. Node labels are shown in
black, and edge weights are shown in red. . . . . . . . . . . . . . . . . . . . .

6

Some of the countless available map projections, including both common and
exotic projections. Image adapted from here. . . . . . . . . . . . . . . . . . . .

8

A random layout of a randomly generated scale-free network. Nodes are
colored from white (low degree) to dark blue (high degree). . . . . . . . . . . .

9

Figure 2.6

A circular layout of the same network shown in Fig. 2.5. . . . . . . . . . . . . .

9

Figure 2.7

A force-directed layout of the same network shown in Fig. 2.5. . . . . . . . . .

9

Figure 2.8

A spectral layout of the same network shown in Fig. 2.5. . . . . . . . . . . . .

9

Figure 2.9

Indegree distribution for an Erd˝os-Rényi network with kin ≈ 3.69. . . . . . . 12

Figure 2.4

Figure 2.5

Figure 2.10 Indegree distribution for Enron’s email network with kin ≈ 3.69 and γ ≈
1.6. Note that both axes are logarithmic. . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.11 (A) A network with one component, and (B) a network with two components.
Both networks have identical degree distributions since all nodes have degree 2.

13

Figure 2.12 An example network. The red node has high betweenness centrality, since
all paths between the left and right sides must pass through the red node. . . . . 13
Figure 2.13 An undirected, unweighted, disconnected network with two components. . . . . 15
Figure 2.14 The set efficiency of the set of red nodes is relatively low. One geodesic for
one (i, j) pair is shown with thick edges. . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.15 The set efficiency of the set of red nodes is relatively high. One geodesic for
one (i, j) pair is shown with thick edges. . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.16 The interset efficiency from the set of red nodes to the set of blue nodes is
relatively low. One geodesic for one (i, j) pair is shown with thick edges. . . . . 18

x

Figure 2.17 The interset efficiency from the set of red nodes to the set of blue nodes
is relatively high. The half-blue-half-red node belongs to both sets. One
geodesic for one (i, j) pair is shown with thick edges. . . . . . . . . . . . . . . 18
Figure 2.18 An example phylogenetic tree of human ancestry derived from mitochondrial
DNA. Original image from [80]. . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 2.19 Inferred dendrogram of 103 Indo-European languages. Line thicknesses correspond to languages’ rates of “diffusion” across land masses. 95% confidence intervals are marked with light blue lines and are included at all forking points. The gray distribution on the left corresponds to the computed
probability density of the location of the root node (the most recent common
ancestor language of all 103 languages shown). Original image from [19]. . . . 22
Figure 2.20 An example network with two levels of community structure. The largest
oval encloses all nodes, C; the two midsized ovals enclose nodes in level-1
communities, {Ci }; and the six small ovals enclose nodes in level-2 communities, Ci j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.1

Schematic of some known apoptosis components and signaling pathways in
humans. Figure adapted from the interactive version available from KEGG [72]. 30

Figure 3.2

Ten of the “hallmarks of cancer.” Acquiring each of these features drives
cells into more cancer-like phenotypes. Figure adapted from [61]. . . . . . . . . 32

Figure 3.3

Schematic of clonal evolution within a single patient. As the founder clone
(marked with N’s) becomes cancerous, it mutates and divides at a higher
rate than normal cells. Each successive generation follows its own unique
evolutionary path, driven by random mutations and natural selection. These
mutations accumulate and, just as with the phylogenetic tree in Fig. 2.18,
their genomes begin to diverge. Cells from the same or different lines may
compete (because space and resources are limited) and/or cooperate (by releasing growth factors to further up-regulate growth and proliferation in other
cells), increasing each successive generation’s fitness compared with normal
cells. Figure taken from [79]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 3.4

Binned Pearson correlation of inferred TF/gene interactions in one of five
data sets using Ong’s method, with maximum correlation (0.98) on the left
and minimum correlation (0.15) on the right. The vertical red line identifies
the computed Pearson correlation threshold, meaning all edges with Pearson
correlation above the threshold are included in the final network. Figure
taken from [116]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xi

Figure 3.5

Force-directed layout of the largest connected component of AML 2.1. Because of the large number of nodes in the network, only the edges (gray lines)
are rendered. Some of the most highly enriched GO pathways within clusters
identified by MCODE are colored. Figure adapted from [116]. . . . . . . . . . 40

Figure 3.6

Force-directed layout of the largest connected component of AML 2.1. Mutations and neighbors of mutations are identified. Figure adapted from [116]. . . 41

Figure 3.7

Set efficiency of 21 common AML mutations in AML 2.1, and the distribution of the set efficiency of 10 million randomly selected sets of 21 genes
from the 5,667 genes in AML 2.1. The set efficiency of the mutated genes is
far greater than expected at random. Figure adapted from [116]. . . . . . . . . . 42

Figure 4.1

Phylogenetic tree used to compute gene ages. Image taken from [25]. . . . . . . 47

Figure 4.2

Difference in evolutionary rates (ER) between pairs of genes, (ER j − ERi ),
for pairs connected by an edge j → i in AML 2.3 (green) and for pairs connected by an edge in one degree-preserving randomization of AML 2.3 (purple). Note that the distribution is asymmetric because AML 2.3 is a directed
network. The width of the true distribution is significantly smaller than the
mean of the standard deviation of 20,000 degree-preserving randomizations,
resulting in an approximate Z-score of −96.8. . . . . . . . . . . . . . . . . . . 49

Figure 4.3

(A) Distribution of evolutionary rates (ERs), measured in units of the number of nonsynonymous substitutions per amino acid site per billion years,
for all genes (purple) and for genes in the translational elongation DAVID
group (green). This DAVID group has a very low ER compared to the background distribution. (B) Distribution of ages for all genes (purple) and genes
in the transmembrane DAVID group (green), where age=0 is the oldest and
age=12 is the youngest. Transmembrane genes are much younger than average. (C) Summary of mean ER and mean age for DAVID groups in Table 4.2. The relative ERs on the x-axis are computed from ERrelative =
ERDAVID group mean − ERnetwork mean , and likewise for relative age on the
y-axis. The DAVID groups from (A) and (B) have bold labels in (C). Each
marker type corresponds to one of communities 0 through 9. As expected,
old DAVID groups tend to have a low average ER (i.e. are “cold”), and
young DAVID groups tend to evolve frequently (i.e. are “hot”). Unabbreviated DAVID group names are listed in Table 4.2. . . . . . . . . . . . . . . . . 50

Figure 4.4

Mean PageRank versus mean age of each DAVID group from Table 4.2
(age=0 is the oldest and age=12 is the youngest). Old DAVID groups tend to
have high PageRank. Unabbreviated DAVID group names are listed in Table 4.2. 52

xii

Figure 4.5

Set efficiency and evolutionary rate for AML 2.3. The cumulative set efficiency (SE) of all genes below a given evolutionary rate (ER) rank (lowest
to highest ER, i.e. “coldest” to “hottest”). The SE of the 500 coldest genes
is significantly higher than the control, and including hotter genes monotonically decreases the SE. This indicates that the coldest genes are located near
each other, while the hottest genes are more dispersed. . . . . . . . . . . . . . . 53

Figure 4.6

Set efficiency and age for AML 2.3. The cumulative set efficiency (SE) of all
genes below a given age rank (oldest to youngest). The SE of the 500 oldest
genes is significantly higher than the control, and including younger genes
monotonically decreases the SE (after a transient period due to the discrete
nature of the age data). This indicates that the oldest genes are located near
each other, while the youngest genes are more dispersed. . . . . . . . . . . . . 54

Figure 4.7

Interset efficiency and age for AML 2.3. Interset efficiency from DAVID
group in column j to DAVID group in row i. The list of DAVID groups was
sorted by average age from oldest (transcriptional elongation) to youngest
(hemoglobin complex). Old DAVID groups exchange information efficiently,
as indicated by the high interset efficiency values in the lower-left corner.
Younger DAVID groups, particularly the blood cell-specific DAVID groups
of lymphocyte activation and hemoglobin complex, are remote from most
other DAVID groups. Note that the above matrix is asymmetric because the
network is directed, and that the colors are log-scaled. . . . . . . . . . . . . . . 55

Figure 4.8

Interset efficiency and age for HumanNet. See Fig. 4.7 for explanation. . . . . . 56

Figure 5.1

[A] Schematic of a protein energy landscape. The high dimensional configuration space has been collapsed to the horizontal axis, with isolated configurations drawn on the left and highly interacting configurations on the right.
Local energetic minima are stable configurations, and transitions between
configurations can occur due to thermal fluctuations, protein-protein interactions, and changes in the microenvironment. [B] Diagram of some of the
possible transitions in the configuration of a protein. Image taken from [13]. . . 61

Figure 5.2

Schematic of Eq. 5.1. Genes produce mRNAs, mRNAs produce proteins,
proteins regulate mRNA production rates, and mRNAs and proteins decay,
all at varying rates. Image taken from [26]. . . . . . . . . . . . . . . . . . . . . 62

Figure 5.3

Family of Hill functions from Eq. 5.3 for a = 1 and k = 5. . . . . . . . . . . . . 64

Figure 5.4

Circuit and logic table for Eq. 5.5. . . . . . . . . . . . . . . . . . . . . . . . . 65

Figure 5.5

Trajectory of a random Boolean NK network with N = 100, K = 2, and a
random initial state. Black and white pixels represent 0’s and 1’s. The system
quickly settles into a cyclic attractor with period 4. . . . . . . . . . . . . . . . . 66

xiii

Figure 5.6

Family of functions from Eq. 5.9. . . . . . . . . . . . . . . . . . . . . . . . . . 68

Figure 5.7

Network segregation for two attractor states (p = 2). Every edge that connects a similarity node to a differential node or a differential node to a similarity node transmits no signal. This means that the signaling in the right
network shown above is identical to that of the left network. Because the
goal is to leave normal cells unaltered while damaging cancer cells as much
as possible, all similarity nodes can be safely ignored, and searches and simulations only need to be done on the differential subnetwork. . . . . . . . . . . 75

Figure 5.8

A directed acyclic network. Controlling all three source nodes (nodes 1,
2 and 3) guarantees full control of the network, but are ineffective when
targeted individually. The best single node to control in this network is node
6 because it directly controls all downstream nodes. . . . . . . . . . . . . . . . 78

Figure 5.9

A network in which nodes 4, 5, 6 and 7 compose a single cycle cluster. The
high connectivity of node 4 prevents any changes made to the spin of nodes
1-3 from propagating downstream. The only way to indirectly control nodes
8-10 is to target nodes inside of the cycle cluster. Targeting node 4, 6 or 7 will
cause the entire cycle cluster to flip away from its initial state, guaranteeing
control of nodes 4-10 (see Fig. 5.10). . . . . . . . . . . . . . . . . . . . . . . . 79

Figure 5.10 Cancer magnetization from targeting various nodes in the network shown in
Fig. 5.9, averaged over 10,000 runs. The averaging removes fluctuations due
to the random flipping of nodes with hi = 0. Targeting node 7 results in the
quickest stabilization, but targeting any one of nodes 4, 6 or 7 results in the
same final magnetization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 5.11 A network with a cycle cluster C, composed of nodes 2-10, that cannot be
controlled at T = 0 by controlling any single node. Here, the set of externally influenced nodes is R(C, G) = {2, 9}, the set of intruder connections is
W (C, G) = {(1, 2), (1, 9)}, the reduced set of critical nodes is Zred (C, G) =
{9, 10}, the minimum indegree is µ = 1 and the number of nodes in the cycle
cluster is N = 9. By Eq. 5.28, this gives the bounds of the critical number of
nodes to be 1 ≤ ncrit ≤ 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figure 5.12 Magnetization for network from Fig. 5.11, averaged over 10,000 runs. There
is no single node to target that will control the cycle cluster, but fixing nodes
9 and 10 results in full control of the cycle cluster, leaving only node 1 in the
cancer state. This means Z(C, G) = {9, 10} and ncrit = 2. . . . . . . . . . . . . 82

xiv

Figure 5.13 Final cancer magnetizations for an unconstrained search on the lung cell network using p = 1. The efficiency-ranked strategy outperforms the relatively
expensive Monte Carlo strategy. The best+1 strategy works best, although
it requires the largest computational time. Note that the mixed efficiencyranked curve is not shown because it is identical to the pure efficiency-ranked
curve. Key for magnetization curves: MC = Monte Carlo, B+1 = best+1,
ERP = pure efficiency-ranked. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 5.14 Final cancer magnetizations for an unconstrained search on the lung cell network using p = 2. As in the p = 1 case, the efficiency-ranked strategy outperforms the expensive Monte Carlo search. The plateaus in the efficiencyranked strategy when fixing 9-10, 12-15, 20-21, etc. nodes are a result of
targeting bottlenecks that are already indirectly controlled. . . . . . . . . . . . 88
Figure 5.15 Largest weakly connected differential subnetwork for IMR-90/A549 and p =
2. Out of the 506 pictured nodes, 450 are sinks and therefore have an impact
equal to one. The top five bottlenecks are labeled with their gene names and
colored orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 5.16 Final cancer magnetizations for a constrained search on the lung cell network
using p = 2. This is the only case in which a limited exhaustive search is
possible. Interestingly, the exhaustive search locates the same nodes as the
best+1 strategy for fixing up to eight nodes. The efficiency-ranked strategy
performs poorly compared to the Monte Carlo strategy because the search
space is small and a large portion of the available space is sampled by the
Monte Carlo search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 5.17 Final cancer magnetizations for an unconstrained search on the B cell network using p = 1. The Monte Carlo strategy is ineffective for fixing any
number of nodes. The efficiency-ranked and best+1 curves slowly separate
because synergistic effects accumulate faster for best+1. . . . . . . . . . . . . . 93
Figure 5.18 Final cancer magnetizations for an unconstrained search on the B cell network using p = 2. The rather sudden drop in the magnetization between
controlling 5 and 10 nodes in the efficiency-ranked strategies comes from
flipping a significant portion of a cycle cluster. This is the only network examined in which the mixed efficiency-ranked strategy produces results different from the pure efficiency-ranked strategy. . . . . . . . . . . . . . . . . . . 94
Figure 6.1

Ascending and Descending by MC Escher. . . . . . . . . . . . . . . . . . . . . 98

Figure 6.2

Relativity by MC Escher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 6.3

Schematic of the magnetization phase transition for Hopfield systems on
Erd˝os-Rényi networks. Note that the symmetric (mirror) solution is also
shown below the T axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xv

Figure 6.4

Mean field (dashed) and explicit simulations (circles with error bars) of the
magnetization of communities in SBMNs over a range of temperatures. There
is a phase change in the mean field solution at Tc ≈ 20, and the explicit simulation is in good agreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 6.5

Mean field SBMN (dashed) and explicit AML 2.3 simulations (circles with
error bars) of the magnetization of communities over a range of temperatures. Unlike the explicit simulations from Fig. 6.4, the real network shows
no apparent phase change, instead maintaining finite magnetization over all
temperatures examined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 6.6

An example deterministic random map that can be stored in a Hopfield coupling matrix using an appropriate mapping matrix M. Each node µ = 0, 1, . . . , 9
represents a stored pattern ξ µ and each edge represents a transition. . . . . . . 104

Figure 6.7

Simulated trajectory of a single HeLa cell. The bottom plot shows the overlap between the system’s configuration and each of the p = 8 patterns as a
function of time. The top plot shows which pattern has maximum overlap at
any point in time, and represents the cell’s phenotype. . . . . . . . . . . . . . . 106

Figure 6.8

Experimental bulk mean expression for the gene RNFT1 as a function of
time, taken from [44]. Although all cells were initially synchronized to the
G1 phase, progression through cell cycle takes slightly different amounts of
time for each cell and their gene expression profiles becomes less and less
correlated, leading to a decaying sinusoid. . . . . . . . . . . . . . . . . . . . . 107

Figure 6.9

Mean of simulated Boolean expression for the gene RNFT1 across 50 isolated cells as a function of time. All cells began in identical states, but because transitions occur stochastically, the 50 cells decohere over time, leading a decaying sinusoid similar to Fig. 6.8. . . . . . . . . . . . . . . . . . . . . 107

Figure 6.10 Cell cycle coupling matrix, separated into symmetric (point attractor) and
antisymmetric (transition) parts on the left and right, respectively, and sorted
into nested communities applied to the symmetric part. Examining the nested
community structres may aid in the search for sets of nodes which, when
targeted, could control cell cycle. . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure B.1 Two overlapping sets, I and J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Figure C.1 Spy plot of Google’s internal web page network [119]. Each node (row and
column indicies) is a web page owned by Google, and each edge (a black dot
located at (i, j)) is a hyperlink from page j to page i. Nodes were assigned a
random index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

xvi

Figure C.2 Same network as Fig. C.1, but the nodes indices have been sorted into level-1
communities (boxed in red). This is the standard result from normal community detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Figure C.3 Same network as Fig. C.1, but the nodes indices have been sorted into level-2
communities. Note that the new level-2 communities are inset in the level-1
communities from Fig. C.2 (boxes within boxes). . . . . . . . . . . . . . . . . 115
Figure C.4 Same network as Fig. C.1, but the nodes indices have been sorted into level-3
communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Figure C.5 Same network as shown in Fig. C.1, zoomed in to focus on the first 5,000 nodes. 116
Figure C.6 Same network as shown in Fig. C.2, zoomed in to focus on the first 5,000 nodes. 116
Figure C.7 Same network as shown in Fig. C.3, zoomed in to focus on the first 5,000 nodes. 116
Figure C.8 Same network as shown in Fig. C.4, zoomed in to focus on the first 5,000 nodes. 116
Figure C.9 Spy plot of an arXiv theoretical high energy physics citation network [90].
Each node (row and column indicies) is an author of at least one high energy
physics publication, and an edge (a black dot located at (i, j)) from author
j to author i means that one of j’s articles on arXiv cited at least one of i’s
articles. Nodes were assigned a random index. . . . . . . . . . . . . . . . . . . 117
Figure C.10 Same network as Fig. C.9, but the nodes indices have been sorted into level-1
communities. This is the standard result from normal community detection. . . 117
Figure C.11 Same network as Fig. C.9, but the nodes indices have been sorted into level-2
communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure C.12 Same network as Fig. C.9, but the nodes indices have been sorted into level-3
communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure C.13 Same network as shown in Fig. C.9, zoomed in to focus on the first 5,000 nodes. 118
Figure C.14 Same network as shown in Fig. C.10, zoomed in to focus on the first 5,000 nodes.118
Figure C.15 Same network as shown in Fig. C.11, zoomed in to focus on the first 5,000 nodes.118
Figure C.16 Same network as shown in Fig. C.12, zoomed in to focus on the first 5,000 nodes.118
Figure C.17 Spy plot of AML 2.3, a gene regulatory network, sorted into nested communities. Each node (row and column indicies) is gene, and an edge (a black dot
located at (i, j)) from gene j to gene i means that j regulates the expression
of i. The maximum depth of this network is 7. . . . . . . . . . . . . . . . . . . 119
Figure C.18 Same network as shown in Fig. C.17, zoomed in to focus on the first 5,000 nodes.119
xvii

Figure C.19 Spy plot of HumanNet, a gene regulatory network, sorted into nested communities. Each node (row and column indicies) is gene, and an edge (a black
dot located at (i, j)) from gene j to gene i means that j regulates the expression of i. The maximum depth of this network is 8. . . . . . . . . . . . . . . . . 119
Figure C.20 Same network as shown in Fig. C.19, zoomed in to focus on the first 5,000 nodes.119

xviii

CHAPTER 1
INTRODUCTION

I must warn the reader that this chapter should be read with care,
for I have not the skill to make myself clear to those who do not
wish to concentrate their attention.
—Jean-Jacques Rousseau, The Social Contract
This dissertation covers a broad range of topics arranged in a bottom-up manner. I begin
by discussing the fundamentals of network theory and complexity from a general, mathematical
point of view. I then focus on the particular case of the gene regulatory networks (GRNs) that
regulate the processes in cancer cells, examining both how these networks are inferred from a
pool of disease-specific experimental data and the resulting topological traits that emerge. These
topological traits are also shown to be strongly related to the evolutionary properties of individual
genes and clusters of genes. I then apply a dynamical Boolean model, the Hopfield model, to
simulate the effects of specific perturbations to cancer GRNs, suggesting potential therapeutic
targets for future experiments. Finally, I introduce a new generalized version of the Hopfield
model, and I apply a specific version to simulate perturbations to genes involved in cell cycle.
Because it is likely that most readers will be knowledgeable of only some of the topics discussed herein, this dissertation was written under the assumption that all readers are both highly
intelligent and oblivious to all of these things. Unlike Rousseau, I introduce the material in what I
believe to be an approachable, edifying, and hopefully entertaining manner. Because of the diversity of topics, it was deemed more natural to include introductions at the opening of each chapter
rather than kludging together a single long and meandering introduction. This has the added benefit
of creating compartmentalized chapters, allowing readers who are familiar with any introductory
material to easily skip to sections containing unfamiliar material. Some of the gory details have
been pushed to the appendices, including a mathematically dense proof and two lengthy deriva-

1

tions. Any reader interested in further details can find them in the original publications referenced
in the text. The remaining chapters are organized as follows.
• Chapter 2, Complex Networks: an introduction to network theory, including many terms and
concepts used in later chapters
• Chapter 3, Gene Regulatory Networks: an overview of proteins, genes, and gene regulatory
networks, as well as network reconstruction algorithms used in modern bioinformatics
• Chapter 4, Evolution of Gene Regulatory Networks: a description of the relationship between
the topology of gene regulatory networks and the evolutionary histories of their constituent
genes
• Chapter 5, The Hopfield Model: a synopsis of the basic properties of one kind of neural
network and its application to simulate the dynamics of cancerous gene regulatory networks
• Chapter 6, Generalized Hopfield Models: an introduction to three generalizations of the Hopfield model (modular networks, cyclic attractors, and complex attractors) and applications to
cell cycle
• Chapter 7, Conclusions: a summary of the material covered in this dissertation and a list of
work to be done in the future

2

CHAPTER 2
COMPLEX NETWORKS

A culture’s icons are a window onto its soul. Few would disagree
that, in the culture of molecular biology that dominated much of
the life sciences for the last third of the 20th century, the dominant
icon was the double helix. In the present, post-modern, ‘systems
biology’ era, however, it is, arguably, the hairball.
—Arthur Lander, The edges of understanding
The following chapter reviews the basics of complex network theory as well as my own contributions, and is organized as follows.
• Section 2.1, The Königsberg bridge problem: a synopsis of the founding problem and solution of network theory
• Section 2.2, Basic terminology: a review of the basic terminology and notation used in
modern network theory
• Section 2.3, The problem with pictures: explanation of network layouts and motivation for
development of non-Euclidean network measures
• Section 2.4, Basic network measures: a review of basic network measures such as degree
• Section 2.5, Advanced network measures: a review of advanced network measures such as
betweenness centrality and global efficiency
• Section 2.6, Set and interset efficiency: definitions of two related measures I developed, the
set efficiency and interset efficiency
• Section 2.7, Communities: definition of network communities and modularity
• Section 2.8, Nested communities: an unpublished generalization of communities which I
developed, nested communities
Sections 2.1-2.5 and 2.7 may be safely ignored by readers with basic knowledge of the listed topics.

3

For detailed definitions and explanations of concepts in canonical network theory, see Networks:
An Introduction by Mark Newman [110].

2.1

The Königsberg bridge problem

Like much of mathematics, the modern field of graph theory was borne from a simple puzzle.
Fig. 2.1 shows a detailed 18th century map from the Prussian city of Königsberg, now named
Kaliningrad. The goal was to find a route through the city that crossed each one of its seven bridges
exactly once. In 1736, Leonhard Euler issued the first proof in graph theory, which demonstrated
that no such route exists.
Euler recognized that most of the information shown in such maps was extraneous. Neither
the details of the city (buildings, streets, shape of the shoreline) nor the details of the bridges
(location, length, construction materials) were relevant to solving the problem; all that mattered
was the manner in which land masses were connected to one another. By reducing each landmass
to a point (a node or a vertex) and each bridge to a line (an edge or a link) connecting points, he
was left with the simplest possible model that could still address the problem. The resulting graph
is shown in Fig. 2.2 with the nodes labeled A, B, C, and D. He concluded that if such a path (a so
called Eulerian path) were to exist, there must be exactly 0 or exactly 2 nodes with an odd number
of edges (or degree). Since all four nodes have odd degree, no Eulerian path exists. See Appendix
A for an informal proof.

2.2

Basic terminology

Networks, or graphs,1 provide a basic mathematical architecture for organizing systems with interacting components, and are especially useful for representing large, heterogeneous systems. Nodes
typically represent entities in a system, e.g. movie actors, and edges typically represent interactions
or relationships between entities, e.g. two actors performing in the same movie.
1 Some literature draws a distiction between the terms network

interchangeably here.
4

and graph, but they will be used

Figure 2.1 Detailed map of Königsberg. For
clarity, the bridges have been circled in red.

Figure 2.2 Bridges of Königsberg, overlayed
with Euler’s graphical representation.

There are many ways to represent networks mathematically. Euler used a diagram to solve the
Königsberg bridge problem, but this was practical only because the network was relatively small
with simple topology. Analyzing large complex networks requires the assistance of a computer,
which requires a symbolic rather than geometric representation. The most compact and convenient
notation for the purposes of this dissertation is using an adjacency matrix A, where


 1 if j → i
Ai j =

 0 otherwise

(2.1)

and j → i denotes a directed edge from node j to node i.2 Note that for undirected networks,
Ai j = A ji . In addition, many real-world networks have weighted edges, signifying the strength of
the connection. Weighted adjacency matrices are sometimes denoted by W , where Wi j ≥ 0 is the
weight of the connection from j to i. For example, the weighted, undirected network shown in
Fig. 2.3A has




5 
 0 2


,
W =
2
0
1.5




5 1.5 0
2 Some

(2.2)

literature and software packages use the convention i → j, but all equations can be
translated between the two conventions using appropriate matrix transposes.
5

Figure 2.3 [A] A weighted network with undirected edges, and [B] a weighted network with two
directed edges and one undirected edge. Node labels are shown in black, and edge weights are
shown in red.

and the weighted, directed network shown in Fig. 2.3B has


 0 0 5 


.
W =
2
0
0




5 1.5 0

(2.3)

Note that in all network diagrams in this dissertation, edges without arrowheads are assumed to
be undirected edges, and networks without edge weight labels are assumed to be unweighted networks.
Although a network can be simplistically described as a set of pairwise relationships, information may need to flow between nodes lacking a direct connection, e.g. airline passengers traveling
long distances or between small cities may need to make connecting stops, and a request from
a personal computer for information from a remote server may need to connect through several
intermediaries. This sequence of transfers through a network is called a path. Traversing a path
usually comes at a cost such as time or money, and intelligent companies and consumers attempt
to find paths of minimum cost. In both Fig. 2.3A and B, the shortest path, or geodesic, from node
1 to node 3 is the sequence 1 → 2 → 3, and the total distance is d31 = W21 +W32 = 3.5. However,
the geodesic from 3 to 1 in Fig. 2.3A is 3 → 2 → 1 with a distance of d13 = W23 + W12 = 3.5,
whereas in Fig. 2.3B the geodesic is 3 → 1 with a distance of d13 = W13 = 5.

6

2.3

The problem with pictures

It can be useful to examine networks diagrammatically, especially for Euclidean networks such
as the Königsberg bridge network or a network of airports linked by connecting flights, since the
network can be drawn to scale such that pairwise distances between nodes are preserved. As
Euler realized, this simplifies analysis by removing unnecessary details contained in a normal
map. However, many complex networks such as movie costar networks do not represent spatial
relationships and therefore do not have a single “best” spatial layout.
Projections provide a way to reduce high dimensional data to a low dimensional form, and are
typically used to help humans visualize the same complex data from multiple simplified perspectives. Perhaps the most familiar examples are projections of the Earth’s surface onto a 2D plane,
some of which are shown in Fig. 2.4. Maps like these allow the Earth to be comfortably stored on
the pages of a book or on the walls of the Pentagon, but while a good projection preserves some
properties of the original data (like the surface area of countries or the distance between parallels),
all projections necessarily introduce distortions and so should be used with caution [1].3 Only a
globe can capture the true topology of Earth’s surface, but certain properties can be gleaned from
consideration of multiple kinds of projections.
Similarly, network projections, also known as network layouts, aim to arrange nodes in two
or three dimensions such that the spatial distances between nodes is related to a property of the
network, often but not always the network distance between nodes. For example, a force-directed
layout treats all nodes as positively charged masses confined to a plane and treats all edges as
springs, and then attempts to find a minimum energy configuration as defined by Coulomb’s law
and Hooke’s law. The Coulomb force pushes all particles apart, whereas the spring forces keep
densely connected groups of nodes close. Because there are many energetic minima in such systems, the layout determined by the algorithm is not unique. Other examples include spectral,
orthogonal, tree, and dominance layouts; and the choice of layout depends on the specific network
3 thetruesize.com is an interactive tool that allows the user to move silhouettes of countries on a

Mercator map, demonstrating how country shapes and sizes morph as they are translated northward
and southward.
7

Figure 2.4 Some of the countless available map projections, including both common and exotic
projections. Image adapted from here.

and the specific application. Some example layouts are shown in Figs. 2.5–2.8.
By virtue of most networks being both high dimensional and non-Euclidean, however, network
layouts tend to suffer greater distortion than maps of the Earth, which can produce misleading results. Even the normal notion of metrics breaks down. In Euclidean space, distances are symmetric,
i.e. d(x, y) = d(y, x); but this is not necessarily true for distances between nodes in directed networks. It is important to remember that network layouts provide primarily qualitative information.
In order to understand the structure of complex networks quantiatively, new sorts of non-Euclidean
measures must be defined.

8

Figure 2.5 A random layout of a randomly
generated scale-free network. Nodes are
colored from white (low degree) to dark
blue (high degree).

Figure 2.6 A circular layout of the same
network shown in Fig. 2.5.

Figure 2.7 A force-directed layout of the
same network shown in Fig. 2.5.

Figure 2.8 A spectral layout of the same
network shown in Fig. 2.5.

9

2.4

Basic network measures

A natural question to ask is, how does a particular node compare with another node, or how does
one node compare with all other nodes? What makes a given node unique, if anything?
One of the most basic measures is the degree of a node i, which is simply the number of edges
connected to i,
ki = ∑ Ai j

(2.4)

j

Because nodes in directed networks have both incoming and outgoing edges, the measure splits
into indegree
kiin = ∑ Ai j

(2.5)

j

and outdegree
kiout = ∑ A ji = ∑ AT
j

j

ij

(2.6)

for the number of upstream and downstream neighbors, respectively. Note that nodes with zero
indegree are termed sources, and nodes with zero outdegree are termed sinks. For weighted networks, the role of the degree is played by the strength, which is the sum of the weights of the edges
connected to a node,
si = ∑ Wi j

(2.7)

j

For weighted, directed networks, the instrength and outstrength are given by
sin
i = ∑ Wi j

(2.8)

j

and
T
sout
i = ∑ W ji = ∑ W
j

j

ij

(2.9)

respectively. The degree or strength of one node relative to another gives some indiciation of the
nodes’ relative importance. For example, an airport-to-airport network could be constructed by
Wi j = number of people in one year who departed from airport j and arrived at airport i

10

This construction means the total number of people who left airport i in one year is given by the
outstrength sout
i . By most reasonable definitions of “importance,” the Harsfield-Jackson Atlanta International Airport, which served over 49 million departing passengers in 2015, is more important
than the Houghton County Memorial Airport, which served less than 26,000 [45].
Similarly, one may be interested not in comparing the properties of two nodes within the same
network, but one network with another. A quantitative comparison between the topologies of two
complex networks, particularly large networks, can be quite difficult. Even discerning whether two
networks are isomorphic (i.e. whether two networks are identical under a relabeling of one of the
network’s nodes), the so-called graph isomorphism problem, is an active area of research.
Although each network has its own unique topology, distilling its complex microscopic details
into a handful of coarse-grained measures that capture its basic properties can be enlightening. The
simplest way to do this is to examine the network’s degree distribution. A convenient model network is the Erd˝os-Rényi network, which is constructed by including an edge j → i with probability
p for each node pair (i, j). This is a simple type of random network model that is both interesting
in its own right and serves as a control for studying the properties of real networks. Because the
presence or absence of each edge is an independent trial with a fixed probability of success p,
Erd˝os-Rényi networks have binomial degree distributions,
Pk =

n k
p (1 − p)n−k ,
k

(2.10)

as shown in Fig. 2.9. This means that hubs, nodes with much higher degree than the mean degree,
are rare because Pk decreases exponentially for k

k = np.

However, real networks rarely have binomial degree distributions. One of the most prevalent
distributions is the power law degree distribution which has Pk ∼ k−γ for some γ > 1, and typically
2

γ

3 for real networks. Although power law distributions also fall to zero as k → ∞, they do so

much more slowly than binomial distributions, resulting in many more hubs than in Erd˝os-Rényi
networks. Fig. 2.10 shows the degree distribution for the Enron email network in which an edge
j → i means that employee j sent at least one email to employee i. Plotting the histogram in log-log
scale reveals that the Enron email network has a power law degree distribution with γ ≈ 1.6.
11

Figure 2.9 Indegree distribution for an
Erd˝os-Rényi network with kin ≈ 3.69.

Figure 2.10 Indegree distribution for Enron’s
email network with kin ≈ 3.69 and γ ≈ 1.6.
Note that both axes are logarithmic.

There are many other local network measures besides degree. The assortativity coefficient [111]
is the Pearson correlation between the degree of neighboring nodes across the network, the value
of which determines whether the network is assortative (hubs tend to connect to other hubs), disassortative (hubs tend to connect to non-hubs), or mixed (no significant relationship between the
degree of neighboring nodes). The clustering coefficient [157] counts the number of “triangles”
(fully connected node triplets) in the network out of the total number of possible triangles.

2.5

Advanced network measures

All of the previously discussed measures are entirely local. They provide important information
about microscopic properties, but typically cannot identify high level structural properties. For
example, the degree distribution, assortativity coefficient, and clustering coefficient all fail to differentiate between the networks in Fig. 2.11A and 2.11B, even though there is an obvious, critical
difference: while there is a path from every node to every other node in the network in 2.11A, the
network in Fig. 2.11B is composed of two components.
Many intermediate- and long-range measures have been developed to understand important
structures in networks. One common class of measures, centrality measures, assigns a score to
each node in the network and identifies nodes that are in some sense “important” or in the “middle”

12

Figure 2.11 (A) A network with one component, and (B) a network with two components. Both
networks have identical degree distributions since all nodes have degree 2.

Figure 2.12 An example network. The red node has high betweenness centrality, since all paths
between the left and right sides must pass through the red node.

of the network. The betweenness centrality [54], for example, counts the fraction of geodesics that
pass through a given node. It is given by
bi =

1
(N − 1)(N − 2)

∑

j=1..N
h=1..N
i= j,i=h, j=h

ρh j (i)
ρh j

(2.11)

where ρh j is the number of geodesics from node j to node h, and ρh j (i) is the number of geodesics
from j to h that pass through node i. The motivation behind the definition of betweenness centrality
is that removal of a node with high betweenness centrality disrupts important avenues of communication between remote nodes. The red node in Fig. 2.12, for example, has high betweenness
centrality because all paths between the left and right sides must pass through it.

13

PageRank [118] is one of the most famous and widely used centrality measures. It was created
by Larry Page in 1999 in an effort to “bring order to the web,” and served as the foundation of
Google’s search engine. Like other search engines, Google identifies a large set of pages containing text that matches all or part of the user’s query. However, Google’s power comes from
integrating text matching information with network topology to intelligently sort the search results. The World Wide Web is composed of pages (nodes) and hyperlinks that point from one page
to another (edges), and so can be represented as a directed network. The basic assumption behind
PageRank is that a page’s reputation is a function of its upstream neighbors’ reputations, a sort
of “reputation by consensus.” In other words, if many trustworthy sources of information point
to a page i, then i is likely trustworthy as well. This effect should be tempered by the outdegree
of those pages, however. Wikipedia’s domain hosts a large number of links to other domains, so
while Wikipedia may have high PageRank, the algorithm should be designed such that the score
it passes downstream is inversely proportional to the number of pages to which it links. Formally,
node i’s PageRank xi is obtained by self-consistently solving
xi = α

xj
Ai j out + β
kj
j=1..N

∑

(2.12)

for x given constants α and β . Under this definition, the PageRank algorithm assigns a score to
each node based on the scores of its upstream neighbors divided equally amongst all outgoing
links, plus a small free score so that source nodes contribute a nonzero score to their targets.
Since the primary goal of centrality measures is to rank nodes from most to least central, overall
multiplicative factors are irrelevant, so dividing both sides of Eq. 2.12 by β and renaming xi /β → xi
gives a form of PageRank with only one free parameter α,
xi = 1 + α

xj
Ai j out
kj
j=1..N

∑

(2.13)

Conventionally, α is set equal to 0.85, and was likely empirically chosen by Google simply because
it produces the best results [110]. There is no guarantee that α = 0.85 is the best choice for all
applications, however. Google no doubt has improved the performance of its search engine since
the company was founded, but PageRank remains an integral part of the algorithm. There are
14

Figure 2.13 An undirected, unweighted, disconnected network with two components.

many variations of PageRank [149], but a simple weighted version of Eq. 2.13 can be obtained by
replacing Ai j with Wi j and the outdegree kiout with the outstrength sout
i ,
xi = 1 + α

xj
Wi j out
sj
j=1..N

∑

(2.14)

Since there can only be a nonnegative integer number of hyperlinks from a page j to a page i, the
World Wide Web has Wi j ∈ {0, 1, 2, . . . }; but in general, Eq. 2.14 can be computed and simply
interpreted as long as Wi j is nonnegative and real. As with all network measures, the “correct”
form of PageRank depends on the particular application.
One relatively simple yet potentially informative measure is the mean distance between all
pairs of nodes in a network,
d¯ =

1
∑ di j
N(N − 1) i=1..N

(2.15)

j=1..N
i= j

where di j is the geodesic distance from node j to node i, and di j ≡ ∞ if no path exists from j to
i. The limiting cases produce sensible results: unweighted complete networks have d¯ = 1, since
all nodes are adjacent to all other nodes, and edgeless networks have d¯ = ∞. But real networks
are rarely this regular. Consider the simple network in Fig. 2.13. For the subnetwork composed
of nodes 1, 2, and 3, d¯ = 1; but including node 4 means d¯ = ∞. This network is clearly more
connected than an edgeless network, however, and a different measure is required to capture this
fact.

15

The global efficiency of a network is defined as
Eglobal =

1
1
.
∑
N(N − 1) i=1..N di j

(2.16)

j=1..N
i= j

This is the mean inverse distance between all distinct ordered pairs of nodes in the network. The
inverse of the global efficiency gives a measure similar to the mean distance, and it is identical in
−1 = d¯ = 1, and for edgeless networks,
the extreme cases: for unweighted complete networks, Eglobal
−1
Eglobal
= d¯ = ∞. However, the global efficiency is less sensitive to disconnected and weakly
−1 = 2,
connected pairs of nodes than the mean distance. For the network shown in Fig. 2.13, Eglobal

whereas d¯ = ∞. Any network with at least one (directed or undirected) edge with nonzero weight
−1 < ∞.
has Eglobal

2.6

Set and interset efficiency

Most commonly used network measures concern either global properties (like the assortativity
coefficient, clustering coefficient, and global efficiency) or single-node properties (like degree,
betweenness centrality, and PageRank). However, one may be interested in the properties of a set
of nodes, for example how a set of nodes is distributed across a network. In [116], I defined the set
efficiency4 of a set of nodes I as
EI =

1
1
∑
|I|(|I| − 1) i, j∈I di j

(2.17)

i= j

where |I| is the number of nodes in I. Note that if I is the set of all nodes in the network, EI =
Eglobal . While the summation over i and j is restricted to a subset of the nodes in the network,
the geodesic connecting j to i may pass through nodes not in I. The set efficiency measures
the proximity of a set of nodes embedded in a networks, where large EI implies short distances
between nodes in set I. The red nodes in Fig. 2.14 are scattered across the network and so have
4 The

set efficiency was originally termed the intraset efficiency in [116], but its name was
changed to the set efficiency in [142] to avoid confusion with the interset efficiency.
16

low set efficiency, whereas the red nodes in Fig. 2.15 are separated by only one or two edges and
so have high set efficiency.
In [142], I defined the interset efficiency from a set of nodes J to a set of nodes I as
EIJ =

1
1
∑
|I||J| − |I ∩ J| i∈I di j

(2.18)

j∈J
i= j

where I ∩ J is the intersection between I and J. As with the set efficiency, geodesics may pass
through any nodes in the network; the only constraint is that the geodesic begins in J and ends in
I. For example, the interset efficiency from the red nodes to the blue nodes is low in Fig. 2.16 and
high in Fig. 2.17. The interset efficiency is a generalization of the set efficiency, since EII = EI .
See Appendix B for the derivation of the interset efficiency’s normalization. Applications of the
set and interset efficiency will be discussed in Sections 3.4 and 4.3.
It is worth emphasizing that the set efficiency is a property of a set of nodes, and the interset
efficiency is a property of a set of two sets of nodes. In other words, they are both collective
properties of the elements of a set, not merely an average of single-node properties. One possible
set-to-set analogue of the betweenness centrality could be the interset betweenness centrality,5
bHJ
i =

ρh j (i)
1
∑
Normalization j∈J,h∈H, ρh j

(2.19)

i= j,i=h, j=h

This is the fraction of all geodesics that start in node set J and end in node set H that pass through
node i. As with the set and interset efficiency, the only difference between Eq. 2.11 and Eq. 2.19
is the limits of the summation. Indeed, many single-node and global network measures can be
adapted to answer questions about sets. This set-oriented approach may be more appropriate
when asking questions about “robust” networks, where redundant connections emerge (as with
metabolic, gene regulatory, and ecological networks) or are intelligently designed (as with transportation, power, and information networks) such that the failure of a single node does not cause
5 The author leaves the derivation of the normalization and the task of thinking of a better name

for Eq. 2.19 as an exercise for Carlo’s next graduate student.

17

Figure 2.14 The set efficiency of the set of red
nodes is relatively low. One geodesic for one
(i, j) pair is shown with thick edges.

Figure 2.15 The set efficiency of the set of red
nodes is relatively high. One geodesic for one
(i, j) pair is shown with thick edges.

Figure 2.16 The interset efficiency from the set
of red nodes to the set of blue nodes is relatively
low. One geodesic for one (i, j) pair is shown
with thick edges.

Figure 2.17 The interset efficiency from the set
of red nodes to the set of blue nodes is relatively
high. The half-blue-half-red node belongs to
both sets. One geodesic for one (i, j) pair is
shown with thick edges.

18

dramatic changes to the system (failure/removal of the red node in Fig. 2.12, for example, engenders such a dramatic change, since the network separates into two components). It may be more
useful to inquire about the behavior of “regions” of the network than individual nodes.

2.7

Communities

The previously discussed set measures are motivated by questions of the form, “what properties
does a given set of nodes have?” Clustering in networks, by contrast, is motivated by questions of
the form, “can a set of nodes which satisfies a given property be identified?” Answers to the second
question are usually far more difficult to compute because the combinatorics of the problem makes
exhaustive searches exponentially harder for every node added to the network. Most clustering
algorithms employ efficient heuristic techniques to find good solutions, but truly optimal solutions are rarely obtained. Like maps of Earth and network layouts, clustering is a dimensionalityreduction method that gives researchers a bird’s eye view of how a network is structured without
the need to understand the properties of every individual node and edge.
One common feature of real world networks is the organization of nodes into communities,
also known as modules. The quantitative definition of what constitutes a community varies, but
most are designed to segregate the network into relatively densely connected subnetworks. The
most common method for community detection is maximizing an objective function Q(c) called
the modularity [112],
Q = ∑ Ai j −

kiin kout
j

ij

m

δci ,c j

(2.20)

where m = ∑i kiin = ∑i kiout is the total number of edges in the network, ci is the index of the
community to which node i belongs, and δx,y is the Kronecker delta function. The first term in
the sum is the adjacency matrix, Ai j , which is the actual number of edges from j to i (either 0 or
1). The second term, kiin kout
j /m, is the expected number of edges from j to i given their out- and
indegrees, respectively, under a degree-preserving randomization. Finally, the Kronecker delta
function ensures that a given (i, j) term contributes to the summation if and only if i and j are

19

assigned to the same community. Maximizing Q means searching for a c that includes as many
positive terms in the summation as possible, i.e. searching for groups of nodes that share more
edges within each group than expected at random. A weighted version of Eq. 2.20 can be obtained
by replacing Ai j with Wi j and the in/outdegree with the in/outstrength. Applications of modularity
will be discussed in Chapter 4.

2.8

Nested communities

The following section contains unpublished findings.
Hierarchical organization is a ubiquitous feature of large complex systems. Phylogenetic trees,
for example, represent the difference between the genomes of organisms that share a common ancestor, and are usually represented with dendrograms as shown on the left side of Fig. 2.18. The
difference between genomes is computed using a metric such as the total number of mismatched
base pairs (i.e. the number of single nucleotide polymorphisms, or SNPs) when the DNA sequences
of two organisms are compared. In the phylogenetic tree in Fig. 2.18, this metric has been mapped
to the estimated point in time at which two distinct genomes emerged from a common ancestor,
which is reflected by the x-coordinate of the forking point between any two clades (subnetworks
of a phylogenetic tree formed from a chosen forking point plus all its decendants), with more distant forks (very different genomes) placed toward the left. The causes of the divergence in human
(and Neanderthal) genomes are many and varied. Geographical divides between prehistoric populations caused humans to evolve in a compartmentalized fashion, with great distances, bodies of
water, mountains, and deserts separating major clades, e.g. Sub-Saharan Africans and Aboriginal
Australians, and more modest barriers between genetically similar subgroups, e.g. English and
French. The evolution of languages is similarly hierarchical. Fig. 2.19 shows an inferred dendrogram of 103 Indo-European languages. In addition to the best estimates of the fork locations, this
figure also includes 95% confidence intervals (marked with light blue lines), and the gray distribution on the left corresponds to the computed probability density of the location of the root node
(the most recent common ancestor language of all 103 languages shown). Social systems also
20

Figure 2.18 An example phylogenetic tree of human ancestry derived from mitochondrial DNA.
Original image from [80].

tend to be organized hierarchically into city, county, state, and federal governments, as well as a
multitude of international organizations like the United Nations and the World Trade Organization.
Although there is great complexity across all scales of such hierarchies, their clades are not necessarily self-similar, as is the case with fractals. Each level of the hierarchy may have its own distinct
topology.
Likewise, numerous real world networks exhibit hierarchical structures [91]. Many of the multitude of existing network measures (geodesic distances, for example) may be directly used or
adapted to build dendrograms, and for some networks, this may prove illuminating. However,
dendrograms are typically constructed using continuous variables for the locations (“heights”) of
the forking points, and for large networks, the resulting dendrograms are highly complex. Futhermore, assigning nodes to discrete clusters requires a user-defined threshold height that collapses
all subsets below this threshold (i.e. clades which are similar enough) to a single point, making this

21

Figure 2.19 Inferred dendrogram of 103 Indo-European languages. Line thicknesses correspond
to languages’ rates of “diffusion” across land masses. 95% confidence intervals are marked with
light blue lines and are included at all forking points. The gray distribution on the left corresponds
to the computed probability density of the location of the root node (the most recent common
ancestor language of all 103 languages shown). Original image from [19].

22

a parametric clustering method.
My simple but powerful nonparametric generalization of the concept of communities is nested
communities, which as its name suggests involves searching for communities inside of communities in a recursive manner. The first step of the algorithm applies the traditional community
finding algorithm to the full network, breaking it into standard communities. The nested community algorithm then independently applies the same traditional community finding algorithm to the
subnetworks composed of the nodes from each community from the previous step. This method is
iterated over each output set of communities until some condition is met, terminating that particular branch of the tree. The algorithm must terminate at the single-node level, but depending on
the application, additional criteria can be used to decide when to break from the algorithm. These
could include reaching a depth where the optimal modulary becomes nonpositive; reaching a depth
where the optimal modularity is no longer statistically signficant compared to a distribution of random subdivisions using a chosen p-value threshold; or using external, non-topological information
(such as gene enrichment, discussed in Section 3.2).
In more mathematical notation, applying the traditional community search algorithm f to C
(the set of all nodes in the network) results in n communities, f (C) = {C1 ,C2 , . . . ,Cn }. Note that
i Ci

= C and Ci ∩ Ci = {∅} for all i = i . The next level of depth is obtained by applying f to

each of the n communities detected. For community i, for example, its n subcommunities Ci j
are found via f (Ci ) = Ci1 ,Ci2 , . . . ,Cin , and the same union and intersection properties from the
first output set apply to these output sets. This is applied iteratively until no further communities
are detected. Note that a community’s depth is equal to its number of indicies. Fig. 2.20 shows an
example network with two labelled levels of community structure.
Figs. C.1-C.4 show spy plots of the adjacency matrix for Google’s internal web page network [119]. Each node (row and column indicies) is a web page owned by Google, and each edge
(a black dot located at (i, j)) is a hyperlink from page j to page i, both of which are managed by
Google (google.com, mail.google.com, calendar.google.com, etc.). Nodes were assigned a random index in Fig. C.1, so no clear patterns emerge. Fig. C.2 shows the same network, but with

23

Figure 2.20 An example network with two levels of community structure. The largest oval
encloses all nodes, C; the two midsized ovals enclose nodes in level-1 communities, {Ci }; and the
six small ovals enclose nodes in level-2 communities, Ci j .

its nodes sorted into standard level-1 communities (boxed in red). Note that this ordering tends to
drive edges toward the diagonal, leading to clear block stucture. Figs. C.3 and C.4 show level-2
and level-3 communities respectively, the result of further iterations of the nested community algorithm. This reveals that a significant amount of substructure is overlooked by standard community
searches. Figs. C.5-C.8 are the same as Figs. C.1-C.4, but zoomed to show only the first 5, 000
nodes to emphasize the patterns detected by the nested community algorithm.
To further demonstrate the power of the nested community algorithm, the same analysis was
performed for an arXiv theoretical high energy physics citation network [90], shown in Figs. C.9C.16. In this network, each node is an author of at least one high energy physics publication posted
on arXiv, and an edge from author j to author i means that at least one of j’s articles on arXiv cited
at least one of i’s articles on arXiv. These plots show more intercommunal edges than the Google
network, with dense off-diagonal blocks indicating strongly coupled but still distinct subfields.
Finally, gene regulatory networks such as AML 2.3 and HumanNet (discussed in detail in
Chapters 3 and 4) show 7 and 8 layers of nested community structure, respectively, as shown
in Figs. C.17-C.20. The communities in these networks – especially HumanNet – show greater

24

heterogeneity than the Google and citation networks, perhaps because these networks are inferred
rather than directly and precisely measured. However, a distribution of communities with high and
low connectivities interconnected in more diverse ways may be an important property for cells.
Some of the more fundamental cellular functions may need to be tightly regulated to keep the cell
alive, whereas others may need to be more flexible to quickly respond to changing stresses. This
could be addressed in the future using enrichment tools such as DAVID (discussed in Section 3.2)
to determine if there is any relationship between the connectivity of communities in these networks
and the kinds of pathways they host.
It is perhaps unsurprising that the Google, citation, and gene regulatory networks show heterogeneous community structures: the rules that govern how these networks are designed (in the case
of Google’s pages) or emerge (in the case of citations and gene regulation) are very different from
one another. However, the nested community algorithm, a rather simple extension of the standard
algorithm, successfully identifies deeper underlying trends with little additional effort.

25

CHAPTER 3
GENE REGULATORY NETWORKS

Quis custodiet ipsos custodes?
(Who will guard the guards themselves?)
—Juvenal, Satire VI
The following chapter briefly reviews the fundamental concepts of biology underpinning my research as well as my own contributions, and is organized as follows.
• Section 3.1, Basic biology: a basic description of proteins and genes
• Section 3.2, Gene regulation: a discussion of how gene expression is regulated
• Section 3.3, Cancer: brief overview of gene regulatory networks and oncogenesis
• Section 3.4, GRN reconstruction: a network reconstruction algorithm that our collaboration
developed and published, and my primary contribution to the publication concerning an analysis of the topology and distribution of cancer mutations in a reconstructed acute myeloid
leukemia network
Sections 3.1-3.3 may be safely ignored by readers with basic knowledge of the listed topics. For
further details concerning mechanisms in cellular biology and genetics, see introductory texts such
as [24] and [77].

3.1

Basic biology

Many cellular functions involve proteins, macromolecules dubbed the “workhorses of life.” Proteins are produced by the cell in a wide array of shapes and sizes (one of the smallest known
natural proteins, spoVM, has a mass less than 3 kDa [33], and the largest, titin, has a mass over
3 MDa [117]) and perform a huge variety of vital tasks. Some of the broad categories of proteins
include [24]

26

• structural proteins, which provide the materials for structures like histones (the protein complexes which package DNA), cytoskeletons (the scaffolding within cells that maintains their
shape and provides “highways” for the movement of material), and large-scale structures
such as hair, spider silk, and connective tissues;
• transport proteins, which transport material within a cell, between the cell and its environment, or from one part of the body to another (such as hemoglobin, which moves oxygen
throughout the body);
• hormonal proteins, which transmit information throughout the body and regulate many processes;
• receptor proteins, which detect chemicals in the environment surrounding cells;
• contractile proteins, which are used in both the motion of single-celled organisms (as in the
case of cilia and flagella) as well as the motion of multicellular organisms (such as actin and
myosin in muscles);
• defensive proteins, which combat bacteria and viruses;
• and enzymatic proteins, which alter the rates of chemical reactions within cells.
The information for the structure of each protein is encoded in one or multiple genes. New proteins
are created within the cell via protein biosynthesis, which follows these basic steps in eukaryotes
(e.g. human cells):
1. The gene’s contents are read by the protein complex RNA polymerase II (RNAP II) and the
information is copied to pre-messenger RNA (pre-mRNA) in a process called transcription.
2. The pre-mRNA is processed into messenger RNA (mRNA), which is transferred from the
cell nucleus to the cytoplasm.
3. The mRNA is processed by a complex of proteins and RNA called a ribosome, which builds
a protein in a process called translation.
4. The protein may undergo further post-translational processing to produce the mature, active
protein.
There may be processes before, after, and between these steps as well, depending on the organism,
27

the particular gene, and the current state of the cell.
In this dissertation, gene expression refers to the concentration of a given gene’s mRNA in a
given cell, i.e. highly expressed genes have high mRNA concentrations. It would be preferable
to use the more traditional definition of expression, meaning the concentration of the final gene
product (either a protein or functional RNA), since there are regulatory steps that occur after mRNA
synthesis. However, it is very difficult to simultaneously measure the concentrations of all final
gene products in a single cell or cell population. Paired with incomplete knowledge about the
existence, functions, chemical states, and interactions between the numerous genes, RNAs, and
proteins in any given organism, compiling a comprehensive snapshot of the internal state of a cell
requires more advanced technologies than are available today. Consequently, many of the findings
contained in this dissertation are based on the simplifying assumption that mRNA concentration is
a good proxy for final gene product concentration. The most plentiful and comprehensive sources
of gene expression data available today come from microarray and mRNA sequencing (RNA-seq)
measurements. See Appendix D for a brief overview of microarray and RNA-seq technologies.

3.2

Gene regulation

The demand for cellular processes varies over time. Cells such as red blood cells die naturally in
the human body and are regularly replenished by stem cells, but unrestricted cell division is one
of the primary hallmarks of cancer. Humans of course require an active immune system to guard
against pathogens, but a hyperactive immune system can lead to a plethora of diseases in which
the immune system attacks the host. The nonlinear, nonequilibrium nature of biological systems
requires each cell to carefully regulate its rate of protein production in response to its immediate
needs and its environment.
Regulation of the amount of final gene product can happen at any point in the biosynthetic
process. Histones and DNA can be chemically altered to change whether a gene is exposed for
transcription; mRNA and proteins can be degraded, or disassembled into their basic components,
by other proteins; but one of the most important and well studied forms of regulation comes from
28

the action of transcription factors (TFs). Transcription factors are proteins that bind to specific
segments of DNA, increasing (activators) or decreasing (repressors) the likelihood that RNAP II
successfully binds to and transcribes a given gene. In many cases, transcription of a given gene
cannot begin until its specific transcription factors bind to their respective sites on the DNA.
Of course, gene regulation is useless unless the regulators themselves are regulated. Individual
transcription factors do not make autonomous decisions about whether to up- or down-regulate
their target genes; rather they respond to their own regulatory inputs from upstream genes, which in
turn depend on their upstream regulators, ad infinitum. This complex web of regulatory interactions
and feedback loops, called a gene regulatory network (GRN), provides each cell with a sort of
“brain” that allows it to make decentralized decisions [20] about how to modulate the production
of the cell’s various proteins at any given time.1
As shown in Section 2.8, GRNs are typically modular, where genes share many interactions
within communities and fewer interactions between communities. These communities host pathways, or subnetworks of a cell’s GRN responsible for constructing particular cellular components
or executing particular cellular functions. This modular structure allows pathways within the same
community to synchronize and cooperate, while operating more or less independently of pathways
in other communities. Tools such as DAVID [38] have been created to detect which pathways are
enriched, or present to a statistically significantly degree, in a given set of genes drawn from a given
population of genes. The p-value of the enrichment of a given pathway in a given set of genes is
usually estimated using the cumulative distribution function of a hypergeometric distribution,
K

p(k0 ) =

∑
k=k0

K
k

N−K
n−k
N
K

(3.1)

where N is the number of genes in the population, n is the number of genes in the input set, K is the
number of genes in the pathway, and k0 is the number of genes that are present in both the input
set and the pathway. Applications of gene enrichment in the context of evolution will be discussed
in Chapter 4.
1 This

emergent decision-making ability motivated us to simulate the dynamics of gene regulatory networks using a model developed in neuroscience, the Hopfield model, and will be described
Chapter 5.
29

Figure 3.1 Schematic of some known apoptosis components and signaling pathways in humans.
Figure adapted from the interactive version available from KEGG [72].

3.3

Cancer

GRNs make many decisions in a collective rather than centralized manner, which stabilizes against
both external and internal perturbations [98] (however, some functions such as developmental processes may be governed by the action of only a handful of genes). These decisions are executed by
proteins, which must fold into very precise shapes largely determined by the their encoding genes’
sequences to function correctly, although the environment and post-transcriptional modifications
play a role in folding, localization, and interactions as well. Mutations, or permanent changes to a
gene’s nucleotide sequence, are sometimes harmless, but certain mutations or collections of mutations can cause encoded proteins to misfold or exhibit altered function. Furthermore, cross linking
and double-stranded breaks in DNA can also cause misregulation of important processes. Cells
have machinery to attempt to repair damaged DNA, but if the damage is too extensive, the cell

30

may initiate apoptosis, or the process of “programmed cell death” in which the cell destroys itself
for the sake of the multicellular organism. Apoptosis is one of the most important signaling pathways being investigated today. Fig. 3.1 shows a high level summary of some of the experimentally
verified stimuli, genes, and reactions involved in the apoptosis signaling pathway in humans.
Elements of the apoptosis signaling pathway are also susceptible to mutation. One of the
most commonly mutated genes in cancer, TP53 (which encodes the protein p53), is integral to
DNA maintenance and apoptosis initiation. Mutations in TP53 result in not only the loss of p53’s
original functionality, but can give rise to new, undesired interactions [107]. TP53 is also a kind of
tumor suppressor gene because p53 is responsible for pausing cell cycle to allow the cell to check
that the DNA can be replicated without errors. Without properly functioning p53s, cells undergo
division regardless of the fidelity of the daughter cells’ DNA [63], allowing for the introduction
of yet more errors in the next generation’s DNA. Another related category of genes, oncogenes, is
responsible for up-regulating processes like cell cycle. The acquisition of mutations that disrupt
the function of apoptosis, cause tumor suppressor genes to malfunction, and cause oncogenes to
become hyperactive are necessary steps in carcinogensis, or the evolution of normal cells to cancer
cells.2 Recently, more of these hallmarks of cancer have been identified [61], and are shown in
Fig. 3.2. For a review of different cancer types and their corresponding driver mutations, see [153].
Any viable theraputic intervention must kill or control cancer cells while doing little or no
damage to normal cells, and because cancer cells are physically very similar to their normal counterparts, treating cancer clinically is quite difficult. Although broad classes of cancers have many
distinguishing mutations and characteristics, each patient has their own sets of mutations that make
their cancer unique. With the advent of single-cell RNA-seq technology, it is rapidly becoming apparent that the mutation and gene expression profiles of cells drawn from the same tumors are
far from homogeneous [22, 120], a direct consequence of the misregulation of cellular processes.
2 In fact, many of the mutations that commonly drive carcinogensis tend to be located in the parts

of the genome responsible for multicellular cooperation (for example, regulation of proliferation).
These kinds of mutations could be considered a form of “reverse evolution” [25] in which the
mutated cell behaves like a unicellular organism, acting with no regard for the health of the host
organism.

31

Figure 3.2 Ten of the “hallmarks of cancer.” Acquiring each of these features drives cells into
more cancer-like phenotypes. Figure adapted from [61].

As the founder clone (marked with N’s in Fig. 3.3) becomes cancerous, it mutates and divides at
a higher rate than normal cells. Each successive generation follows its own unique evolutionary
path, driven by random mutations and natural selection. These mutations accumulate and, just as
with the phylogenetic tree in Fig. 2.18, their genomes begin to diverge. Cells from the same or
different lines may compete (because space and resources are limited) and/or cooperate (by releasing growth factors to further up-regulate growth and proliferation in other cells), increasing
each successive generation’s fitness compared with normal cells. Designing a drug regimen that
effectively kills one of the clonal lines may simply allow more aggressive, drug-resistant lines to
fill the vacuum.
The apoptosis diagram in Fig. 3.1 represents how at least part of a typical functioning signaling
pathway works, but there are a large number of ways for that system to malfunction (removing
or adding edges, changing binding strengths, etc.). Before any progress can be made in treating a

32

Figure 3.3 Schematic of clonal evolution within a single patient. As the founder clone (marked
with N’s) becomes cancerous, it mutates and divides at a higher rate than normal cells. Each
successive generation follows its own unique evolutionary path, driven by random mutations and
natural selection. These mutations accumulate and, just as with the phylogenetic tree in Fig. 2.18,
their genomes begin to diverge. Cells from the same or different lines may compete (because
space and resources are limited) and/or cooperate (by releasing growth factors to further
up-regulate growth and proliferation in other cells), increasing each successive generation’s
fitness compared with normal cells. Figure taken from [79].

33

malfunctioning GRN, the structure of the GRN itself must be deduced.

3.4

GRN reconstruction

The human genome contains an estimated 23,000 genes [122] that together encode for an estimated 29,000 proteins [106] (some genes encode multiple proteins thanks to alternative splicing).
Although nowhere near the complexity of the network of neurons in the human brain, these numbers pose a problem for understanding gene regulation: explicitly determining whether one gene
regulates another or one protein interacts with another across all possible pairs is an enormous
task. This is further complicated by the fact that many genes require the presence of multiple
transcription factors before transcription initiates, effectively giving rise to 3+ body interactions.3
Nevertheless, much effort has been devoted to assembling manually-curated databases of regulatory interactions. Some databases like KEGG [72] focus on organizing information about individual pathways. Others databases like TRANSFAC [101] gather data about transcription factor/gene
interactions across many organisms and cell types. Curated databases such as these rely on manual
or computer-assisted searches through abstracts and publications to identify and categorize known
interactions.
However, assembling a list of known interactions alone cannot predict the existence of unknown interactions. The true underlying GRN in a given cell type likely has more edges than have
been directly measured. Additionally, it may not be necessary to understand the precise details of
every regulatory interaction to understand some of the statistical properties of GRNs. A practical alternative to fully combinatorial experimental detection is to employ network reconstruction
algorithms, which infer the structure of the GRN based on some sort of correlation between the
expression of pairs of genes across multiple gene expression profiles. A number of reconstruction algorithms have been developed recently, driven in large part by the vast quantity of gene
3 Hypergraphs are graphs (networks) that are composed of nodes and hyperedges.

A hyperedge
is the n-body generalization of the pairwise relationship that an edge in a normal network represents. Hypergraphs are interesting mathematical constructs and are of potential utility in the study
of gene regulation, but will not be covered here.

34

expression and other data produced in recent years. ARACNE [16] is a well known network reconstruction algorithm based on mutual information; TIGRESS [65] uses least angle regression
to detect edges; and GENIE3 [71] ranks the significance of edges using a tree-based ensemble
method.
In 2015 I coauthored A scalable method for molecular network reconstruction identifies properties of targets and mutations in acute myeloid leukemia [116], which reported on an efficient
network reconstruction algorithm developed by our collaboration. Curiously, we neglected to
name the algorithm in the original publication, so for the sake of this dissertation the algorithm
will be called Ong’s method, named for the primary author. At the most basic level, Ong’s method
builds a GRN from the Pearson correlation between the expression of pairs of genes across sets of
microarray and/or RNA-seq data. The reconstructed network discussed in the original publication
used five carefully selected, high quality gene expression experiments (two RNA-seq and three
microarray) of acute myeloid leukemia (AML), a form of blood cancer from the myeloid line of
cells.4 Each sample in each data set represented a bulk gene expression profile of untreated AML
cells from a unique patient, so that the full collection of data covered many independent, hetergeneous instances of AML. This disease-specific network was named AML 2.1. Because my chief
contribution to the paper was an analysis of the resulting networks and designing the set efficiency
measure covered in Section 2.6, some of the more esoteric details have been omitted. Full details
can be found in the original publication.
The Data Processing step runs each of the five gene expression data sets z through common
software packages5 (which convert the raw data to a usable form and conduct statistical tests to
control for experimental errors) to produce a “normalized” gene expression matrix X z , where 0 ≤
z < ∞ is the expression of gene i in sample µ from data set z. Although the gene expression from
Xi,µ

two distinct samples µ and µ taken from two distinct data sets z and z cannot be directly compared
(since the reported absolute expression is sensitive to experimental conditions and platforms), the
4 This

video provides an excellent summary of acute myeloid leukemia (AML) as well as a
related but distinct form of cancer, acute lymphoblastic leukemia (ALL).
5 Expression data was processed using the R package Bioconductor for microarrays and the
“Tuxedo” suite for RNA-seq.
35

z and X z
normalization procedure enables comparisons between Xi,µ

i,µ

for µ = µ . This means the

correlations can be sensibly computed for any two genes i and j across all samples µ within a
single data set z.
Some high but coincidental correlations may be detected in which one or both of the genes
exhibit only minor random fluctuations around a high mean value, such as housekeeping genes
(genes which are consistently active across many or all cell types). The second step in Ong’s
method, Optimization and Method Selection, was designed as both a filter for such genes and a way
to compare the four algorithms (ARACNE, TIGRESS, GENIE3, and Ong’s method). A minimum
threshold for the coefficient of variation (CV, the standard deviation divided by the mean) was
introduced so that genes are included in data set z only if they show high variation across samples
relative to their means. To determine the best value of the CV threshold, 90 TFs which showed
high expression in the AML samples were randomly selected from the TRANSFAC database.
These 90 TFs had 2486 known regulatory interactions (hereafter true interactions or TIs) spread
across 1273 unique target genes. These 90+1273=1363 genes were selected from the full data set,
and each of the four algorithms attempted to reconstruct this subnetwork using the five data sets.
The CV threshold was varied until the number of TIs identified in the top 100 most significant
interactions was maximized for each of the data sets and each algorithm, producing a total of
(5 data sets) × (4 algorithms) = 20 optimal CV cutoffs.
All four algorithms aim to reconstruct the true underlying GRN, but because each was designed
using different assumptions, they produce somewhat different networks for the same set of input
data. In order to judge which method produces the best results, the list of inferred interactions
was compared with the TIs. Of course, TIs only represent a fraction of all existing interactions,
and the coverage of various regulatory interactions in scientific literature is uneven: famous genes
such as TP53 have been the focus of a multitude of experiments because of their importance in
diseases or development, while other genes like C9orf37 and C16orf59 are currently predicted
genes (predicted from sequence motifs) with no known functions. These may not be genes at all,
or may play subtle but crucial roles in studied or novel biological functions. Even considering

36

Figure 3.4 Binned Pearson correlation of inferred TF/gene interactions in one of five data sets
using Ong’s method, with maximum correlation (0.98) on the left and minimum correlation (0.15)
on the right. The vertical red line identifies the computed Pearson correlation threshold, meaning
all edges with Pearson correlation above the threshold are included in the final network. Figure
taken from [116].

this potential bias, however, the ability to recreate a large number of known interactions is likely
indicative of a meaningfully reconstructed network. The top 100 most strongly correlated pairs of
genes identified by Ong’s method contained as many or more TIs than the top 100 using ARACNE,
TIGRESS, and GENIE3, both before and after CV optimization. Coupled with its relative speed
and algorithmic simplicity, Pearson correlation was selected for Ong’s method.
The TFG (transcription factor/gene) Subnetwork Reconstruction step proceeds with the five
full data sets, excluding genes whose expression profiles failed the above CV threshold test. The
Pearson correlation coefficient was computed between all pairs of genes (i, j) where at least one
of i and j were known to be TFs according to the databases AnimalTFDB [162] and KEGG.
The resulting list of interactions was ranked from largest to smallest Pearson correlation. Under
the assumption that genes linked by true interactions should show higher correlation than two
randomly selected genes, a Pearson correlation threshold was computed to identify which edges
were likely real and which should be discarded.
To determine this threshold, the sorted Pearson correlations were divided into 50 bins, and the

37

number of TIs identified in each bin was tabulated, as shown in Fig. 3.4. (Note that, as expected,
the bin containing edges with the highest correlations were most enriched for TIs, labelled as “TI
hits” in the figure.) For a single randomly chosen interaction from this list, the probability that it
is a TI is q = NT /N, where NT is the number of TIs and N is the number of inferred interactions.
Thus, given NB randomly chosen interactions in a given bin, the probability that at least k0 of these
interactions are TIs (i.e. the p-value of the number of TIs) follows a binomial CDF,
NB

p(k0 ) =

∑
k=k0

which, given NB

1 and q

NB k
q (1 − q)NB −k
k

(3.2)

1, can be approximated as a Poisson distribution,
λ k e−λ
∑ k!
k=k0
∞

p(k0 ) =

(3.3)

and λ ≈ 2. The Pearson correlation threshold was chosen by moving through the bins in Fig. 3.4
from left to right and locating the first bin with p(k0 ) > 0.15, or in other words, the first bin with
k0 ≤ 3 (indicated with a red line). All edges in bins to the left of the cutoff were included in that
data set’s final list of edges.
Finally, the resulting TFG subnetwork was built by combining the edges inferred from each
of the five data sets according to the above rules. An edge’s overlap category, or the number of
times it was detected across these five networks, is a measure of its reproducibility. To further
avoid reporting spurious edges, only edges in overlap ≥2 (i.e. edges which were detected in at
least two data sets) were included in the final network. Because Pearson correlation is a symmetric
measure, the direction of inferred TFG edges were taken from TRANSFAC when available, and
otherwise were left undirected. Although not directly addressed in the original publication, additional external information could aid in assigning directionality. Since only regulators can have
nonzero outdegree, a list of which genes are regulators could define the direction of edges linking
regulators and non-regulators.
The PPI (protein-protein interaction) Subnetwork Reconstruction step is a parallel and nearly
identical step to the TFG step, except that it used the protein-protein interaction database HIPPIE [131] as its TI database. Because HIPPIE does not report any directionality, these inferred
38

interactions were left undirected. Combining the TFG and PPI subnetworks resulted in the completed (unweighted) network AML 2.1.
The network clustering algorithm MCODE6 was used to identify sets of highly interacting
genes, which were in turn provided to DAVID for enrichment analyses. Thirteen of these clusters
are shown in Fig. 3.5, with some of the most significantly enriched pathways listed. The multicolored, dense core hosts pathways crucial for any cell from nearly any organism, such as DNA
and RNA processes and cell cycle. In contrast, the more remote, homogeneously colored clusters
host cell type-specific functions like leukocyte/lymphocyte activation, immune response, and heme
biosynthetic process. The presence of these blood-specific pathways is to be expected given that
the input data sets were from AML studies. The relationship between the topological structures of
GRNs and the functions of their genes will be discussed more quantitatively in Chapter 4.
The process of building this AML-specific network reduced the number of nodes from over
20,000 genes in the raw microarray and RNA-seq data to 5,667 genes. Out of the 26 most common mutations found in AML cells [86], 21 were present in the 5,667 genes in AML 2.1. By
Eq. 3.1, the p-value of this enrichment was 2.3 × 10−8 . To determine the functional context of
these mutations, the set of the 21 mutations plus each mutation’s immediate upstream and downstream neighbors (a total of 257 genes) was provided to DAVID to test for enrichment. Table 3.1
shows the resulting enrichment profile for all pathways with FDR-corrected (Benjamini) p-values
below 0.01. As expected, nearly all identified pathways are related to one of the hallmarks of cancer, particularly processes involving DNA metabolism and maintenance (DNA metabolic process;
chromosome organization; response to DNA damage stimulus; DNA repair; chromatin modification; DNA replication; histone modification; and negative regulation of nucleobase, nucleoside,
nucleotide, and nucleic acid metabolic process), proliferation (cell cycle), and resisting apoptosis
(regulation of cell death).
6 MCODE

[10] is an older, parametric network clustering algorithm designed specifically for
biological/molecular networks like GRNs, and can be easily implemented using network analysis
software like Cytoscape [133]. However, modularity-based algorithms (which are typically nonparametric) are the most common network clustering algorithms used today. See this video for an
animated description of the MCODE algorithm.

39

Figure 3.5 Force-directed layout of the largest connected component of AML 2.1. Because of the
large number of nodes in the network, only the edges (gray lines) are rendered. Some of the most
highly enriched GO pathways within clusters identified by MCODE are colored. Figure adapted
from [116].

Pathway name
Number of genes detected
DNA metabolic process
43
Chromosome organization
42
RNA processing
42
Response to DNA damage stimulus
33
Cell cycle
47
Cellular response to stress
39
DNA repair
27
Chromatin modification
23
DNA replication
18
Histone modification
14
Regulation of cell death
37
Cellular macromolecular complex
22
subunit organization
Negative regulation of nucleobase,
nucleoside, nucleotide, and nucleic
27
acid metabolic process

p-value Benjamini value
4.28E-16
7.55E-13
5.06E-16
9.44E-13
3.00E-14
5.08E-11
7.17E-13
1.22E-9
3.29E-12
5.58E-9
8.55E-12
1.45E-8
2.79E-11
4.73E-8
1.16E-8
1.97E-5
1.20E-7
2.03E-4
5.17E-7
8.77E-4
1.69E-6
2.90E-3
4.39E-6

7.40E-3

4.67E-6

7.9E-3

Table 3.1 Pathway enrichment of 21 common AML mutations plus their first neighbors in AML
2.1. Nearly all pathways are related to the hallmarks of cancer from Fig. 3.2.

40

Figure 3.6 Force-directed layout of the largest connected component of AML 2.1. Mutations and
neighbors of mutations are identified. Figure adapted from [116].

The close spatial proximity of the mutated nodes highlighted in red in Fig. 3.6 hints that the
mutations are not randomly distributed across the network, but are concentrated in certain regions.
As discussed in Section 2.3, however, a measure needed to be designed to assess the significance
of the proximity of mutations in AML 2.1. This problem inspired the creation of the set efficiency
from Section 2.6. To repeat, the set efficiency of a set of genes I is given by
EI =

1
1
∑
|I|(|I| − 1) i, j∈I di j

(3.4)

i= j

where di j is the distance from node j to node i. Defining I to be the set of 21 genes commonly mutated in AML, the set efficiency was computed for I as well as for a control composed of 10 million
randomly generated sets of 21 genes. EI was found to be 0.2979, 1.8% larger than the largest set
efficiency of the random sets observed and 145% larger than the mean. A skew normal probability
distribution was fitted to a histogram of the randomized sets with R2 = 0.999982, and an approximate right-tailed p-value of 7.3 × 10−8 was obtained for EI (see Fig. 3.7). A more conservative
41

Figure 3.7 Set efficiency of 21 common AML mutations in AML 2.1, and the distribution of the
set efficiency of 10 million randomly selected sets of 21 genes from the 5,667 genes in AML 2.1.
The set efficiency of the mutated genes is far greater than expected at random. Figure adapted
from [116].

estimate of the p-value was obtained by assigning a random direction to each protein-protein interaction whose true direction is unknown (rather than using an undirected edge). Adding edges
to a network guarantees an increase in the global efficiency as well as the average set efficiency.
The genes in I communicate with their neighbors predominantly through protein-protein interactions, and some of the PPI edges are listed as undirected in the AML 2.1 network because their
true directions are unknown. To ensure that these undirected edges were not the sole cause of the
statistical significance of EI , a new network was constructed in which each undirected PPI edge
was assigned a random direction. The significance of EI was then calculated from the new network
using 10 million random sets of 21 nodes as a control. Only 63 of these sets had set efficiencies
greater than EI , which gives an estimated p-value of 63/107 = 6.3 × 10−6 . Both estimates of the
p-value clearly demonstrate the significance of the proximity of the mutations in AML 2.1.
It is well known in data science that even simple models based on a large amount of data

42

can outperform complex models based on a small amount of data [59, 66]. Efficient network
reconstruction algorithms that can leverage the large and ever increasing amount of freely available
gene expression (and perhaps other) data across many organisms and cell types could provide
researchers with insights into the mechanisms governing properly functioning cells and suggest
therapies for malfunctioning cells. Given its ability to identify expected characteristics such as the
compartmentalization of pathways in communities and the functional location and close proximity
of mutations, Ong’s method appears to successfully reconstruct a reasonable AML GRN.
Our collaboration recently developed a weighted AML network, AML 2.3, built using a tweaked
version of Ong’s method with a total of twelve AML data sets. Ten of these data sets were used
to build a network in precisely the same way as AML 2.1. The weight assigned to each of the xn
edges in overlap category n is given by the fraction of those xn edges detected by performing Ong’s
method on the two remaining data sets. For example, if xn = 1,000 edges were detected in overlap
3, and 600 of those same edges were detected in either of the two remaining data sets, then all
edges in overlap 3 would be assigned a weight of 600/1000 = 0.6. By construction, 0 ≤ Wi j ≤ 1.
This weight can be interpreted as the probability that an edge from a given overlap category is
actually representative of a real edge based on its expected reproducibility. Chapter 4 analyzes the
evolutionary properties of AML 2.3, further validating the reliability of Ong’s method.

43

CHAPTER 4
EVOLUTION OF GENE REGULATORY NETWORKS

Evolution is a tinkerer.
—François Jacob, Evolution and tinkering
The following chapter is adapted from my 2016 publication, Evolutionary and topological
properties of genes and community structures in human gene regulatory networks [142], and is
organized as follows.
• Section 4.1, Background: an overview of previous work concerning the evolution of individual genes and GRNs
• Section 4.2, Single-gene evolutionary measures: a discussion of the connection between the
two evolutionary measures of interest in this chapter, gene age and gene evolutionary rate
• Section 4.3, Evolution and GRN topology: an analysis of how the evolution of individual
genes is linked to the topology of two GRNs, AML 2.3 and HumanNet
• Section 4.4, Conclusion: a summary of the chapter’s findings
Supplementary information including extra plots and tables deemed too unwieldy for this dissertation can be found in the original publication.

4.1

Background

The evolution of a gene can be mapped in various ways. The absolute evolutionary rate (ER) of
a gene, for example, can be computed from observed differences in orthologs1 across species in
the context of their phylogenetic relationships [82], whereas the age of a gene can be measured by
1 Orthologs, also known as orthologous genes, are genes from separate species that can be traced

back to a common ancestral gene, but whose sequences may have diverged due to accumulated
mutations.

44

tracing when the gene first appeared in the organism’s phylogenetic tree [25]. Quantities such as
these allow researchers to chronicle the journey of individual genes across evolutionary history.
But genes do not exist, and therefore do not evolve, in isolation. Mutations in a transcription
factor may affect the expression of the genes it regulates, since changes in a protein’s amino acid
sequence can cause it to lose compatibility with former binding partners, and gain compatibility
with new partners. Accumulation of these alterations can lead to changes in fitness and, eventually,
speciation. The evolution of individual genes is thus coupled with the evolution of the structure of
the organism’s GRN, and network properties should be related to the evolutionary properties of its
constituent nodes and edges.
It has been proposed that GRNs grow and evolve incrementally via gene duplication followed
by mutation and functional divergence [15, 115, 130, 146, 152], although changes may have occasionally arrived in bursts, as in whole-genome duplication [37]. This time-dependent network
formation suggests that GRNs are composed of a core of ancient, conserved genes with fundamental functions, and younger, peripheral genes with species- or cell type-specific function, which
mutate frequently until the functions of the newly created pathways are optimized. These mutations can alter GRNs by creating, removing, reassigning, or changing other properties of nodes and
edges.
Fraser et al. demonstrated that interacting pairs of proteins have similar ERs [53]. This constraint is likely driven by the necessity of coevolution, since a change in one protein’s sequence
may require a corresponding change in its partner’s sequence in order for the pair to remain compatible. Daub et al. showed that genes which are part of many biological pathways have lower ERs
than genes which belong to few or no known pathways, further supporting the idea that related
genes share similar evolutionary properties [34]. It has also been shown that ERs are weakly, but
significantly, negatively correlated with degree, closeness centrality, and betweenness centrality,
and that essential genes have high centrality and low ERs [58].
Originally in [142] and reproduced below, I establish quantitative relationships between the
evolutionary history of genes and their topological properties in AML 2.3 [116] as well as for a

45

general human GRN, HumanNet2 [88]. In contrast to the earlier studies mentioned above, this
analysis moves beyond single-node centrality and pairwise measures by studying the connection
between network topology and evolution, particularly from the point of view of network community structures. It will be shown that the ERs and ages of genes are not randomly distributed across
the networks, but are naturally organized in communities with well-defined evolutionary characteristics: old genes cluster with old genes, and young cluster with young. Likewise, “cold genes”
(genes with low ERs) cluster with cold genes, and “hot genes” (genes with high ERs) cluster with
hot genes. This segregation also exists for enriched groups of genes identified by DAVID within
the communities. In terms of network topology, genes and DAVID groups which are old and cold
tend to be central, and those which are young and hot tend to be peripheral. This is demonstrated
with traditional single-node centrality measures as well as with new network measures, the set
efficiency and the interset efficiency (defined and discussed in Section 2.6). PageRank, a finiterange centrality measure, shows stronger biological significance than degree (a local measure) and
betweenness centrality (a global measure), and the set efficiency and interset efficiency correlate
strongly with the evolutionary histories of individual genes and DAVID groups.

4.2

Single-gene evolutionary measures

To compute the ER of a gene, the absolute ER for each amino acid position of the protein it
encodes was computed using the method from Kumar et al. [82]. Given the multiple alignment at
an amino acid position in 46 species [55], the ER of a given amino acid position equals the number
of different residues divided by the total evolutionary time span, based on a known phylogenetic
tree [82]. The ER of a gene is the average ER over all amino acid positions, in units of the number
of substitutions per amino acid site per one billion years. The ER value ranges from 0.011 (most
conserved) for LSM2 to 6.928 (least conserved) for CDRT15.
Ages, taken from Chen et al. [25], were estimated by comparing the human genome to the
2 HumanNet is a GRN constructed from 21 different methods using diverse data types, including

microarray co-expression, databases and mass spectrometry proteomics.

46

Figure 4.1 Phylogenetic tree used to compute gene ages. Image taken from [25].

genomes of species from 13 major clades which branched from the evolutionary path that resulted
in humans, indexed 0 (oldest) through 12 (youngest). A gene’s age (more technically, a lower
bound on its age) was determined by searching for the earliest time at which an orthologous gene
appears in an organism from one of these 13 clades (see Fig. 4.1).
While computed differently, a gene’s ER and its age are related. It could be that young genes
with novel functions need time to fine-tune their properties in order to optimize the fitness of
the host organism, so young genes tend to be hot. Likewise, old genes with fundamental roles,
such as protein translation, may have had enough time to sufficiently optimize their functions, and
so should change very slowly. As expected, the ERs and ages of genes are strongly correlated
(R = 0.504, p < 10−300 ).

4.3

Evolution and GRN topology

Consistent with previous results [53], interacting genes tend to have similar ERs and ages. The
distributions of differences in ERs and ages between genes linked by an edge in AML 2.3 are closer
to zero than those of degree-preserving randomizations of the same network. The distribution of
ER differences is shown in Fig. 4.2. To estimate the significance of this difference in widths,
20,000 degree-preserving randomization of AML 2.3 were conducted and the standard deviation

47

of each distribution was recorded. Comparing the true standard deviation to the randomized trials
resulted in an approximate Z-score3 of −96.8 for differences in ER and −72.0 for differences
in age. This tendency for connected genes to have similar ERs and ages hints that there may be
large-scale segregation between clusters of old, cold genes and young, hot genes. Indeed, this is
reflected in the natural community structure present in AML 2.3, as well as in the DAVID groups
present within these communities.
The main results of the community analysis are in Table 4.2 for AML 2.3 and Table 4.3 for HumanNet. These tables list the ER and age properties for the ten largest network communities, and
for the three most significantly enriched DAVID groups found within each community. Communities and DAVID groups in Tables 4.2 and 4.3 labeled “cold” and “hot” have significantly lower and
higher ERs than the network average, respectively. Likewise, groups of genes labeled “old” and
“young” are significantly older and younger than the network’s average age, respectively. A relatively strict one-tailed significance level of p < 10−3 in the difference from the mean was chosen
for both ER and age.
The ERs and ages for many of these DAVID groups reflect their biological functions. Zinc finger proteins, which are enriched in both AML 2.3 and HumanNet, are involved in a large number
of heterogeneous cellular processes [28], so it may be that their genes need to adapt more often
than genes with very specific singular functions. They also have a particularly high rate of duplication and loss, so while the family itself is old (found in animals, plants [28], and fungi [14]),
individual genes in this family are young [95]. Genes involved in transcriptional regulation may
also need to be flexible enough to tune the expression of target genes in response to environmental
changes over time [145, 160]. The olfactory group is enriched in HumanNet, and it is significantly
younger than average. A small number of olfactory genes were present in early chordates, but
olfactory systems became far more complex and diverse in land-dwelling animals, particularly in
mammals [113]. Conversely, the most fundamental DAVID groups have experienced few changes
since early single-celled lifeforms. DAVID groups such as mRNA metabolic process [8] and trans3Z

= (x − µ)/σ where x is the measurement, and µ and σ are the distribution’s mean and
standard deviation.
48

Figure 4.2 Difference in evolutionary rates (ER) between pairs of genes, (ER j − ERi ), for pairs
connected by an edge j → i in AML 2.3 (green) and for pairs connected by an edge in one
degree-preserving randomization of AML 2.3 (purple). Note that the distribution is asymmetric
because AML 2.3 is a directed network. The width of the true distribution is significantly smaller
than the mean of the standard deviation of 20,000 degree-preserving randomizations, resulting in
an approximate Z-score of −96.8.

lational elongation [78] in AML 2.3 as well as ribosome [138] and protein kinase core [99] in
HumanNet are old and stable, having nearly optimized their functions long ago.
As a control for the enriched DAVID groups, ten new communities were built by randomly
shuffling the genes between communities from the network, while maintaining the size of each
community. The resulting random communities were then analyzed using DAVID. This randomization procedure was followed for both AML 2.3 and HumanNet, and in both cases, the enrichment was far less significant than for the real communities. The enriched DAVID groups in
Tables 4.2 and 4.3 are thus biologically meaningful, not merely coincidental.
Fig. 4.3 analyzes DAVID groups in AML 2.3 and their relationship with network communities
and evolutionary properties. Fig. 4.3A shows that the ER distribution of translational elonga-

49

Figure 4.3 (A) Distribution of evolutionary rates (ERs), measured in units of the number of
nonsynonymous substitutions per amino acid site per billion years, for all genes (purple) and for
genes in the translational elongation DAVID group (green). This DAVID group has a very low ER
compared to the background distribution. (B) Distribution of ages for all genes (purple) and genes
in the transmembrane DAVID group (green), where age=0 is the oldest and age=12 is the
youngest. Transmembrane genes are much younger than average. (C) Summary of mean ER and
mean age for DAVID groups in Table 4.2. The relative ERs on the x-axis are computed from
ERrelative = ERDAVID group mean − ERnetwork mean , and likewise for relative age on the y-axis.
The DAVID groups from (A) and (B) have bold labels in (C). Each marker type corresponds to
one of communities 0 through 9. As expected, old DAVID groups tend to have a low average ER
(i.e. are “cold”), and young DAVID groups tend to evolve frequently (i.e. are “hot”).
Unabbreviated DAVID group names are listed in Table 4.2.

tion genes is noticeably left-shifted relative to the ERs of all genes, indicating that it hosts relatively slowly evolving genes. Transmembrane genes are much younger than average, as shown
in Fig. 4.3B. Fig. 4.3C provides a comprehensive picture of the evolutionary properties of the ten
largest network communities (symbols) with their main DAVID groups (as labeled).

50

Single-gene ER
Single-gene age
DAVID group ER
DAVID group age

Degree centrality
R
p-value
ρ
p-value
-0.06 8.9E-10 -0.15 7.4E-51
-0.06 7.3E-09 -0.18 2.5E-73
-0.13 5.3E-01 -0.04 8.4E-01
-0.1 6.4E-01 -0.21 3.1E-01

R
-0.14
-0.12
-0.58
-0.75

PageRank
p-value
ρ
p-value
1.5E-43 -0.25 5.6E-141
4.1E-33 -0.22 1.7E-111
2.3E-03 -0.57 2.8E-03
1.4E-05 -0.86 5.1E-08

R
-0.07
-0.04
-0.25
-0.26

Betweenness centrality
p-value
ρ
p-value
1.2E-11 -0.21 3.0E-103
5.5E-05 -0.14 2.0E-47
2.3E-01 -0.18 3.9E-01
2.1E-01 -0.23 2.7E-01

Table 4.1 Correlations (Pearson’s R and Spearman’s ρ) between evolutionary measures and
network measures in AML 2.3. The most significant correlation is between PageRank and ER for
individual genes, and between PageRank and the mean age of DAVID groups (see Fig. 4.4).
Dividing genes into DAVID groups causes stronger relationships between network topology
and evolutionary properties to emerge. Traditional single-node centrality measures such as degree,
betweenness centrality, and PageRank show small but significant correlation with ERs and ages,
with the oldest, coldest genes being the most central (see Table 4.1). Grouping genes by DAVID
group leads to stronger correlations, the clearest of which is between the mean PageRank and
mean age, shown in Fig 4.4 (Pearson’s R = −0.75, p = 1 × 10−5 ; Spearman’s ρ = −0.86, p =
5 × 10−8 ). These three centrality measures are related, but differ in their global reach. Degree is
completely local, only dependent on the number of neighbors of a gene; betweenness centrality is
global, requiring information from the entire network; but PageRank is between these extremes,
influenced by all genes but with more weight granted to those genes which are nearby. The strong
correlation between PageRank and evolutionary measures thus may be explained by the presence
of communities in the GRN, since community structure itself is strongly correlated with ER and
age, as shown in Tables 4.2 and 4.3.
Because of the strong correlation between a gene’s history and that of its neighbors, genes are
expected to evolve in groups rather than as individuals, which should be evident in the structure of
the network. The set efficiency is shown in Fig. 4.5 for genes in AML 2.3 ranked from coldest to
hottest, computed for the first 500 genes in the list, then the first 510 genes, etc. in steps of 10. As a
control, the same “cumulative set efficiency” was computed for 100 sets of randomly ranked genes
(but leaving the underlying network, AML 2.3, unchanged). Fig. 4.5 shows the mean of these
100 controls (solid green line) plus/minus one standard deviation (dashed green lines). Fig. 4.6
shows the same plot for genes ranked from oldest to youngest, with similar results. These plots
indicates that the oldest, coldest genes tend to be close, separated by approximately four directed
51

Figure 4.4 Mean PageRank versus mean age of each DAVID group from Table 4.2 (age=0 is the
oldest and age=12 is the youngest). Old DAVID groups tend to have high PageRank.
Unabbreviated DAVID group names are listed in Table 4.2.

edges, significantly smaller than the network average of approximately six. The set efficiency
monotonically declines as hotter, younger genes are included. Note that the plateau below rank
2000 in Fig. 4.6 is a result of the discrete nature of the age data.
Furthermore, the oldest DAVID groups efficiently exchange information with each other, and
the youngest DAVID groups are distant from the oldest DAVID groups as well as from each other.
Fig. 4.7 shows the interset efficiency between all pairs of DAVID groups in AML 2.3, where the
DAVID groups are sorted from oldest to youngest. Note that each diagonal term of the interset
efficiency matrix is the set efficiency of that DAVID group. Similarly, Fig. 4.8 shows the interset
efficiency between DAVID groups in HumanNet.
Purely locally, AML 2.3 and HumanNet look quite different from one another. AML 2.3 is
composed of roughly 10,000 genes and 338,000 edges, and HumanNet is composed of 14,000

52

Figure 4.5 Set efficiency and evolutionary rate for AML 2.3. The cumulative set efficiency (SE)
of all genes below a given evolutionary rate (ER) rank (lowest to highest ER, i.e. “coldest” to
“hottest”). The SE of the 500 coldest genes is significantly higher than the control, and including
hotter genes monotonically decreases the SE. This indicates that the coldest genes are located
near each other, while the hottest genes are more dispersed.

genes and 876,000 edges. While they share roughly 9,000 genes, they share only 26,000 edges.
However, modularity and interset efficiency, which are coarse-grained network measures that reveal the properties of sets of nodes rather than individual nodes or pairs of nodes, demonstrate that
the same evolutionary signatures are present in both networks.

4.4

Conclusion

It was demonstrated that slowly evolving, old genes tend to interact with each other, and frequently
evolving, young genes tend to interact with each other, whereas edges between those groups are
less common. This naturally creates communities of genes with relatively homogeneous evolutionary attributes. Analyzing the networks in terms of communities and DAVID groups rather than
single genes provided a new perspective which established clear relationships between network

53

Figure 4.6 Set efficiency and age for AML 2.3. The cumulative set efficiency (SE) of all genes
below a given age rank (oldest to youngest). The SE of the 500 oldest genes is significantly
higher than the control, and including younger genes monotonically decreases the SE (after a
transient period due to the discrete nature of the age data). This indicates that the oldest genes are
located near each other, while the youngest genes are more dispersed.

topology and evolution. The abundance of connections between old DAVID groups and the relative scarcity between old-and-young and young-and-young DAVID groups suggests that during
the course of human evolution, the primitive gene regulatory network present in early metazoans
began as a core of fundamental genes and pathways. As genes duplicated and mutated, novel functions arose and eventually, through selective duplications, deletions, mutations, translocations, and
rewirings, novel regulatory pathways emerged, growing outward from these ancient genes. This
placed the oldest genes near the middle of the network and the youngest genes toward the periphery. These findings were mainly derived from an AML network and a general human network, and
are consistent with previous reports [58].
No gene is an island. A real understanding of the evolution of a genome only comes from
studying its constituent genes in the context of the underlying complex network of interactions

54

Figure 4.7 Interset efficiency and age for AML 2.3. Interset efficiency from DAVID group in
column j to DAVID group in row i. The list of DAVID groups was sorted by average age from
oldest (transcriptional elongation) to youngest (hemoglobin complex). Old DAVID groups
exchange information efficiently, as indicated by the high interset efficiency values in the
lower-left corner. Younger DAVID groups, particularly the blood cell-specific DAVID groups of
lymphocyte activation and hemoglobin complex, are remote from most other DAVID groups.
Note that the above matrix is asymmetric because the network is directed, and that the colors are
log-scaled.

55

Figure 4.8 Interset efficiency and age for HumanNet. See Fig. 4.7 for explanation.

56

rather than as independent units. As network reconstruction methods continue to improve and
more high quality networks become available, researchers will find more evidence of how evolution
shaped, and continues to shape, the topology of gene regulatory networks.

57

Comm. Num.
index genes
0
1760
1579

2

1208

3

1055

4

867

5

780

6

748

7

417

8
9

296
270

58

1

Comm.
ER
hot

Diff. in
pComm. Diff. in
pGroup
DAVID group name
mean
value
age
mean
value
type
0.24
2.5E-81 young
1.01
6.7E-28 Zinc finger
P
Transcription regulation
B
average -0.03 2.3E-02
old
-0.39 6.8E-07 Transcription
B
Intracellular organelle lumen
L
Protein localization
B
cold
-0.10 2.6E-10
old
-1.30 1.2E-58 Nuclear lumen
L
Cell cycle phase
B
Response to DNA damage stimulus
B
average
0.01
2.4E-01 young
1.41
5.4E-53 Cell-cell signaling
B
Plasma membrane part
C
cold
-0.27 3.7E-55
old
-0.94 1.7E-16 mRNA metabolic process
B
Nuclear lumen
L
mRNA transport
B
cold
-0.11 1.1E-05
old
-0.91 5.1E-10 Establishment of protein localization
B
Actin filament-based process
B
Regulation of programmed cell death
B
hot
0.18
3.5E-22 young
1.34
2.8E-29 Lymphocyte activation
B
Hemoglobin complex
P
Transmembrane
P
average -0.08 1.2E-02
old
-1.54 8.3E-22 Mitochondrial envelope
C
Respiratory chain
C
Ribosomal protein
P
hot
0.16
1.5E-05 young
1.49
4.8E-14 Homeobox
P
cold
-0.28 5.7E-14
old
-1.96 2.9E-20 Translational elongation
B
rRNA processing
B

Num.
DAVID
Group Diff. in
genes Benjamini
ER
mean
275
1.2E-19
hot
0.32
298
1.9E-16
hot
0.28
304
4.8E-24
average -0.05
239
9.7E-19
average -0.07
145
2.9E-10
cold
-0.27
332
8.4E-107
cold
-0.21
169
3.6E-98
average
0.06
135
3.5E-58
average
0.03
146
3.6E-39
cold
-0.16
359
1.1E-67
cold
-0.11
124
9.5E-66
cold
-0.46
212
9.3E-60
cold
-0.34
27
1.2E-11
average -0.36
105
1.1E-19
cold
-0.33
51
1.9E-16
cold
-0.41
79
2.6E-07
average -0.03
40
1.2E-11
hot
0.68
13
1.3E-12
average
0.54
252
1.8E-05
hot
0.51
103
1.0E-68
average -0.14
44
4.9E-47
average
0.03
62
2.4E-53
average
0.18
27
2.2E-12
average -0.22
77
1.7E-114
cold
-0.60
27
1.1E-22
average -0.24

pGroup Diff. in
value
age
mean
6.0E-19 young
0.75
4.8E-14 young
1.17
4.6E-02 young
0.69
7.1E-02
old
-1.18
7.3E-05
old
-1.50
2.6E-07
old
-1.94
1.0E-01
old
-1.67
3.1E-01
old
-1.97
7.0E-04 young
1.12
2.5E-04 young
0.82
2.7E-25
old
-1.84
5.3E-24
old
-1.45
5.8E-03 average -2.17
5.2E-09
old
-2.03
9.0E-06
old
-1.96
4.2E-01 average -0.63
1.2E-10 young
3.27
5.1E-03 young
4.43
2.0E-34 young
2.55
2.0E-02
old
-1.98
3.4E-01
old
-1.57
2.3E-02
old
-1.73
7.3E-02 young
3.01
6.0E-18
old
-3.04
3.2E-02
old
-2.93

pvalue
4.4E-05
9.3E-11
5.6E-05
1.7E-08
5.2E-08
1.6E-28
5.4E-07
3.2E-07
1.3E-05
4.1E-06
2.7E-06
1.2E-09
7.2E-03
1.6E-12
9.0E-06
3.0E-02
6.2E-11
1.1E-04
4.5E-28
4.5E-10
9.3E-05
3.2E-05
3.7E-07
1.7E-13
1.7E-05

Table 4.2 Evolutionary properties of communities and DAVID groups in AML 2.3. Gene evolutionary rates (ERs) take real values from
0 (most conserved) to approximately 6.9 (most variable), and ages take integer values from 0 (oldest) to 12 (youngest). The table is
organized as follows. "Comm. Index" is the index of the ten largest communities. "Num. genes" is the number of genes in the
community. "Comm. ER" indicates whether the community is significantly hotter (i.e. has a higher ER) or colder (i.e. has a lower ER)
than the mean of 300 equally-sized sets of genes randomly selected from the network, with a significance threshold of p = 10−3 . "Diff.
in mean" is the difference between the mean ER of the community and the mean ER of the 300 randomly selected sets. "p-value" is the
significance of the difference. "Comm. age", "Diff. in mean", and "p-value" are the same as previously stated, but for age rather than
ER. "DAVID group name" is the name of the DAVID group that DAVID identified as enriched in each community. “Group type” states
whether the DAVID group is a protein type (P), location of final gene product (L), biological process (B), or cellular component (C).
"Num. genes" is the number of genes in the DAVID group. "DAVID Benjamini" is the significance of the enrichment of the DAVID
group, as reported by DAVID. The remaining DAVID group columns are computed in the same manner as the community columns.

58

Comm. Num.
index genes
0
2961

Comm.
ER
hot

Diff. in
pmean
value
0.38
1.0E-185

Comm.
age
young

Diff. in
pmean
value
2.05
7.1E-198

59

1

2849

cold

-0.14

8.7E-30

average

0.10

8.2E-03

2

2665

cold

-0.12

9.1E-21

old

-1.50

5.3E-97

3

1402

cold

-0.07

1.7E-05

old

-1.05

2.7E-28

4

1115

cold

-0.24

7.1E-30

old

-2.20

8.3E-89

5

781

cold

-0.26

1.6E-24

old

-1.08

8.1E-16

6

736

cold

-0.09

6.1E-04

young

0.75

1.3E-07

7

731

hot

0.20

9.7E-16

average

0.17

1.1E-01

8

320

average

0.05

1.2E-01

young

0.70

4.6E-04

Group Num.
DAVID
Group Diff. in
type
genes Benjamini
ER
mean
Immunoglobulin domain
P
201
7.6E-66
hot
0.92
EGF-like domain
P
122
6.8E-55
hot
0.25
Chemotaxis
B
93
1.5E-40
hot
0.86
Protein kinase core
P
344
3.2E-226
cold
-0.30
Pos. reg. of transcr. from RNAP II promoter
B
181
4.0E-57
cold
-0.30
Transcription regulation
B
503
4.2E-59
cold
-0.22
Mitochondrial envelope
C
211
2.7E-81
cold
-0.16
Protein transport
B
243
1.2E-51
cold
-0.35
Peroxisome
C
54
2.6E-30
average -0.11
Response to DNA damage stimulus
B
145
1.9E-71
average
0.03
Cellular macromolecule catabolic process
B
195
1.6E-70
cold
-0.31
Mitosis
B
82
1.2E-58
average
0.01
Ribosome
C
68
7.2E-89
cold
-0.64
Nuclear lumen
L
241
3.6E-59
cold
-0.29
Nucleosome
C
61
2.5E-29
cold
-0.50
mRNA processing
B
151
1.4E-144
cold
-0.51
Nuclear lumen
L
185
1.1E-75
cold
-0.39
WD40 repeat region
P
83
5.1E-60
cold
-0.30
Sulfotransferase
P
26
1.6E-33
average
0.06
Vision
B
32
1.2E-16
average
0.03
C2 calcium-dependent membrane targeting
P
25
6.7E-11
average -0.14
KRAB
P
168
9.5E-208
hot
0.67
Zinc finger
P
205
3.4E-245
hot
0.62
Domain: SCAN box
P
22
1.4E-20
hot
0.54
ANK repeat
P
58
2.4E-72
average
0.07
DAVID group name

pvalue
4.0E-82
1.5E-04
4.7E-31
1.5E-15
1.3E-08
2.0E-11
5.0E-04
2.6E-15
1.2E-01
3.3E-01
5.4E-10
3.8E-01
9.6E-16
5.6E-13
9.6E-06
9.8E-25
1.9E-15
5.3E-05
3.4E-01
3.9E-01
1.7E-01
4.9E-36
1.4E-32
3.2E-04
2.4E-01

Group Diff. in
age
mean
young
4.16
young
1.02
young
4.09
old
-1.69
young
0.69
young
1.04
old
-2.24
old
-2.20
old
-2.39
old
-2.35
old
-2.10
old
-1.83
old
-3.50
old
-2.38
average -0.09
old
-2.32
old
-2.15
old
-1.95
average
0.23
average
0.84
average
0.14
average
0.66
average
0.84
young
4.82
average
0.15

pvalue
2.2E-58
3.3E-04
1.9E-23
2.0E-18
4.6E-03
1.8E-12
1.0E-23
1.9E-21
2.0E-06
2.0E-16
7.2E-16
1.9E-06
1.8E-15
5.6E-27
4.0E-01
3.5E-15
1.1E-15
9.3E-07
3.6E-01
1.1E-01
4.5E-01
1.1E-02
1.3E-03
8.1E-10
3.9E-01

Table 4.3 Evolutionary properties of communities and DAVID groups in HumanNet. See Table 4.2 for explanation of column headers.

59

CHAPTER 5
THE HOPFIELD MODEL

Children and lunatics cut the Gordian knot which the
poet spends his life patiently trying to untie.
—Jean Cocteau
The following chapter introduces a dynamical mathematical system, the Hopfield model, and
is adapted from my 2014 publication, Control of asymmetric Hopfield networks and application to
cancer attractors [141]. It is organized as follows.
• Section 5.1, Why Boolean?: the motivation behind using Boolean models to understand gene
regulatory networks
• Section 5.2, The Hopfield model: a description of the canonical Hopfield model and some of
its basic properties
• Section 5.3, Cancer signalling and the Hopfield model: an application of the Hopfield model
to signalling in cancerous gene regulatory networks
• Section 5.4, Conclusion: a summary of the chapter’s findings
For further details concerning the Hopfield model and its applications, see Introduction to the
Theory of Neural Computation by John Hertz, Anders Krogh, and Richard Palmer [67].

5.1

Why Boolean?

The boom in the volume of biological data over the past decade has made mathematical modeling in biology both more feasible and more important. Analyzing genome-wide expression data
increasingly requires the use of advanced clustering techniques and network models (and even
network-based clustering techinques [97, 148]). However, standard static networks provide only a
sort of time-averaged representation of gene regulation. In reality, GRNs are dynamical systems

60

Figure 5.1 [A] Schematic of a protein energy landscape. The high dimensional configuration
space has been collapsed to the horizontal axis, with isolated configurations drawn on the left and
highly interacting configurations on the right. Local energetic minima are stable configurations,
and transitions between configurations can occur due to thermal fluctuations, protein-protein
interactions, and changes in the microenvironment. [B] Diagram of some of the possible
transitions in the configuration of a protein. Image taken from [13].

that transmit information via molecular signals across networks. Predicting the effects of perturbations such as gene knockouts (disabling a gene or multiple genes in an organism to observe
changes in expression and/or phenotype) or the application of kinase inhibitors (a class of drugs
which blocks certain protein enzymes called kinases) requires a dynamical model.
It is perhaps natural for physical scientists, and particularly physicists, to try to design a firstprinciples mathematical model built from a handful of microscopic details to predict and explain
macroscopic phenomena. The power of this approach is clear [158]: knowing the lattice structure
of a crystal can lead to highly accurate predictions of X-ray diffraction patterns, and the critical
temperature of a noninteracting Bose-Einstein condensate can be computed to great precision from
just two parameters, the mass and the density of particles.
A similar ab initio approach is taken in the field of protein folding, the primary goal of which
is predicting the folded shape of a protein from its sequence and local environment (including
both small molecules and other macromolecules) alone [13]. Much effort has been devoted to
understanding the folding landscape, shown schematically in Fig. 5.1A, where proteins transi-

61

Figure 5.2 Schematic of Eq. 5.1. Genes produce mRNAs, mRNAs produce proteins, proteins
regulate mRNA production rates, and mRNAs and proteins decay, all at varying rates. Image
taken from [26].

tion between states on an energetic landscape and settle into local energetic minima, occasionally
“hopping” to neighboring wells with the assistance of thermal fluctuations and interactions with
chaperone proteins as shown in in Fig. 5.1B.
While the configuration of a protein is defined by many degrees of freedom, the vast majority
of states are transient states. However, a great number of local minima pocket the landscape, some
of which are energetically highly favorable but biologically damaging, such as the protein aggregates formed in mad cow disease. Furthermore, mutations which alter the amino acid sequence
of encoded proteins can alter this landscape, making the normally functional configurations of the
protein inaccessible, and creating new nonfunctional energetic minima. Given that a host of afflictions such as Huntington’s disease and sickle cell disease are caused by the misfolding of a single
protein (the proteins huntingtin and HBB, respectively), understanding the mechanics of protein
folding is a crucial task for advancing medicine.
However, cells are highly complex, diverse, crowded, stochastic, non-equilibrium microenvironments whose processes and components span a broad spectrum of length and time scales, and
the combinatorics of multi-gene diseases such as cancer make the problem far more difficult to
analyze and understand from a molecular point of view. Even if all of the innumerable parameters that govern the dynamical processes of a particular cell were known, modelling the dynamics

62

of genome-wide GRNs atom-by-atom to predict the effects of mutations is computationally intractable. Accordingly, some approximations must be made.
An alternative to the molecular dynamics approach is modelling gene regulation with coupled
differential equations [150]. One possible system of differential equations, shown schematically in
Fig. 5.2, is [26]
dr
= f (p) −V r
dt
dp
= Lr −U p
dt

(5.1)

where
N = the number of genes in the genome
r = mRNA concentrations, functions of t
p = protein concentrations, functions of t
f (p) = transcription functions

(5.2)

L = translational constants, N × N diagonal matrix
V = mRNA degradation rates, N × N diagonal matrix
U = protein degradation rates, N × N diagonal matrix
While much simpler than large-scale molecular dynamics simulations, modelling the behavior
of a signficant fraction of the genome using this approach requires knowledge of a great deal
of parameters, and the model’s nonlinear nature (buried in the f (p) term) makes solving such
systems computationally expensive. Hill functions are commonly used [75] to model protein-gene
interactions. Activating Hill functions with a single regulatory protein, for example, take the form
apn
f (p) = n
k + pn

(5.3)

for amplitude a, Hill coefficient n (a steepness parameter), and apparent dissociation constant k
(the ratio of unbinding to binding rates for the protein/gene pair). A family of curves for a = 1 and
k = 5 is shown in Fig. 5.3. For large n, the rate of transcription is sensitive to protein concentration
only over a short range near k,1 resulting in switch-like [47] gene regulation.
1 For

large n and given k, a Taylor expansion at p = k shows that f (p) has nearly zero slope
everywhere except in the approximate range k 1 − n2
p k 1 + 2n .
63

Figure 5.3 Family of Hill functions from Eq. 5.3 for a = 1 and k = 5.

As argued in Section 3.4, it may not be necessary to understand the precise details of every
interaction in a GRN to understand some of its high level properties. Inspired by the approximate
switch-like nature of some genes, Boolean models provide a simple way to interpret both the
current state of a gene (either inactive or saturated) and the nature of interactions between multiple
genes. Boolean models tend to be defined via discrete-time (t ∈ Z) mappings of the form
σ (t + 1) = f (σ (t))

(5.4)

where N is the number of nodes, σ (t) ∈ {0, 1}N , and the particular model is defined by the form
of the mapping function f : {0, 1}N → {0, 1}N (mapping a Boolean vector to another Boolean
vector).
One of the most famous Boolean models, the Kauffman model [74], also known as a random
Boolean network (RBN), treats interactions between genes as random sets of digital logic gates
embedded in a random network, and was created to demonstrate the complex dynamics that can
emerge from a collection of many elements interacting via a simple set of rules. For example, a
system in which the production of a protein D is activated by transcription factors A and B and is
64

Figure 5.4 Circuit and logic table for Eq. 5.5.

repressed by C could be modelled as
D(t + 1) = [A(t) AND B(t)] AND [NOT C(t)]

(5.5)

as shown diagrammatically in Fig. 5.4.2 Because the trajectory of an RBN is deterministic and
confined to a finite number of configurations 2N , all states are guaranteed to eventually converge
to point attractors (stable states) or cyclic attractors (periodic states).3 Fig. 5.5 shows the dynamics
of one such random Boolean NK network for N = 100, K = 2, and a random initial state. Although
there are 2100 ≈ 1030 unique configurations, the system settles into a cyclic attractor with period
4 in less than 20 time steps. The simple rules that govern RBNs generate rich, complex dynamical
properties and have been studied extensively [41, 42, 135], including phase changes, stability (sensitivity to single bit flips), and capacity (average number of attractors per node) on various network
topologies.

5.2

The Hopfield model

The Hopfield model is a kind of associative memory model that was introduced by the physicist
JJ Hopfield in 1982 [68], and was designed to mimic the way a brain stores memories (attractors)
and recalls them from particular stimuli, and has been applied to areas such as neural computation
2 In Eq. 5.5, C is a canalyzing variable, since C(t) = 1 guarantees D(t + 1) = 0 regardless of the

state of the other inputs. Canalyzing variables have been shown to greatly increase the stability of
the Kauffman model [73, 87, 121].
3 A point attractor is a special case of a cyclic attractor with period 1.
65

Figure 5.5 Trajectory of a random Boolean NK network with N = 100, K = 2, and a random
initial state. Black and white pixels represent 0’s and 1’s. The system quickly settles into a cyclic
attractor with period 4.

and image recognition. It belongs to the class of infinite-range spin glass models such as the
the Sherrington-Kirkpatrick model [76]. Similarities and differences between the Kauffman and
Hopfield models have been studied for many years [5, 83, 84, 126]. The most important distinction
for the purposes of this dissertation is their differing abilities to encode data: RBNs define random
interactions that give rise to emergent attractors, whereas the Hopfield model begins with a set of
desired attractor states and constructs a set of node-node couplings which guarantees them. As
with the protein folding energy landscape in Fig. 5.1A, the (symmetric) Hopfield model defines
a high dimensional energy landscape with local energetic minima corresponding to programmed
attractor configurations.4
Formally, the canonical Hopfield model is an Ising model whose configuration is defined by N
spins σi (t) at time t ∈ Z. The state of each node takes one of two values, σi (t) = ±1 (on/off). The
coupling matrix, which defines both the strength and the sign of the signal from node j to node i
4 Only

Hopfield systems with symmetric (i.e. undirected) connections can be visualized as energetic landscapes. Hopfield systems with directed edges still obey the same signaling rules, but
may produce cyclic attractors.

66

(+ for ferromagnetic and − for antiferromagnetic), is constructed according to
Ji j =

1 p µ µ
∑ ξi ξ j
N µ=1

(5.6)

µ

where p is the number of attractors, ξi = ±1 is the state of the ith spin in the µ th attractor, and the
leading factor of 1/N is conventional. (Note that the superscript µ is an index, not an exponent.)
A certain subset of interactions can be selected by introducing an adjacency matrix Ai j so that
Ai j p µ µ
Ji j =
∑ ξi ξ j
N µ=1

(5.7)

The total field at node i at time t is given by
hi (t) = ∑ Ji j σ j (t) + hext
i (t)

(5.8)

j

where ∑ j Ji j σ j (t) is the internal field at node i due to its coupling with all nodes j, and hext
i (t) is
an optional external field applied to node i. The dynamical update rule is defined by



+1 with probability (1 + e−2hi (t)/T )−1
σi (t + 1) =


−1 otherwise

(5.9)

(known as Glauber dynamics) where the factor of 2 in the exponent is conventional and T is the
“temperature,” i.e. the level of noise. Biologically, this noise represents the effects of all kinds of
fluctuations present in cells. Note that for hi (t) → ±∞, σi (t + 1) = ±1; and for T → ∞, σi (t + 1) =
±1 with equal probability. For T → 0 (i.e. fully deterministic dynamics),
σi (t + 1) = sign (hi (t))

(5.10)

A family of curves for Eq. 5.9 for various temperatures is shown in Fig. 5.6.
There are two main ways to apply the Hopfield update rule: the synchronous and asynchronous
schemes. The synchronous scheme assumes that there is some sort of centralized “clock” that
updates all nodes simultaneously at every time step, analogous to the clock signal used in synchronous digital circuits. In contrast, the asynchronous scheme selects one node i at random and

67

Figure 5.6 Family of functions from Eq. 5.9.

updates its state, then selects another (or possibly the same) node and updates its state, etc. Alternatively, a “quasisynchronous” scheme can be constructed by assigning a fixed probability q
that each node is updated at each time step, so that the number of updates follows a binomial
probability distribution and the expected number of updates at each time step is Nq. Note that for
q = 1, the quasisynchronous scheme is the same as the synchronous scheme, whereas if Nq

1,

the asynchronous and quasisynchronous schemes give nearly identical results aside from statistical fluctuations around Nq. Because the Hopfield model was selected to simulate the dynamics of
gene regulation, a stochastic system which lacks a central clock, the remainder of this chapter uses
the asynchronous scheme.
For the sake of the following analysis, assume that all attractors are randomly and indepenµ

dently drawn from the space {−1, +1}N (so that ξi = ±1 with equal probability for any given i

68

and µ). Consider the special case of p = 1 and T = 0. If σi (t) = ξi1 , then
σi (t + 1) = sign

∑ Ji j σ j (t)
j

1 1
ξ
ξ j1 ξ j1
N i ∑
j

= sign

(5.11)

= sign ξi1
= ξi1
which utilizes
N

∑

2

ξ j1

N

=

j=1

∑ 1=N

(5.12)

j=1

This means σi = ξi1 is a stationary point in state space, but in order to be an attractor, it has to be a
stable point as well (i.e. the system must relax to ξi1 if perturbed). This is indeed the case: as long
as more than half of the spins are in the state ξi1 , the field on all nodes will be in the direction of
ξi1 , guaranteeing that σi (t + 1) = σi (t + 2) = . . . = ξi1 . Note that if +ξi1 is an attractor, −ξi1 (its
mirror state) is also an attractor. This is a basic symmetry of the system and is true for all attractors
µ regardless of the number of attractors p. Only the application of a nonzero external field hext
i
can break this symmetry.
µ

More generally, the stability of an attractor ξi in a system with an arbitrary number of attractors p can also be computed. It is helpful to break the coupling matrix into two terms, Ji j =
N −1 ξi ξ j + ∑ν=µ ξiν ξ jν . If σi (t) = ξi , then the internal field at node i is
µ µ

µ

hi (t) =

1
1 µ
µ µ
µ
ξi ∑ ξ j ξ j +
ξiν ξ jν ξ j
∑
∑
N
N ν=µ j
j
µ

= ξi +

1
µ
∑ ∑ ξiν ξ jν ξ j
N ν=µ
j
1 µ
µ
ξi ∑ ∑ ξiν ξ jν ξ j
N
ν=µ j

µ

1+

µ

1 +Ci

= ξi
≡ ξi

(5.13)

µ
µ µ

This is the same as the p = 1 case, plus an additional crosstalk term ξi Ci which depends on
µ

the patterns’ particular values. As long as the term Ci is greater than −1, it cannot change the
69

µ

µ

sign of the field at node i, making ξi a stationary point. Furthermore, the pattern ξi is stable to
small perturbations as long as the self-term is not overwhelmed by the crosstalk term. Because the
crosstalk term is a sum of many equally probable ±1’s for large N and p, the central limit theorem
implies that its sum follows a normal distribution with mean zero and standard deviation

p/N.

µ

The probability that the crosstalk term flips a given bit i away from ξi after one time step is thus
µ

the probability that Ci < −1,
Psingle flip =

1
1 − erf
2

N
2p

(5.14)

This means that for N → ∞, p → ∞, and a single-node error tolerance of Psingle flip = 0.001, the
maximum number of attractors per node is p/N = 0.105. However, NPsingle flip only represents
the expected number of flips after one time step. Each of these subsequent errors can also produce
more errors in the next time step, and even more in the next time step, etc., leading to a cascade
µ

of spin flips that drives the system away from the desired attractor state ξi . It can be shown
via a laborious mean field solution [6, 67] that the maximum capacity (also known as the load
parameter) is αc ≡ p/N = 0.138, beyond which errors compound and destabilize states intended
to be attractors. The asymmetric case (i.e. using a directed network Ai j ) can also be exacty solved
in some limits [40]. See Appendix F for a derivation of the Hopfield mean field equation on
directed Erd˝os-Rényi networks, as well as my unpublished derivation of the Hopfield mean field
equation on modular directed Erd˝os-Rényi networks.

5.3

Cancer signalling and the Hopfield model

The Hopfield model has been used to model biological processes of current interest such as the
reprogramming of pluripotent stem cells [85]. Moreover, it has been suggested that a biological
system in a chronic or therapy-resistant disease state can be seen as a network that has become
trapped in a pathological Hopfield attractor [7]. To a major extent, the final determinant of a cell’s
phenotype is its gene expression profile, which is relatively stable, reproducible, unique to cell
types, and can differentiate cancer cells from normal cells, as well as differentiate between differ70

ent types of cancer [21, 46]. In fact, there is evidence that attractors exist in cells’ gene expression,
and as with protein folding, these attractors can be reached by different trajectories rather than only
by a single transcriptional program [69]. While the dynamical attractors paradigm was originally
proposed in the context of cellular development, the similarity between cellular ontogenesis (the
development of different cell types) and oncogenesis (the process under which normal cells are
transformed into cancer cells) has been recently emphasized [140]. The main hypothesis of this
chapter is that cancer robustness is rooted in the dynamical robustness of signaling in an underlying cellular network. If the cancerous state of rapid, uncontrolled growth is an attractor state
of the system [9], a goal of modelling therapeutic control could be to design complex therapeutic interventions based on drug combinations [51] that push the cell out of the cancer attractor
basin [31].
Many authors have discussed the control of biological signaling networks using complex external perturbations. Calzolari and coworkers considered the effect of complex external signals
on apoptosis signaling [23]. Agoston and coworkers [2] suggested that perturbing a complex biological network with partial inhibition of many targets could be more effective than the complete
inhibition of a single target, and explicitly discussed the implications for multi-drug therapies [32].
In the traditional approach to control theory [137], the control of a dynamical system consists in
finding the specific input temporal sequence required to drive the system to a desired output. This
approach has been discussed in the context of RBNs [4] and their attractor states [27]. Several
studies have focused on the intrinsic global properties of control and hierarchical organization in
biological networks [52, 18]. A recent study has focused on the minimum number of nodes that
needs to be addressed to achieve the complete control of a network [94]. This study used a linear
control framework, a matching algorithm [124] to find the minimum number of controllers, and
a replica method to provide an analytic formulation consistent with the numerical study. Finally,
Cornelius et al. [30] discussed how nonlinearity in network signaling allows reprogramming a system to a desired attractor state even in the presence of constraints in the nodes that can be accessed
by external control. This novel concept was explicitly applied to a T-cell survival signaling net-

71

work to identify potential drug targets in T-cell large granular lymphocyte (T-LGL) leukemia. The
approach in the present paper is based on nonlinear signaling rules and takes advantage of some
useful properties of the Hopfield formulation. In particular, by considering two attractor states it
will be shown that the network separates into two types of domains which do not interact with each
other. Moreover, the Hopfield framework allows for a direct mapping of a gene expression pattern
into an attractor state of the signaling dynamics, facilitating the integration of genomic data in the
modeling.
Originally in [141] and reproduced here, I consider an asymmetric Hopfield model using two
GRNs, mapping gene expression data from normal and cancer cells to attractor states. Focus was
placed on the question of controlling of a network’s final state (after a transient period) using external local fields representing therapeutic interventions, e.g. gene knockouts or kinase inhibitors.
General strategies aiming at selectively disrupting the signaling only in cells that are near a cancer
attractor state are investigated. The strategies use the concept of bottlenecks, which identify single
nodes or strongly connected clusters of nodes that have a large impact on the signaling. A theorem
that places bounds on the minimum number of nodes that guarantee control of a bottleneck consisting of a strongly connected component is also provided. This theorem is useful for practical
applications since it helps to establish whether an exhaustive search for such minimal set of nodes
is practical. These strategies are then applied to lung and B cell cancers. Two different networks
are used for this analysis. The first is an experimentally validated and non-specific network (that is,
the observed interactions are compiled from many experiments conducted on heterogeneous cell
types) obtained from a kinase interactome and phospho-protein database [161] combined with a
database of interactions between transcription factors and their target genes [100]. The second network is cell-specific and was obtained using network reconstruction algorithms and transcriptional
and post-translational data from mature human B cells [89]. The algorithmically reconstructed
network is significantly more dense than the experimental one, and the same control strategies
produce different results in the two cases.

72

5.3.1

Mathematical details

See Table 5.4 for a list and description of all mathematical symbols introduced in this chapter. In
order to derive analytical results, the remainder of this chapter makes the simplifying assumption
that the signaling in GRNs is fully deterministic, i.e. T = 0. Real biological systems are stochastic,
however, and the importance of this stochasticity is covered in Chapter 6. It is convenient to
introduce a new kind of notation for networks in addition to the adjacency matrix Ai j . Let G =
{V (G), E(G)} be a network composed of the set of vertices (nodes) V (G) and the set of edges
E(G) = {( j, i) : j → i}. Note that because the networks of interest in this chapter are directed, care
must be taken concerning source nodes (nodes with zero indegree). Source nodes are fixed to their
initial states by a small external field so that σq (t) = σq (0) for all q ∈ Q, where Q is the set of
source nodes. However, the source nodes flip if directly targeted by an external field. Biologically,
genes at the “top” of a network are assumed to be controlled by elements outside of the network.
In this application, two attractors are needed. Define these states as ξ n and ξ c , the normal state
and cancer state, respectively. The magnetization (also called the alignment or overlap) along
attractor state a is given by
ma (t) =

1 N
∑ σi(t)ξia .
N i=1

(5.15)

Note that if ma (t) = ±1, σ (t) = ±ξ a . Also define the steady state magnetization along state a as
1 τ a
m (t) .
τ→∞ τ ∑
t=1

ma∞ = lim

(5.16)

There are two ways to model normal and cancer cells. One way is to simply define a different
coupling matrix for each attractor state a,
Jiaj =

Ai j a a
ξ ξ .
N i j

(5.17)

Alternatively, both attractor states can be encoded in the same coupling matrix,
Ji j =

Ai j n n
ξi ξ j + ξic ξ jc .
N

(5.18)

Systems using Eqs. 5.17 and 5.18 will be referred to as the one attractor state (p = 1) and two
attractor state (p = 2) systems, respectively. An interesting property emerges when p = 2. Consider
73

a simple network composed of two nodes, with only one edge 1 → 2 with attractor states ξ n and
ξ c , and T = 0. The only nonzero entry of the matrix Ji j is
J21 =

1 n n
ξ2 ξ1 + ξ2c ξ1c .
N

Note that if ξ n = ±ξ c , J21 = 2ξ2n ξ1n . In either case, by Eq. 5.9,


 +ξ n if σ (t) = +ξ n
1
2
1
,
σ2 (t + 1) =

 −ξ n if σ1 (t) = −ξ n
2
1

(5.19)

(5.20)

that is, the spin of node 2 at a given time step will be driven to match the attractor state of node 1
at the previous time step. However, if ξ1n = ±ξ1c and ξ2n = ∓ξ2c , J21 = 0. This gives


 +1 with probability 1/2
σ2 (t) =

 −1 with probability 1/2

(5.21)

In this case, node 2 receives no input from node 1. Nodes 1 and 2 have become effectively disconnected.
This motivates new designations for node types. Define similarity nodes as nodes with ξin = ξic ,
and differential nodes as nodes with ξin = −ξic . Additionally, define the set of similarity nodes
S = i : ξin = ξic and the set of differential nodes D = i : ξin = −ξic . Connections between two
similarity nodes or two differential nodes remain in the network, whereas connections that link
nodes of different types transmit no signals. The effective deletion of edges between nodes means
that the original network fully separates into two subnetworks: one composed entirely of similarity
nodes (the similarity network) and another composed entirely of differential nodes (the differential
network), each of which can be composed of one or more separate weakly connected components
(see Fig. 5.7). With this separation, new source nodes (effective sources) can be exposed in both
the similarity and differential networks. For the remainder of this article, Q is the set of both source
and effective source nodes in a given network.

74

Figure 5.7 Network segregation for two attractor states (p = 2). Every edge that connects a
similarity node to a differential node or a differential node to a similarity node transmits no signal.
This means that the signaling in the right network shown above is identical to that of the left
network. Because the goal is to leave normal cells unaltered while damaging cancer cells as much
as possible, all similarity nodes can be safely ignored, and searches and simulations only need to
be done on the differential subnetwork.
5.3.2

Control Strategies

The strategies presented below focus on selecting the best single nodes or small clusters of nodes
to control, ranked by how much they individually change ma∞ . In application, however, controlling
many nodes is necessary to achieve a sufficiently changed ma∞ . The effects of controlling a set of
nodes can be more than the sum of the effects of controlling individual nodes, and predicting the
truly optimal set of nodes to target is computationally difficult. Heuristic strategies are discussed
for controlling large networks where the combinatorial approach is impractical.
For both p = 1 and p = 2, simulating a cancer cell means that σ (0) = +ξ c (i.e. starting
in the cancer state), and likewise for normal cells. Although the normal and cancer states are
mathematically interchangeable, biologically useful results are obtained by decreasing mc∞ as much
as possible while leaving mn∞ ≈ +1. “Network control” thus means driving the system away from
its initial state of σ (0) = ξ c with hext . Controlling individual nodes is achieved by applying a
strong field (stronger than the magnitude of the field due to the node’s upstream neighbors) to a set

75

of targeted nodes T so that


c
 lim
(u→∞) −uξτ if τ ∈ T
ext
hτ =
.

 0
otherwise

(5.22)

This ensures that the drug field can always overcome the field from neighboring nodes.
In application, similarity nodes are never deliberately directly targeted, since changing their
state would adversely affect both normal and cancer cells. Roughly 70% of the nodes in the
networks surveyed are similarity nodes, so the search space is reduced. For p = 2, the effective
edge deletion means that only the differential network in cancer cells needs to be simulated to
determine the effectiveness of hext . For p = 1, however, there may be some similarity nodes
that receive signals from upstream differential nodes. In this case, the full effect of hext can be
determined only by simulating all differential nodes as well as any similarity nodes downstream of
differential nodes. All following discussion assumes that all nodes examined are differential, and
therefore targetable, for both p = 1 and p = 2. The existence of similarity nodes for p = 1 only
limits the set of targetable nodes.

5.3.2.1

Directed acyclic networks

Full control of a directed acyclic network is achieved by forcing σq = −ξqc for all q ∈ Q. This
guarantees mc∞ = −1. Suppose that nodes q ∈ Q in an acyclic network have always been fixed
away from the cancer state, that is, σq (t → −∞) = −ξqc . For any node i to have σi (t) = ξin , it is
sufficient to have either i ∈ Q or σ j (t − 1) = ξ jn for all j → i, i ∈
/ Q. Because there are no cycles
present, all upstream paths of sufficent length terminate at a source. Because the spin of all nodes
q ∈ Q point away from the cancer attractor state, all nodes downstream must also point away from
the cancer attractor state. Thus, for acyclic networks, forcing σq = −ξqc guarantees mc∞ = −1. The
complications that arise from cycles are discussed in the next section. However, controlling nodes
in Q may not be the most efficient way to push the system away from the cancer basin of attraction
and, depending on the control limitations, it may not be possible. If minimizing the number of
controllers is required, searching for the most important bottlenecks is a better strategy.
76

Consider a directed network G and an initially identical copy, G = G. If removing node i (and
all connections to and from i) from G decreases the indegree of at least one node j ∈ V (G ), j = i,
to less than half of its indegree in network G, {i} is a size 1 bottleneck. The bottleneck control set
of bottleneck {i}, L(i), is defined algorithmically as follows: (1) Begin a set L(i) with the current
bottleneck i so that L = {i}; (2) Remove bottleneck {i} from network G ; (3) Append L(i) with all
nodes j with current indegree that is less than half of that from the original network G; (4) Remove
all nodes j from the network G . If additional nodes in G have their indegree reduced to below
half of their indegree in G, go to step 3. Otherwise, stop. The impact of the bottleneck i, I(i), is
defined as
I(i) = |L(i)| ,

(5.23)

where |X| is the cardinality of the set X. The impact of a bottleneck is the minimum number of
nodes that are guaranteed to switch away from the cancer state when the bottleneck is forced away
from the cancer state.
The impact is used to rank the size 1 bottlenecks by importance, with the most important as
those with the largest impact. In application, when searching for nodes to control, any size 1
bottleneck {i} that appears in the bottleneck control set of a different size 1 bottleneck { j} can be
ignored, since fixing j to the normal state fixes i to the normal state as well.
The network in Fig. 5.8, for example, has three sources (nodes 1, 2 and 3), but one important
bottleneck (node 6). If maximum damage, i.e. mc∞ = −1, is required, then control of all source
nodes is necessary. If minimizing the number of directly targeted nodes is important and mc∞ > −1
can be tolerated, then control of the bottleneck node 6 is a better choice.

5.3.2.2

Directed cycle-rich networks

Not all networks can be fully controlled at T = 0 by controlling the source nodes, however. If there
is a cycle present, paths of infinite length exist and the final state of the system may depend on the
initial state, causing parts of the network to be hysteretic. Controlling only sources in a general
directed network thus does not guarantee mc∞ = −1 unless the system begins with σi = −ξic .
77

Figure 5.8 A directed acyclic network. Controlling all three source nodes (nodes 1, 2 and 3)
guarantees full control of the network, but are ineffective when targeted individually. The best
single node to control in this network is node 6 because it directly controls all downstream nodes.
Define a cycle cluster, C, as a strongly connected subnetwork of a network G. The network in
Fig. 5.9, for example, has one cycle cluster with nodes V (C) = {4, 5, 6, 7}. If the network begins
with σ (0) = ξ c , forcing both source nodes away from the cancer state does nothing to the nodes
downsteam of node 3 (see Fig. 5.10). This is because the indegree deg− (4) = 4, and a majority of
the nodes connecting to node 4 are in the cancer attractor state. At T = 0, cycle clusters with high
connectivity tend to block incoming signals from outside of the cluster.
The most effective single node to control in this network is any one of nodes 4, 6 or 7. Forcing
any of these away from the cancer attractor state will eventually cause the entire cycle cluster to
flip away from the cancer state, and all nodes downstream will flip as well, as shown in Fig. 5.10.
This cycle cluster act as a large, hysteretic bottleneck that motivates a generalization of size 1
bottlenecks.
Define a size k bottleneck in a network G to be a cycle cluster B with |V (B)| = k which, when
removed from G, reduces the indegree of at least one node j ∈ V (G), j ∈ V (B) to less than half of
its original indegree. Other than now using the set of nodes V (B) rather than a single node set, the
above algorithm for finding the bottleneck control set remains unchanged. In Fig. 5.9, for instance,

78

Figure 5.9 A network in which nodes 4, 5, 6 and 7 compose a single cycle cluster. The high
connectivity of node 4 prevents any changes made to the spin of nodes 1-3 from propagating
downstream. The only way to indirectly control nodes 8-10 is to target nodes inside of the cycle
cluster. Targeting node 4, 6 or 7 will cause the entire cycle cluster to flip away from its initial
state, guaranteeing control of nodes 4-10 (see Fig. 5.10).

Figure 5.10 Cancer magnetization from targeting various nodes in the network shown in Fig. 5.9,
averaged over 10,000 runs. The averaging removes fluctuations due to the random flipping of
nodes with hi = 0. Targeting node 7 results in the quickest stabilization, but targeting any one of
nodes 4, 6 or 7 results in the same final magnetization.

79

V (B) = {4, 5, 6, 7}, k = 4, L(B) = {4, 5, 6, 7, 8, 9, 10}, and I(B) = 7. Note that controlling any size
k bottleneck B guarantees control of all size 1 bottlenecks B in the control set of B for all k ≥ 1.
For any bottleneck B of size k ≥ 1 in a network G, define the set of critical nodes, Z(B, G),
as the set of nodes Z(B, G) ⊆ V (B) of minimum cardinality that, when controlled, guarantees full
control of all nodes i ∈ V (B) after a transient period. Also define the critical number of nodes
as ncrit (B, G) = |Z(B, G)|. Thus, for the network in Fig. 5.9, Z(B, G) = {4}, {6}, or {7}, and
ncrit (B, G) = 1.
In general, however, more than one node in a cycle cluster may need to be targeted to control
the entire cycle cluster. Fig. 5.11 shows a cycle cluster (composed of nodes 2-10) that cannot
be controlled by targeting any single node. The precise value of ncrit for a given cycle cluster C
depends on its topology as well as the edges connecting nodes from outside of C to the nodes inside
of C, and finding Z(C, G) can be difficult. The following theorem places bounds on ncrit to help
determine whether a search for Z(C, G) is practical.
Theorem:5 Suppose a network G contains a cycle cluster C. Define the set of externally influenced nodes
R(C, G) = {i ∈ V (C) : j ∈ V (G \C), ( j, i) ∈ E(G)} ,

(5.24)

the set of intruder connections
W (C, G) = {( j, i) ∈ E(G) : i ∈ V (C), j ∈ V (G \C)} ,

(5.25)

and the reduced set of critical nodes
Zred (C, G) = Z(C, G \W ) .

(5.26)

µ ≡ min deg− (i) ,

(5.27)

If N = |V (C)| and
i∈V (C)

where deg− (i) is computed ignoring intruder connections, then
µ
≤ ncrit (C, G) ≤ ζ ,
2
5 This

(5.28)

is admittedly a rather technical theorem. It won’t hurt my feelings if you don’t want to
parse it out.
80

Figure 5.11 A network with a cycle cluster C, composed of nodes 2-10, that cannot be controlled
at T = 0 by controlling any single node. Here, the set of externally influenced nodes is
R(C, G) = {2, 9}, the set of intruder connections is W (C, G) = {(1, 2), (1, 9)}, the reduced set of
critical nodes is Zred (C, G) = {9, 10}, the minimum indegree is µ = 1 and the number of nodes in
the cycle cluster is N = 9. By Eq. 5.28, this gives the bounds of the critical number of nodes to be
1 ≤ ncrit ≤ 6.
where
ζ ≡ min

N
+ |R(C, G) \ Zred (C, G)|, N
2

.

(5.29)

See Appendix E for the proof.
There can be more than one Zred for a given cycle cluster. Note that the tightest constraints on
ncrit in Eq. 5.28 come from using the Zred with the largest overlap with R. If finding Zred is too
difficult, an overestimate for the upper limit of ncrit can be made by assuming that R ∩ Zred = {∅}
so that
µ
≤ ncrit (C, G) ≤ min
2

N
+ |R(C, G)| , N
2

.

(5.30)

The cycle cluster in Fig. 5.11 has N = 9, R = {2, 9}, µ = 1, and one of the reduced sets of
critical nodes is Zred = {9, 10}, so 1 ≤ ncrit ≤ 6. It can be shown through an exhaustive search that
for this network ncrit = 2, and the set of critical nodes is Z = {9, 10} (see Fig. 5.12). Here, Z = Zred ,
although this is not always the case. Because the cycle cluster has 9 nodes and 1 ≤ ncrit ≤ 6, at
81

Figure 5.12 Magnetization for network from Fig. 5.11, averaged over 10,000 runs. There is no
single node to target that will control the cycle cluster, but fixing nodes 9 and 10 results in full
control of the cycle cluster, leaving only node 1 in the cancer state. This means Z(C, G) = {9, 10}
and ncrit = 2.
most ∑6n=1 9n = 465 simulations are needed to find at least one solution for Z(C, G). However, the
maximum number of simulations required to find Z(C, G) increases exponentially and for larger
networks the problem quickly becomes intractable.
One heuristic strategy for controlling cycle clusters is to look for size k < |V (C)| bottlenecks
inside of C. Bottlenecks of size k

1 and average indegree deg− (B)

k can contain high

impact size k bottlenecks, where k < k. Size k ≥ 1 bottlenecks need to be compared to find the
best set of nodes to target to reduce mc∞ . Simply comparing the impact is insufficent because a
cycle cluster with a large impact could also have a large ncrit , requiring much more effort than its
impact merits. Define the critical efficiency of a bottleneck B as
ecrit (B) =

I(B)
.
ncrit (B, G)

(5.31)

If the critical efficiency of a cycle cluster is much smaller than the impacts of size 1 bottlenecks
from outside of the cycle cluster, the the cycle cluster can be safely ignored.
82

For some cycle clusters, however, not all of the nodes need to be controlled in order for a large
portion of the nodes in the cycle cluster’s control set to flip. Define the optimal efficiency of a
bottleneck B as
eopt (B) = max

n=1,2,...

I

n B
i=1 i

n

(5.32)

where Bi ⊆ V (B) are size 1 bottlenecks and I(Bi ) > I(Bi+1 ) for all i. Note that for any size 1
bottleneck B, eopt (B) = ecrit (B) = I(B). This quantity thus allows bottlenecks with very different
properties (I(B), ncrit (B, G), or |V (B)|) to be ranked against each other.
All strategies presented above are designed to select the best individual or small group of nodes
to target. Significant changes in the biological networks’ magnetization require targeting many
nodes, however. Brute force searches on the effect of larger combinations of nodes are typically
impossible because the required number of simulations scales exponentially with the number of
nodes. A crude Monte Carlo search is also numerically expensive, since it is difficult to sample an
appreciable portion of the available space. One alternative is to take advantage of the bottlenecks
that can be easily found, and rank all size k ≥ 1 bottlenecks Bi in an ordered list U such that
U = (B1 , B2 , B3 , . . .)

(5.33)

eopt (Bi ) ≥ eopt (Bi+1 ) , Bi ⊂ L(B j )

(5.34)

where

for all Bi , B j ∈ U and fix the bottlenecks in the list in order. This is called the efficiency-ranked
strategy. If all size k > 1 bottlenecks are ignored, it is called the pure efficiency-ranked strategy,
and if size k > 1 bottlenecks are included it is called the mixed efficiency-ranked strategy.
An effective polynomial-time algorithm for finding the top z nodes to fix, which called the
best+1 strategy (a kind of greedy algorithm), works as follows: (1) Begin with a seed set of nodes
to fix, F; (2) Test the effect of fixing F ∪ i for all allowed nodes i ∈
/ F; (3) F ← F ∪ ibest , where
ibest is the best node from all i sampled; (4) If |F| < z, go to step (2). Otherwise, stop. The seed
set of nodes could be the single highest impact size 1 bottleneck in the network, or it could be the
best set of n nodes (where n < z) found from a brute force search.

83

5.3.3

Cancer signaling

In application to biological systems, the magnetization of cell type a is assumed to be related to the
viability of cell type a, va ; that is, the fraction of cells of type a that survives a drug treatment. It is
reasonable to assume that va is a monotonically increasing function of ma∞ . As few controllers as
possible should be used to sufficiently reduce mc∞ while leaving mn∞ ≈ +1. In practical applications,
however, the set of druggable targets is limited. All classes of drugs are constrained to act only on
a specific set of biological components. For example, kinase inhibitors have two constraints: the
only nodes that can be targeted are those that correspond to kinases, and they can only be inhibited,
i.e. turned off. The example of kinase inhibitors will be used to show how control is affected by
such constraints. In the real systems studied, many differential nodes have only similarity nodes
upstream and downstream of them, while the remaining differential nodes form one large cluster.
For p = 2, focus is placed on controlling only the largest weakly connected differential subnetwork,
ignoring all islets (isolated nodes). All final magnetizations are normalized by the total number of
nodes in the full network, even if the simulations are only conducted on small portion of the
network.

5.3.3.1

Lung Cell Network

The network used to simulate lung cells was built by combining the kinase interactome from PhosphoPOINT [161] with the transcription factor interactome from TRANSFAC [100]. Both of these
are general networks constructed from experimentally observed interactions. This bottom-up approach means that some edges may be missing, but those present are reliable. Because of this, the
network is sparse (∼ 0.057% complete, see Table 5.1), resulting in the formation of many islets
for p = 2. Note also that this network presents clear hierarchical structure characteristic of biological networks [56, 125], with many sink nodes [134] that are targets of transcription factors and a
relatively large cycle cluster originating from the kinase interactome.
It is important to note that this is a non-specific network, whereas real gene regulatory networks
can experience a sort of “rewiring” for a single cell type under various internal conditions [96]. In

84

Properties
Nodes
Edges
Sources
Sinks
Av. outdegree
Max outdegree
Max indegree
Self-loops
Undirected edges
Diameter
Max cycle cluster
Av. clustering coeff. [50]

Lung
9073
45635
129
8443
5.03
240
68
238
350
11
401
0.0544

B cell
4364
55144
8
1418
12.64
2372
196
0
23386
11
2886
0.2315

Table 5.1 General properties of the full networks. The network used for the analysis of lung
cancer is a generic one obtained by combining the data sets in [100] and [161]. The B cell
network is a curated version of the B cell interactome obtained from [89] using a network
reconstruction method and gene expression data from B cells.
this analysis, the difference in topology between a normal and a cancer cell’s regulatory network is
assumed to be negligible. The methods described here can be applied to more specialized networks
for specific cell types and cancer types as these networks become more widely available.6
The IMR-90 cell line [105, 108] was used for the normal attractor state, and the two cancer
attractor states examined were from the A549 (adenocarcinoma) [70, 109, 129, 139, 154] and NCIH358 (bronchioalveolar carcinoma) [154, 139] cell lines. Gene expression measurements from all
referenced studies for a given cell line were averaged together to create a single attractor. The
resulting magnetization curves for A549 and NCI-H358 are very similar, so the following analysis
addresses only A549. The full network contains 9073 nodes, but only 1175 of them are differential
nodes in the IMR-90/A549 model. In the unconstrained p = 1 case, all 1175 differential nodes
are candidates for targeting. Exhaustively searching for the best pair of nodes to control requires
investigating 689725 combinations simulated on the full network of 9073 nodes. However, 1094
of the 1175 nodes are sinks (i.e. nodes i with outdegree deg+ (i) = 0, ignoring self loops) and
therefore have I(i) = eopt (i) = 1, which can be safely ignored. The search space is thus reduced to
6 The

original article containing these results was published before Ong’s method was devel-

oped.

85

81 nodes, and finding even the best triplet of nodes exhaustively is possible. Including constraints,
only 31 nodes are differential kinases with ξic = +1. This reduces the search space at the cost of
increasing the minimum achievable mc∞ .
There is one important cycle cluster in the full network, and it is composed of 401 nodes. This
cycle cluster has an impact of 7948 for p = 1, giving a critical efficiency of at least ∼ 19.8, and
1 ≤ ncrit ≤ 401 by Eq. 5.30. The optimal efficiency for this cycle cluster is eopt = 29, but this is
achieved for fixing the first bottleneck in the cluster. Additionally, this node is the highest impact
size 1 bottleneck in the full network, and so the mixed efficiency-ranked results are identical to the
pure efficiency-ranked results for the unconstrained p = 1 lung network. The mixed efficiencyranked strategy was thus ignored in this case.
Fig. 5.13 shows the results for the unconstrained p = 1 model of the IMR-90/A549 lung cell
network. The unconstrained p = 1 system has the largest search space, so the Monte Carlo strategy
performs poorly. The best+1 strategy is the most effective strategy for controlling this network.
The seed set of nodes used here was simply the size 1 bottleneck with the largest impact. The
best+1 method works better than effeciency-ranked method because best+1 includes the synergistic
effects of fixing multiple nodes, while efficiency-ranked assumes that there is no overlap between
the set of nodes downstream from multiple bottlenecks. Importantly, however, the efficiencyranked method works nearly as well as best+1 and much better than Monte Carlo, both of which
are more computationally expensive than the efficiency-ranked strategy.
Fig. 5.14 shows the results for the unconstrained p = 2 model of the IMR-90/A549 lung cell
network. The search space for p = 2 is much smaller than that for p = 1. The largest weakly
connected differential subnetwork contains only 506 nodes (see Table 5.2) , and the remaining
differential nodes are islets or are in subnetworks composed of two nodes and are therefore unnecessary to consider. Of these 506 nodes, 450 are sinks. Fig. 5.15 shows the largest weakly connected
component of the differential subnetwork, and the top five bottlenecks in the unconstrained case
are shown in red. If limiting the search to differential kinases with ξic = +1 and ignoring all sinks,
p = 2 has 19 possible targets. There is only one cycle cluster in the largest differential subnetwork,

86

Figure 5.13 Final cancer magnetizations for an unconstrained search on the lung cell network
using p = 1. The efficiency-ranked strategy outperforms the relatively expensive Monte Carlo
strategy. The best+1 strategy works best, although it requires the largest computational time. Note
that the mixed efficiency-ranked curve is not shown because it is identical to the pure
efficiency-ranked curve. Key for magnetization curves: MC = Monte Carlo, B+1 = best+1, ERP =
pure efficiency-ranked.
containing 6 nodes. Like the p = 1 case, the optimal efficiency occurs when targeting the first
node, which is the highest impact size 1 bottleneck. Because the mixed efficiency-ranked strategy
gives the same results as the pure efficiency-ranked strategy, only the pure strategy was examined.
The Monte Carlo strategy fares better in the unconstrained p = 2 case because the search space
is smaller. Additionally, the efficiency-ranked strategy does worse against the best+1 strategy for
p = 2 than it did for p = 1. This is because the effective edge deletion decreases the average indegree of the network and makes nodes easier to control indirectly. When many upstream bottlenecks
are controlled, some of the downstream bottlenecks in the efficiency-ranked list can be indirectly
controlled. Thus, controlling these nodes directly results in no change in the magnetization. This
gives the plateaus shown for fixing nodes 9-10 and 12-15, for example.
The only case in which an exhaustive search is possible is for p = 2 with constraints, which is

87

Properties
Nodes
Edges
Sources and eff. sources
Sinks and eff. sinks
Av. outdegree
Max outdegree
Max indegree
Self-loops
Undirected edges
Diameter
Max cycle cluster size
Av. clustering coeff

Lung
I/A
I/H
506
667
846 1227
30
34
450
598
1.67 1.84
52
51
8
10
27
31
0
4
9
9
6
3
0.035 0.042

B
N/D
N/F
N/L M/D M/F M/L
684
511
841
621
457
742
2855 1717 3962 2525 1501 3401
12
11
9
9
9
12
286
198
369
275
204
333
4.17 3.36 4.71 4.07 3.28 4.58
155
143
336
138
132
292
40
29
49
35
27
44
0
0
0
0
0
0
1238
738 1468 1000
596 1214
12
15
12
13
14
12
351
280
397
305
199
337
0.188 0.197 0.245 0.175 0.194 0.239

Table 5.2 Properties of the largest weakly connected differential subnetworks for all cell types. I =
IMR-90 (normal), A = A549 (cancer), H = NCI-H358 (cancer), N = Naïve (normal), M =
Memory (normal), D = DLBCL (cancer), F = Follicular lymphoma (cancer), L =
EBV-immortalized lymphoblastoma (cancer).

Figure 5.14 Final cancer magnetizations for an unconstrained search on the lung cell network
using p = 2. As in the p = 1 case, the efficiency-ranked strategy outperforms the expensive
Monte Carlo search. The plateaus in the efficiency-ranked strategy when fixing 9-10, 12-15,
20-21, etc. nodes are a result of targeting bottlenecks that are already indirectly controlled.

88

Figure 5.15 Largest weakly connected differential subnetwork for IMR-90/A549 and p = 2. Out
of the 506 pictured nodes, 450 are sinks and therefore have an impact equal to one. The top five
bottlenecks are labeled with their gene names and colored orange.

89

Figure 5.16 Final cancer magnetizations for a constrained search on the lung cell network using
p = 2. This is the only case in which a limited exhaustive search is possible. Interestingly, the
exhaustive search locates the same nodes as the best+1 strategy for fixing up to eight nodes. The
efficiency-ranked strategy performs poorly compared to the Monte Carlo strategy because the
search space is small and a large portion of the available space is sampled by the Monte Carlo
search.
shown in Fig. 5.16. Note that the polynomial-time best+1 strategy identifies the same set of nodes
as the exponential-time exhaustive search. This is not surprising, however, since the constraints
limit the available search space. This means that the Monte Carlo also does well. The efficiencyranked method performs worst. The efficiency-ranked strategy is designed to be a heuristic strategy
that scales gently, however, and is not expected to work well in such a small space when compared
with more computationally expensive methods.

5.3.3.2

B Cell Network

The B cell network was derived from the B cell interactome of Ref. [89]. The reconstruction
method used in Ref. [89] removes edges from an initially complete network depending on pairwise gene expression correlation. Additionally, the original B cell network contains many protein-

90

protein interactions (PPIs) as well as transcription factor-gene interactions (TFGIs). TFGIs have
definite directionality: a transcription factor encoded by one gene affects the expression level of
its target gene(s). PPIs, however, do not have obvious directionality. These PPIs were filtered by
checking if the genes encoding these proteins interacted according to the PhosphoPOINT/TRANSFAC
network of the previous section, and if so, kept the edge as directed.
Because of the network construction algorithm and the inclusion of many undirected edges, the
B cell network is more dense (∼0.290% complete, see Table 5.1) than the lung cell network. This
higher density leads to many more cycles than the lung cell network, and many of these cycles overlap to form one very large cycle cluster containing about two thirds of all nodes in the full network.
All gene expression data used for B cell attractors was taken from Ref. [29]. Two types of normal B cells (naïve and memory) and three types of B cell cancers (diffuse large B-cell lymphoma
(DLBCL), follicular lymphoma, and EBV-immortalized lymphoblastoma) were analyzed, giving
six combinations in total. Results are presented for only the naïve/DLBCL combination below, but
Tables 5.2 and 5.3 list the properties of all normal/cancer combinations. Again, all gene expression
measurements for a given cell type were averaged together to produce a single attractor. The full
B cell network is composed of 4364 nodes. For p = 1, there is one cycle cluster C composed of
2886 nodes. This cycle cluster has 1 ≤ ncrit (C) ≤ 1460, I(C) = 4353, and 3.0 ≤ ecrit (C) ≤ 4353.
Finding Z(C) was deemed too difficult.
Fig. 5.17 shows the results for the unconstrained p = 1 case. Again, the pure efficiency-ranked
strategy gave the same results as the mixed efficiency-ranked strategy, so only the pure strategy was
analyzed. As shown in Fig. 5.17, the Monte Carlo strategy is out-performed by both the efficiencyranked and best+1 strategies. The synergistic effects of fixing multiple bottlenecks slowly becomes
apparent as the best+1 and efficiency-ranked curves separate.
Fig. 5.18 shows the results for the unconstrained p = 2 case. The largest weakly connected
subnetwork contains one cycle cluster with 351 nodes, with 1 ≤ ncrit ≤ 208. Although finding a
set of critical nodes is difficult, the optimal efficiency for this cycle cluster is 62.2 for fixing 10
bottlenecks in the cycle cluster. This makes targeting the cycle cluster worthwhile. The efficiency

91

I/A

UNC

CON

p=1
Gene
HNF1A
TMEM37
OR5I1
MAP3K14
MAP3K3
MAP3K14
SRC

I/H
p=2
Gene
OR5I1
TMEM37
HNF1A
POSTN
RORA
SRC
BMPR1B

I
29
22
20
19
18
19
14

I
35
25
23
21
18
15
7

p=1
Gene
HNF1A
MAP3K3
TP53
RUNX1
RORA
TTN
RIPK3

N/D

UNC

CON

p=1
Gene
BCL6
MEF2A
NCOA1
TGIF1
NFATC3
BUB1B
AAK1

Gene
NFIC
TGIF1
BCL6
FOXJ2
NFATC3
CSNK2A2
AKT1

I
22
19
14
12
12
2
2

p=1
Gene
BCL6
MEF2A
NCOA1
TGIF1
NFATC3
BUB1B
AAK1

M/D

UNC

CON

p=1
Gene
BCL6
MEF2A
NCOA1
NFATC3
SMAD4
AAK1
RIPK2

I
12
5
5
4
4
1
1

I
41
38
25
20
18
18
8

N/F
p=2

I
12
5
5
4
4
2
1

p=2
Gene
HMX1
PBX1
MYB
ITGB2
TNFRSF10A
BMPR1B
LCK

I
29
18
18
17
16
16
6

p=2
Gene
FOXJ2
NFIC
BCL6
NCOA1
MEF2A
RIPK2
MAST2

N/L
p=2

I
12
5
5
4
4
2
1

Gene
NCOA1
NFATC3
BCL6
CEBPD
RELA
WEE1
CSNK2A2

I
20
15
11
8
8
2
2

p=1
Gene
RBL2
FOXM1
ATF2
RXRA
NFATC3
BUB1B
AAK1

M/F
I
12
12
11
9
8
1
1

p=1
Gene
BCL6
MEF2A
NCOA1
NFATC3
RELA
AAK1
RIPK2

I
12
5
5
4
4
1
1

p=2
Gene
NCOA1
BCL6
E2F3
RUNX1
TFE3
ROCK2
RIPK2

I
11
8
7
5
4
2
1

p=2
Gene
RBL2
ATF2
NFATC3
RXRA
PATZ1
PRKCD
AURKB

M/L
I
18
13
9
9
7
2
1

p=1
Gene
RBL2
FOXM1
ATF2
RXRA
TGIF1
AAK1
SCYL3

I
11
8
7
5
4
1
1

p=2
Gene
RBL2
ATF2
ZNF91
STAT6
FOXM1
AURKB
RIPK2

Table 5.3 Best single genes and their impacts for the p=1 and p=2 models. The unconstrained
(UNC) and constrained (CON) case are shown. The constrained case refer to target that are
kinases and are expressed in the cancer case. I = IMR-90 (normal), A = A549 (cancer), H =
NCI-H358 (cancer), N = Naïve (normal), M = Memory (normal), D = DLBCL (cancer), F =
Follicular lymphoma (cancer), L = EBV-immortalized lymphoblastoma (cancer).
of this set of 10 nodes is larger than the efficiencies of the first 10 nodes from the pure efficiencyranked strategy, so the mc∞ from the mixed strategy drops earlier than the pure strategy. Both
strategies quickly identify a small set of nodes capable of controlling a significant portion of the
differential network, however, and the same result is obtained for fixing more than 10 nodes. The

92

I
16
10
8
8
8
2
1

I
22
12
11
9
8
2
2

Figure 5.17 Final cancer magnetizations for an unconstrained search on the B cell network using
p = 1. The Monte Carlo strategy is ineffective for fixing any number of nodes. The
efficiency-ranked and best+1 curves slowly separate because synergistic effects accumulate faster
for best+1.
best+1 strategy finds a smaller set of nodes that controls a similar fraction of the cycle cluster, and
fixing more than 7 nodes results in only incremental decreases in mc∞ . The Monte Carlo strategy
performs poorly, never finding a set of nodes adequate to control a significant fraction of the nodes
in the cycle cluster.

5.4

Conclusion

Signaling models for large and complex biological networks are becoming important tools for
designing new therapeutic methods for complex diseases such as cancer. Even if knowledge of
biological networks is incomplete, rapid progress is currently being made using reconstruction
methods that use large amounts of publicly available omic data [36, 64]. The Hopfield model allows mapping of gene expression patterns of normal and cancer cells into stored attractor states
of the signaling dynamics in directed networks. The role of each node in disrupting the network

93

Figure 5.18 Final cancer magnetizations for an unconstrained search on the B cell network using
p = 2. The rather sudden drop in the magnetization between controlling 5 and 10 nodes in the
efficiency-ranked strategies comes from flipping a significant portion of a cycle cluster. This is the
only network examined in which the mixed efficiency-ranked strategy produces results different
from the pure efficiency-ranked strategy.
signaling can therefore be explicitly analyzed to identify isolated genes or sets of strongly connected genes that are selective in their action. The concept of size k bottlnecks to identify such
genes led to the formulation of two heuristic strategies, efficiency-ranked and best+1, to find nodes
that efficiently reduce the cancer state’s magnetization. Using this approach, small sets of nodes in
lung and B cancer cells were located which, when forced away from their initial states with local
magnetic fields representing targeted drugs, disrupt the signaling of the cancer cells while leaving
normal cells in their original state. For networks with few targetable nodes, exhaustive searches
or Monte Carlo searches can locate effective sets of nodes. For larger networks, however, these
strategies become too cumbersome and the discussed heuristic strategies represent a feasible alternative. For tree-like networks, the pure efficiency-ranked strategy works well, whereas the mixed
efficiency-ranked strategy could be a better choice for networks with high-impact cycle clusters.
Two important assumptions are made in applying this analysis to real biological systems. First,

94

it is assumed that genes are either fully off or fully on, with no intermediate state. Modelling the
state of a neuron as “all-or-none” has long been accepted as a reasonable assumption [102], which
inspired the spin glass framework for the Hopfield model. While similar switch-like behavior in
gene regulatory networks has been proposed as an explanation of, for example, segmentation in
Drosophila embryos [81], assigning a Boolean value to gene expression may be overly simplistic
in many cases. A model which uses spins with more than two projections could prove to be more
realistic and predictive. Second, all nodes are assumed to update their status with a single timescale
and with a single interaction strength. If the signaling timescale τi j (i.e. how long it takes for a
gene j to produce a protein that regulates gene i) for each edge in the biological network were
known, simulations could be conducted in which a signal traveling along an edge ( j, i) reaches its
target after τi j time steps. This would amount to a “queue” of signals moving between nodes.
Despite these issues, the model shows promise. Some of the genes identified in Table 5.3 are
consistent with current clinical and cancer biology knowledge. For instance, one of the effective
targets in the lung cancer list is the well known tumor suppressor gene TP53 [12] that is frequently
mutated in many cancer types including lung cancer [143]. Mutations in PBX1 have recently been
detected in non-small-cell lung cancer and this gene is now being considered as a target for therapy
and prognosis [103]. MAP3K3 and MAP3K14 are in the MAPK/ERK pathway which is a target
of many novel therapeutic agents [104], and SRC is a well known oncogene and a candidate target
in lung cancer [128]. BCL6 (B-cell lymphoma 6) is the most common oncogene in DLBCL, and
it is known that its expression can predict prognosis and response to drug therapy [62, 127, 159].
BCL6 is also frequently mutated in follicular lymphoma [3, 43]. This analysis identified BCL6 as
an important drug target for both DLBCL and follicular lymphoma using either naive or memory
B-cells as a control for both p = 1 and p = 2. RBL2 disregulation has been recently associated
with many types of lymphoma [35, 123, 156]. FOXM1 is a potential therapeutic target in mature B
cell tumors [147] and ATF2 has been recently found to be highly disregulated in lymphoma [151,
155]. Besides BCL6 discussed above, the N/D list for DLBCL contains genes (MEF2A [11],
NCOA1 [49, 163], TGIF1 [17, 60, 92], and NFATC3 [57]) that are all known to have a functional

95

role in cancer, even if they have not been associated to the specific B-cell cancer types considered
here. All of these predictions were made from data from cell lines commonly used in in vitro
testing in many laboratories. RNAi and targeted drugs could be used in these cell lines against
the top scoring genes in Table 5.3 to test the disruption of survival or proliferative capacity. If
experimentally validated, this analysis based on attractor states and bottlenecks could be applied
to patient-derived cancer cells by integrating in the model patient gene expression data to identify
patient-specific targets.
The above unconstrained searches assume that there exists some set of “miracle drugs” which
can turn any gene on and off at will. This limitation can be partially taken into account by using constrained searches that restrict the set of nodes that can be addressed. However, even the
constrained search results are unrealistic, since most drugs directly target more than one gene. Inhibitors, for example, could target differential nodes with ξic = −1 and ξin = +1, which would
damage only normal cells. Additionally, drugs would not be restricted to target only differential
nodes, and certain combinations could be toxic to both normal and cancer cells. Few cancer treatments involve the use of a single drug, and the synergistic effects of combining multiple drugs
adds yet another level of complication to finding an effective treatment [52]. On the other hand,
the intrinsic nonlinearity of a cellular signaling network, with its inherent structure of attractor
states, enhances control [30] so that a properly selected set of druggable targets might be sufficient
for robust control.

96

Symbol
G
N
Ai j
V (G)
E(G)
deg+/− (i)
σi
ξa
ξ n/c
Ji j
hi
hext
i
T
Q
ma (t)
ma∞
p
S
D
L(B)
I(B)
C
B
Z(B, G)
ncrit (B, G)
R(C, G)
W (C, G)
Zred (C, G)
µ
ecrit (B)
eopt (B)

Explanation
Set of nodes and directed edges (network)
Number of nodes
Adjacency matrix
Set of nodes in G
Set of edges in G
Outdegree/indegree of node i
Spin of node i, = ±1
ath attractor
Normal/cancer attractor
Coupling matrix
Total field at node i
External field applied to node i
Temperature
Set of source and effective source nodes
Magnetization along attractor a at time t
Steady-state magnetization along attractor a
Number of attractors in coupling matrix
Set of similarity nodes
Set of differential nodes
Control set of bottleneck B
Impact of bottleneck B
Cycle cluster
Size k bottleneck, where k = |B|
Set of critical nodes for bottleneck B in network G
Critical number of nodes in bottleneck B in network G
Set of externally influenced nodes
Set of intruder connections
Reduced set of critical nodes
Minimum indegree of all nodes in a cycle cluster
Critical efficiency of bottleneck B
Optimal efficiency of bottleneck B

Table 5.4 Reference table of all symbols used in this chapter.

97

CHAPTER 6
GENERALIZED HOPFIELD MODELS

Figure 6.1 Ascending and Descending by MC
Escher.

Figure 6.2 Relativity by MC Escher.

The following chapter contains unpublished but exciting new directions for the Hopfield model
in general and for its application to gene regulation in particular, and is organized as follows.
• Section 6.1, Stability in Hopfield GRNs: a discussion of the starkly different behavior of
the magnetization of modular Erd˝os-Rényi networks and real gene regulatory networks as a
function of the level of noise in the Hopfield model
• Section 6.2, Programmable nonequilibrium Hopfield systems: a discussion of one known
extension of the Hopfield model which encodes cyclic attractors, and one novel extension
which enables more complex shapes to be programmed into state space
• Section 6.3, Cell cycle and the Hopfield model: an application of the cyclic Hopfield model
to cell cycle

98

Figure 6.3 Schematic of the magnetization phase transition for Hopfield systems on Erd˝os-Rényi
networks. Note that the symmetric (mirror) solution is also shown below the T axis.

6.1

Stability in Hopfield GRNs

The Hopfield model is designed to balance stability and flexibility, which is an indispensable attribute of all biological systems: a cell temporarily driven away from its native gene expression
state by temporary or weak external perturbations or noise (e.g. fluctuations in the environmental
temperature) is able to self-correct and return to its native state, but is also capable of switching its
state in response to particular sets of signals and stresses (e.g. an increased concentration of growth
hormones or other chemical signals in the cell’s environment may lead to upregulation of proliferation). Furthermore, the modular structure of gene regulatory networks means that individual
communities of genes may respond differently to the same stresses. This means that perturbations to genes in a particular community may transmit only weakly or not at all to neighboring
communities due to the smaller number of connections between them (by the definition of maximum modularity). Understanding how community structures affect signaling is thus important for
designing control techniques for GRNs.
The mean field solution for Hopfield systems on asymmetric Erd˝os-Rényi networks [40] is
rederived in Appendix F. Because in the limit N → ∞ the vast majority of nodes in Erd˝os-Rényi
networks are average nodes (they all have nearly identical degree and are wired together in a
uniformly random way), there is a sharp phase transition in the magnetization as a function of
temperature, m(T ), at a critical temperature Tc , as shown schematically in Fig. 6.3.

99

In the same appendix, I generalize this result to a class of model networks called stochastic
block model networks (SBMNs). Each of the Ω communities in a SBMN is its own Erd˝os-Rényi
network, and these communities are also randomly wired together. A SBMN is defined by a
connectivity matrix, CΩ×Ω , where CIJ is the average number of connections per node from nodes
in community J to nodes in community I (and so CII is the average number of connections per
node within community I). This allows for different levels of stability within communities and
asymmetric couplings between communities, meaning that the state of J may influence the state of
I more strongly than the reverse case.
Fig. 6.4 shows the mean field solution (dashed lines) and a numerical solution (circles with
error bars) for a SBMN constructed using the connectivity matrix from AML 2.3 (using level-1
communities only) and random attractor states, with Tc ≈ 20. There is good agreement between
the analytical and numerical solutions, and this agreement improves when larger networks are
used. However, simulating the dynamics on the real AML 2.3 network across the same range of
temperatures reveals very different behavior, as shown in Fig. 6.5 (which includes the dashed mean
field solution for reference). While the zero temperature system has lower magnetization than the
SBMN case, it maintains finite magnetization even for T

Tc , showing no sign of a phase transi-

tion. This is perhaps because AML 2.3 has a power law rather than a binomial degree distribution,
as verified by simulations using modular power law networks, which show qualitatively similar
m(T ) curves. The nested modular structure of AML 2.3 may also play a part in the long tailed
behavior of m(T ), as the lack of a phase transition could be caused by a level-by-level breakdown
of order as the temperature is increased. This is an open question that merits further investigation.

6.2

Programmable nonequilibrium Hopfield systems
µ

Note: in the previous chapter, the patterns ξi

were indexed according to µ = 1, 2, . . . , p. For

mathematical convenience, this chapter simply relabels these indices to µ = 0, 1, . . . , p − 1, and the
limits on the summations have been disregarded for simplicity. Additionally, the previous chapter
used an asynchronous update scheme, in which a node was randomly selected to be updated at
100

Figure 6.4 Mean field (dashed) and explicit
simulations (circles with error bars) of the
magnetization of communities in SBMNs over a
range of temperatures. There is a phase change
in the mean field solution at Tc ≈ 20, and the
explicit simulation is in good agreement.

Figure 6.5 Mean field SBMN (dashed) and
explicit AML 2.3 simulations (circles with error
bars) of the magnetization of communities over
a range of temperatures. Unlike the explicit
simulations from Fig. 6.4, the real network
shows no apparent phase change, instead
maintaining finite magnetization over all
temperatures examined.

each time step. The following models require using the synchronous scheme, in which the state of
all nodes is simultaneously updated at each time step. The background and original publications
for this section can be found in [67].
The original definition of the coupling matrix,
Ji j =

1
µ µ
ξi ξ j
∑
N µ

(6.1)

encodes each of the p patterns as point attractors for N, p → ∞, p/N ≡ α < αc ≈ 0.138, and
µ

ξi = ±1 with equal probability. As explained in Appendix G, a more stable form of the coupling
matrix which accomodates correlated patterns may be constructed according to
Ji j =

1
µ
∑ ξi Q−1 µν ξ jν
N µν

(6.2)

where
Qµν =

1
µ
ξi ξiν
∑
N i

101

(6.3)

A slight alteration to Eq. 6.2 enables cyclic attractors to be encoded as well. A simple cyclic
attractor of period p can be created by
Ji j =

1
mod p (µ+1) µ
ξi
ξj
∑
N µ

(6.4)

or, as futher explained in Appendix G, the matrix Qµν can be included to orthogonalize the patterns
in the cyclic case by using the form
Ji j =

1
mod p (µ+1)
ξν
Q−1
ξi
∑
N µν
µν j

(6.5)

At zero temperature, this coupling matrix cyclically maps through the p patterns (or the mirror
µ

cycle, depending on the initial configuration); in other words, if σi (t) = ±ξi , then σi (t + ∆t) =
mod p (µ+∆t)

±ξi

for all ∆t = 0, 1, 2, . . .. Cyclic attractors may be envisioned as a ball falling down a

flight of Penrose stairs, such as MC Escher’s Ascending and Descending shown in Fig. 6.1.
A delay may be introduced by combining the point attractor coupling matrix and the cyclic
attractor coupling matrix into a single coupling matrix,
Ji j = Ji j + λ Ji j

(6.6)
µ

for an adjustable transition strength parameter λ ≥ 0. If σ (t) = ξi , λ

1, and T = 0, the point

attractor term dominates and σ (t) = σ (t + 1) = σ (t + 2) = . . .. If T > 0, however, stochastic
fluctuations will eventually begin to push the configuration out of the basin of attraction of the
µ th attractor and initiate the cascade of spin flips that causes the configuration to transition to
mod p (µ+1)

ξi

. The system will remain there until another stochastic fluctuation pushes it into the

next basin of attraction, and so on. One of the possible applications of this delayed cycle model
will be covered in Section 6.3.
The point and cyclic attractor coupling matrices may be rewritten as
µ

Ji j = ∑ ξi
µ

1
ξ jν
Q−1
N∑
µν
ν

(6.7)

and
mod p (µ+1)

Ji j = ∑ λ ξi
µ

102

1
Q−1
ξ jν
N∑
µν
ν

(6.8)

I define the common bracketed term
µ

Sj =

1
ξ jν
Q−1
N∑
µν
ν
mod p (µ+1)

µ

to be the source term, and the ξi and λ ξi

(6.9)

terms to be particular examples of target

terms. A more general way to write Ji j is
Ji j =

µ

∑ ξiω Mω µ S j

(6.10)

µω

for some p × p mapping matrix M. The point attractor coupling matrix is constructed by setting
Mµω = δµω (the identity matrix). For p = 3, the cycle ξi0 → ξi1 → ξi2 → ξi0 is constructed by


 0 0 1 



M=
(6.11)
1
0
0




0 1 0
and for a delayed cyclic attractor, M takes the form


 1 0 λ 



M=
 λ 1 0 


0 λ 1

(6.12)

M may be used to store transition maps which are more complex than simple point and cyclic
attractors. For example, a system with p = 3 can encode the transitions ξi0 → ξi2 , ξi1 → ξi2 , and
ξi2 → ξi2 using





 0 0 0 



M=
 0 0 0 


1 1 1

(6.13)

This ensures that all states map to ξi2 . M can also be designed to store multiple point and cyclic
attractors in the same system. In fact, any target term may be used to generate complex transition
maps in state space, as long as each state transitions to exactly one other state (in addition to an
optional self-loop that delays transitions). This enables the encoding of maps such as deterministic
random maps [39], an example of which is shown in Fig. 6.6. Furthermore, different transition
103

Figure 6.6 An example deterministic random map that can be stored in a Hopfield coupling
matrix using an appropriate mapping matrix M. Each node µ = 0, 1, . . . , 9 represents a stored
pattern ξ µ and each edge represents a transition.
strengths can be set to values other than 0 and 1, meaning that transition rates can be tuned as well.
Similar to cyclic attractors, this complex landscape of attractors may be envisioned as a ball falling
down a labyrinth of stairs such as in MC Escher’s Relativity shown in Fig. 6.2, where doorways lead
to closets (point attractors) or flights of Penrose stairs (cyclic attractors). The concept of mapping
matrices could be used in the future to model gene expression changes in complex biological
processes such as cell differentiation.

6.3

Cell cycle and the Hopfield model

One of the most important misregulated processes in cancer is cell cycle (CC), in which a cell
replicates its DNA and divides into two daughter cells. Understanding CC in both normal and
cancer cells is undoubtedly important for medical research, and the ability to slow CC in cancer
cells without affecting CC in normal cells could be therapeutically beneficial. Below, I present
the preliminary results of encoding time series gene expression data (gene expression measured
at regular time intervals) for the process of CC in a delayed cyclic coupling matrix Ji j , with the
eventual goal of predicting effective perturbations which control CC.
Dominguez et al. [44] recently released time series bulk (i.e. many cells) RNA-seq data of
HeLa cells (human cervical cancer). The CC processes in the many cells in the sample were
104

synchronized using a double thymidine block, freezing all cells in the G1 phase and then releasing
the block. RNA-seq measurements were conducted on subsamples at regular intervals for the
duration of two full cycles, producing a raw (continuous) expression matrix Diµ ≥ 0.
I used the continuous matrix Diµ to construct Boolean patterns as follows. For convenience,
each row i of the raw matrix was linearly rescaled so that its minimum value was set to −1 and
its maximum was set to +1. Because CC is a periodic process, only genes whose expressions
were found to be sufficiently periodic were kept, and all non-periodic genes were discarded. The
resulting data set had N = 592 genes and p = 8 gene expression patterns per CC period. The
µ

raw expression was then converted to Boolean form by assigning ξi = +1 (overexpressed) for
µ

samples in which Diµ > 0 and ξi = −1 (underexpressed) otherwise. These patterns were encoded
cyclically in Ji j according to Eq. 6.5.
In order to conduct simulations, the free parameters λ and T must be defined. λ = 0.23 and
T = 0.07 were chosen here because it placed the system in a regime in which the fluctuations were
great enough to cause transitions from one attractor to the next in a delayed, somewhat periodic
manner. The system was initialized with σi (t = 0) = ξi0 (representing a single isolated cell in
the G1 phase) and was allowed to evolve. The resulting trajectory spanning nearly three periods
is shown in Fig. 6.7. The bottom plot shows the overlap (magnetization) between the system’s
configuration and each of the p = 8 patterns as a function of time, where an overlap of +1 with
attractor µ signifies perfect alignment. The top plot shows which pattern has maximum overlap at
any point in time, and represents the cell’s phenotype.
Unlike a purely cyclic system, the delay term in the coupling matrix causes the duration of
each pattern to fluctuate from cycle to cycle. Real bulk gene expression measurements of CC show
a decoherence of initially phase-matched cells because the length of each stage of CC for each
cell fluctuates somewhat. This decoherence is shown for the randomly selected gene RNFT1 in
Fig. 6.8. This effect can be observed using the Hopfield model by running parallel, independent
simulations representing many cells and averaging the expression results. For example, Fig. 6.9
shows the resulting mean expression of RNFT1 across 50 isolated cells, each of which is governed

105

Figure 6.7 Simulated trajectory of a single HeLa cell. The bottom plot shows the overlap between
the system’s configuration and each of the p = 8 patterns as a function of time. The top plot shows
which pattern has maximum overlap at any point in time, and represents the cell’s phenotype.

by the same coupling matrix and parameters but subject to its own unique stochastic fluctuations.
Although the expression of any given gene from any cell is either ±1, the mean expression of a
given gene across many cells reproduces the expected decaying sinusoid.
A heat map of Ji j is shown in Fig. 6.10, separated into symmetric (point attractor) and antisymmetric (transition) parts, and sorted into nested communities according to the algorithm from
Section 2.8 applied to the symmetric part. Efficient control of the entire CC system (for example,
arresting the system in the G1 phase) could begin by searching for control sets across different
levels of the hierarchy of nested communities. If this method proves to be effective, it could significantly reduce the combinatoral complexity of searching for optimal or nearly optimal sets of
genes to control.

106

Figure 6.8 Experimental bulk mean expression
for the gene RNFT1 as a function of time, taken
from [44]. Although all cells were initially
synchronized to the G1 phase, progression
through cell cycle takes slightly different
amounts of time for each cell and their gene
expression profiles becomes less and less
correlated, leading to a decaying sinusoid.

Figure 6.9 Mean of simulated Boolean
expression for the gene RNFT1 across 50
isolated cells as a function of time. All cells
began in identical states, but because transitions
occur stochastically, the 50 cells decohere over
time, leading a decaying sinusoid similar to
Fig. 6.8.

Figure 6.10 Cell cycle coupling matrix, separated into symmetric (point attractor) and
antisymmetric (transition) parts on the left and right, respectively, and sorted into nested
communities applied to the symmetric part. Examining the nested community structres may aid
in the search for sets of nodes which, when targeted, could control cell cycle.

107

CHAPTER 7
CONCLUSIONS

This dissertation covered a wide variety of topics, but all were unified by the theme of investigating
complex biological systems with relatively simple mathematical models. While my experiences
have taught me a great deal, much work remains to be done, and my research has led me to ask
more questions than I have answered.
The relationship between nested communities and Hopfield dynamics is intriguing. It has been
shown for coupled oscillator models that nested community structures create distinct timescales
wherein the lowest levels communities with the densest sets of connections synchronize first, followed by the next lowest, etc. up the hierarchy until all of the nodes in the entire network finally
synchronize [136]. This could also happen in Hopfield systems; perhaps starting in a random initial state at low temperature causes a similar hierarchical synchronization. This deserves futher
consideration.
As stated in Section 6.3, the ultimate goal of the cell cycle model is predicting the effects of perturbations such as drugs and searching for sets of genes which, when controlled, drive the system
toward a desired state. Our collaborators in San Diego can test these predictions experimentally,
and we can use the experimental results to further refine our mathematical models.
Another cell cycle data set using S. cerevisiae (yeast) cells [48] will also be examined in the
future, and its results will be compared with the human data set. Because cell cycle is a fundamental process in many forms of life, I expect to find similarities between the models for humans and
yeast, but it is possible that the evolutionary distance between humans and yeast resulted in very
different cell cycle mechanisms. This will also provide two separate systems to test experimentally.
I recently unearthed a publication which demonstrates that cyclic attractors can be stored using
an asynchronous update scheme [114], which is biologically more realistic than the synchronized
case discussed in Section 6.3. This will certainly be integrated into the cell cycle model if possible.

108

Finally, all of the material presented in this document assumes that the simulated cells are isolated, but it is well known that cells communicate with one another using chemical signals. It
may be possible to model cell-cell communication effects (such as the release of growth factors) in
addition to the intracellular process of gene regulation. Furthermore, cells may compete over the
limited space and resources in the environment. I have done some work with modelling population
dynamics using competitive Lotka-Volterra equations with additional terms representing the different effects a set of drugs has on, for example, normal and cancer cells. Understanding not only
the genetic heterogeneity of cells within populations, but how these different cells communicate
and compete is critical to designing therapies to fight cancer. I plan to continue studying these and
similar complex biological systems in my postdoctoral research.

109

APPENDICES

110

APPENDIX A
EULER’S SOLUTION TO THE KÖNIGSBERG BRIDGE PROBLEM

Assume an Eulerian path exists. Aside from the starting and ending nodes, every node that was
entered via one bridge must be exited via another bridge, guaranteeing that all of the intermediate
nodes have even degree. If the starting node was distinct from the ending node, then each of them
must have odd degree, and if the starting node was the same as the ending node, then that node
must have even degree. Because all four nodes in Fig. 2.2 have odd degree, no such path exists.

111

APPENDIX B
INTERSET EFFICIENCY NORMALIZATION

The interset efficiency,
EIJ =

1
1
,
∑
|I||J| − |I ∩ J| i∈I di j
j∈J
i= j

measures how efficiently nodes in set J signal to nodes in set I. Below is the derivation of the
number of terms in the summation, (|I||J| − |I ∩ J|).
Fig. B.1 shows an Euler diagram for the relationship between sets I and J, allowing for a
non-empty intersection between them. For convenience, define
A = J \I
B = J ∩I
C = I \J
All that is left is to count the number of distinct pairs (i, j) such that
1. j ∈ A and i ∈ B: |A| × |B| = |J \ I| × |I ∩ J|
2. j ∈ B and i ∈ C: |B| × |C| = |I ∩ J| × |I \ J|
3. j ∈ A and i ∈ C: |A| × |C| = |J \ I| × |I \ J|
4. j ∈ B, i ∈ B, and i = j: |B| × (|B| − 1) = |I ∩ J| × (|I ∩ J| − 1)
The total number of terms is the sum of these pieces,
Normalization = |J \ I| × |I ∩ J| +

(A to B)

|I ∩ J| × |I \ J| +

(B to C)

|J \ I| × |I \ J| +

(A to C)

|I ∩ J| × (|I ∩ J| − 1)

(B to B)

112

Figure B.1 Two overlapping sets, I and J.

Performing the substitutions
|I \ J| = |I| − |I ∩ J|
|J \ I| = |J| − |I ∩ J|
and canceling terms results in
Normalization = |I||J| − |I ∩ J|.

113

APPENDIX C
NESTED COMMUNITY PLOTS

This appendix contains the spy plots referenced in Section 2.8.

114

Figure C.1 Spy plot of Google’s internal web
page network [119]. Each node (row and
column indicies) is a web page owned by
Google, and each edge (a black dot located at
(i, j)) is a hyperlink from page j to page i.
Nodes were assigned a random index.

Figure C.2 Same network as Fig. C.1, but the
nodes indices have been sorted into level-1
communities (boxed in red). This is the
standard result from normal community
detection.

Figure C.3 Same network as Fig. C.1, but the
nodes indices have been sorted into level-2
communities. Note that the new level-2
communities are inset in the level-1
communities from Fig. C.2 (boxes within
boxes).

Figure C.4 Same network as Fig. C.1, but the
nodes indices have been sorted into level-3
communities.

115

Figure C.5 Same network as shown in
Fig. C.1, zoomed in to focus on the first 5,000
nodes.

Figure C.6 Same network as shown in
Fig. C.2, zoomed in to focus on the first 5,000
nodes.

Figure C.7 Same network as shown in
Fig. C.3, zoomed in to focus on the first 5,000
nodes.

Figure C.8 Same network as shown in
Fig. C.4, zoomed in to focus on the first 5,000
nodes.

116

Figure C.9 Spy plot of an arXiv theoretical
high energy physics citation network [90].
Each node (row and column indicies) is an
author of at least one high energy physics
publication, and an edge (a black dot located at
(i, j)) from author j to author i means that one
of j’s articles on arXiv cited at least one of i’s
articles. Nodes were assigned a random index.

Figure C.10 Same network as Fig. C.9, but the
nodes indices have been sorted into level-1
communities. This is the standard result from
normal community detection.

Figure C.11 Same network as Fig. C.9, but the
nodes indices have been sorted into level-2
communities.

Figure C.12 Same network as Fig. C.9, but the
nodes indices have been sorted into level-3
communities.

117

Figure C.13 Same network as shown in
Fig. C.9, zoomed in to focus on the first 5,000
nodes.

Figure C.14 Same network as shown in
Fig. C.10, zoomed in to focus on the first
5,000 nodes.

Figure C.15 Same network as shown in
Fig. C.11, zoomed in to focus on the first
5,000 nodes.

Figure C.16 Same network as shown in
Fig. C.12, zoomed in to focus on the first
5,000 nodes.

118

Figure C.17 Spy plot of AML 2.3, a gene
regulatory network, sorted into nested
communities. Each node (row and column
indicies) is gene, and an edge (a black dot
located at (i, j)) from gene j to gene i means
that j regulates the expression of i. The
maximum depth of this network is 7.

Figure C.18 Same network as shown in
Fig. C.17, zoomed in to focus on the first
5,000 nodes.

Figure C.19 Spy plot of HumanNet, a gene
regulatory network, sorted into nested
communities. Each node (row and column
indicies) is gene, and an edge (a black dot
located at (i, j)) from gene j to gene i means
that j regulates the expression of i. The
maximum depth of this network is 8.

Figure C.20 Same network as shown in
Fig. C.19, zoomed in to focus on the first
5,000 nodes.

119

APPENDIX D
MEASURING EXPRESSION

This appendix is intended as a basic introduction to two of the multitude of methods available
for measuring bulk (i.e. many cells pooled together) genome-wide expression profiles. The recent
breakneck speed of technological innovation in biological science will likely render this appendix
hopelessly obsolete before the author finishes writing it. Any reader interested in detailed coverage
of modern gene expression measurement techniques should consult the most recently published
review articles available.
Two of the most popular methods today for measuring gene expression profiles are microarrays and mRNA sequencing (RNA-seq). First introduced around 1983 [144] and firmly established
in 1995 [132], typical microarrays use a substrate, called a chip, which is subdivided into a lattice of spots. Each spot hosts many single-stranded DNA probes corresponding to a given gene
from a given organism. Cells of interest are lysed and their mRNA is extracted from the lysate.
Complementary DNA (cDNA) is synthesized from the mRNA sample by an enzyme derived from
retroviruses called reverse transcriptase in a process called reverse transcription (making cDNA
from mRNA, the reverse of making mRNA from DNA). Fluorescent tags are also attached to the
resulting cDNA, and the cDNA count is amplified via polymerase chain reaction. The resulting
single-stranded cDNA is then passed over the probes, and the cDNA binds to probes with complementary sequences. Finally, a light source is shone over the microarray and light intensities are
recorded for each spot, with a bright spot meaning higher expression for that gene. Because the
technology relies on bulk measurements of fluorescence rather than counting individual segments
of cDNA, microarrays can only quantify relative differences in gene expression. Thus, all measurements (e.g. the gene expression of a sample of acute myeloid leukemia cells) must be paired with
controls (e.g. healthy myeloblast cells) to identify which genes are over- or underexpressed. Additionally, different cDNA segments have different binding affinities with their matching probes,

120

and some can cross hybridize, or bind to mismatched probes. Even within a single experiment,
the relative expression of a pair of genes cannot be reliably compared: a bright spot can be explained as either a high concentration of mRNA from the given gene, or a low concentration with
high binding affinity. This experimental design makes combining microarray measurements from
different experiments quite difficult, although more modern microarrays use tens of probes per
gene to circumvent this issue. Furthermore, microarrays must be manufactured with a specific set
of probes ahead of time, making them unsuitable for searching for novel transcripts in a known
organism and unsuitable for less well studied organisms. Although there is a large amount of publicly available microarray data across many organisms and experimental conditions, microarrays
are fading in popularity due to their limited quality and the rise of RNA-seq methods.
RNA-seq was introduced around 2008 [93], and was designed to take advantage of the cheap
and highly accurate DNA sequencing methods and tools developed for the Human Genome Project
in the 1990s. As with microarray technology, RNA-seq requires that cDNA is derived from the
mRNA sample via reverse transcription. After the first round of cDNA synthesis and duplication
via PCR, however, a second round of PCR is performed in four batches. Each of these four batches
contains both normal free nucleotides as well as modified versions of one of the four DNA bases,
A, C, G, and T. If a modified base is added to a given cDNA template, sythesis of that strand is
halted. For example, a template strand of DNA in the T batch with the sequence ACTGCTTGA
would produce the strands ACT, ACTGCT, ACTGCTT, and ACTGCTTGA in the final round of
PCR, since there is a fixed probability that the DNA polymerase will pull a modified T from
its environment. The same sequence in the C batch would produce the strands AC, ACTGC,
and ACTGCTTGA. These reads are then sequenced by a machine via electrophoresis through
narrow capillaries, with shorter segments of DNA moving through the gel more quickly than longer
segments. To sequence with high fidelity, either the mRNA or the cDNA needs to be fragmented
into small pieces (hundreds or thousands of base pairs each, depending on the platform) before
being read. It then falls to computers to reconstruct the sequence of strands of cDNA, and thus the
original mRNA, from the millions of resulting fragmented reads. RNA-seq is much more accurate

121

than microarrays because there are no artifacts from varying binding affinity or cross hybridization
(but still may suffer from amplification and cDNA conversion artifacts), but the experimental and
computational costs are much higher than microarrays. RNA-seq has the added advantage of
being able to detect previously unknown transcripts. This has enabled researchers to learn about
the regulatory and other roles of non-coding RNAs such as long non-coding RNAs (lncRNA), short
interfering RNAs (siRNA), and micro RNAs (miRNA).

122

APPENDIX E
PROOF OF THEOREM FROM SECTION 5.3.2.2

Theorem: Suppose a network G contains a cycle cluster C. Define the set of externally influenced
nodes
R(C, G) = {i ∈ V (C) : j ∈ V (G \C), ( j, i) ∈ E(G)} ,

(E.1)

the set of intruder connections
W (C, G) = {( j, i) ∈ E(G) : i ∈ V (C), j ∈ V (G \C)} ,

(E.2)

and the reduced set of critical nodes
Zred (C, G) = Z(C, G \W ) .

(E.3)

µ ≡ min deg− (i) ,

(E.4)

If N = |V (C)| and
i∈V (C)

where deg− (i) is computed ignoring intruder connections, then
µ
≤ ncrit (C, G) ≤ ζ ,
2

(E.5)

where
ζ ≡ min

N
+ |R(C, G) \ Zred (C, G)|, N
2

.

(E.6)

Proof: First, prove the lower limit of Eq. E.5. Let C be a cycle cluster in a network G with
R(C, G) = {∅}. (A cycle cluster in a network with |R(C, G)| > 0 will have the same or higher
activation barrier for any node in the cluster than the same cycle cluster in a network with R = {∅}.
Since the lower limit of Eq. E.5 is being examined, the case with the lowest activation barrier is
considered. Any externally influenced nodes cause ncrit to either increase or remain the same.) For
any node i to be able to flip away from the cancer state (although not necessarily remain there), it
must be that hi = −aξic for a ≥ 0, meaning that at least half of the nodes upstream of i must point
123

away from the cancer state. The node i requiring the smallest number of upstream nodes to be in
the normal state is the node that satisfies deg− (i) = µ. Controlling less than µ/2 nodes will leave
all uncontrolled nodes with a field in the cancer direction, and no more flips will occur. Thus,
ncrit ≥

µ
2

.

(E.7)

For the upper limit of Eq. E.5, consider a complete clique on N nodes, C = KN (that is, Ai j = 1
for all i, j ∈ V (KN ), including self loops) in a network G. First, let there be no connections to any
nodes in C from outside of C so that R(C, G) = {∅}. For odd N, forcing (N + 1)/2 nodes away
from the cancer state will result in the field

∑ Ji j σ j =
j

N −1 N +1
−
ξic = −ξic
2
2

(E.8)

for all nodes i. After one time step, all nodes will flip away from the cancer state. For even N,
forcing N/2 nodes away from the cancer state will result in the field

∑ Ji j σ j =
j

N N
−
2 2

ξic = 0

(E.9)

for all nodes i. At the next time step, the unfixed nodes will pick randomly between the normal and
cancer state. If at least one of these nodes makes the transition away from the cancer state, the field
at all other nodes will point away from the cancer direction. The system will then require one more
time step to completely settle to σi = −ξic . Thus, for C = KN in a network G with R(C, G) = {∅},
ncrit (KN , G) =

N
2

.

(E.10)

KN with σi (0) = ξic gives the largest activation barrier for any cycle cluster on N nodes with
R(C, G) = {∅} to switch away from the cancer attractor state. A general cycle cluster C with any
topology on N nodes with R(C, G) = {∅} in a network G will have deg− (i) ≤ N for all nodes i,
and so the upper bound is
ncrit (C, G) ≤

N
2

,

thus proving Eq. E.5 for the special case of R(C, G) = {∅}.

124

(E.11)

Now consider a cycle cluster C on N nodes in a network G with |R(C, G)| ≥ 0. Suppose all
nodes in Zred (C, G) are fixed away from the cancer state. By Eq. E.11, |Zred (C, G)| ≤ N/2 .
For any node i ∈ (R(C, G) ∩ Zred (C, G)), σi (t → ∞) = −ξic is guaranteed because it has already
been directly controlled. Any node i ∈ (R(C, G) \ Zred (C, G)) has some incoming connections
from nodes j ∈
/ V (C), and these connections could increase the activation barrier enough such that
fixing Zred (C, G) is not enough to guarantee σi (t → ∞) = −ξic . To ensure that any node l ∈ V (C)
points away from the cancer state, it is sufficient to fix all nodes i ∈ (R(C, G) \ Zred (C, G)) as well
as Zred (C, G) away from the cancer state. This increases ncrit by at most |R(C, G) \ Zred (C, G)|,
leaving
ncrit (C, G) ≤

N
+ |R(C, G) \ Zred (C, G)| .
2

(E.12)

ncrit can never exceed N, however, because directly controlling every node results in controlling C.
It can thus be said that
ncrit (C, G) ≤ min

N
+ |R(C, G) \ Zred (C, G)|, N
2

.

(E.13)

Finally, combining the upper limit in Eq. E.13 with the lower limit from Eq. E.7 gives Eq. E.5.

125

APPENDIX F
HOPFIELD MEAN FIELD SOLUTION

F.1

Mean field for Erd˝os-Rényi networks

I now rederive Derrida’s mean field result [40]. Without loss of generality, define m to be the
overlap between attractor 1 and the state vector at time t,
m=

1 N 1
∑ ξi σi(t).
N i=1

(F.1)

where N is the number of nodes in the network, and N → ∞. Assume that {σi (t)} has macroscopic
overlap with attractor 1, and microscopic overlap with all other attractors µ = 1. Define the indegree of node i to be k, and label its upstream neighbors { j1 , j2 , . . . , jk }. At a given time t, let n be
the number of upstream nodes which have σ jr = −ξi1 (and so k − n have σ jr = +ξi1 ). Similarly,
µ=1

define s to be the number of attractors across all upstream nodes which have ξ j1r = −ξ jr
µ=1

(kp − k − s) have ξ j1r = +ξ jr

(and so

). The total field at node i is given by
hi = kp − 2n − 2s.

(F.2)

The probability for a randomly chosen node to have a given triplet (k, n, s) is given by
P(k, n, s) = P(k)P(n, s|k)
(F.3)
= P(k)P(n|k)P(s|k).
Directed Erd˝os-Rényi random networks have a Poisson indegree (and identical outdegree) distribution,
P(k) =

ck e−c
,
k!

(F.4)

where c is the mean indegree (which equals the mean outdegree). Since (m + 1)/2 is the expected
fraction of nodes in state ξ j1r , P(n|k) follows a binomial distribution,
k
P(n|k) =
n

1 + m k−n 1 − m n
.
2
2
126

(F.5)

µ

Each ξ jr is an independent random variable, so for the remaining p − 1 attractors, P(s|k) also
follows a binomial distribution,
1 k(p−1)
.
2

k(p − 1)
s

P(s|k) =

(F.6)

This means that the expected spin of node i can be computed from
∞

σi = m =

k k(p−1)

P(k, n, s)

∑∑ ∑

σi =±1

k=0 n=0 s=0
∞

=

∑

k k(p−1)

∑∑ ∑

P(k, n, s) tanh

k=0 n=0 s=0

σi
−(kp−2n−2s)σ
i /T
1+e

(F.7)

kp − 2n − 2s
T

This gives Derrida’s mean field equation for a Hopfield system on a dilute directed Erd˝os-Rényi
network,
∞

m=

k k(p−1) ck e−c

∑∑ ∑

k=0 n=0 s=0

k!

k (1 + m)k−n (1 − m)n k(p − 1)
kp − 2n − 2s
,
tanh
n
T
s
2kp

(F.8)

where c is the mean indegree, p is the number of attractors, and T is the temperature.
In principle, m can be computed from Eq. F.8 numerically. However, the nested summations
produce many non-negligible terms for even modest values of c and p (e.g. for c = 5 and p = 3,
Eq. F.8 is composed of 4940 separate terms when keeping all terms that satisfy P(k) > 10−3 ).
Some approximations are now made to solve for m. Recall that
2
1
lim √ e−x /a = δ (x),
a→0 πa

(F.9)

where δ (x) is the Dirac delta function. This means that for large c and p (but keeping the ratio
(p − 1)/c fixed),
+∞ dk
ck e−c
(k − c)2
√
≈
exp
−
∑ k!
2c
−∞
2πc
k=0
∞

( k − 1)2
(dk/c)
exp − c
2/c
−∞
2π/c
+∞

=

+∞

≈

−∞

d

k
δ
c

127

k
−1 .
c

(F.10)

Similarly for large k,
k

(n − kq)2
dk
exp −
2kq(1 − q)
2πkq(1 − q)

+∞
k
(1 − q)k−n qn ≈
−∞
n=0 n

∑

+∞

n
≈
d
δ
cq
−∞

(F.11)

n
−1 .
cq

The summation over s can also be approximated as
k(p−1)

∑

s=0

k(p − 1)
s

+∞
1 k(p−1)
≈
2
−∞

=

=

ds

exp −

c(p−1)
2
s
+∞ d
c
exp
(p−1)
−∞
π 2c
+∞ dy
−y2

−∞

π

√ e
π

−

c(p−1) 2
2 )
c(p−1)
2

(s −

(p−1) 2
2 )
(p−1)
2c

( cs −

(F.12)

,

where α = (p − 1)/c and a variable substitution was performed by setting
y=

p−1
s
c− 2 .

(F.13)

α/2
Combining these approximations results in Derrida’s mean field integral equation,
√
+∞
2
cm − cy 2α
1
−y
dy e
tanh
m= √
.
T
π −∞

F.2

(F.14)

Mean field for modular networks

The topology of a stochastic block model network (SBMN) with Ω communities is defined by the
nonnegative connectivity matrix CΩ×Ω , where CIJ is the mean number of edges from nodes in
community J to nodes in community I. Note that while AML 2.3’s CIJ is diagonally dominant
(i.e. there are more edges internal to communities than between communities), it is not necessary
for the following results to hold.
Eq. F.8 will now be generalized for modular networks. Consider a SBMN with connectivity
matrix CIJ . Without loss of generality, define the overlap between all spins i ∈ I and attractor 1 to
be
mI =

1
∑ ξi1σi(t),
NI i∈I
128

(F.15)

where NI is the number of nodes in community I, and NI → ∞. The variables k and n now gain
indices to signify the source and target communities for each edge. P(kIJ ) is the probability that
a randomly chosen node i ∈ I has kIJ upstream neighbors from community J, and is given by a
Poisson distribution,
k

C IJ e−CIJ
P(kIJ ) = IJ
.
kIJ !

(F.16)

nIJ is the number of these upstream neighbors with a current spin antialigned with attractor 1 (and
therefore kIJ − nIJ is the number of upstream neighbors with current spin aligned with attractor 1).
nIJ follows a binomial distribution which depends on the overlap of community J with attractor 1,
k
P(nIJ |kIJ ) = IJ
nIJ

mJ + 1 kIJ −nIJ
2

mJ − 1 nIJ
.
2

(F.17)

Since all attractors are composed of independent random variables, the number of upstream attractor diagreements follows a binomial distribution and depends only on the total indegree across all
communities,
P(sI |kItot ) =

kItot (p − 1)
sI

tot
1 kI (p−1)
,
2

(F.18)

where kItot = ∑Ω
J=1 kIJ . The joint probability distribution can now be written as
Ω

P(kI1 , . . . , kIΩ ; nI1 , . . . , nIΩ ; sI ) =

∏ P(kIJ , nIJ )

P(sI |kItot )

J=1
Ω

=

∏ P(kIJ )P(nIJ |kIJ )

(F.19)
P(sI |kItot ).

J=1

Substituting Eqs. F.16, F.17, and F.18 into Eq. F.19 gives the probability density


Ω CkIJ e−CIJ k
ktot (p − 1)
IJ
P({kIJ }, {nIJ }, sI ) =  ∏ IJ
(mJ + 1)kIJ −nIJ (mJ − 1)nIJ  I
kIJ !
nIJ
sI
J=1

tot
1 kI p
.
2

(F.20)

129

This means that the modular form of the mean field equation is given by




Ω k
(p−1)
∑
k
k
k
∞
∞
∞
J=1 IJ
I1
I2
IΩ


mI =  ∑ ∑ . . . ∑   ∑ ∑ . . . ∑  

∑
kI1 =0 kI2 =0

kIΩ =0



nI1 =0 nI2 =0

nIΩ =0

sI =0


k
CIJIJ e−CIJ kIJ
(1 + mJ )kIJ −nIJ (1 − mJ )nIJ 
×∏
kIJ !
nIJ
J=1
Ω

∑Ω
J=1 kIJ (p − 1)
×
sI

Ω
1 ∑J=1 kIJ p
tanh
2

∑Ω
J=1 (kIJ p − 2nIJ ) − 2sI . (F.21)
T

Note that for Ω = 1, Eq. F.21 reduces to Eq. F.8. Again, there are too many terms in Eq. F.21 for
computation of mI to be practical. For large values of p and CIJ for all I and J, the approximations
in Eqs. F.10, F.11, and F.12 hold. This means that Eq. F.21 becomes
dkIJ
(kIJ −CIJ )2
√
exp
−
∏
2CIJ
2πCIJ
J=1 −∞
Ω

mI =

+∞

×

−∞

+∞

+∞

dsI
π p−1
2 ∑J kIJ

exp −

−∞

2
(sI − p−1
2 ∑J kIJ )
p−1
2 ∑J kIJ

(n − kIJ qJ )2
dnIJ
exp − IJ
2kIJ qJ (1 − qJ )
2πkIJ qJ (1 − qJ )
tanh

∑J (kIJ p − 2nIJ ) − 2sI
T

. (F.22)

where qJ ≡ (1 − mJ )/2. Letting CI ≡ ∑J CIJ , αI ≡ (p − 1)/CI , and taking p → ∞, CIJ → ∞
such that αI remains constant and the ratios CIJ /CI J remain constant for all I, I , J, and J , the
Gaussians in kIJ and nIJ turn into delta functions. Only the sI integral remains, so


sI
2
−
p
+
1)
(2
+∞
s
2
CI
 tanh ∑J (kIJ p − 2nIJ ) − 2sI
mI =
d I
exp −
CI
παI
2αI
T
−∞
Performing the substitution

(F.23)

s

2 CI − p + 1
yI = I√
2αI
gives the final form for mI ,
+∞
2
1
mI = √
dyI e−yI tanh
π −∞

√
∑Ω
J=1 CIJ mJ −CI yI 2αI
T

(F.24)

.

(F.25)

This system of Ω coupled self-consistent integral equations can be solved numerically. The variance of the magnetization is given by
(∆mI )2 = σI2 − σI 2
= 1 − m2I .
130

(F.26)

Introducing a control term Qi , our system’s magnetization vector becomes
√
+∞
Cix mx − yCi 2αi
Qi + ∑Ω
2
1
−y
x=1
mi = √
dy e
tanh
.
T
π −∞

(F.27)

Note that
1
lim mi (T ) = √
π
T →0

wi
−∞

2

dy e−y −

∞

dy e−y

2

(F.28)

wi

where
Qi + ∑Ω
x=1 Cix mx .
√
wi ≡
Ci 2αi

(F.29)

Define χi j as the total derivative
χi j ≡

dmi
,
dQ j m =m∗ , Q =0 ∀x
x
x x

(F.30)

where m∗x is the numerical solution to (F.27). χi j is thus the expected change in the magnetization
of community i when community j is targeted by an external field. I define this using the total
derivative rather than the partial derivative because I’m making no assumptions about mi remaining
constant when m j changes for i = j. This is required to see off-target effects. I drop all of the
“evaluated at” bars hereafter, but keep in mind that all derivatives are evaluated at mx = m∗x , Qx = 0
∀x. In general, if f = f (x1 (t), x2 (t), . . . ,t), the total derivative is given by
∂f
∂ f dxi
df
=
+∑
,
(F.31)
dt
∂t
i ∂ xi dt
where I am being very careful to correctly distinguish between partial and total derivatives. In our
case, since mi = mi (m1 (Qi ), m2 (Qi ), . . . , mΩ (Qi ), Qi ),
Ω ∂ m dm
dmi
∂ mi
i
k
=
+∑
dQ j ∂ Q j k=1 ∂ mk dQ j
Ω ∂m
∂ mi
i
=
+∑
χ ,
∂ Q j k=1 ∂ mk k j

(F.32)

Using the definition from (F.27),
√
Qi + ∑Ω
C
m
−
yC
2αi
x
ix
i
x=1
T
Ω C m − yC √2α
+∞
Q
+
2
∑
1
i
i
i
x=1 ix x
√
dy e−y sech2
T
T π −∞

+∞
2 ∂
∂ mi
1
=√
dy e−y
tanh
∂Qj
∂Qj
π −∞

= δi j

= δi j ai ,
131

(F.33)

where, recalling that mx = m∗x , Qx = 0 ∀x,
√
∗
∑Ω
x=1 Cix mx − yCi 2αi
T

+∞
2
1
ai ≡ √
dy e−y sech2
T π −∞

.

(F.34)

Note that ai = ai (T ), so all of the temperature dependence is in a single term. Also note that in
general,
y − yγ
1
sech2 0
T
T →0 T

= 2δ (y0 − yγ) =

lim

so


lim ai (T ) =

T →0

Also,

2
Ci2 αi π

exp −

2
δ
|γ|

Cix m∗x
∑Ω
x=1
√
Ci 2αi

y0
−y ,
γ
2

(F.35)


.

√
Qi + ∑Ω
x=1 Cix mx − yCi 2αi
T
√
+∞
Cix mx − yCi 2αi
Qi + ∑Ω
2
1
2
−y
x=1
√
sech
dy e
T
T π −∞

(F.36)

+∞
2 ∂
∂ mi
1
dy e−y
=√
tanh
∂ mk
∂ mk
π −∞

= Cik

(F.37)

= Cik ai .
Putting everything together,
Ω

χi j = ai δi j +

∑ aiCik χk j .

(F.38)

k=1

If Ai j ≡ ai δi j and Bi j ≡ aiCi j (matrices which only depend on T , which is fixed) are defined, then
Ω

χi j = Ai j +

∑ Bik χk j ,

(F.39)

k=1

or in simpler notation,
χ = A + Bχ .

(F.40)

χ = (11 − B)−1 A.

(F.41)

This can be solved by

Since I defined χi j as the derivative evaluated at equilibrium, χi j is the solution to the linear
equation (F.41). However, χi j is a nonlinear function of the temperature T and the topology Ci j
because it depends on ai .

132

APPENDIX G
CORRELATED ATTRACTORS

µ

using the scheme

Note: for mathematical convenience, this appendix labels the attractors ξi
µ = 0, 1, . . . , p − 1.
µ

The basic Hopfield model was designed under the assumption that ξi = ±1 with equal probµ

αc ≈ 0.138, then ∑i ξi ξiν /N ≈ 0 for all

ability for any given (i, µ) pair, and if N → ∞ and α

µ = ν, and the crosstalk term from Eq. 5.13 can be ignored. However, the capacity for correlated
attractors is less than 0.138 and depends on the degree of the correlations. A simple trick from
linear algebra called a Moore-Penrose pseudoinverse can be used to stabilize correlated patterns.
If the coupling matrix is constructed as
Ji j =

1
µ
ξi Q−1
ξ jν
∑
N µν
µν

(G.1)

for the matix
Qµν =
then for σi (t) = ξiλ ,
hi (t) =

1
µ
ξi ξiν
∑
N i

(G.2)

1
µ
∑ ξi Q−1 µν ξ jν ξ jλ
N jµν
µ

= ∑ ξi

Q−1
µν

µν

Qνλ
(G.3)

µ

= ∑ ξi δµλ
µ

= ξiλ
Note that this only works if the matrix is invertible. If the patterns are uncorrelated, then Qµν =
δµν , reducing Ji j to the basic form from Eq. 5.6. The matrix Qµν may also be integrated into the
cyclic Hopfield matrix by
Ji j =

1
mod p (µ+1)
ξi
Q−1
ξν
∑
N µν
µν j
133

(G.4)

Again, if σi (t) = ξiλ ,
hi (t) =

1
mod p (µ+1)
ξi
Q−1
ξνξλ
∑
N jµν
µν j j
mod p (µ+1)

= ∑ ξi

Q−1
µν

µν

=∑

Qνλ
(G.5)

mod p (µ+1)
ξi
δµλ

µ
mod p (λ +1)

= ξi

In general, using this formulation for the coupling matrix ensures higher fidelity point and cyclic
attractors than the original form which omits Qµν .

134

BIBLIOGRAPHY

135

BIBLIOGRAPHY

[1]

West Wing, 28 Feb 2001. Season 2, Episode 16, 23:00 mark. CJ Cregg is shocked to learn
that the surface area of Africa is 14 times larger than that of Greenland.

[2]

V. Ágoston, P. Csermely, and S. Pongor. Multiple weak hits confuse complex systems: a
transcriptional regulatory network as an example. Phys. Rev. E, 71(5):051909, 2005.

[3]

T. Akasaka, I.S. Lossos, and R. Levy. Bcl6 gene translocation in follicular lymphoma: a
harbinger of eventual transformation to diffuse aggressive lymphoma. Blood, 102(4):1443–
1448, 2003.

[4]

T. Akutsu, M. Hayashida, W.K. Ching, and M.K. Ng. Control of boolean networks: hardness
results and algorithms for tree structured networks. J. Theor. Biol., 244(4):670–679, 2007.

[5]

S. Amari, H. Ando, T. Toyoizumi, and N. Masuda. State concentration exponent as a measure of quickness in kauffman-type networks. Phys. Rev. E, 87:022814, Feb 2013.

[6]

D.J. Amit, H. Gutfreund, and H. Sompolinsky. Spin-glass models of neural networks. Phys.
Rev. A, 32(2):1007, 1985.

[7]

Ron C. Anafi and Jason H. T. Bates. Balancing robustness against the dangers of multiple
attractors in a hopfield-type model of biological attractors. PLoS ONE, 5(12):e14413, 12
2010.

[8]

Vivek Anantharaman, Eugene V. Koonin, and L. Aravind. Comparative genomics and evolution of proteins involved in rna metabolism. Nucleic Acids Research, 30(7):1427–1464,
2002.

[9]

P. Ao, D. Galas, L. Hood, and X. Zhu. Cancer as robust intrinsic state of endogenous
molecular-cellular network shaped by evolution. Med. Hypotheses, 70(3):678–684, 2008.

[10] Gary D Bader and Christopher WV Hogue. An automated method for finding molecular
complexes in large protein interaction networks. BMC bioinformatics, 4(1):1, 2003.
[11] X. Bai, L. Wu, T. Liang, Z. Liu, J. Li, D. Li, H. Xie, S. Yin, J. Yu, Q. Lin, et al. Overexpression of myocyte enhancer factor 2 and histone hyperacetylation in hepatocellular carcinoma.
J. Canc. Res. Clinic. Oncol., 134(1):83–91, 2008.
[12] S.J. Baker, E.R. Fearon, J. M. Nigro, A.C. Preisinger, J.M. Jessup, D.H. Ledbetter, D.F.
Barker, Y. Nakamura, R. White, B. Vogelstein, et al. Chromosome 17 deletions and p53
gene mutations in colorectal carcinomas. Science, 244(4901):217–221, 1989.
[13] David Balchin, Manajit Hayer-Hartl, and F Ulrich Hartl. In vivo aspects of protein folding
and quality control. Science, 353(6294):aac4354, 2016.
[14] Paola Ballario, Paola Vittorioso, Armando Magrelli, Claudio Talora, Andrea Cabibbo, and
Giuseppe Macino. White collar-1, a central regulator of blue light responses in neurospora,
is a zinc finger protein. The EMBO Journal, 15(7):1650, 1996.
136

[15] Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s
functional organization. Nature Reviews Genetics, 5(2):101–113, 2004.
[16] Katia Basso, Adam A Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera,
and Andrea Califano. Reverse engineering of regulatory networks in human b cells. Nature
genetics, 37(4):382–390, 2005.
[17] M.T. Bengoechea-Alonso and J. Ericsson. Tumor suppressor fbxw7 regulates tgfβ signaling
by targeting tgif1 for degradation. Oncogene, 29(38):5322–5328, 2010.
[18] N. Bhardwaj, M.B. Carson, A. Abyzov, K.-K. Yan, H. Lu, and M.B. Gerstein. Analysis
of combinatorial regulation: scaling of partnerships between regulators with the number of
governed targets. PLoS Comp. Biol., 6(5):e1000755, 2010.
[19] Remco Bouckaert, Philippe Lemey, Michael Dunn, Simon J Greenhill, Alexander V Alekseyenko, Alexei J Drummond, Russell D Gray, Marc A Suchard, and Quentin D Atkinson. Mapping the origins and expansion of the indo-european language family. Science,
337(6097):957–960, 2012.
[20] Dennis Bray. Protein molecules as computational elements in living cells.
376(6538):307–312, 1995.

Nature,

[21] L. Bullinger, K. Döhner, E. Bair, S. Fröhling, R.F. Schlenk, R. Tibshirani, H. Döhner, and
J.R. Pollack. Use of gene-expression profiling to identify prognostic subclasses in adult
acute myeloid leukemia. New Engl. J. Med., 350(16):1605–1616, 2004.
[22] Carlos Caldas. Cancer sequencing unravels clonal evolution.
30(5):408–410, 2012.

Nature biotechnology,

[23] D. Calzolari, G. Paternostro, P.L. Harrington Jr., C. Piermarocchi, and P.M. Duxbury. Selective control of the apoptosis signaling network in heterogeneous cell populations. PLoS
ONE, 2(6):e547, 2007.
[24] Neil A. Campbell and Jane B. Reece. Biology, 6th Edition. Benjamin Cummings, 2002.
[25] Han Chen, Fangqin Lin, Ke Xing, and Xionglei He. The reverse evolution from multicellularity to unicellularity during carcinogenesis. Nature Communications, 6, 2015.
[26] Ting Chen, Hongyu L He, George M Church, et al. Modeling gene expression with differential equations. In Pacific symposium on biocomputing, volume 4, page 4, 1999.
[27] A. Choudhary, A. Datta, M.L. Bittner, and E.R. Dougherty. Intervention in a family of
boolean networks. Bioinformatics, 22(2):226–232, 2006.
[28] S Ciftci-Yilmaz and R Mittler. The zinc finger network of plants. Cellular and Molecular
Life Sciences, 65(7-8):1150–1160, 2008.
[29] Mara Compagno, Wei Keat Lim, Adina Grunn, Subhadra V Nandula, Manisha Brahmachary, Qiong Shen, Francesco Bertoni, Maurilio Ponzoni, Marta Scandurra, Andrea Califano, et al. Mutations of multiple genes cause deregulation of nf-κb in diffuse large b-cell
lymphoma. Nature, 459(7247):717–721, 2009.
137

[30] S.P. Cornelius, W.L. Kath, and A.E Motter. Realistic control of network dynamics. Nature
Commun., 4:1–9, 2013.
[31] P. Creixell, E. M Schoof, J.T. Erler, and R. Linding. Navigating cancer network attractors
for tumor-specific therapy. Nature Biotechnol., 30(9):842–848, 2012.
[32] Péter Csermely, Vilmos Ágoston, and Sandor Pongor. The efficiency of multi-target
drugs: the network approach might help drug design. Trends in Pharmacological Sciences,
26(4):178–182, 2005.
[33] Simon Cutting, Michelle Anderson, Elena Lysenko, Anthony Page, Toshifumi Tomoyasu,
Kenji Tatematsu, Takashi Tatsuta, Lee Kroos, and Teru Ogura. spovm, a small protein
essential to development in bacillus subtilis, interacts with the atp-dependent protease ftsh.
Journal of Bacteriology, 179(17):5534–5542, 1997.
[34] Josephine T Daub, Isabelle Dupanloup, Marc Robinson-Rechavi, and Laurent Excoffier.
Inference of evolutionary forces acting on human biological pathways. Genome Biology
and Evolution, page evv083, 2015.
[35] G. De Falco, E. Leucci, D. Lenze, P.P. Piccaluga, P.P. Claudio, A. Onnis, G. Cerino,
J. Nyagol, W. Mwanda, C. Bellan, et al. Gene-expression analysis identifies novel
rbl2/p130 target genes in endemic burkitt lymphoma cell lines and primary tumors. Blood,
110(4):1301–1307, 2007.
[36] R. De Smet and K. Marchal. Advantages and limitations of current network inference methods. Nature Rev. Microbiol., 8(10):717–729, 2010.
[37] Paramvir Dehal and Jeffrey L Boore. Two rounds of whole genome duplication in the
ancestral vertebrate. PLoS biology, 3(10):1700, 2005.
[38] Glynn Dennis Jr, Brad T Sherman, Douglas A Hosack, Jun Yang, Wei Gao, H Clifford Lane,
Richard A Lempicki, et al. David: database for annotation, visualization, and integrated
discovery. Genome Biology, 4(5):P3, 2003.
[39] B Derrida and H Flyvbjerg. The random map model: a disordered model with deterministic
dynamics. Journal de Physique, 48(6):971–978, 1987.
[40] B. Derrida, E. Gardner, and A. Zippelius. An exactly solvable asymmetric neural network
model. Europhys. Lett.), 4(2):167, 1987.
[41] Bernard Derrida and Henrik Flyvbjerg. Multivalley structure in kauffman’s model: Analogy
with spin glasses. Journal of Physics A: Mathematical and General, 19(16):L1003, 1986.
[42] Bernard Derrida and Yves Pomeau. Random networks of automata: a simple annealed
approximation. EPL (Europhysics Letters), 1(2):45, 1986.
[43] A. Diaz-Alderete, A. Doval, F. Camacho, L. Verde, P. Sabin, R. Arranz-Saez, C. Bellas,
C. Corbacho, J. Gil, M. Perez-Martin, et al. Frequency of bcl2 and bcl6 translocations in
follicular lymphoma: relation with histological and clinical features. Leukemia Lymphoma,
49(1):95–101, 2008.
138

[44] Daniel Dominguez, Yi-Hsuan Tsai, Nicholas Gomez, Deepak Kumar Jha, Ian Davis, and
Zefeng Wang. A high-resolution transcriptome map of cell cycle reveals novel connections
between periodic genes and cancer. Cell Research, 2016.
[45] Federal Aviation Administration (2016). Passenger Boarding (Enplanement) and AllCargo Data for U.S. Airports for Calendar Year 2015. Retrieved on 22 Aug 2016 from
http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger.
[46] K. Eppert, K. Takenaka, E.R. Lechman, L. Waldron, B. Nilsson, P. van Galen, K.H. Metzeler, A. Poeppl, V. Ling, J. Beyene, et al. Stem cell gene expression programs influence
clinical outcome in human leukemia. Nature Med., 17(9):1086–1093, 2011.
[47] Adam Ertel and Aydin Tozeren. Switch-like genes populate cell communication pathways
and are enriched for extracellular proteins. BMC genomics, 9(1):1, 2008.
[48] Philipp Eser, Carina Demel, Kerstin C Maier, Björn Schwalb, Nicole Pirkl, Dietmar E Martin, Patrick Cramer, and Achim Tresch. Periodic mrna synthesis and degradation co-operate
during cell cycle gene expression. Molecular systems biology, 10(1):717, 2014.
[49] S. Fabris, L. Mosca, G. Cutrona, M. Lionetti, L. Agnelli, G. Ciceri, M. Barbieri, F. Maura,
S. Matis, M. Colombo, et al. Chromosome 2p gain in monoclonal b-cell lymphocytosis and
in early stage chronic lymphocytic leukemia. Am. J. Hemat., 88(1):24–31, 2013.
[50] G. Fagiolo. Clustering in complex directed networks. Phys. Rev. E, 76:026107, Aug 2007.
[51] Jacob D Feala, Jorge Cortes, Phillip M Duxbury, Carlo Piermarocchi, Andrew D McCulloch, and Giovanni Paternostro. Systems approaches and algorithms for discovery of
combinatorial therapies. Wiley Interdisciplinary Reviews: Systems Biology and Medicine,
2(2):181–193, 2010.
[52] J.D. Feala, J. Cortes, P.M. Duxbury, A.D. McCulloch, C. Piermarocchi, and G. Paternostro.
Statistical properties and robustness of biological controller-target networks. PLoS ONE,
7(1):e29374, 2012.
[53] Hunter B Fraser, Aaron E Hirsh, Lars M Steinmetz, Curt Scharfe, and Marcus W Feldman.
Evolutionary rate in the protein interaction network. Science, 296(5568):750–752, 2002.
[54] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages
35–41, 1977.
[55] Pauline A. Fujita, Brooke Rhead, Ann S. Zweig, Angie S. Hinrichs, Donna Karolchik,
Melissa S. Cline, Mary Goldman, Galt P. Barber, Hiram Clawson, Antonio Coelho, Mark
Diekhans, Timothy R. Dreszer, Belinda M. Giardine, Rachel A. Harte, Jennifer HillmanJackson, Fan Hsu, Vanessa Kirkup, Robert M. Kuhn, Katrina Learned, Chin H. Li, Laurence R. Meyer, Andy Pohl, Brian J. Raney, Kate R. Rosenbloom, Kayla E. Smith, David
Haussler, and W. James Kent. The ucsc genome browser database: update 2011. Nucleic
Acids Research, 2010.
[56] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.
Proc. Nat. Acad. Sci. USA, 99(12):7821–7826, 2002.
139

[57] S. Z. Glud, A. B. Sörensen, M. Andrulis, B. Wang, E. Kondo, R. Jessen, L. Krenacs,
E. Stelkovics, M. Wabl, E. Serfling, et al. A tumor-suppressor function for nfatc3 in t-cell
lymphomagenesis by murine leukemia virus. Blood, 106(10):3546–3552, 2005.
[58] Matthew W Hahn and Andrew D Kern. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Molecular Biology and Evolution,
22(4):803–806, 2005.
[59] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data.
IEEE Intelligent Systems, 24(2):8–12, 2009.
[60] R. Hamid and S.J. Brandt. Transforming growth-interacting factor tgif regulates proliferation and differentiation of human myeloid leukemia cells. Mol. Oncol., 3(5):451–463, 2009.
[61] Douglas Hanahan and Robert A Weinberg. Hallmarks of cancer: the next generation. Cell,
144(5):646–674, 2011.
[62] C.P. Hans, D.D. Weisenburger, T.C. Greiner, R.D. Gascoyne, J. Delabie, G. Ott, H.K.
Müller-Hermelink, E. Campo, R.M. Braziel, E. S. Jaffe, et al. Confirmation of the molecular classification of diffuse large b-cell lymphoma by immunohistochemistry using a tissue
microarray. Blood, 103(1):275–282, 2004.
[63] Sandra L Harris and Arnold J Levine. The p53 pathway: positive and negative feedback
loops. Oncogene, 24(17):2899–2908, 2005.
[64] A.J. Hartemink. Reverse engineering gene regulatory networks.
23(5):554–555, 2005.

Nature Biotechnol.,

[65] Anne-Claire Haury, Fantine Mordelet, Paola Vera-Licona, and Jean-Philippe Vert. Tigress:
trustful inference of gene regulation using stability selection. BMC systems biology, 6(1):1,
2012.
[66] James Hays and Alexei A Efros. Scene completion using millions of photographs. Communications of the ACM, 51(10):87–94, 2008.
[67] John Hertz, Anders Krogh, and Richard G Palmer. Introduction to the theory of neural
computation, volume 1. Basic Books, 1991.
[68] J.J. Hopfield. Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci. USA, 79(8):2554–2558, 1982.
[69] S. Huang, G. Eichler, Y. Bar-Yam, and D.E. Ingber. Cell fates as high-dimensional attractor
states of a complex gene regulatory network. Phys. Rev. Lett., 94:128701, Apr 2005.
[70] Mustafa Hussain, Mahadev Rao, Ashley E Humphries, Julie A Hong, Fang Liu, Maocheng
Yang, Diana Caragacianu, and David S Schrump. Tobacco smoke induces polycombmediated repression of dickkopf-1 in lung cancer cells. Cancer research, 69(8):3570–3578,
2009.

140

[71] Alexandre Irrthum, Louis Wehenkel, Pierre Geurts, et al. Inferring regulatory networks from
expression data using tree-based methods. PloS one, 5(9):e12776, 2010.
[72] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.
Nucleic acids research, 28(1):27–30, 2000.
[73] Stuart Kauffman, Carsten Peterson, Björn Samuelsson, and Carl Troein. Genetic networks
with canalyzing boolean rules are always stable. Proceedings of the National Academy of
Sciences of the United States of America, 101(49):17102–17107, 2004.
[74] Stuart A Kauffman. Metabolic stability and epigenesis in randomly constructed genetic
nets. Journal of theoretical biology, 22(3):437–467, 1969.
[75] Haseong Kim and Erol Gelenbe. Stochastic gene expression modeling with hill function
for switch-like gene responses. IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 9(4):973–979, 2012.
[76] Scott Kirkpatrick and David Sherrington. Infinite-ranged models of spin-glasses. Physical
Review B, 17(11):4384, 1978.
[77] William S. Klug, Michael R. Cummings, and Charlotte A. Spencer. Concepts of Genetics,
8th Edition. Pearson Education, 2006.
[78] Eugene V Koonin. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Reviews Microbiology, 1(2):127–136, 2003.
[79] Kirill S Korolev, Joao B Xavier, and Jeff Gore. Turning ecology and evolution against
cancer. Nature Reviews Cancer, 14(5):371–380, 2014.
[80] Johannes Krause, Qiaomei Fu, Jeffrey M Good, Bence Viola, Michael V Shunkov, Anatoli P
Derevianko, and Svante Pääbo. The complete mitochondrial dna genome of an unknown
hominin from southern siberia. Nature, 464(7290):894–897, 2010.
[81] Dmitry Krotov, Julien O. Dubuis, Thomas Gregor, and William Bialek. Morphogenesis at
criticality. Proceedings of the National Academy of Sciences, 2014.
[82] Sudhir Kumar, Michael P Suleski, Glenn J Markov, Simon Lawrence, Antonio Marco, and
Alan J Filipski. Positional conservation and amino acids shape the correct diagnosis and
population frequencies of benign and damaging personal amino acid mutations. Genome
Research, 19(9):1562–1569, 2009.
[83] K.E. Kürten. Correspondence between neural threshold networks and kauffman boolean
cellular automata. J. Phys. A, 21(11):L615, 1988.
[84] K.E. Kürten. Critical phenomena in model neural networks. Phys. Lett. A, 129(3):157 –
160, 1988.
[85] A. H. Lang, H. Li, J. J. Collins, and P. Mehta. Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes. ArXiv e-prints, page
arXiv:1211.3133v3, November 2012.
141

[86] Michael S Lawrence, Petar Stojanov, Craig H Mermel, James T Robinson, Levi A Garraway,
Todd R Golub, Matthew Meyerson, Stacey B Gabriel, Eric S Lander, and Gad Getz. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, 505(7484):495–
501, 2014.
[87] Lori Layne, Elena Dimitrova, and Matthew Macauley. Nested canalyzing depth and network
stability. Bulletin of mathematical biology, 74(2):422–433, 2012.
[88] Insuk Lee, U Martin Blom, Peggy I Wang, Jung Eun Shim, and Edward M Marcotte. Prioritizing candidate disease genes by network-based boosting of genome-wide association data.
Genome Research, 21(7):1109–1121, 2011.
[89] C. Lefebvre, P. Rajbhandari, M.J. Alvarez, P. Bandaru, W.K. Lim, M. Sato, K. Wang,
P. Sumazin, M. Kustagi, B.C Bisikirska, et al. A human b-cell interactome identifies myb
and foxm1 as master regulators of proliferation in germinal centers. Mol. Syst. Biol., 6(1),
2010.
[90] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification
and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD),
1(1):2, 2007.
[91] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Statistical
properties of community structure in large social and information networks. In Proceedings
of the 17th International Conference on World Wide Web, pages 695–704. ACM, 2008.
[92] T.N. Libório, E. N. Ferreira, F. C. Aquino Xavier, D. M. Carraro, L. P. Kowalski, F. A.
Soares, and F.D. Nunes. Tgif1 splicing variant 8 is overexpressed in oral squamous cell carcinoma and is related to pathologic and clinical behavior. Oral Surg. Oral Med., 116(5):614–
625, 2013.
[93] Ryan Lister, Ronan C O’Malley, Julian Tonti-Filippini, Brian D Gregory, Charles C Berry,
A Harvey Millar, and Joseph R Ecker. Highly integrated single-base resolution maps of the
epigenome in arabidopsis. Cell, 133(3):523–536, 2008.
[94] Yang-Yu Liu, Jean-Jacques Slotine, and Albert-László Barabási. Controllability of complex
networks. Nature, 473(7346):167–173, 2011.
[95] Camilla Looman, Magnus Åbrink, Charlotta Mark, and Lars Hellman. Krab zinc finger
proteins: an analysis of the molecular mechanisms governing their increase in numbers and
complexity during evolution. Molecular Biology and Evolution, 19(12):2118–2130, 2002.
[96] N.M. Luscombe, M.M. Babu, H. Yu, M. Snyder, S.A. Teichmann, and M. Gerstein. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature,
431(7006):308–312, 2004.
[97] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of
Machine Learning Research, 9(Nov):2579–2605, 2008.

142

[98] Lesley T MacNeil and Albertha JM Walhout. Gene regulatory networks and the role of
robustness and stochasticity in the control of gene expression. Genome research, 21(5):645–
657, 2011.
[99] Gerard Manning, Gregory D Plowman, Tony Hunter, and Sucha Sudarsanam. Evolution of
protein kinase signaling from yeast to man. Trends in Biochemical Sciences, 27(10):514–
520, 2002.
[100] V. Matys, E. Fricke, R. Geffers, E. Gössling, M. Haubrock, R. Hehl, K. Hornischer,
D. Karas, A. E. Kel, O. V. Kel-Margoulis, et al. Transfac: transcriptional regulation, from
patterns to profiles. Nucleic Acids Res., 31(1):374–378, 2003.
[101] Vea Matys, Ellen Fricke, R Geffers, Ellen Gößling, Martin Haubrock, R Hehl, Klaus Hornischer, Dagmar Karas, Alexander E Kel, Olga V Kel-Margoulis, et al. Transfac®: transcriptional regulation, from patterns to profiles. Nucleic acids research, 31(1):374–378,
2003.
[102] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
[103] M.-L. Mo, Z. Chen, H.-M. Zhou, H. Li, T. Hirata, D.M. Jablons, and B. He. Detection of
e2a-pbx1 fusion transcripts in human non-small-cell lung cancer. J. Exp. Clin. Canc. Res.,
32(1):29, 2013.
[104] C. Montagut and J. Settleman. Targeting the raf–mek–erk pathway in cancer therapy. Canc.
Lett., 283(2):125–134, 2009.
[105] Aslaug A Muggerud, Henrik Edgren, Maija Wolf, Kristine Kleivi, Emelyne Dejeux, Jörg
Tost, Therese Sørlie, and Olli Kallioniemi. Data integration from two microarray platforms
identifies bi-allelic genetic inactivation of ric8a in a breast cancer cell line. BMC medical
genomics, 2(1):26, 2009.
[106] Arne Müller, Robert M MacCallum, and Michael JE Sternberg. Structural characterization
of the human proteome. Genome research, 12(11):1625–1641, 2002.
[107] Patricia AJ Muller and Karen H Vousden. p53 mutations in cancer. Nature cell biology,
15(1):2–8, 2013.
[108] Javier Munoz, Teck Y Low, Yee J Kok, Angela Chin, Christian K Frese, Vanessa Ding,
Andre Choo, and Albert JR Heck. The quantitative proteomes of human-induced pluripotent
stem cells and embryonic stem cells. Molecular systems biology, 7(1), 2011.
[109] Katy A Muzikar, Nicholas G Nickols, and Peter B Dervan. Repression of dna-binding
dependent glucocorticoid receptor-mediated gene expression. Proceedings of the National
Academy of Sciences, 106(39):16598–16603, 2009.
[110] Mark Newman. Networks: An Introduction. Oxford university press, 2010.
[111] Mark EJ Newman. Assortative mixing in networks. Physical review letters, 89(20):208701,
2002.
143

[112] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the
National Academy of Sciences, 103(23):8577–8582, 2006.
[113] Yoshihito Niimura. Olfactory receptor multigene family in vertebrates: from the viewpoint
of evolutionary genomics. Current Genomics, 13(2):103–114, 2012.
[114] Hidetoshi Nishimori, Tota Nakamura, and Masatoshi Shiino. Retrieval of spatio-temporal
sequence in asynchronous neural network. Physical Review A, 41(6):3346, 1990.
[115] Eric N Olson. Gene regulatory networks in the evolution and development of the heart.
Science, 313(5795):1922–1927, 2006.
[116] Edison Ong, Anthony Szedlak, Yunyi Kang, Peyton Smith, Nicholas Smith, Madison
McBride, Darren Finlay, Kristiina Vuori, James Mason, Edward D Ball, et al. A scalable
method for molecular network reconstruction identifies properties of targets and mutations
in acute myeloid leukemia. Journal of Computational Biology, 22(4):266–288, 2015.
[117] Christiane A Opitz, Michael Kulke, Mark C Leake, Ciprian Neagoe, Horst Hinssen, Roger J
Hajjar, and Wolfgang A Linke. Damped elastic recoil of the titin spring in myofibrils of
human myocardium. Proceedings of the National Academy of Sciences, 100(22):12688–
12693, 2003.
[118] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: bringing order to the web. 1999.
[119] Gergely Palla, Illes J Farkas, Peter Pollner, Imre Derenyi, and Tamás Vicsek. Directed
network modules. New journal of physics, 9(6):186, 2007.
[120] Anoop P Patel, Itay Tirosh, John J Trombetta, Alex K Shalek, Shawn M Gillespie, Hiroaki
Wakimoto, Daniel P Cahill, Brian V Nahed, William T Curry, Robert L Martuza, et al.
Single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. Science,
344(6190):1396–1401, 2014.
[121] Ute Paul, Viktor Kaufman, and Barbara Drossel. Properties of attractors of canalyzing
random boolean networks. Physical Review E, 73(2):026118, 2006.
[122] Mihaela Pertea and Steven L Salzberg. Between a chicken and a grape: estimating the
number of human genes. Genome Biology, 11(5):1, 2010.
[123] P.P. Piccaluga, G. De Falco, M. Kustagi, A. Gazzola, C. Agostinelli, C. Tripodo, E. Leucci,
A. Onnis, A. Astolfi, M. R. Sapienza, et al. Gene expression analysis uncovers similarity
and differences among burkitt lymphoma subtypes. Blood, 117(13):3596–3608, 2011.
[124] M. D. Plummer and L. Lovász. Matching theory. Elsevier, 1986.
[125] E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai, and A.-L. Barabási. Hierarchical organization of modularity in metabolic networks. Science, 297(5586):1551–1555, 2002.
[126] T. Rohlf and S. Bornholdt. Self-organized criticality and adaptation in discrete dynamical
networks. In Adaptive Networks, pages 73–106. Springer, 2009.
144

[127] A. Rosenwald, G. Wright, W.C. Chan, J. M. Connors, E. Campo, R.I. Fisher, R.D. Gascoyne, H.K. Muller-Hermelink, E.B. Smeland, J. M. Giltnane, et al. The use of molecular
profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. New
Engl. J. Med., 346(25):1937–1947, 2002.
[128] S.I. Rothschild, O. Gautschi, E.B. Haura, and F.M. Johnson. Src inhibitors in lung cancer:
current status and future directions. Clin. Lung Canc., 11(4):238–242, 2010.
[129] Maureen A Sartor, Vasudeva Mahavisno, Venkateshwar G Keshamouni, James Cavalcoli,
Zachary Wright, Alla Karnovsky, Rork Kuick, HV Jagadish, Barbara Mirel, Terry Weymouth, et al. Conceptgen: a gene set enrichment and gene set relation mapping tool. Bioinformatics, 26(4):456–463, 2010.
[130] Tatjana Sauka-Spengler, Daniel Meulemans, Matthew Jones, and Marianne Bronner-Fraser.
Ancient evolutionary origin of the neural crest gene regulatory network. Developmental
Cell, 13(3):405–420, 2007.
[131] Martin H Schaefer, Jean-Fred Fontaine, Arunachalam Vinayagam, Pablo Porras, Erich E
Wanker, and Miguel A Andrade-Navarro. Hippie: Integrating protein interaction networks
with experiment based quality scores. PloS One, 7(2):e31826, 2012.
[132] Mark Schena, Dari Shalon, Ronald W Davis, and Patrick O Brown. Quantitative monitoring
of gene expression patterns with a complementary dna microarray. Science, 270(5235):467,
1995.
[133] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel
Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research,
13(11):2498–2504, 2003.
[134] S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcriptional
regulation network of escherichia coli. Nature Genet., 31(1):64–68, 2002.
[135] Ilya Shmulevich and Stuart A Kauffman. Activities and sensitivities in boolean network
models. Physical review letters, 93(4):048701, 2004.
[136] Sitabhra Sinha and Swarup Poria. Multiple dynamical time-scales in networks with hierarchically nested modular organization. arXiv preprint arXiv:1110.2906, 2011.
[137] E.D. Sontag. Mathematical control theory: deterministic finite dimensional systems, volume 6. Springer, 1998.
[138] Thomas A Steitz and Peter B Moore. Rna, the first macromolecular catalyst: the ribosome
is a ribozyme. Trends in Biochemical Sciences, 28(8):411–418, 2003.
[139] Susanna Stinson, Mark R Lackner, Alex T Adai, Nancy Yu, Hyo-Jin Kim, Carol O’Brien,
Jill Spoerke, Suchit Jhunjhunwala, Zachary Boyd, Thomas Januario, et al. Trps1 targeting
by mir-221/222 promotes the epithelial-to-mesenchymal transition in breast cancer. Science
Signaling, 4(177):ra41, 2011.
145

[140] H. Sui, I. Ernberg, and S. Kauffman. Cancer attractors: A systems view of tumors from a
gene network dynamics and developmental perspective. Sem. Cell Dev. Biol., 20(7):869 –
876, 2009.
[141] Anthony Szedlak, Giovanni Paternostro, and Carlo Piermarocchi. Control of asymmetric
hopfield networks and application to cancer attractors. PloS one, 9(8):e105842, 2014.
[142] Anthony Szedlak, Nicholas Smith, Li Liu, Giovanni Paternostro, and Carlo Piermarocchi.
Evolutionary and topological properties of genes and community structures in human gene
regulatory networks. PLOS Computational Biology, 12(6):e1005009, 2016.
[143] T. Takahashi, M.M. Nau, I. Chiba, M.J. Birrer, R.K. Rosenberg, M. Vinocour, M. Levitt,
H. Pass, A.F. Gazdar, and J.D. Minna. p53: a frequent target for genetic abnormalities in
lung cancer. Science, 246(4929):491–494, 1989.
[144] E TAUB, FLOYD, JAMES M DeLEO, and E BRAD THOMPSON. Sequential comparative hybridizations analyzed by computerized image processing can identify and quantitate
regulated rnas. DNA, 2(4):309–327, 1983.
[145] Diethard Tautz. Evolution of transcriptional regulation. Current Opinion in Genetics &
Development, 10(5):575–579, 2000.
[146] Sarah A Teichmann and M Madan Babu. Gene regulatory network growth by duplication.
Nature Genetics, 36(5):492–496, 2004.
[147] V.S. Tompkins, S.-S. Han, A. Olivier, S. Syrbu, T. Bair, A. Button, Laura Jacobus, Zebin
Wang, Samuel Lifton, Pradip Raychaudhuri, et al. Identification of candidate b-lymphoma
genes by cross-species gene expression profiling. PLoS ONE, 8(10):e76889, 2013.
[148] Cole Trapnell, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael
Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single
cells. Nature biotechnology, 32(4):381–386, 2014.
[149] Sonal Tuteja. Enhancement in weighted pagerank algorithm using vol. IOSR Journal of
Computer Engineering (IOSR-JCE), ISSN, pages 2278–0661, 2013.
[150] John J Tyson, Katherine C Chen, and Bela Novak. Sniffers, buzzers, toggles and blinkers:
dynamics of regulatory and signaling pathways in the cell. Current opinion in cell biology,
15(2):221–231, 2003.
[151] B.C. Valdez, A.R. Zander, G. Song, D. Murray, Y. Nieto, Y. Li, R.E. Champlin, and B.S. Andersson. Synergistic cytotoxicity of gemcitabine, clofarabine and edelfosine in lymphoma
cell lines. Blood Canc. J., 4(1):e171, 2014.
[152] Vera Van Noort, Berend Snel, and Martijn A Huynen. The yeast coexpression network
has a small-world, scale-free architecture and can be explained by a simple model. EMBO
Reports, 5(3):280–284, 2004.

146

[153] Bert Vogelstein, Nickolas Papadopoulos, Victor E Velculescu, Shibin Zhou, Luis A Diaz,
and Kenneth W Kinzler. Cancer genome landscapes. science, 339(6127):1546–1558, 2013.
[154] Klaus W Wagner, Elizabeth A Punnoose, Thomas Januario, David A Lawrence, Robert M
Pitti, Kate Lancaster, Dori Lee, Melissa von Goetz, Sharon Fong Yee, Klara Totpal, et al.
Death-receptor o-glycosylation controls tumor-cell sensitivity to the proapoptotic ligand
apo2l/trail. Nature medicine, 13(9):1070–1077, 2007.
[155] J. Walczynski, S. Lyons, N. Jones, and W. Breitwieser. Sensitisation of c-myc-induced
b-lymphoma cells to apoptosis by atf2. Oncogene, 33:1027–1036, 2013.
[156] L. Wang, S. Pal, and S. Sif. Protein arginine methyltransferase 5 suppresses the transcription
of the rb family of tumor suppressors in leukemia and lymphoma cells. Mol. Cell. Biol.,
28(20):6262–6277, 2008.
[157] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks.
nature, 393(6684):440–442, 1998.
[158] Eugene P Wigner. The unreasonable effectiveness of mathematics in the natural sciences.
richard courant lecture in mathematical sciences delivered at new york university, may 11,
1959. Communications on pure and applied mathematics, 13(1):1–14, 1960.
[159] J.N. Winter, E.A. Weller, S.J. Horning, M. Krajewska, D. Variakojis, T.M. Habermann, R.I.
Fisher, P.J. Kurtin, W.R. Macon, M. Chhanabhai, et al. Prognostic significance of bcl-6
protein expression in dlbcl treated with chop or r-chop: a prospective correlative study.
Blood, 107(11):4207–4213, 2006.
[160] Gregory A Wray, Matthew W Hahn, Ehab Abouheif, James P Balhoff, Margaret Pizer,
Matthew V Rockman, and Laura A Romano. The evolution of transcriptional regulation in
eukaryotes. Molecular Biology and Evolution, 20(9):1377–1419, 2003.
[161] C.-Y. Yang, C.-H. Chang, Y.-L. Yu, T.-C. E. Lin, S.-A. Lee, C.-C. Yen, J.-M. Yang, J.-M.
Lai, Y.-R. Hong, T.-L. Tseng, K.-M. Chao, and C.-Y. F. Huang. Phosphopoint: a comprehensive human kinase interactome and phospho-protein database. Bioinformatics, 24(16):i14–
i20, 2008.
[162] Hong-Mei Zhang, Hu Chen, Wei Liu, Hui Liu, Jing Gong, Huili Wang, and An-Yuan Guo.
Animaltfdb: a comprehensive animal transcription factor database. Nucleic acids research,
40(D1):D144–D149, 2012.
[163] Y. Zhang, C. Duan, C. Bian, Y. Xiong, and J. Zhang. Steroid receptor coactivator-1: A
versatile regulator and promising therapeutic target for breast cancer. J. Steroid Biochem.,
138:17, 2013.

147