COMPUTATIONAL ANNOTATIONS OF CELL TYPE SPECIFIC TRANSCRIPTION
 FACTORS BINDING AND LONG-RANGE ENHANCER-GENE INTERACTIONS
                                       By
                                    Wenjie Qi
                              A DISSERTATION
                                   Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
              Biomedical Engineering – Doctor of Philosophy
     Computational Mathematics, Science, and Engineering – Dual Major
                                      2022


                                         ABSTRACT
Precise execution of cell-type-specific gene transcription is critical for cell differentiation
and development. The accurate lineage-specific gene regulation lies in the proper
combinatorial binding of transcription factors (TFs) to the cis-regulatory elements. TFs
bind to the proximal DNA sequences around the genes to exert control over gene
transcription. Recently, experimental studies revealed that enhancers also recruit TFs to
stimulate gene expression by forming long-range chromatin interactions, suggesting the
interplay between gene, enhancer, and TFs in the 3D space in specifying cell fates.
Identification of transcription factor binding sites (TFBSs) as well as pinpointing the long-
range chromatin interactions is pivotal for understanding the transcriptional regulatory
circuits. Experimental approaches have been developed to profile protein binding as well
as 3D genome but have their limitations. Therefore, accurate and highly scalable
computation methods are needed to comprehensively delineate the gene regulatory
landscape. Accordingly, I have developed a supervised machine learning model, TF-
wave, to predict TFBSs based on DNase-Seq data. By incorporating multi-resolutions
features generated by applying Wavelet Transform to DNase-Seq data, TF-wave can
accurately predict TFBSs at the genome-wide level in a tissue-specific way. I further
designed a matrix factorization model, EP3ICO, to jointly infer enhancer-promoter
interactions based on protein-protein interactions (PPIs) between TFs with combined
orders. Compared with existing algorithms, EP3ICO not only identifies underlying
mechanistic regulators that mediate the 3D chromatin interactions but also achieves
superior performance in predicting long-range enhancer-promoter links. In conclusion,
our models provide new computational approaches for profiling the cell-type specific TF
bindings and high-resolution chromatin interactions.


Copyright by
WENJIE QI
2022


                                 ACKNOWLEDGEMENTS
I would like to acknowledge and give the warmest thanks to my advisor and committee
chair, Professor Jianrong Wang. I have learned a lot from Dr. Wang about how to conduct
research with specific goals and directions. His guidance carried me through all the stage
of completing my projects as well as helped me to establish computational algorithm
development.
I would also like to express the appreciation to my committee members, Professor Adam
Alessio, Professor Yuehua Cui and Professor Jens Schmidt as well as Professor Sudin
Bhattacharya, for their insightful feedback in my research and professional guidance in
my career skill development.
Special thanks to Professor Adam Alessio. Dr. Alessio helped me go through the tough
stages during my graduate school. Working with him, I have not only developed my
technical skills, but also learned how to become a caring person who would always be
there and help others. I feel blessed that I have Dr. Alessio on my committee.
I would like to thank my lab members, Jiaxin Yang, Dr. Binbin Huang and Dr. Hao Wang
for their help in my research. I will remember the days we worked together.
Thanks to my friends, David Filipovic, Muneeza Amat for their listening, understanding
and mental support. Graduate school could be hard, I am more than grateful that I have
met them and have their accompany through this journey.
I want to express my deepest thanks to my parents, my brother and his family. They are
always behind me and encourage me to go further. Without their love and support, I would
never have the achievements today.
                                              iv


                                    TABLE OF CONTENTS
           INTRODUCTION ........................................................................................ 1
           PREDICTING TRANSCRIPTION FACTOR FOOTPRINT USING
WAVELET DECOMPOSED DNASE-SEQ DATA ............................................................ 4
  2.1  INTRODUCTION ............................................................................................... 4
  2.2  MATERIALS AND METHODS ........................................................................... 8
  2.3  RESULTS ........................................................................................................ 14
  2.4  DISCUSSION................................................................................................... 23
           PREDICTIVE MODELS OF GENOME-WIDE ARYL HYDROCARBON
RECEPTOR DNA BINDING REVEAL CELL SPECIFIC BINDING DETERMINANTS... 25
  3.1  INTRODUCTION ............................................................................................. 25
  3.2  MATERIALS AND METHODS ......................................................................... 29
  3.3  RESULTS ........................................................................................................ 31
  3.4  DISCUSSION................................................................................................... 37
           JOINT INFERENCE OF PROTEIN-PROTEIN INTERACTIONS AND
ENHANCER-GENE LINKS BY A MATRIX DECOMPOSITION MODEL ....................... 39
  4.1  INTRODUCTION ............................................................................................. 39
  4.2  MATERIALS AND METHODS ......................................................................... 45
  4.3  RESULTS ........................................................................................................ 59
  4.4  DISCUSSION................................................................................................... 67
           FUTURE DIRECTIONS ............................................................................ 69
BIBLIOGRAPHY............................................................................................................. 70
APPENDIX A SUPPLEMENTARY FIGURES FOR CHAPTER 2 ................................ 82
APPENDIX B SUPPLEMENTARY FIGURES FOR CHAPTER 3 ................................ 89
APPENDIX C SUPPLEMENTARY MATERIAL FOR CHAPTER 4 .............................. 90
                                                      v


                                       INTRODUCTION
Understanding the gene regulatory network is critical to gain clear insights of cellular
process as well as improve the understanding of human health. The formation of lineage-
specific gene regulatory network achieved by cell-type specific regulatory relationship
between transcription factors (TFs) and their target genes. Through the binding to cis-
regulatory elements including promoters and enhancers in the non-coding genomic
region, TFs exert control over target genes transcription activity such as enhance or inhibit
target genes expression level. On the one hand, TFs bind on promoters near their target
gene transcription start sites and directly recruit RNA polymerase or other accessary
factors to regulate gene transcription activity. On the other hand, TFs also bind to distal
enhancers that could be 1Mb away from target genes’ core promoters to regulate their
transcription activity. The regulatory input from enhancers is achieved by the proper
folding of chromatins that brings regulatory elements close to promoters in a three-
dimensional space. Therefore, identification of TFs binding sites (TFBSs) is the first step
for constructing the cell-type specific gene regulatory networks. Moreover, pinpointing the
long-range enhancer-promoter interactions is also critical for delineating the gene
regulatory network in the 3D space.
Experimental technologies including ChIP-Seq, ChIP-exo, and ChIP-nexus have been
developed to detect TFs bindings at genome-wide level in different cell types. However,
all of the ChIP-based methods are limited by the quality of antibodies. In addition, given
that there are approximate 1600 TFs in the human genome, profiling every TFs in every
cell line at every stage is currently not experimentally feasible. Experimental approaches
                                              1


have also been developed to profile the chromatin conformation across cell types. For
example, Hi-C and Capture-C can profile long-range chromatin contacts but it has high
false positive rates and lacks tissue variability. ChIA-PET can detect chromatin
interactions at high resolution. However, ChIA-PET can only detect chromatin interactions
facilitated by a specific protein and therefore has high false negative rates. Overall,
experimental methods used for detecting TFBSs as well as chromatin structures are cost-
expensive and not able to identify underlying mechanisms that determine transcription
factor binding or facilitate long-range chromatin interactions. Therefore, accurate and
highly scalable computational methods are needed to address the problems.
Diverse genomic datasets have been profiled by high throughput sequencing
technologies and collected by large data consortia such as ENCODE, Roadmap. All of
the previous efforts make it promising to develop computational methods that integrate
multi-omics data to predict TFBSs and infer long-range enhancer-promoter interactions.
Here, we developed two main computational models, TF-wave and EP3ICO to predict
cell-type specific TFBSs and long-range enhancer-promoter interactions respectively. TF-
wave uses a Gradient Boosting Tree model to predict based on DNase-Seq data. By
applying Wavelet Transform to DNase-Seq data to extract multi-resolution features as
input, TF-wave predicts cell-type TFBSs at genome-wide level accurately. TF-wave can
also be applied to distinguish different TFs binding accurately. Moreover, by using local
chromatin accessibility information, TF-wave can predict TF bind probability at single-
nucleotide level, which opens a new avenue in predicting TF footprinting at high-
resolution. As a case study, we have developed an XGBoost model that takes multi-omics
data to predict Aryl Hydrocarbon Receptor’s (AHR) binding sites in multiple tissues.
                                             2


Determinants of AHR binding are identified by extracting the important features from the
XGBoost model. To detect long-range enhancer-promoter interactions, we have
developed EP3ICO, which applies matrix factorization models to predict long-range
enhancer-promoter interactions based on protein-protein interactions (PPIs) between TFs.
By considering second-order PPIs between TFs, EP3ICO has boosted accuracy for
enhancer-promoter interactions that cannot be reconstructed by first-order PPIs. EP3ICO
can be further applied to infer higher-order PPIs between TFs that regulate chromatin
interactions. Dispute that predicting long-range enhancer-promoter interactions is
challenging, EP3ICO has achieved superior performance when compared with existing
methods. EP3ICO can identify PPIs between TFs as mechanistic regulators that mediate
3D chromatin interactions. The predicted long-range enhancer-promoter links are also
enriched with cis-eQTLs, providing insights of unrevealing how genetic variants affect
gene regulation, ultimately leading to different phenotypes.
                                              3


       PREDICTING TRANSCRIPTION FACTOR FOOTPRINT USING WAVELET
                            DECOMPOSED DNASE-SEQ DATA
This chapter is adapted from our in-preparation work: Qi W., Yang J., Wang J.. Predicting
transcription factor footprint using Wavelet decomposed DNase-seq data.
2.1     INTRODUCTION
Identification of site-specific transcription factors (TF) binding at cis-regulatory elements
is the key for elucidating regulatory mechanisms underlying transcriptional process and
disease progress1–4. Characterizing TF binding sites (TFBS) across the entire genome is
a monumental task2. Recently, high-throughput sequencing-based methods such as
chromatin immunoprecipitation followed by DNA-sequencing (ChIP-Seq) are widely used
to profile and detect TF binding sites at the genome-wide level5. However, ChIP-Seq has
low resolution and is cost-expensive6. Newer experimental methods such as ChIP-exo
and ChIP-nexus have been developed to detect DNA-binding TFs with higher resolution
and cost efficiency7,8. However, all of the ChIP-based techniques have several
shortcomings. Firstly, ChIP-based methods require high-quality antibodies to pull down
TFs and therefore are limited to TFs that have high-quality antibodies. Secondly, ChIP-
based methods cannot be applied to distinguish TF bindings between primary binding
and secondary binding2,6,9. Lastly, ChIP-based experiments can only characterize one TF
per experiment. Given that there are approximate 1600 TFs in the human genome, it is
currently not experimentally feasible to profile every TF in every tissue/cell line at different
stages.
                                                4


With the advent of high-throughput sequencing technologies, another experimental assay
that identifies DNase I Hypersensitive Sites (DHS) has been developed to detect the open
regions of chromatin10–12. In vivo binding of TFs shields bound DNA elements from the
DNase I’s attack. After deep-sequencing, the chromatin accessibility data as well as the
TF-specific DNase-I protection profiles at single-nucleotide resolution are collected in the
open chromatin region. A majority of TFBSs can be identified at single-nucleotide
resolution by detecting a footprint-like region with low DNase I cutting frequency13.
By taking advantage of the booming DNase-Seq data, computational methods have been
developed to predict TFBS by investigating the TF footprint patterns in open chromatin
regions. The computational footprinting methods can be grouped into two main categories,
including motif-centric models (BinDNase, CENTIPEDE, FLR, DeFCoM, PIQ) and de-
novo models (Wellington, TRACE, DNase2TF)2,13–19.
On the one hand, motif-centric methods usually scan the genome to generate candidate
binding sites for TF of interest and predict TF-specific binding activity at the candidate
sites. BinDNase takes up- and down-stream base pair resolution DNase-Seq data (100bp)
around the motifs as features and select the discriminatory features by a backward greedy
method. BinDNase then deploys supervised logistic regression model to predict TFBSs
based on the selected features. CENTIPEDE used position weight matrices (PWM) to
scan the genome and collect all positions with substantial sequence similarity with the TF
motif as the candidate sites. Then CENTIPEDE applied an unsupervised Bayesian
mixture model to integrate DNase-Seq data, DNA sequence data, and histone
modification data to infer which candidate site is likely to be bound by the TF. FLR detects
active TFBSs by a mixture of multinomial models. Firstly, FLR first learns the mixture of
                                              5


two multinomial distributions, representing TF footprints and background by the
expectation maximization algorithm. Secondly, footprints are scored by log-odds ration or
the footprint-versus-background model. One main advantage of FLR is that it uses a small
window size around the motif, i.e. ±25bp, which makes it a more targeted approach for
detecting footprints for TF of interest. DeFCoM extracts both local and global DNase-Seq
features by using different size of segments around the motifs. The DeFCoM applies
kernel SVM model with either a linear kernel or a radical kernel to integrate DNase-Seq
features in a non-linear way to predict TFBSs at their motif sites. PIQ uses Gaussian
process to model and smooth the footprint signals around candidate motif sites. Then PIQ
estimates the footprints with an expectation propagation algorithm. Finally, PIQ selects
the set of motifs whose footprints are distinguishable from the noise as the final
predictions.
On the other hand, the de-novo methods do not require the pre-generated candidate
binding sites. De-novo methods detect TFBSs in the open chromatin region based on the
DNase I digestion pattern. Wellington19 is a sliding window approach that detects TF
footprints based on the binomial test. For a given candidate footprint site, Wellington tests
the hypothesis that there are less reads within the footprint than its flanking regions. The
major novelty of Wellington is that it tests each strand independently as it considers that
different strands have different effect in inhibiting DNase I activity. However, Wellington
is unable to detect the binding sites for TFs of interest specifically. DNase2TF detects
footprints by calculating the significance of DNase I cut depletion around the motif region
based on a binomial Z score. Then DNase2TF interactively merge the candidate
footprints sites to identify the regions that produces the most significant depletion regions.
                                                6


TRACE uses a Hidden Markov Model (HMM)20,21 to predict footprints and label binding
sites for desired TFs by integrating PWMs and DNase-Seq Data in the open chromatin2.
However, most de-novo methods were not designed to predict TF-specific footprinting
and cannot label binding sites for TFs of interest.
Assigning TFs to their footprints on the basis of matching consensus sequences enables
the analysis of TF-mediated gene regulatory networks6,22. Experimental studies have
revealed that different TFs have different TF footprint shapes13,17. However, previous
computational footprinting methods mainly focus on binary prediction, i.e. if TFs will bind
on a given candidate motif or an accessible open region. Yet, existing methods haven’t
explored the different TFs footprint shape information and lack the ability to assign
different TFs to their footprinted regions when there exist multiple candidate regions.
Feature extraction methods such as short-time Fourier transform (STFT) and Discrete
Wavelet Transform have been widely applied in spectrum signal processing23–27. Wavelet
Transform applies a low pass filter, i.e. scaling and a high pass filter to a signal, i.e.
wavelet transforms and decomposes the original signals into signals with low resolution
and high resolution respectively. The low-resolution signals, which are the approximate
coefficients, represent a summary of the original signals. On the other hand, the high-
resolution signals, which are the detailed coefficients represent the fine details in the
original signals23,28. Both low-resolution coefficients and high-resolution coefficients are
half sizes of the original signals. Multi-resolution signals are obtained by repeating the
rounds of scaling and wavelet transform process to the low-resolution signals.
Inspired by the different TF footprint patterns and the successful feature extraction
application of Wavelet Transform, we develop a supervised machine-learning model, TF-
                                              7


wave, which applies Gradient Boosted trees to utilizes different frequency features
decomposed by Discrete Wavelet Transform (DWT) from DNase-Seq data to predict
specific TF bindings. By applying Wavelet decomposition to the DNase-Seq signal, TF-
wave considers the low and high-frequency signals contained in the spectrum in different
TF’s footprints and can accurately infer specific TF binding sites as well as distinguish
different TFs’ binding sites. Furthermore, we introduce a convolutional neuron network
(CNN) model that can predict TF footprint probability at single-nucleotide level based on
the local chromatin accessibility. Our results provide a framework for both binary and
nucleotide-level footprinting model of TFs could be applied as the first step in analyzing
TF-focused gene regulatory networks.
2.2     MATERIALS AND METHODS
2.2.1 Datasets and data generation
TF ChIP-Seq data in BED format and DNase-Seq data in BAM and BED format in K562
and GM12878 were collected from ENCODE. FRiP score for the TF ChIP-Seq files was
calculated to quantify the quality of ChIP-Seq experiments by R Libray ChIPQCsample29.
The ChIP-Seq Bed file with the highest FRiP score is kept for TFs with multiple datasets.
To filter out TFs ChIP-Seq data with low quality, TFs with ChIP-Seq FRiP score < 5 were
removed. Overall, 46 TFs in K562 and 19 TFs in GM12878 with ChIP-Seq FRiP score =>
5 were kept for the subsequent analysis in each cell line. TFs’ motifs are downloaded
from JASPAR30. For each TF, candidate motif sites were identified by MOODS31 with
threshold p < 0.0001 in open chromatin regions obtained from DNase-Seq BED files.
Motif sites under TF ChIP-Seq peaks are labeled as positive samples, while motif sites
under the DNase-Seq peak only are labeled as negative samples. Motifs located on both
                                             8


forward and reverse strands are collected. Using the motif midpoint as the center, we
extended each motif site to upstream 100bp and downstream 100bp to obtain the local
chromatin accessibility from corresponding DNase-Seq BAM files around the TF motifs
for the following analysis.
2.2.2 Data processing
The DNase-Seq read depth at each nucleotide for each motif site was obtained by
SAMtools32 based on the DNase-Seq BAM file. DNase-Seq read-depth at each base pair
for 200 bp was extracted by 5’-3’ orientation corresponding to the orientation of the TF
motif on the forward strand naturally. However, for the motifs located in the
complementary strand, we extracted the DNase-Seq read-depth at each nucleotide from
3’-5’ with respect to the orientation of the motifs. Then we inverted the DNase-Seq read-
depth vector for motifs on the complimentary strand to have the 5’-3’ direction. DNase-
Seq read-depth vectors on leading strands and inverted DNase-Seq read-depth vectors
on complementary strands were used to construct the feature matrix at 5’-3’ orientation.
2.2.3 Discrete Wavelet Transform and Gradient Boosted Trees model
To extract features with different frequencies contained in the TF footprints as well as
motifs sites in open chromatin regions without being footprinted by TFs, the DNase read-
depth data is first normalized by Z-Score normalization. Then we applied the Biorthogonal
wavelet to decompose the 200bp normalized DNase-Seq vector into approximate
coefficients and detailed coefficients with multiresolution. The approximate coefficients
and detailed coefficients are then concatenated together as features to train the Gradient
Boosting Trees (GBT). GBT is a fast and effective tree-boosting algorithm for
                                              9


classification and regression33. GBT model learns the data by adding additional trees into
the model to minimize the loss function of previous trees. Similar to the tree-based method,
GBT can learn the non-linear interactions between different features and is proved to
have high classification performance.
2.2.4 Model training and evaluation
In order to train the GBT model in TF-wave to obtain high accuracy, we tuned two kinds
of parameters during the training process. Firstly, previous studies have observed that
different TFs have different footprint shapes13,22, a single Wavelet Transform model with
one decomposition level is unable to capture the most informative features for all TFs.
Therefore, all of the 14 Biorthogonal wavelets model in pywt packages34 were used to
decompose the DNase-Seq. The parameter of decomposition level is set up to six. The
decomposed signals of a particular wavelet model with a specific decomposition level
were used to construct a feature matrix and train the model. Secondly, the
hyperparameters of GBT including the number of trees including [100, 500, 1000, 2000]
and learning rate including [0.05, 0.1, 0.5] were tuned during the training process. 5-fold
cross-validation with AUPR metric was used to evaluate the model performance. The
best-performing model with the highest AUPR is selected and it consists of features
constructed by a specific Biorthogonal wavelet model and the corresponding
decomposition level. Then the best-performing model is applied to predict TFBSs at
genome-wide level and in cell-specific way.
                                             10


2.2.5 Models performance comparison
Due to the limited access to other model, we can only compare the performance of our
model with Wellington (http://jpiper.github.io/pyDNase). We used the open chromatin
regions containing TFs motifs as evaluation data. Specifically, candidate binding sites
(motif sites) that overlap with DNase-seq peaks and ChIP-seq were used as the positive
set, and candidate sites that are under DNase-seq peaks but not under ChIP-seq peaks
are used as the negative set. We applied Wellington to DNase-seq peaks (with 100-bp
flanking regions to each side) containing the same sets of motif sites that were included
in our model and We compared our model with Wellington19. The fdrlimit in the output of
Wellington was set to 0 to have predictions for all input DNase-seq peaks. The absolute
values were used as scores in the evaluation. Only the predictions overlapping with motif
sites of tested TFs were included in the evaluation on the same dataset.
2.2.6 Distinguish different TFs footprints
TFs have unique footprint shapes as it is shown in Figure 2.1. To investigate if different
TFBSs is distinguishable by their DNase I footprinting shapes, we trained GBT models to
predict different TFs’ binding sites based on the Wavelet decomposed DNase-Seq signals.
Specifically, given two TFs, TF A and TF B, the DNase-Seq signal of these two TFs’
binding sites was collected and decomposed by Wavelet as input features for the GBT
model using the methods described above. Then the binding sites of TF A are labeled to
be 1 as the positive data and the binding sites of TF B are labeled to be 0 as the negative
data for training the GBT model. GBT models were then trained and evaluated by 5-fold
cross-validation. We also used original DNase-Seq signals to train the GBT model as
                                             11


baseline models and compared it with the model trained by Wavelet decomposed DNase-
Seq signals using AUCs on the same datasets.
2.2.7 Consensus footprint generation
For each TF, the trained GBT was used to predict TFBS for each candidate motif site. To
obtain the consensus footprints for each TF, the 200bp DNse-Seq read-depth of the top
5% predicted binding sites were used to generate the TF’s consensus footprints. The
consensus footprints were presented by aggregating DNase-Seq read-depth at each
nucleotide. The examples were obtained by the UCSC genome browser35.
Figure 2.1. TFs leave footprints to DNase-Seq signal. Four examples of DNase-Seq
footprints left by TF ZNF143. From the top to bottom for each panel: observed ZNF143
binding site (green) flanking the motif (yellow), predicted ZNF143 binding site by TF-wave
(red), DNase-Seq read depth of 200 base pairs flanking the ZNF143 motif (blue). ZNF143
ChIP-Seq signal from bigwig files (green). The orientation of the DNase-Seq data is
indicated by 5’-3’ or 3’-5’.
                                              12


2.2.8 Single-nucleotide level footprint probability prediction
Mutations in TFs motif results in the gain of function or loss of function binding, which will
lead to the cis-regulatory evolution22. Therefore, it is critical to determine the binding
probability at each nucleotide for a TF to understand how the variants influence the gene
regulatory networks mediated by the TF binding. In order to investigate the TF binding
probability, we trained a CNN model to predict the TF binding at each nucleotide centering
the TF motifs based on the 2kb flanking regions DNase-Seq signals. CNN has proven
highly effective in a number of diverse tasks including biological sequence analysis36–38.
The input layer of CNN has concatenated Wavelet decomposed DNase-Seq signals.
Following the input layer, there are three convolutional layers. In each convolutional layer,
batch normalization was applied to center the data. The kernel width is 3, and stride size
is 1. Then Relu activation function was used to the filter output. Finally, Maxpool was
applied to pool adjacent values by taking the maximum in a small window to reduce the
dimension of input for the next layer (Figure 2.7a).
Figure 2.2. TF-wave predicts TF binding sites based on wavelet-transformed
DNase-Seq signal. Left: schematic figure of binding of CTCF in open chromatin leaves
                                             13


                                   Figure 2.2 (cont’d)
footprint on DNase-Seq read-depth, centering around the CTCF motif, shown on the top.
Right: schematic figure of applying Wavelet Transform to DNase-Seq to predict TF
binding sites using gradient boosting tree model. Wavelet Transform is applied to
decompose DNase-Seq signal. The extracted approximate coefficients (cA! ) and detailed
coefficients (dA! ) are concatenated as input features to train the Gradient Boosting Tree
models (GBT), the trained GBT is applied to predict TF binding status for a candidate
motif in the open chromatin.
2.3     RESULTS
2.3.1 TF-wave predicts TFBS based on Wavelet decomposed DNase-Seq signals
The binding of TFs protects DNA from DNase I digestion and therefore leaves the TFs
footprints on the DNase-Seq signal (Figure 2.1). The footprints of different TFs are
distinguishable and contain unique features for specific TFs. We applied a Discrete
Wavelet Transform to extract the multi-resolution features underlying the DNase-Seq
signals centering the TF motifs. Gradient Boosting Tree model were applied to predict TF
binding probability at candidate motif sites by using the multi-resolution features (Figure.
2.2).
By using 5-fold cross-validation with auROC and auPR to evaluate the model’s
performance, we demonstrated that Wavelet decomposed DNase-Seq signal can predict
with-in TF binding sites at the genome-wide level accurately.
2.3.2 TF-wave accurately predicts TFBS at the genome-wide level
To assess the performance of TF-wave and Wellington which also uses DNase-Seq data
to predict TFBS, we evaluated the performance of TF-wave and Wellington on the same
datasets. To ensure a fair comparison, for a TF of interest, we used the same DNase-
                                             14


Seq datasets containing this TF’s motif as input data to predict its binding sites. The area
under the receiver operating characteristic curve and the area under the precision-recall
Figure 2.3. TF-wave predicts TF binding sites accurately and outperforms
Wellington in K562. TF-wave and Wellington were evaluated on the same datasets
using the averaged performance of 5-fold cross-validation. (a,b) auPRC of ZNF143 and
CTCF in K562 for TF-Wave and Wellington. (c,d) auROC of ZNF143 and CTCF in K562
for TF-Wave and Wellington. (e) Performance comparison in K562 for TF-Wave and
Wellington using auPR. The x-axis and y-axis are auPR of applying Wellington and TF-
                                            15


                                     Figure 2.3 (cont’d)
wave to predict TFBS on the same data. Each point represents a TF, points above the
diagonal line indicate TF-wave performs better than Wellington. (f) Performance
comparison in K562 for TF-Wave and Wellington using auROC. The x-axis and y-axis are
auROC of applying Wellington and TF-wave to predict TFBS on the same data. Each
point represents a TF, points above the diagonal line indicate TF-wave performs better
than Wellington.
curve was calculated for TF-wave and Wellington based on the predicted binding
probability and scores respectively. 5-fold cross-validation was used and the averaged
auROC and auPRC were reported to evaluate the final prediction performance of TF-
wave and Wellington. For all the TFs evaluated in the analysis, TF-wave achieves higher
performances in identifying the TF binding sites from the candidate motif sites than
Wellington in both K562 and GM12878 cell lines (Figure 2.3, Figure A.1). Examples of
observed CTCF binding sites and TF-wave accurately predicted CTCF binding sites are
shown in Figure 2.4.
2.3.3 Applying Wavelet Transform to DNase-Seq signal provides boosted
prediction performance
To justify that multi-resolution features extracted by Discrete Wavelet Transform indeed
contribute to the TFBSs prediction accuracy of TF-wave, we also trained the Gradient
Booting Trees model on the Z-score normalized original DNase-Seq data as the baseline
model and compared TF-wave with the baseline model on the same datasets. 5-fold
cross-validation was used to evaluate TF-wave and the baseline model’s performances.
Averaged auROC and auPR were reported as the final results. We observed that TF-
wave has higher auROC and auPR scores than the baseline model in both K562 and
GM12878 cell lines (Figure A.2). The results demonstrated that TF-wave achieves
                                              16


boosted performances for all TF binding sites predictions by using the multi-resolution
features extracted from Wavelet decomposed DNase-Seq signal, suggesting the
performance improvement of TF-wave is induced by applying Wavelet transformation
DNase-Seq data.
Figure 2.4. TF-wave identify TFBS accurately. Examples of observed CTCF binding
sites at K562 chr1: 15000000-2000000. From top to bottom: observed CTCF binding sites
(green), predicted CTCF binding sites by TF-wave (red), observed DNase-Seq indicating
the open chromatin (blue). One correctly predicted CTCF binding site is highlighted by
red rectangle and the CTCF footprinted DNase-Seq signal is shown below. Blue: CTCF
footprinted DNase-Seq signal centering the CTCF motif(yellow). The orientation of the
DNase-Seq read-depth data is 5’-3’. The CTCF ChIP-Seq track is shown at the bottom
(green).
2.3.4 Predicted TF binding sites reveal the TF footprint shapes
In order to investigate if the predicted TF binding can reveal the different footprint shapes
based on the DNase-Seq data, we ranked the predicted TF binding sites based on the
GBT provided binding probability and selected the top 5% predicted binding sites to
reconstruct the TF consensus footprints. Our results indicated that different TFs have
different footprint shapes as shown in Figure 2.5 and Figure A.3. Yet, the TFs footprint
                                              17


shapes are consistent with the pattern that there is one small DNase-Seq peak at each
shoulder of the motif, indicating the high cleavage of DNase I at the accessible chromatin
regions, and low cleavage of DNase I at the motif sites that are protected by TFs bindings.
On the other hand, DNase-Seq read counts are relatively low at the centering motif region,
suggesting the decreased DNase I cleavage rate at the motif sequences due to the
protection of TF binding.
Figure 2.5. Different TFs have different consensus footprint shapes. (a) Aggregated
                                              18


                                     Figure 2.5 (cont’d)
DNase-Seq read-depth for top 5% predictions for ATF2, CTCF, MAFF, REST. Each TF
has a unique footprint shape. The motif of ATF2 is palindromic, and the footprint of ATF2
is symmetric. (b) Examples of DNase-Seq read-depth for ATF2 and REST. The footprint
shape of the specific ATF2 binding site and REST binding site is similar to the aggregated
ATF2 DNase-Seq and REST DNase-Seq respectively. While the footprint shapes of
ATF2 and REST are different.
2.3.5 Using Wavelet decomposed DNase-Seq data can distinguish different TFs
binding
TFs bind at the accessible chromatin and different TFs leave different footprint shapes.
Examples are shown in Figure 2.5. To test if different DNase-Seq signals centering the
TFs footprinted region could distinguish different TFs binding, we applied Wavelet
Transform to REST and ATF2 footprinted DNase-Seq data to extract features at different
frequencies. The features at multi-resolution were used to train a Gradient Boosting Trees
model to distinguish REST and ATF2 binding in the open chromatin (see details in the
method section). The resulting AUCs are then compared to a baseline model that was
trained on the original REST and ATF2 footprinted DNase-Seq data. Figure 2.6 shows
that using Wavelet decomposed DNase-Seq signals can successfully distinguish the
binding of REST and ATF2 and has higher accuracy (auROC 0.75 and auPR 0.65) than
using DNase-Seq data only (auROC 0.72 and auPR 0.61). As it is shown in Figure. 2.6b,
Figure. 2.6c, even though both REST and ATF2 are in the open chromatin region, the
different footprinted DNase-Seq data decomposed by Wavelet have the capacity to
distinguish their binding sites precisely.
                                             19


Figure 2.6. Applying wavelet transformation to DNase-Seq signal can distinguish
TF binding sites accurately. (a) ROCs and PRC for predictions of REST-ATF2 binding
sites. ROCs for predictions by applying Wavelet transform to DNase-Seq signals (Red).
GBT is applied to DNase-Seq directly to predict TF binding sites as a baseline model
(Green). (b, c) Examples of correctly predicted REST and ATF2 binding sites. From top
to bottom: predicted ATF2 binding site, observed ATF2 binding site. Predicted REST
binding sites, observed REST binding site. DNase-Seq read-depth at each motif site,
DNase-Seq track in K562.
                                           20


2.3.6 Local chromatin accessibility can predict TF binding at single-nucleotide
level accurately
To investigate if local chromatin accessibility provides information for predicting TF
binding probability at the single-nucleotide level, we obtained more DNase-Seq data
including ±1000bp from the summit of the TF ChIP-Seq peaks as features, then we
trained a CNN model to learn the Wavelet decomposed DNase-Seq data and predict TF
binding probability at each nucleotide for ±25bp sequences flanking the summit of TF
ChIP-Seq peaks containing the TF motif. Using CTCF ChIP-Seq data in K562 as an
example data, we obtained multi-resolution features extracted by applying the Wavelet
Transform to the K562 DNase-Seq data. The input layer for the CNN is the concatenated
Wavelet decomposed DNase-Seq data. Subsequent convolutional layers use Maxpool
and Relu activation functions (see methods). The final layer outputs 51 predictions for the
probability of each nucleotide in the DNA sequence is being footprinted by the CTCF in
K562.
To synthesize sensitivity and specificity, we assessed the CNN model using auROC. As
the data is imbalanced, auPR is also used to measure the CNN model’s performance.
Using K562 CTCF ChIP-Seq as gold-standard, the CNN model achieves high accuracy
with auROC 0.74 and auPR 0.51 (Figure. 2.7a). For the CTCF motifs within ±25bp of the
ChIP-Seq summit, the CNN model predicts high TF footprinted probability (> 0.5) for each
nucleotide in the motifs, while for the flanking sequences around the motif, the predicted
footprinted probability (< 0.5) is low (Figure 2.7b). These results suggested that local
chromatin accessibilities can be applied to predict TF footprint probability at single-
nucleotide level.
                                              21


Figure 2.7. Schematic figure of applying wavelet transformation to DNase-Seq
signals and using CNN model to predict TF binding at single-nucleotide levels. (a)
The DNase-Seq signals are extracted and decomposed by wavelet transformation. The
approximate coefficients and detailed coefficients are concatenated as the input layer for
training the CNN model. The CNN model has three convolutional layers. In each
convolutional layer, batch normalization was applied, followed by Relu activation, and
Maxpool function. Finally, the output layer was the predicted footprinted probability for
each nucleotide around the CTCF motif. (b) AUCs for CTCF prediction at the single
nucleotide level. (c) Examples of predicted CTCF binding at every single nucleotide. From
bottom to top: the observed CTCF track, the DNA sequences containing the CTCF motif
(violet-red), and the predicted TF footprint probability for each nucleotide in the DNA
sequence. The predicted footprint probability is high for the motif sequences.
                                            22


2.4     DISCUSSION
Identifying TFBSs is the first-step for understanding the cell-specific gene regulatory
network. In this study, we introduced TF-wave, a supervised Gradient Boosting Trees
model that uses Wavelet decomposed DNase-Seq data to predict TF binding at candidate
motif sites in a cell-specific way. We implement TF-wave in K562 and GM12878 cell lines.
TF-wave demonstrates higher accuracy in TFBS prediction than other available
computational footprinting method, Wellington, which also uses DNase-Seq data as the
main feature. Moreover, we demonstrate that by applying Wavelet Transform to DNase-
Seq data to extract multi-resolution features underlying the TF footprints as the input
features, TF-wave achieved higher prediction performance than models trained on
original DNase-Seq data.
Although the binding of TF in open chromatin protects DNA sequences from the cleavage
of DNase I cleavage and therefore leaves footprints, different TFs yet have different
footprint shapes. Existing computational footprinting methods focus on binary prediction
of TFBS, i.e. if TFs bind or not for a given candidate motif or accessible chromatin region.
They lack the capacity to distinguish different TFs binding in the open chromatin region.
In this work, by using the Wavelet decomposed DNase-Seq signals, we demonstrate that
TF-wave could accurately distinguish different TFs bindings when their motifs are in the
open chromatin regions. This workflow provides the first step in understanding how TFs
compete in the binding at the same accessible region as well as how TFs cooperate the
binding.
DNase-Seq can profile TF footprint at high resolution by counting the read at 5’ end.
However, previous TFBSs footprinting methods mainly focus predict binary binding
                                              23


activity for a given DNA sequence. By taking advantage of the high-resolution DNase-
Seq data, we developed the CNN model that can predict TF footprint probability at the
single-nucleotide level. CNN model has been widely applied to biomedical filed and
proved to be successful in multi-task prediction. The results show that trained CNN model
can accurately predict TF binding at the motif site at single-nucleotide level. This single-
nucleotide footprinting strategy enables the understanding of how genetic variants located
in the TF footprints regions will result in the gain of function or loss of function TF binding
as well as elucidate how variants influence the gene regulatory network. Finally, this
workflow can open the revenue not only for identifying functional mutations but also for
understanding the underlying mechanism for phenotypic variations.
                                               24


   PREDICTIVE MODELS OF GENOME-WIDE ARYL HYDROCARBON RECEPTOR
         DNA BINDING REVEAL CELL SPECIFIC BINDING DETERMINANTS
This chapter is adapted from our in-preparation work: Filipovic D., Qi W., Kana O., Marri
D., LeCluyse E., Andersen M., Cuddapah S., Bhattacharya S.. Predictive Models of
Genome-Wide Aryl Hydrocarbon Receptor DNA Binding Reveal Tissue Specific Binding
Determinants.
Section 3.3.2 in this chapter is adapted from previously published work (Desmet N. et al,
2021):   Desmet N., Dhusia K., Qi W., Doseff A., Bhattacharya S., Gilad A. (2021)
Bioengineering of Genetically Encoded Gene Promoter Repressed by the Flavonoid
Apigenin for Constructing Intracellular Sensor for Molecular Events. Biosensors.
3.1    INTRODUCTION
The Aryl Hydrocarbon Receptor (AHR) is a ligand-activated transcription factor that
belongs to the basic-helix-loop-helix (bHLH) PER-ARNT-SIM (PAS) family39–42. The AHR
mediates    toxic    actions   of   environmental   contaminants,    such   as   2,3,7,8-
tetrachlorodibenzo-ρ-dioxin (TCDD)43,44. Before binding to ligands and being activated,
the AHR is sequestered in the cytoplasm by its chaperone proteins including a dimer of
the 90-kDa heat shock protein (HSP90), the AHR-interacting protein (AIP) and the
cochaperone p2345–48. When activated by its ligands, the AHR translocates to the nucleus
and forms a heterodimer with Aryl Hydrocarbon Nuclear Translocator (ARNT)49–51. The
AHR-ARNT heterodimer binds to specific DNA sequences termed Aryl Hydrocarbon
Response Elements (AHRE), Dioxin Response Elements (DREs), or xenobiotic response
                                            25


elements (XREs) containing a consensus motif, 5’-GCGTG-3’ to mediate genes
transcription activity.
Through the binding to cis-regulatory elements including gene promoters and distal
enhancers, the AHR regulates a variety of target genes including cytochrome p450 1A1
(CYP1A1), CYP1B1, and AHR Repressor. Identifying AHR binding sites is the first step
for constructing its gene regulatory network and understanding its gene regulatory circuits,
which is crucial for understanding the role of AHR in toxicity and disease, as well as its
role in physiological functions such as immune response52,53, circadian rhythm54,55 cell
cycle progression54,56, and embryonic development57,58. High throughput sequencing
technologies such as Chromatin Immunoprecipitation and Sequencing (ChIP-Seq), ChIP-
exo, and ChIP-nexus have enabled the profiling of transcription factors binding sites at a
genome-wide level7,8,59. However, these experimental approaches are limited by several
constraints including cost, time, or biological materials such as high-quality antibodies60–
62
   . AHR binding has been profiled in MCF-7 by ChIP-Seq experiment53. However, the
determinants for the tissue specificity of AHR binding remain poorly understood.
In recent years, a lot of pioneer efforts have been made to develop computational
approaches to predict the TF binding sites. One of the most commonly used models to
infer TF binding is the position weight matrix (PWM). PWM is derived from experimentally
validated DNA sequences bound by a particular TF and quantitively describes the binding
sites of the TF. The representation of a particular TF PWM is its motif, which can be
readily obtained from databases such as JASPAR, HOCOMOCO, TRABSFAC, or
estimated de novo30,63–66. For a potential binding site, the PWM generates a quantitative
score by adding up individual scores of each nucleotide making up the PWM and
                                             26


overlapping the potential binding site. PWM is commonly used to scan the genome for
candidate TF binding sites by a predefined threshold score67–70. The PWMs are often
derived from in vitro experiments, including high throughput systematic evolution of
ligands by exponential enrichment (HT-SELEX)69 , and in vivo experiments such as ChIP-
seq. However, many TFs exhibit high levels of in vivo binding to DNA sequences that do
not possess the in-vitro or even the in-vivo derived binding motif.
On the other hand, TFs in eukaryotes generally do not bind DNA in isolation but rather in
dense with other co-binding factors, which contributes to the tissue-specific binding
activity. TF clusters with binding sites of multiple TFs co-occurring in close proximity71,72.
Consequently, PWMs of co-bound TFs could potentially be used to predict the binding of
a TF of interest. However, models incorporating PWMs of co-binding TFs have shown
limited utility in improving model performance70. Nevertheless, given that TFs bind in
clusters and that PWMs are not necessarily representative of actual TF binding, ChIP-
seq signals of co-bound TFs, as a measure of their actual binding, are likely to provide
information that cannot be obtained from PWM. In addition, interpretable machine
learning models incorporating measures of co-bound TFs could provide mechanistic
insights into the determinants of tissue-specific binding for a TF of interest, such as AHR.
Computational models have been developed to predict TF binding sites by integrating
multi-omics data. Pique-Regi R et al applied Bayesian mixture models that integrate PWM,
chromatin accessibility, and histone modifications to predict TF binding sites in open
chromatin region73. TFBSImpute imputes missing TF binding directly from ChIP-Seq data
by using a three-mode tensor74, where the three dimensions are the TFs, cell lines and
genome locations respectively. Virtual ChIP-Seq makes the TF binding prediction by
                                             27


leveraging the associations between gene expression and TF bindings70. Deep learning
models are also applied to predict TF binding in specific tissue based on DNA sequence
and chromatin accessibility60,74,75. However, even though some of these models achieve
high prediction performance for some TFs in both intra- and inter-cell lines, they generally
lack interpretability and provide little mechanistic insight into what drives tissue-specific
or cell-type specific TF binding. This limitation is especially acute for TFs exhibiting highly
variable binding across tissues or cell lines. In addition, most computational models of TF
binding to date have been developed for and tested on constitutively active TFs. The
binding of inducible TFs, such as nuclear receptors or other ligand-activated TFs like the
AHR, remains largely unexplored.
In this study, we applied a Gradient Boosting Tree model, XGBoost76, to develop
supervised machine learning models predicting the AHR binding status of DREs in open
chromatin, i.e., DRE is bound or unbound, MCF-7 cell. Using AHR ChIP-Seq data derived
from TCDD-treated MCF-7 cell line and corresponding chromatin accessibility
experiments (DNase-Seq) in MCF-7 downloaded from ENCODE, we first detected tissue-
specific AHR-bound and AHR-unbound DREs in open chromatin of each tissue. Then,
we applied XGBoost to integrate multi-omics data to predict the binding status of DREs
in open chromatin in MCF-7. Our results demonstrate highly accurate and robust models
of within-tissue binding prediction. We identify several TFs as predictive features of AHR
binding in individual tissues, such as GATA3 as well as histone modifications (HMs) –
H3K4me1 and H3K4me3 in MCF-7 cells. Our tissue-specific models generalize well to
the prediction of AHR binding sites without DREs, demonstrating the robustness of the
models. In conclusion, we demonstrate that the patterns of TFs and HMs most predictive
                                               28


of AHR binding. Moreover, we show that AHR binding is driven by a complex interplay of
tissue-agnostic DNA sequences flanking the DRE and tissue-specific local chromatin
context. The approach used here can be further adapted to other inducible TFs, such as
steroid hormones and nuclear receptors.
3.2    MATERIALS AND METHODS
3.2.1 Identification of AHR bound and unbound DREs
To obtain the AHR bound or unbound DREs, all DREs in the human genome according
to the reference sequence of the hg19 human assembly were collected based on the
AHR motif firstly. Secondly, DNase-Seq experiment data for MCF-7 cell line was
downloaded from ENCODE https://encodeproject.org/. The BroadPeak DNase-seq files
for the hg19 genome assembly were used to mark the open chromatin. If there were
multiple replicates, the intersection of all replicates was used for downstream analysis.
Any DRE found under the peaks of DNase-seq intersection was considered to be in the
open chromatin of the corresponding cell line and was used in the determination of bound
and unbound DREs for the purposes of model training. DREs occurring in blacklisted
regions were ignored in subsequent analyses. Thirdly, the AHR ChIP-seq file was
downloaded from GEO (GEO: GSE90550)77, where the original sequencing files have
been processed uniformly following a standard processing pipeline. DREs in open
chromatin as well as under the AHR ChIP-Seq peaks are assigned to be the AHR bound
DREs and used as the positive data in the machine learning model. DREs in open
chromatin but outside of the AHR ChIP-Seq peaks are assigned to be the AHR unbound
DREs.
                                              29


3.2.2 Genomic and epigenetic features generation
For each DRE in the human genome, the genomic sequence of seven nucleotides 5' and
3' flanking the DRE (5'-GCGTG-3') were obtained. These nucleotides were then one-hot
encoded and used as genomic features for the machine learning models. There were
around 1.6 million DREs located in the entire the human genome. While only a small
fraction of DREs have fulfilled the criteria for bound and unbound DREs used in the model
training and testing.
The epigenetic signals profiled by DNase-Seq, ChIP-Seq bigwig files were downloaded
from ENCODE. The epigenetic signals extracted from bigwig files were then used as
features in the models training and testing. Specifically, for each bound and unbound
DRE, the epigenetic signal of 740 base pairs up and downstream from the DRE were
extracted from the bigwig files, for a total of 1495 base pairs of signals (including the 5-
base pairs DREs). The extracted signals were split into 15 genomic bins and each bin
contains 99 base pairs. The epigenetic signals within each bin were averaged to generate
15 features corresponding to the particular DRE-genomic signal combination. During the
averaging, any areas of missing signal are replaced with zeros.
3.2.3 XGBoost model training
For all the bound and unbound DREs appearing in open chromatin of that particular cell
line, we have constructed the features matrix including genomic sequence, epigenetic
features for all available DNase-Seq, histone mark as well as the co-binding activity of
transcription factors. In order to tune the parameters of the XGBoost model, grid search
was performed with the following of the hyperparameter space: max_depth = {3, 4, 5, 6,
7}, min_child_weight = {3, 4, 5, 6, 7}, subsample = {1.0, 0.9, 0.8, 0.7}, colsample_by_tree
                                               30


= {1.0, 0.9, 0.8, 0.7} and eta = {0.05, 0.075, 0.1, 0.125, 0.15, 0.2, 0.3}. The average
performances over all five folds were used to select best performing models in terms of
hyperparameter selection.
3.2.4 XGBoost model evaluation
Area under Receiver Operating Characteristic (auROC) as well as area under Precision
Recall curve (auPRC) was used to evaluate the performance of the model. The scores
are the output of the XGBoost algorithm in the form of probabilities of each particular
observation (DRE) belonging to a particular output class (bound or unbound). By using
different thresholds for these probabilities above which the model predicts a DRE as
bound, we obtain the number of true and false positives for each threshold, as well as
true and false negatives relative to the ground truth of DRE binding obtained from the
corresponding AHR ChIP-Seq experiment. Each threshold produces a point on the ROC
and PRC curves; the area under the curve was calculated using a line interpolated
through all the points.
3.3     RESULTS
3.3.1 Machine learning models accurately predict AHR binding
In order to detect AHR binding sites, we applied XGBoost to integrate multi-omics data to
predict the binding status of DREs in open chromatin. We trained the models on DREs
occurring under singleton, i.e., 1-DRE, peaks only. These DREs represent about one third
of all AHR peaks, across all binding experiments. To avoid noise in the training data,
multi-DRE peaks were not used for training since it is impossible to determine which
specific DRE among the cluster of DREs under the ChIP-Seq peak were indeed bound
                                            31


by AHR and can be used as positive labels for training the model. In the validation, multi-
DRE peaks were used for validating the performance of the model. We developed all our
machine learning models using the gradient boosted tree algorithm of the XGBoost family
of algorithms, which has been shown to handle non-linear data well78,79. These algorithms
also supply metrics of feature importance, i.e., the contribution of individual input features
to improving the model performance.
Local chromatin features were used as input features for the models and are trained and
validated to predict the binding status of singleton bound and isolated unbound DREs
(see Methods), limited to DREs found in open chromatin. Model evaluation is performed
using a 5-fold cross validation procedure (see Methods for details) (Figure 3.1). The local
                                                                                 AhR
                                 …                                        DNA
        Local chromatin                                                          DRE
                               +
           features                                      DRE binding
                                                                               OR
                        1 1 0 0 0 0 0 1
                                                           prediction
                        0 0 1 1 0 0 1 0
                                                                          DNA
                        0 0 0 0 1 0 0 0                                          DRE
                        0 0 0 0 0 1 0 0
                                        XGBoost 5-fold
                                        cross-validation
Figure 3.1. Machine learning models predicting AhR binding learn tissue specific
and agnostic rules. Schematic figure of XGBoost architecture. The input features
include epigenetic features and DNA sequence (method). XGBoost uses the local
chromatin features to predict the binding status of DRE in open chromatin.
chromatin input features that were used include 1) The DNA sequence immediately
flanking the DRE. We included the flanking sequence of up to 7 nucleotides directly up-
and down- stream from the DRE–previously proposed to be involved in AHR binding.
These sequences were one-hot encoded prior to being used as model inputs; 2) Binned
mean values of bigWig signals of the MCF-7 cell line. We used the bigWig files of i)
                                                  32


DNase-seq (as representative of chromatin accessibility), ii) histone modification, and iii)
transcription factor ChIP-seq signals from ENCODE – see methods for details. For each
bigWig signal and each DRE, we created 15 bins of width 99 base pairs. Each bin was
assigned a value representing the average bigWig signal across the width of that bin. The
mid-point of the central bin was positioned at the middle nucleotide of the 5-bp DRE; 3)
Indicator variables of whether the DRE is found in a strict (+/- 200 bp away from a
transcription start site - TSS) or loose (+/- 1500 bp away from the TSS) definition of a
promoter.
In order to achieve accurate prediction performance, an extensive hyperparameter search
were conducted for each binding experiment and input feature set (see Methods) and the
model with the highest performance were selected. In all instances, unless otherwise
stated, model performance was reported as the area under the Receiver Operating
Characteristic (ROC) and Precision Recall (PR), averaged over five folds using the 5-fold
cross validation procedure. For class imbalanced datasets such as the ones used here,
where unbound DREs far outnumber the bound (Supplementary Table 1), the area under
the PR curve (auPRC) is considered a more appropriate metric of model performance.
Therefore, the model producing the highest auPRC was selected as the best performing
model. However, the area under the ROC curve (auROC) was still a useful metric to
distinguish between poorly and well performing models when comparing between binding
experiments (see Methods).
To better understand the tissue-specific determinants of AHR binding beyond the core
DRE motif and chromatin accessibility, we developed a series of interpretable classifiers
that use different combination of features to determine the binary binding status of DREs
                                             33


in open chromatin. The model performance was investigated as a function of the input
feature set. Different feature sets consisting of the following features were - 1) DNase-
seq only, 2) flanking sequence only, 3) flanking sequence and DNase-seq, 4) flanking
sequence, DNase-seq and histone modifications, 5) flanking sequence, DNase-seq,
histone modifications and transcription factor binding (referred to as the full model).
(Figure 3.2)
Figure 3.2. Performance of models predicting the binding status of DREs in open
chromatin of the MCF-7 cells, with five different sets of input features. Performance
of each set of features is represented as a mean line with a 95% confidence interval
shaded around the line resulting from 5-fold cross-validation. The legend shows the
features used, as well as area under the curve. Both receiver operating characteristic
(ROC) - left panel, and precision-recall curves (PRC) – right panel, are shown.
XGBoost was extracted to determine, for each binding experiment, the total feature
importance of non-sequence (i.e., TF binding and epigenomic) features (Figure 3.3A),
relative feature importance of sequence features per flanking sequence nucleotide
position (Figure 3.3B), and relative importance of individual bins of non-sequence
features (Figure 3.3C). Further examination of non-sequence feature importance scores
                                             34


revealed that the specific models are predominantly learning and making AHR-DRE
binding predictions by relying on different features across different binding experiments.
For example, within each binding experiment we observed three to six bigWig signals
with feature importance 2-5 times higher than that of any other signal. These were in
H3K4me1, H3K4me3, GATA3, CTCF, ZNF217 and FOXA1.
Figure 3.3. Feature importance of all local chromatin context features excluding
DNA sequence flanking the DRE, measured as feature importance gain in XGBoost
classifier model trained on a particular cell type. A Feature importance for each
chromatin context feature is calculated as the average feature importance of all bins for
that particular chromatin context feature. B Feature importance of DNA sequence flanking
the DRE, measured as feature importance gain in XGBoost classifier model trained on a
particular cell line or type. Feature importance of each nucleotide type at a particular
position relative to the DRE is normalized to the nucleotide type with the highest feature
importance at that nucleotide position. C Feature importance of all local chromatin context
features excluding DNA sequence flanking the DRE in models predicting the DRE binding
status of all DREs in open chromatin. Feature importance measured as feature
importance gain in XGBoost classifier model trained on a particular cell line or type.
Feature importance for each bin of a particular chromatin context feature is normalized to
the bin with the highest feature importance for that chromatin context feature.
The importance of DNA sequence immediately flanking the DRE was evaluated via
examining the importance scores of nucleotides flanking the DRE produced by two ways
including 1) flanking sequence-only models, and 2) full models (inclusive of flanking
                                             35


sequence, DNase-seq, histone mark and transcription factor features). And the
importance of every nucleotide is shown in Figure 3.3B
3.3.2 Sequence-only AHR model can identify the Optimal Promoter Sequence for
designing gene circuits
As a ligand-inducible transcription factor, the AHR could be activated by its ligand
including flavonoids, and bind to DNA to turn on its target genes expression. Therefore,
the AHR system could be used to design reporter genes or gene circuits controlled by
AHR ligands.
Figure 3.4. Western blot analysis of AHR expression in HEK293FT (1) and Hela (2).
In order to find the promoter with valid AHR binding sites, the XGBoost model trained
based DNA sequence only was applied to predict the optimal promoter sequence for
designing a reporter gene in response to flavonoids since the gene circuits don’t contain
any epigenetic features. Specifically, the XGBoost model was trained on the AHR binding
sites found in human breast cancer MCF-7 cells. The trained model was then applied to
predict the binding probabilities for the 10 base pair segments containing the AHR core
motif (5’-GCGTG-3’) are displayed in Table 1. Screening of a 2Kb promoter region
upstream of the CYP1A1 transcription start site revealed 5 locations with putative binding
probability of the AHR/ARNT complex greater than 0.5, suggesting that the complex
                                             36


bound to these locations would initiate transcription of the reporter gene. Western blot
confirmed that the reporter gene successfully expressed the AHR protein (Figure 3.4)
3.4    DISCUSSION
The binding of transcription factors (TFs) to DNA has been extensively studied by
experimental studies, computational prediction models have been developed for TF
binding imputation. However, the determinants and mechanisms underlying tissue-
specific TF binding are still not well understood. In addition, unlike many constitutively
active TFs, tissue-specific binding of ligand-inducible TFs such as AHR cannot be fully
determined through chromatin accessibility, the extended binding motif, the motifs of
other co-bound TFs, or any combination of these features.
PWM has been applied to scan the genome to identify candidate AHR binding sites,
however, this method has high-false positive rate and cannot give us insights of
determinants of AHR binding. In order to accurately identify AHR binding sites as well as
as well as gain insights of mechanisms that mediate AHR binding, we developed, to the
best of our knowledge, the first machine learning models that integrate multi-omics data
to predict cell-specific and tissue-specific AHR binding sites. The machine learning model,
XGBoost has demonstrated robust prediction performance across different tissue.
Moreover, XGBoost can integrate predictive features in a non-linear way, and the
important features can be extracted based on the Gradient Boosting tree structures,
which provides us the probability of deciphering factors that determine AHR bindings.
Furthermore, XGBoost trained on DNA sequence only has high scalability, and can be
applied to predict AHR binding for building synthetic genes.
                                              37


One of the limitations of predicting ligand-activated TFs is inconsistency in the data. In
this study, the AHR ChIP-Seq is derived from TCDD treated MCF-7 cell line, while the
epigenetics data used for training the machine-learning model collected from untreated
cell lines. The treatment of TCDD might change the epigenetic environment in the
genome, which results in the inconsistency of data used to generate features and labels
for the model. Therefore, for future studies, comprehensive profiling of ligand-activated
transcription factors is required not only to provide more biological information but also
improve the predictive model’s performance designed specifically for such transcription
factors.
                                             38


   JOINT INFERENCE OF PROTEIN-PROTEIN INTERACTIONS AND ENHANCER-
                GENE LINKS BY A MATRIX DECOMPOSITION MODEL
This chapter is adapted from our in-preparation work: Qi W., Yang J., Wang H., Wang J..
Joint Inference of Protein-Protein Interactions and Enhancer-Gene Links by A Matrix
Decomposition Model.
4.1     INTRODUCTION
One of the fundamental questions in human biology is how one genome sequence can
give rise to so many different cell fates. The answer to this question lies in the accurate
execution of cell-type-specific gene transcription during cell differentiation and
development80–84. Such lineage-specific regulation of gene transcription requires core
promoters and proximal elements that locate around the genes’ transcription start sites.
Recent experimental studies have revealed that, in addition to the proximal regulation
elements, distal cis-regulatory elements, e.g. enhancers, also appear to be as major
contributors in regulation of gene transcription activity85–87. Far away from their target
genes as the enhancers could be in the genome, they can stimulate the gene transcription
through the formation of chromatin loops to that bring the enhancers close to their target
gene promoters in the 3D space88–91.
Proper folding of chromatin is critical for gene regulation. The alterations in chromatin
structure could lead to developmental abnormalities or human disease88–91. For example,
disruptions including deletions, inversions, and duplications cause the TAD boundaries
spanning EPHA4 brachydactyly, F syndrome, and polysyndactyly in humans. Further 4C
analyses in a CRISPR-Cas9 edited mouse model revealed that a cluster of limb-specific
                                             39


enhancers associated with EPHA4 was misplaced and abnormally activated neighboring
genes including Pax3, Wnt6, and Ihh92. In addition to developmental disorders, mutations
in chromatin structures such as enhancer adoption or enhancer hijacking by oncogenes
could also lead to tumorigenesis91,93–96. For instance, variations are frequently found at
cohesin and CTCF binding sites and, perturbing the TAD boundaries in non-malignant
cells, which upregulates the proto-oncogenes97. A study using 7,416 cancer genomes
across 26 tumor types also revealed that even one single duplication of the genome could
result in a formation of a new chromatin domain and lead to the overexpression of the
IGF2 gene in colorectal cancer98. Therefore, delineating the genome in the 3D space is
critical to expanding our understanding of cell development and disease progression.
Experimental methods have been developed to profile long-range chromatin interactions.
For instance, chromatin conformation capture (3C) and 3C-derived techniques (4C and
5C) perform high-throughput sequencing to examine spatial topology at a regional scale
in the genome99–103. Moreover, Hi-C was later developed to profiling chromatin
interactions at a genome-wide level103. A similar method, Capture Hi-C104 was developed
to further improve the resolution of C based method. Chromatin interaction analysis with
paired-end-tag sequencing (ChIA-PET) can detect cell-type specific long-range
chromatin interactions at high resolution by targeting one protein out of interest105.
Furthermore, CRISPR-dCase9-based techniques were developed to unbiasedly capture
long-range DNA interactions as well as identify locus-specific chromatin-regulating
protein complexes106,107. All of these technologies have provided large-scale chromatin
contact maps in a diversity of cell types and tissues in the human genome as well as
model species108. Imagining-based methods such as Fluorescence in situ hybridization
                                            40


of DNA (DNA-FISH)109 use fluorescently labeled probes to hybridize to their
complementary target loci within the nucleus. The chromatin contacts are therefore
inferred by setting an arbitrary threshold, which is usually 50 nm to 1 μm110 , to the spatial
distance based on the scale of genomic distances between the regions of interest and
the resolution of the microscope.
However, even though the experimental methods have generated comprehensive
references of the long-range chromatin interactions maps for a number of cell types and
tissue in both humans and model species108,111–114, there are limitations on in-depth
profiling of enhancer-promoter interaction in a cell-type specific way. First, the resolution
of existing C-methods including Hi-C and Capture Hi-C is usually 5kb-40kb, which is still
relatively low and wherefore hard to precisely pinpoint specific enhancers that regulate
the promoters104,108. Second, the chromatin interactions detected by Hi-C lack tissue
variability and therefore have low specificity and too many false positive discoveries115,116.
Third, mapping resolution is directly related to sequencing depth for Hi-C assays, hence,
the cost of laboratories seriously hinders the popularity of Hi-C. Fourth, although ChIA-
PET and CRISPR-dCas9 have higher resolutions and can detect cell-type specific
interactions, ChIA-PET relies on the antibodies to specific proteins such as CTCF,
RAD21114, CRISPR-dCas9 can only generate locus-specific long-range interactions.
Therefore, these two methods can only provide a subset of long-range chromatin
interaction, which lead to high false negative discoveries. Imagining-based methods can
detect chromatin contacts at all scales of chromosome folding, including contacts
between chromosomes, while DNA-FISH remains to be limited to pre-selected genomic
                                              41


regions and often used to validate findings, herein, not widely applied at the genome-wide
level110.
Due to the limitations of experimental technologies, computational methods have been
developed to predict cell-type and tissue-type specific long-range enhancer-promoter
interactions by integrating multi-omics signatures, including genomics, transcriptomics,
and epigenomics117–120. There are in general two categories of computational methods of
predicting chromatin interactions, e.g. unsupervised and supervised machine learning
models. Unsupervised machine learning methods usually predict chromatin interactions
by assigning each enhancer-promoter pair a score, ranking the pair based on the scores,
and selecting the top-ranked pairs as the predicted interacting enhancer-promoter
pairs121,122. Supervised models predict enhancer-promoter interactions by incorporating
multi-omics signature features. The features include i) genomic sequences and genomic
distance between enhancer-promoter pair, ii) gene expression activity profiled by RNA-
Seq, iii) enhancer activity profiled by epigenetics signals such as H3K4me1, DNase-Seq.
iv) epigenetic features at the genomic window region between enhancer and promoter.
By integrating these or some of these features, supervised machine-learning models are
trained on labels generated by experimental technologies and predict enhancer-promoter
interactions   on   unknown      data.   Two   top-performing   methods     IM-PET123  and
TargetFinder124 were developed to infer long-range enhancer gene links. IM-PET
predicts EP pairs by training a Random Forest model that takes four kinds of features 1)
correlation between enhancer and promoter activity. 2) The correlation between the
expression of transcription factor binding on enhancers and the promoters’ activity. 3) the
coevolution between the enhancer and its target promoter. 4) genomic distance between
                                              42


enhancer and promoter. TargetFinder employs a gradient boosting tree model and
integrates hundreds of genomic features including 1) chromatin accessibility, 2)
methylation status of DNA, 3) gene expression, 4)ChIP-Seq signal of TFs, architectural
proteins, modified histones, 5) quantified signal as well as the genomic distance between
enhancer and promoter. 6) conserved synteny of the enhancer and promoter, 7) similarity
of TF and target gene annotations. Despite the supervised machine learning models have
                                                   125
better performance than unsupervised models            , these models suffer from overfitting
problems due to the large feature dimension 126. In addition, these models didn’t provide
insights into the mechanisms that mediate long-range enhancer-promoter interactions127.
In addition to the predictive features used in previous studies as mentioned above, recent
experimental studies revealed that protein-protein interactions (PPIs) between TFs also
have been identified to participate in the formation of long-range chromatin interactions
and herein facilitate interactions between distal enhancers with their target genes128–
131
    . The most well-known example is the DNA loop extrusion formed by the cohesin
complex, where cohesin loads onto chromatin, leading to the formation and enlargement
of DNA loops that are eventually arrested at boundary elements such as CTCF132–134. In
addition, instead of being an insulator to maintain the TAD boundary, intra-TAD CTCF
binding, together with the cohesin complex, is also reported to stabilize the enhancer-
promoter interactions and maintain robust gene expression. Deleting the CTCF binding
sites compromises the interactions between enhancers and promoters135,136. Moreover,
The zinc-finger transcription factor Yin Yang 1 (YY1) has also been identified as a
structural regulator of enhancer-promoter loops. YY1 could form dimers to promote DNA
                                              43


interactions by binding to active enhancers and promoters identified by chromatin
immunoprecipitation with mass spectrometry (ChIP-MS)131,137.
Furthermore, experimental studies have reported that TFs bind together to secure long-
range enhancer-promoter interactions. For example, the zinc-finger protein, ZNF143, is
also demonstrated to be involved in CTCF-mediated chromatin interaction loops by
overlapping with cohesin binding sites and CTCF binding sites. Deletion of the ZNF143
binding sites is reported to result in the loss of CTCF-mediated chromatin loops. Together,
these observations implicate that ZNF143 functions as a partner of CTCF to establish
and stabilize the chromatin structures by cooperating with the cohesin complex138–142.
These observations from previous experimental work establish a mechanistic hypothesis
that PPIs between specific TFs may mediate long-range enhancer regulation. Therefore,
integrating PPIs between TFs as a new set of features into the computational models is
expected to improve the prediction accuracy of long-range enhancer-promoter links. By
interpreting the computational model to extract the most predicted PPIs, we can gain
insights of cell/tissue-type specific mechanisms that regulate long-range enhancer-gene
links. Furthermore, by considering TF PPIs with combined orders, the computational
models are expected to achieve boosted accuracy in predicting long-range enhancer-
gene links.
In this study, we developed a matrix factorization model to infer enhancer-promoter
interactions based on TF PPIs with combined order (EP3ICO). EP3ICO jointly predicts
long-range enhancer-promoter interactions and optimizes first-order TF-TF interactions
that regulate enhancer-promoter links. Prioritized first-order TF-TF interactions can be
applied to accurately to predict enhancer-promoter links. For the enhancer-promoter pairs
                                                44


that cannot be reconstructed by the first-order TF-TF interactions, EP3ICO further
decomposes these enhancer-promoter pairs and identify second-order PPIs between
TFs that facilitate the long-range interaction between enhancer-promoter pairs that
cannot be explained by the first-order TF PPIs only. By using TAD-split cross-validation
and controlling confounding factors including genomic distance between enhancer and
promoter, we demonstrated the superior performance of EP3ICO compared to existing
models. EP3ICO provides a workflow to prioritize novel multi-order PPIs between TFs that
regulate long-range enhancer-promoter interactions, making it possible to understand the
underlying mechanisms that organize the 3D chromosome.
4.2     MATERIALS AND METHODS
4.2.1 Datasets
EP3ICO takes the continuous chromatin contact frequency matrices derived from Hi-C
data as input. High-resolution Hi-C data from K562 was downloaded from GEO (GEO:
GSE63525)103,104,108. Knight-Ruiz normalization (KR) was first applied to remove the
biases such as library size, fragment length, GC content, sequence mappability, copy
number variations, and other unknown factors143,144. ChIA-PET in K562 was used as
ground truth to evaluate the model performance111. Enhancer-promoter pairs were
labeled as positive samples if they overlap with ChIA-PET interactions. Otherwise, they
will be labeled as negative samples.
Gene promoters are defined as up-stream and down-stream 1kb regions of the gene
transcriptional start sites (TSS). The TSS data is obtained based on the annotation from
GENCODE v17145. Gene expressions in K562 were measured by RPKM values of the
RNA-seq dataset from the Roadmap Epigenomics project70. Enhancer coordinates were
                                            45


obtained based on ENCODE and Roadmap enhancer annotations. Enhancer activities in
K562 were quantified by the DNase-Seq signals146,147. In order to have high-activity
enhancers and promoters, enhancers or promoters that have activity < 0.05 in K562 are
removed. Enhancers that locate on the same Hi-C anchors with promoters are also
filtered out. Correlation coefficients between enhancer-gene pairs were calculated based
on the enhancer activity and gene expression across 111 cell types.
The ChIP-Seq IDR narrow peaks datasets for TF binding in K562 were downloaded from
ENCODE148. The ChIP-Seq experiments with treatments were removed to maintain the
consistency with the cell-specific Hi-C and ChIA-PET data used in this work. If multiple
datasets exit for one TF, ChIP-Seq file with the highest FRiP score149 is selected first,
then ChIP-Seq datasets with FRiP score < 5 are further filtered out to maintain the high
quality of ChIP-Seq data. The significant peaks identified by MACS2 were used to label
the TF binding sites in the genome. Overall, 114 TFs in K562 were used for constructing
the TF-TF interaction matrix for the following analysis.
Protein-protein interactions are collected from STRING dataset v11150. To obtain the high-
quality PPIs, only PPIs with the confidence score greater than 100 in the ‘Experiments’
were used to for the following analysis.
4.2.2 Generation of data matrices
EP3ICO applies matrix factorization models to optimize first-order PPIs and second-order
PPIs between TFs that regulate long-range enhancer-promoter interactions. Five kinds
of matrices containing genomic information are required for the matrix factorization model:
1) intra-TAD enhancer-promoter interaction frequency matrix, 2) TF binding on enhancer
matrix and TF binding on promoter matrix, 3) first-order TF-TF interaction matrix and
                                              46


second-order TF-TF interaction matrix, 4) TF binding correlation matrices, 5) genomic
distance between enhancer-promoter pairs.
Hi-C data is used to construct the enhancer-promoter contact frequency matrix108. For
each enhancer-promoter pair, the enhancer is overlapped with one Hi-C anchor, the
promoter is overlapped with the other Hi-C anchor, KR normalized Hi-C contacts between
these two anchors were used as the contact frequency between the enhancer-promoter
pair. Previous studies have reported that inter-TAD Hi-C interactions' data quality is
substantially reduced compared with intra-TAD interactions151,152. Therefore, only intra-
TAD enhancer-promoter pairs were used to construct the enhancer-promoter pair contact
matrix.
ChIA-PET interactions were used as golden-standard to evaluate the model153.
Specifically, for each intra-TAD enhancer-promoter pair where the enhancer overlaps with
one ChIA-PET fragment and the promoter overlaps with the other ChIA-PET fragment,
this enhancer-promoter pair will be assigned with 1 if there exists validated ChIA-PET
interaction between these two fragments. Otherwise, the enhancer-promoter pair will be
assigned 0. The binary matrix was then used as a label to evaluate model performance.
To construct the binary TF binding on the enhancer matrix as well as TF binding on the
promoter matrix, we overlapped the TF ChIP-Seq peak coordinates with enhancers’ or
promoters’ coordinates, respectively. If the TF peak overlaps with enhancer coordinates,
the TF-enhancer entry value is labeled to be 1, otherwise, 0. Similarly, if the TF peak
overlaps with promoter coordinates, the TF-promoter entry value is labeled to be 1,
otherwise, 0.
                                             47


The first-order TF PPI matrix was constructed by labeling the TF-TF links according to
the PPIs data, i.e. for each TF-TF pair, if there exist high-quality PPIs between them, then
the value between these two TFs is labeled as 1, otherwise 0. All TFs with available ChIP-
Seq data from ENCODE that meet the criteria described above were used to construct
the first-order TF PPI matrix.
To obtain the second-order TF PPI matrix, the correlation coefficient between two TFs
was firstly calculated based on their binding activities on enhancer and promoter
respectively. Only correlation coefficients that are greater than 0.6 were used to make the
E, and P matrices. Correspondingly, experimental validated PPIs between TF-TF pairs
with a correlation greater than 0.6 on both enhancer binding and promoter binding sites
were used in the second-order PPIs matrix.
4.2.3 Cluster topological associated domains
It is observed that TFs have high binding to enhancers or promoters in certain domains
while low bindings in others. To reduce the variances caused by the TF binding,
hierarchical clustering with the walds method were applied to cluster the TAD based on
the TFs binding in promoters and enhancers. Specifically, for each TF, the proportion of
enhancers or promoters this TF bind to was calculated, and a vector is used to represent
the TFs binding abundance on promoters and enhancers for each domain. Ward
minimum variance method is applied for the clustering154. Four clusters were identified so
that there are enough training data (> 100 TADs) and similarity is maintained within each
cluster.
                                               48


4.2.4 Remove genomic distance variance from Hi-C data
To remove the distance variance of Hi-C data, we first fit a linear model between
enhancer-promoter contacts and their genomic distance. 𝒀 = 𝜸𝑫 + 𝜸𝟎 , where 𝒀 is the
enhancer-promoter contacts matrix, 𝑫 is the genomic distance between enhancers and
promoters. 𝜸 is solved by fitting a linear model. Residuals 𝑹𝟏 between the reconstructed
enhancer-promoter interaction frequency using genomic distance and observed
enhancer-promoter interaction frequency is calculated: 𝑹𝟏 = 𝒀 − (𝜸𝑫 + 𝜸𝟎 ). 𝑹𝟏 is used
as input for EP3ICO to optimize the TF-TF interactions that regulate enhancer-promoter
interactions and predict enhancer-promoter links (Figure 4.1D).
Figure 4.1. EP3ICO infers long-range enhancer-promoter interaction based on TF-
TF PPI features. A The Enhancer-promoter links are mediated by multi-order PPIs. Left,
                                             49


                                     Figure 4.1 (cont’d)
enhancer-promoter interactions are regulated by first order PPIs between enhancer-
binding TFs (TF1, yellow) and promoter-binding TFs (TF2, beige). Right, enhancer-
promoter interactions are regulated by second-order PPIs. The co-binding of TFs on
promoters (TF1 and TF3, purple) or enhancers (TF2 and TF3, dark olive-green) facilitate
the second-order PPIs between TF2 and TF3 and TF1 and TF4 respectively. The second-
order PPIs regulate long-range enhancer-promoter links. B Examples of second-order
PPI between IKZF1-CTCF-RAD21 mediated Hi-C interactions. CTCF and RAD21 have
first-order PPIs, the co-binding of IKZF1 and CTCF on promoters provides second-order
PPIs between IKZF1 and RAD21. The second-order PPI between IKZF1-CTCF-RAD21
regulates interactions between ZMYND10 and enhancer located 100kb away. C
Examples of second-order PPI between CTCF-YY1-JUND mediated Hi-C interactions.
YY1 and JUND has first-order PPIs, the co-binding of YY1 and CTCF on enhancer
provides second-order PPIs between CTCF and JUND. The second-order PPI between
CTCF-YY1-JUND regulates interactions between SRP90 and enhancer located 200kb
away. D The flexible framework of EP3ICO for jointly inferring multiple orders of PPIs that
mediate chromatin loops and predicting long-range enhancer-promoter links using
prioritized TF-TF PPIs. Bottom: a matrix optimization model that prioritizes first-order TF-
TF PPIs that regulate enhancer-promoter interactions, 𝑋 is the first-order PPI
matrix. Middle: a matrix optimization model that prioritizes second-order TF-TF PPIs that
regulate enhancer-promoter interactions, 𝑋′ is the second-order PPI matrix. Top: EP3ICO
can be further applied to identify higher-order PPIs that mediate chromatin interactions.
4.2.5 Matrix decomposition model is applied to optimize first-order TF-TF
interactions
EP3ICO applies the model decomposition model 𝑹𝟏 = 𝑨𝑿𝑩 to jointly infer the first-order
TF-TF interactions that regulate long-range enhancer-promoter links as well as predict
long-range enhancer-promoter interactions (Figure 4.1D), where 𝑨 is the TF-enhancer
matrix, 𝑩 is the TF-promoter matrix, and 𝑿 is the TF-TF interaction matrix that need to be
optimized. Biologically, only a subset of TF PPIs mediates the long-range enhancer-
promoter interactions. Therefore, the optimized TF-TF interaction matrix is expected to
be a sparse matrix. To make the optimized matrix be sparse, L1-regularization is added
to the loss function to increase the sparsity in the optimized PPI matrix as well as avoid
over-fitting of the models 𝑳(𝑿) = ||𝑹𝟏 − 𝑨𝑿𝑩||𝟐𝑭 + 𝝀|𝑿|. The gradient is derived for the
                                             50


loss function and is used in the gradient descent method to minimize the loss function.
We aim to identify TF-TF interactions from experimentally validated PPIs, thus, only
entries with experimentally validated TF-TF interactions will be updated during the
optimization progress, and remaining entries will be kept zero. In order to quickly achieve
the global optimum, the Barzilai-Borwein method is used to calculate the step size at each
iteration in the optimization process155.
4.2.6 Matrix decomposition model to optimize second-order TF-TF interactions
The first-order TF-TF interaction matrix is optimized in each cluster. Then the prioritized
TF-TF interactions are applied to reconstruct long-range enhancer-promoter contact
frequency together the genomic distance model in each TAD within the cluster. The
residuals between observed and reconstructed enhancer-promoter interaction frequency
is calculated: 𝑹𝟐 = 𝒀 − (𝜸𝑫 + 𝜸𝟎 ) − 𝑨𝑿𝑩. 𝑹𝟐 is used to optimize the second-order TF-
TF interactions 𝑿& : 𝑹𝟐 = 𝑨(𝑬𝑿& + 𝑿& 𝑷)𝑩, where 𝑿& is the second-order TF-TF interaction
matrix. Specifically, enhancer-promoter pairs with the top as well as bottom 5% residuals
are considered the ones that cannot be solely explained by the first-order TF-TF
interactions and are therefore used to prioritize the second-order PPIs. Similarly, the L1-
regularization is applied in the matrix factorization model to increase the sparsity and
                                                      𝟐
avoid over-fitting 𝑳(𝑿′) = ;|𝑹𝟐 − ( 𝑨(𝑬𝑿& + 𝑿& 𝑷)𝑩)|;𝑭 + 𝜷|𝑿′|.
4.2.7 Model evaluation and performance comparison
Area under the Receiver Operating Characteristic curve (AUROC) and area under the
precision-recall curve (AUPR) are used to evaluate both first-order TF-TF interactions and
second-order TF-TF interactions model’s performance by 5-fold cross-validation. In each
                                            51


cluster, TADs were randomly split into training and testing sets. EP3ICO was trained on
the training TADs and was applied to predict EP pairs in the testing TADs.
Figure 4.2. Performance comparison in K562 cells. EP3ICO, ProTECT, TargetFinder,
                                           52


                                   Figure 4.2 (cont’d)
and IM-PET are applied on the same input datasets and are evaluated based on the
averaged performance of 5-fold cross-validation. As a baseline comparison, enhancer-
gene activity correlations are also included in the analysis. A ROC curves and B PR
curves of first-order PPIs performance in K562 cells. C ROC curves and D PR curves of
second-order PPIs performance in K562 cells. E, F Examples of enhancer-promoter
interactions predicted by EP3ICO second-order PPIs. In each example, the highlighted
enhancer (red) is predicted to interact with the highlighted promoter (green) by EP3ICO.
Both predictions are supported by cell-type specific Hi-C interactions (blue paired lines).
The prioritized TF PPIs mediating the interactions are second-order PPIs RAD21-CTCF-
YY1 (E) and YY1-JUND-CTCF (F), respectively.
The training and testing data have high imbalance and the imbalanced dataset might
inflate the model’s performance. To justify the robustness of EP3ICO, a balanced dataset
with the same number of positive and negative samples is also generated by
downsampling for cross-validation. Furthermore, the genomic distance between the
enhancer and promoter pair might be a confounding factor that dominates the model
performance. Thus, a balanced dataset with genomic distance control is further generated
to evaluate the model performance.
The performance of our EP3ICO were also evaluated and compared with two state of art
models, TargetFinder and IM-PET, which also leverage TF binding features in their
model124,156. Both TargetFinder and IM-PET integrate activity-based features, genomic
distance, as well as TF binding information in enhancers and promoters. Additionally,
TargetFinder also used the TF binding information in the windows region between
enhancers and promoters. Using the same set of TF ChIP-Seq peaks to generate the TF
features in the window between enhancer-promoter pair, TargetFinder is trained on the
Hi-C data according to the data processing procedure in the paper. The trained model is
then applied to the same testing dataset to make the predictions. IM-PET is implemented
                                             53


to the same dataset. As IM-PET automatically predicts enhancer-promoter pairs with
genomic distance < 2Mb, only common enhancer-gene pairs are used to evaluate
performances, resulting in a fair comparison among the three models. By comparing our
first-order PPIs model TargetFinder and IM-PET, we demonstrate that PPIs information
can greatly improve the accuracy in the predicting of enhancer-promoter links.
The prioritized second-order PPIs are used to predict enhancer-promoter interactions on
the subset of enhancer-promoter pairs (see methods). By comparing the second-order
PPIs with the first-order PPI's performance on the same dataset, we demonstrate that
second-order PPIs can provide additional information for predicting enhancer-promoter
links. ProTECT applies Random Forest model to integrate TF-TF interactions modules to
predict enhancer-promoter interactions157. In order to demonstrate the additional
information provided by the second-order PPIs, we further compare our second-order PPI
model with ProTECT. Only the common enhancer gene pairs among all models were
used to evaluate the model performance. 5-fold cross-validation is applied and the
average AUCs are used to report the final performances.
4.2.8 TF shuffling
In order to quantitively check if TFs binding, as well as TF PPIs, indeed contribute to the
high accuracy of EP3ICO, we randomly permute the TFs binding on the enhancers and
genes as well as permute the PPIs between TFs with the degrees of each TFs unchanged
and the numbers of TFs that bind on enhancer or promoter maintained respectively. The
First-order PPI model was trained and tested on the same datasets using permuted TF
bindings and TF PPIs. The AUCs were used to compare the first-order PPI model trained
with the original dataset. For the second-order PPIs model, the TF binding on the
                                             54


enhancer correlation matrix, i.e, 𝐸 matrix, as well as the TF binding on the promoter, i.e
𝑃 matrix is calculated based on the permuted TF binding on the enhancers and promoters,
respectively. Then, the second-order PPI model was trained using permuted 𝐸, 𝑃, and
second-order PPI matrix on the same training set used for training the original second-
order PPI matrix, and the AUCs were compared to the original second-order PPI model
on the same testing dataset. Balanced dataset as well as distance-controlled balanced
data were also generated to evaluate the model performance based on shuffled PPIs.
4.2.9 EP3ICO predict enhancer-promoter interactions based on imputed cell-
specific TF bindings
In order to assess if EP3ICO can be transferred to cell lines or tissues when TF ChIP-Seq
data is not accessible, we impute cell-line specific TF bindings in K562 and implement
EP3ICO based on the imputed TF bindings.
To impute the TFs binding, the TF motifs coordinates discovered from ENCODE ChIP-
Seq data were collected158. Firstly, TF motif coordinates located in close chromatin were
filtered out and motifs in accessible chromatin were used to impute the TF binding.
Secondly, TFs that have lower gene expressions than 0.6 were further removed to avoid
the high false positives. Finally, imputed TF binding sites at enhancers and promoters are
generated with TFs pass the criteria by overlapping these TFs’ imputed binding sites with
enhancers and promoters through BEDtools159 (see methods). These TFs are then used
to construct first-order and second-order TF-TF interactions matrices and applied in
optimization models.
                                             55


Figure 4.3. TF PPI features provide additional information beyond TF bindings. A
Schematic figure of the permutation test on TF PPI features. The shuffled PPIs are
                                           56


                                    Figure 4.3 (cont’d)
generated by randomly pairing two interacting TFs from the original pool of TF PPIs, while
the degrees of PPI partners and TF binding sites in enhancers and promoters are
maintained. B Correspondingly, the co-occurrence matrices of TFs, E, and P are
generated based on the shuffle TF bindings. Based on the shuffled PPIs, new matrix
factorization models were trained and then evaluated by the same cross-validation
procedure for the first-order and second-order PPIs respectively. C, D ROC, and PR plots
for the models based on the original first-order TF PPI features (red), the models based
on the shuffled first-order TF PPI features (green). E, F ROC and PR plots for the models
based on the original second-order TF PPI features (purple), the models based on the
shuffled second-order TF PPI features (green).
4.2.10 cis-eQTL enrichment analysis for predicted long-range enhancer-gene
interactions
To predict the enhancer-promoter interactions, the prioritized first-order PPIs were
applied to reconstruct the score of the enhancer-promoter pairs. The top 10% predicted
enhancer-promoter pairs were selected to be the final predictions. Furthermore, optimized
second-order PPIs were applied to predict enhancer-promoter links on the subset of data
we used to learn the second-order PPIs. And the top 10% predictions in the testing data
from the subset were used as the final predictions.
cis-eQTL from matched cell lines also provides orthogonal information in validating the
accuracy of genome-wide predictions generated by our model. As we made the genome-
wide enhancer-promoter interactions in the K562 cell line, we collected three eQTL
datasets that are profiled from either lymphoblastoid cells or blood tissue156,160,161. A
predicted enhancer-promoter interaction is considered to be supported by the cis-eQTL
if the SNP overlaps with the enhancer and the promoter matches with the gene. The
fraction of predicted enhancer-promoter links was calculated for each eQTL dataset as
the enrichment score. The overlapping fraction of SNP-gene pairs of the top 10%
predictions of IM-PET were also used as a comparison for the first-order PPIs predictions.
                                             57


The top 5% predictions of IM-PET were used to calculate the overlapping fractions of
eQTLs as a comparison to our second-order PPIs predicted enhancer-gene pairs.
Wilcoxon signed-rank test was used to check the statistical significance.
Figure 4.4. Imputed TF bindings can accurately predict enhancer-promoter
interactions. EP3ICO uses imputed TF binding to predict long-range enhancer-promoter
interactions and is compared with IM-PET on the same input datasets. Performance is
                                         58


                                    Figure 4.4 (cont’d)
evaluated based on the averaged performance of 5-fold cross-validation. As a baseline
comparison, enhancer-gene activity correlations are also included in the analysis. A ROC
curves and B PR curves of first-order PPIs performance in K562 cells based on imputed
TF binding. C ROC curves and D PR curves of second-order PPIs performance in K562
cells based on imputed TF binding.
4.3      RESULTS
4.3.1 Predict long-range enhancer-promoter interactions based on multi-order
PPIs between TFs
Previous experimental studies have identified that PPIs between specific TFs mediate
long-range enhancer-promoter interactions through the binding to promoters and
enhancers respectively. Two examples of first-order PPI-mediated Hi-C enhancer-gene
interactions are shown in Figure C.1. In the first example, PPI between RAD21 and CTCF
regulates gene TTC31’s promoter with a distal enhancer located 100kb away. In the
second example, PPI between CTCF and YY1 mediates the long-range interaction
between gene E2F2’s promoter and an enhancer located 20kb away.
Moreover, higher-order PPIs between TFs are also found to participate in the regulation
                                138–142
of long-range chromatin loops           . Two second-order PPI facilitated Hi-C enhancer-
promoter interaction examples are shown in (Figure 4.1B, 4.1C). The first example shows
that IKZF1 maintains a second-order PPI with RAD21 through the co-binding with CTCF
on enhancer. The second-order PPI between IZKF1 and RAD21 facilitates long-range
links between gene ZMYND10 with an enhancer located 100kb away. Without the co-
binding of IKZF1 on enhancer, PPI between CTCF and RAD21 alone is not able to
facilitate the long-range chromatin loops for certain enhancer-promoter pairs. The second
example shows second-order PPI between CTCF-YY1-JUND mediated Hi-C interactions.
                                             59


YY1 and JUND has first-order PPIs, the co-binding of YY1 and CTCF on enhancer
provides second-order PPIs between CTCF and JUND. The second-order PPI between
CTCF-YY1-JUND regulates interactions between SRP90 and enhancer located 200kb
away.
A predictive model, especially a linear model can easily suffer from overfitting problems
due to the large feature dimensions. To effectively avoid the overfitting problems, we
applied two steps to increase the feature sparsity. First, before the training of the model,
we removed PPIs with a confidence score < 100 to manually reduce the feature
dimension as well as control the input feature quality. Secondly, we added L1-
regularization to the matrix factorization model we developed. L1-regularization shrinks
the less important features’ weights to zero, resulting in a sparse optimized matrix. The
sparse features will avoid over-fitting issues effectively as well as preserve the robustness
of the model162.
The first-order PPI models are trained on a high-resolution Hi-C dataset from human K562
cell line. The Hi-C data was normalized by KR normalization method (see methods). The
enhancer-gene contact frequency matrix in each TAD was constructed from the KR
normalized matrix, and the variance caused by genomic distance was removed by fitting
a linear model between enhancer-promoter distance and contacts (see methods). The
residual matrix was computed by subtracting the prediction from the observation. The
residual matrix was decomposed to into three matrices including the TF-enhancer matrix,
the first-order PPIs interaction matrix, and the TF-promoter matrix. The learned first-order
PPIs were optimized through an optimization process. Prioritized first-order TF-TF
interactions were applied to predict genome-wide enhancer gene interaction contacts.
                                               60


Figure 4.5. EP3ICO identifies functional TF PPIs that regulate enhancer-promoter
interactions. A,B Prioritized first-order (A) and second-order (B) TF PPIs in four clusters.
                                              61


                                    Figure 4.5 (cont’d)
Min-Max normalization is applied to the TF PPIs optimized weights in each cluster, top
20 optimized TF PPIs are selected based on the normalized weights in each cluster. The
union of PPIs is used to plot the heatmap. The higher the number, the more important
TF PPI is in regulating long-range enhancer-promoter interactions. C D cis-eQTLs from
multiple datasets (x- axis) are significantly enriched in first-order TF PPIs (C) predicted
enhancer-promoter interactions and second-order PPIs (D) predicted enhancer-promoter
interactions in K562 (red). The fractions of enhancer-promoter interactions overlapping
with cis-QTLs (y-axis) are compared with IM-PET. Error bars represent sd. E Example of
a cis-eQTL, i.e. the rs16889678-RPS10 pair, overlapping with a second-order PPI
predicted enhancer-promoter interaction. The predicted interaction is supported by ChIA-
PET (blue paired lines). The prioritized PPI feature is YY1-CTCF-RAD21, consistent with
the ChIP-seq signal tracks (black signals).
Differences between the observed enhancer-promoter pair contacts and predicted
enhancer-promoter contacts were calculated. Enhancers-promoter pairs with top 5% and
bottom 5% residuals were then selected to be further decomposed to learn the second-
order PPIs information. The prioritized second-order PPIs were then applied to boost the
prediction for enhancer-promoter pairs in the subset (Figure 4.1D). The prioritized first-
order PPIs as well as second-order PPIs between TFs provide new insights for
understanding the complex interplay between TFs, enhancers, and genes. Further cis-
eQTL analysis based on the predicted first-order TF PPIs as well as second-order TF PPI
predicted enhancer-promoter interactions provide a new platform for understanding the
human genetic analyses.
4.3.2 TF PPIs features provide boosted performance
Using the genome TAD-split cross-validation strategy, we tested the prediction accuracy
of the first-order PPI model as well as compared our model with two other supervised
machine learning models, TagetFinder and IM-PET in K562 cell line. Our first-order PPI
model achieved the highest performance, with AUROCs are 0.86. IM-PET ranked second
                                             62


with AUROC being 0.57. TargetFinder has the lowest AUROC value 0.52. Since the
testing data has a high imbalance, the AUPR is also used to assess the performance of
all of the models. Our model has the highest performance again, with an AUPR are 0.13.
As a baseline comparison, the activity correlation between the enhancer and gene pair is
also calculated as a prediction score and used in the comparison. The AUROC and AUPR
of activity correlation are 0.57 and 0.03 respectively, similar to IM-PET, which uses activity
correlation as the main feature in their model (Figure 4.2A, 4.2B). As the data set is highly
imbalanced, we also downsampled balanced data in two ways and evaluate the model
performance on the balanced dataset. First, we randomly select a negative dataset that
have the same number of samples as the positive dataset. By assessing the models’
performance on the balanced data, EP3ICO still shows the highest accuracy (Figure
C.4A). Secondly, we performed the downsampling with a more stringent strategy by
controlling the genomic distance. Specifically, instead of randomly selecting negative
pairs, we selected the negative pairs that have the same genomic distance distribution
with the positive pairs. By evaluating the models’ performance on the genomic-distance
controlled data, EP3ICO again shows the highest accuracy (Figure C.6A).
Two examples of predicted enhancer-promoter interactions are shown in Figure C.2. The
first enhancer-gene pair example, where the enhancer locates 50kb away from the
enhancer, has Hi-C validated interactions. The activity correlation between enhancer and
gene is 0, which indicates a false negative prediction based on the activity correlation.
However, our prioritized PPI between CTCF and RAD21 reconstructed the long-range
enhancer-promoter interaction with CTCF binding on the enhancer and RAD21 binding
on the promoter. The second enhancer-gene pair example have a low activity correlation,
                                              63


i.e. 0.4 while with validated Hi-C interaction. Again, the prioritized PPIs between EP300
and RAD21 have reconstructed this long-range enhancer-promoter pair with EP300
binding on the enhancer and RAD21 binding on the promoter.
A subset of enhancer-promoter pairs can’t be reconstructed by our first-order PPI model,
we further developed a matrix factorization model and trained the model on the subset
enhancer-gene pairs to identify second-order PPIs required for enhancer-promoter
regulation (Figure 4.1D). By comparing the second-order PPI model with the first-order
PPI model, we further demonstrated that second-order PPIs boosted the accuracy in
predicting long-range enhancer-promoter links that could not be predicted by first-order
PPIs only (Figure C.3). By generating the balanced dataset and evaluate the models on
balanced data, we further proved the robust performance of second-order PPI model
(Figure C. 4B, Figure C. 6B). ProTECT157 is another supervised random forest model that
uses PPI modules as main features to predict long-range enhancer-promoter interactions.
However, ProTECT doesn’t use higher order PPIs as their features. To justify that higher-
order PPIs can provide additional information, we compared our second-order PPI model
with ProTECT, TargetFinder, IM-PET, and the baseline model on the common data. The
results show that our second-order PPI model overperforms ProTECT as well as other
three models, suggesting there are more information provided by the second-order PPIs
(Figure C. 8).
Two predicted examples using second-order PPIs are presented in Figure 4.2E, 4.2F. In
the first example, YY1 maintains a second-order PPI with RAD21 through the co-binding
with CTCF on the enhancer and leading to the regulation of this enhancer’s interaction
with a gene located 20kb away. In the second example, JUND has second-order PPI in
                                             64


CTCF through the co-binding with YY1. The second-order PPIs regulate the long-range
enhancer-promoter interaction where the enhancer is 10kb away from the promoter.
As we used a subset to learn the second-order PPIs, our second-order PPI model was
compared with TargetFinder and IM-PET on the same testing dataset. Still, our second-
order PPI model has the highest performance with AUROC and AUPR are 0.81 and 0.14
respectively (Figure 4.2C, 4.2D). The results not only demonstrate that our second-order
PPIs model has superior performance, but also our matrix factorization model is robust
on different datasets.
4.3.3 TF PPI features provide additional information beyond TF bindings and
activity-based features
To further justify the superior performance of indeed result from the information of first-
order TF PPIs features, we permuted the TF bindings on the enhancer and promoters
with the numbers of TFs on the enhancers or promoters being strictly maintained.
Moreover, the TF-TF interactions were shuffled randomly with the degree of PPI partner
remained (Figure 4.3A). Correspondingly, the E and P matrices were calculated based
on the shuffled TF bindings to have the shuffled second-order PPIs (Figure 4.3B). We
trained EP3ICO on the data y using the permuted TF bindings and the shuffled first-order
as well as shuffled second-order TF-TF interactions, and compared the model
performance with using the original data. We observed that using the shuffle TF-TF
features decreases the accuracy of both first-order’s and second-order’s predictions in
the K562 cell line (Figure 4.3 C-F). Balanced dataset was also generated by downsample
same number of enhancer-promoter pairs that are not supported by called ChIP-PET
interactions. Balanced dataset was generated with distance controlled as well as in a
                                            65


random way, i.e distance is not controlled. The models’ performance was further
compared on these two kinds of balanced dataset. The results further demonstrated that
original TF bindings and TF-TF interactions can reconstruct the enhancer-promoter
interactions more accurately (Figure C.5, Figure C.7). The difference in prediction
accuracy suggests the boosted performance of EP3ICO is contributed by the TFs bindings
and the TF PPI features.
4.3.4 cis-eQTL are enriched in the first-order PPIs and second-order PPIs
predicted long-range enhancer-promoter interactions.
The prioritized first-order PPIs are applied to predict enhancer-promoter interactions on
the testing dataset. We selected the top 10% predictions of enhancer-promoter links
based on the reconstructed interaction scores as the final predictions. As we trained the
model on the Hi-C data, cis-eQTLs were used as orthogonal evidence to evaluate the
accuracy of predicted enhancer-promoter interactions. By calculating the fractions of
predicted enhancer-promoter pairs supported with SNP-gene pairs of significant eQTLs,
we compared the overlapping enrichment scores of EP3ICO with IM-PET. Compared with
IM-PET, the enhancer-promoter links predicted by first-order PPIs have significantly
higher fractions overlapping with eQTLs in K562 for all three eQTL datasets (p-values <
5.50e-08, 6.03e-04, and 7.45e-04 respectively, Figure 4.5C). Moreover, the prioritized
second-order PPIs were also applied to predicted enhancer-promoter interactions on the
testing set from the subset data. The top 10% reconstructed enhancer-promoter links
were selected as the predictions. The same cis-eQTL datasets were applied to check the
enrichment scores of EP3ICO with IM-PET. EP3ICO has significantly higher overlapping
                                            66


fractions in K562, the p-values < 2.54e-06, 0.0171, 0.0187 respectively (Figure 4.5D).
The results suggest high accuracy of EP3ICO predictions.
4.4     DISCUSSION
In this study, we develop a matrix factorization model, EP3ICO to infer multi-order PPIs
between TFs that regulate long-range enhancer-promoter interactions. Using the
optimized multi-order TF-TF interactions, EP3ICO accurately predicts long-range
enhancer-promoter interactions and achieves superior performance compared with
existing methods. By incorporating higher-order PPIs between TFs, EP3ICO further
improved the prediction performance on a subset of enhancer-promoter interactions that
can’t be reconstructed accurately by first-order TF-TF interactions.
cis-eQTLs are also applied to validate the predictions of EP3ICO as the orthogonal
evidence. Enrichment scores of cis-eQTLs in EP3ICO predicted enhancer-promoter links
are compared with enrichment scores of IM-PET predicted enhancer-promoter
interactions. The results show that prioritized first-order PPIs and second-order PPIs
reconstructed enhancer-promoter links have significantly higher fractions than IM-PET.
The promising enrichment analysis further indicates the predictions of EP3ICO can be
used as a platform to characterize the non-coding SNP’s effects propagated through 3D
chromatin interactions.
The K562 cell line has comprehensive profiling of TF bindings sites, which provides the
data for EP3ICO. However, there remain cell lines or tissues that have limited data sets.
To address the data limitation issue, we impute TF bindings based on the TF motifs, TF
gene expression level, and chromatin accessibility. By using the imputed TF bindings,
EP3ICO still achieves superior performance for both first-order PPIs and second-order
                                            67


PPIs predicted enhancer-promoter links. The results further demonstrate the accuracy of
EP3ICO and its generalizability.
The major novelty of EP3ICO is the inclusion of multi-order TF PPIs as features. The first-
order TF PPI can reconstruct and explain the majority of enhancer-promoter links.
However, a subset of enhancer-promoter interactions cannot be predicted by first-order
TF PPIs only, second-order TF PPIs can be applied to improve the predictions. Moreover,
EP3ICO is a flexible workflow and can be further applied to identify higher-order PPIs that
regulate the 3D genome.
                                           68


                                    FUTURE DIRECTIONS
Identification of transcription factors binding sites (TFBSs) and charactering chromatin
interactions in the 3D space are critical for understanding the gene regulatory networks.
In this dissertation, we presented the studies that were conducted in both directions to
decode the gene regulatory rhythms by integrating the multi-omics data. We developed
a supervised machine learning model, TF-wave, that uses DNase-Seq data to predict
TFBSs within cell lines. Next, we designed a matrix factorization model, EP3ICO, that
jointly infers long-range enhancer-promoter interactions and multi-order PPIs that
mediate the 3D genome.
Although TF-wave has achieved superior performance in predicting intra-cell transcription
factors footprints, inter-cell line prediction is not investigated yet and will be one of the
future directions we will focus on. Another future direction is applying ATAC-Seq data as
features to predict TFBSs. Moreover, single-cell ATAC-Seq (scATAC-Seq) data is
available in many cell lines, we will future take use of the scATAC-Seq to predict TFBS
at single-cell level.
 EP3ICO achieves high accuracy in predicting long-range enhancer-promoter links as well
as provides mechanistic insights of 3D genome. We will continue working on exploring
more on the roles of epigenetic features including histone modifications in shaping the 3D
-genome. By integrating new epigenetics features, we will further improve the accuracy
of mathematical models in predicting long-range enhancer-promoter links.
                                               69


                                    BIBLIOGRAPHY
1.  Lambert, S. A. et al. The Human Transcription Factors. Cell (2018)
    doi:10.1016/j.cell.2018.01.029.
2.  Ouyang, N. & Boyle, A. P. TRACE: Transcription factor footprinting using
    chromatin accessibility data and DNA sequence. Genome Res. (2020)
    doi:10.1101/gr.258228.119.
3.  Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of
    in vivo protein-DNA interactions. Science (80-. ). (2007)
    doi:10.1126/science.1141319.
4.  Levy, S. & Hannenhalli, S. Identification of transcription factor binding sites in the
    human genome sequence. Mamm. Genome (2002) doi:10.1007/s00335-002-
    2175-6.
5.  Mikkelsen, T. S. et al. Genome-wide maps of chromatin state in pluripotent and
    lineage-committed cells. Nature (2007) doi:10.1038/nature06008.
6.  Li, H., Quang, D. & Guan, Y. Anchor: Trans-cell type prediction of transcription
    factor binding sites. Genome Res. (2019) doi:10.1101/gr.237156.118.
7.  Rhee, H. S. & Pugh, B. F. ChIP-exo: A Method to Identify Genomic Location of
    DNA-binding proteins at Near Single Nucleotide Accuracy. Curr. Protoc. Mol. Biol.
    (2008).
8.  He, Q., Johnston, J. & Zeitlinger, J. ChIP-nexus enables improved detection of in
    vivo transcription factor binding footprints. Nat. Biotechnol. (2015)
    doi:10.1038/nbt.3121.
9.  Ernst, J., Plasterer, H. L., Simon, I. & Bar-Joseph, Z. Integrating multiple evidence
    sources to predict transcription factor binding in the human genome. Genome
    Res. (2010) doi:10.1101/gr.096305.109.
10. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
    Chromatin across the Genome. Cell (2008) doi:10.1016/j.cell.2007.12.014.
11. Zhong, J. et al. Mapping nucleosome positions using DNase-seq. Genome Res.
    (2016) doi:10.1101/gr.195602.115.
12. Funk, C. C. et al. Atlas of Transcription Factor Binding Sites from ENCODE
    DNase Hypersensitivity Data across 27 Tissue Types. Cell Rep. (2020)
    doi:10.1016/j.celrep.2020.108029.
13. Yardimci, G. G., Frank, C. L., Crawford, G. E. & Ohler, U. Explicit DNase
    sequence bias modeling enables high-resolution transcription factor footprint
    detection. Nucleic Acids Res. (2014) doi:10.1093/nar/gku810.
                                            70


14. Kähärä, J. & Lähdesmäki, H. BinDNase: A discriminatory approach for
    transcription factor binding prediction using DNase I hypersensitivity data.
    Bioinformatics (2015) doi:10.1093/bioinformatics/btv294.
15. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
    sequence and chromatin accessibility data. Genome Res. (2011)
    doi:10.1101/gr.112623.110.
16. Quach, B. & Furey, T. S. DeFCoM: Analysis and modeling of transcription factor
    binding sites using a motif-centric genomic footprinter. Bioinformatics (2017)
    doi:10.1093/bioinformatics/btw740.
17. Sherwood, R. I. et al. Discovery of directional and nondirectional pioneer
    transcription factors by modeling DNase profile magnitude and shape. Nat.
    Biotechnol. (2014) doi:10.1038/nbt.2798.
18. Gusmao, E. G., Allhoff, M., Zenke, M. & Costa, I. G. Analysis of computational
    footprinting methods for DNase sequencing experiments. Nat. Methods (2016)
    doi:10.1038/nmeth.3772.
19. Piper, J. et al. Wellington: A novel method for the accurate identification of digital
    genomic footprints from DNase-seq data. Nucleic Acids Res. (2013)
    doi:10.1093/nar/gkt850.
20. Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in
    Speech Recognition. Proc. IEEE (1989) doi:10.1109/5.18626.
21. Speed, T. P. Biological sequence analysis. in Selected Works of Terry Speed
    (2012). doi:10.1007/978-1-4614-1347-9_14.
22. Vierstra, J. et al. Global reference mapping of human transcription factor
    footprints. Nature (2020) doi:10.1038/s41586-020-2528-x.
23. Wickerhauser, M. V. The Discrete Wavelet Transform. in Adapted Wavelet
    Analysis (2020). doi:10.1201/9781439863619-11.
24. Kim, J. et al. Stacked auto-encoder based CNC tool diagnosis using discrete
    wavelet transform feature extraction. Processes (2020) doi:10.3390/PR8040456.
25. Bailey, D. H. & Swarztrauber, P. N. A Fast Method for the Numerical Evaluation of
    Continuous Fourier and Laplace Transforms. SIAM J. Sci. Comput. (1994)
    doi:10.1137/0915067.
26. Mateo, C. & Talavera, J. A. Short-time Fourier transform with the window size
    fixed in the frequency domain. Digit. Signal Process. A Rev. J. (2018)
    doi:10.1016/j.dsp.2017.11.003.
27. Mateo, C. & Talavera, J. A. Short-Time Fourier Transform with the Window Size
                                           71


    Fixed in the Frequency Domain (STFT-FD): Implementation. SoftwareX (2018)
    doi:10.1016/j.softx.2017.11.005.
28. Liu, C.-L. A Tutorial of the Wavelet Transform. History (2010).
29. de Santiago, I. & Carroll, T. Analysis of ChIP-seq data in R/Bioconductor. in
    Methods in Molecular Biology (2018). doi:10.1007/978-1-4939-7380-4_17.
30. Castro-Mondragon, J. A. et al. JASPAR 2022: The 9th release of the open-access
    database of transcription factor binding profiles. Nucleic Acids Res. (2022)
    doi:10.1093/nar/gkab1113.
31. Korhonen, J., Martinmäki, P., Pizzi, C., Rastas, P. & Ukkonen, E. MOODS: Fast
    search for position weight matrix matches in DNA sequences. Bioinformatics
    (2009) doi:10.1093/bioinformatics/btp554.
32. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics
    (2009) doi:10.1093/bioinformatics/btp352.
33. Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot.
    (2013) doi:10.3389/fnbot.2013.00021.
34. Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K. & O’Leary, A. PyWavelets:
    A Python package for wavelet analysis. J. Open Source Softw. (2019)
    doi:10.21105/joss.01237.
35. Navarro Gonzalez, J. et al. The UCSC genome browser database: 2021 update.
    Nucleic Acids Res. (2021) doi:10.1093/nar/gkaa1070.
36. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the
    accessible genome with deep convolutional neural networks. Genome Res.
    (2016) doi:10.1101/gr.200535.115.
37. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep
    learning-based sequence model. Nat. Methods (2015) doi:10.1038/nmeth.3547.
38. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence
    specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.
    (2015) doi:10.1038/nbt.3300.
39. Dere, E., Lo, R., Celius, T., Matthews, J. & Zacharewski, T. R. Integration of
    Genome-Wide Computation DRE Search, AhR ChIP-chip and Gene Expression
    Analyses of TCDD-Elicited Responses in the Mouse Liver. BMC Genomics (2011)
    doi:10.1186/1471-2164-12-365.
40. Mimura, J. & Fujii-Kuriyama, Y. Functional role of AhR in the expression of toxic
    effects by TCDD. in Biochimica et Biophysica Acta - General Subjects (2003).
    doi:10.1016/S0304-4165(02)00485-3.
                                           72


41. Sogawa, K. & Fujii-Kuriyama, Y. Ah receptor, a novel ligand-activated
    transcription factor. Journal of Biochemistry (1997)
    doi:10.1093/oxfordjournals.jbchem.a021864.
42. Beischlag, T. V., Morales, J. L., Hollingshead, B. D. & Perdew, G. H. The aryl
    hydrocarbon receptor complex and the control of gene expression. Critical
    Reviews in Eukaryotic Gene Expression (2008)
    doi:10.1615/CritRevEukarGeneExpr.v18.i3.20.
43. Poland, A. & Glover, E. Comparison of 2,3,7,8 tetrachlorodibenzo p dioxin, a
    potent inducer of aryl hydrocarbon hydroxylase, with 3 methylcholanthrene. Mol.
    Pharmacol. (1974).
44. Nebert, D. W. & Gelboin, H. V. The in vivo and in vitro induction of aryl
    hydrocarbon hydroxylase in mammalian cells of different species, tissues, strains,
    and developmental and hormonal states. Arch. Biochem. Biophys. (1969)
    doi:10.1016/0003-9861(69)90253-7.
45. Chen, H. S. & Perdew, G. H. Subunit composition of the heteromeric cytosolic aryl
    hydrocarbon receptor complex. J. Biol. Chem. (1994) doi:10.1016/s0021-
    9258(18)47020-2.
46. Petrulis, J. R. & Perdew, G. H. The role of chaperone proteins in the aryl
    hydrocarbon receptor core complex. Chem. Biol. Interact. (2002)
    doi:10.1016/S0009-2797(02)00064-9.
47. Larigot, L., Juricek, L., Dairou, J. & Coumoul, X. AhR signaling pathways and
    regulatory functions. Biochimie Open (2018) doi:10.1016/j.biopen.2018.05.001.
48. Guyot, E., Chevallier, A., Barouki, R. & Coumoul, X. The AhR twist: Ligand-
    dependent AhR signaling and pharmaco-toxicological implications. Drug
    Discovery Today (2013) doi:10.1016/j.drudis.2012.11.014.
49. Murray, I. A., Patterson, A. D. & Perdew, G. H. Aryl hydrocarbon receptor ligands
    in cancer: Friend and foe. Nature Reviews Cancer (2014) doi:10.1038/nrc3846.
50. Denison, M. S., Fisher, J. M. & Whitlock, J. P. The DNA recognition site for the
    dioxin-Ah receptor complex. Nucleotide sequence and functional analysis. J. Biol.
    Chem. (1988).
51. Korkalainen, M., Lindén, J., Tuomisto, J. & Pohjanvirta, R. Effect of TCDD on
    mRNA expression of genes encoding bHLH/PAS proteins in rat hypothalamus.
    Toxicology (2005) doi:10.1016/j.tox.2004.11.003.
52. Esser, C., Rannug, A. & Stockinger, B. The aryl hydrocarbon receptor in
    immunity. Trends in Immunology (2009) doi:10.1016/j.it.2009.06.005.
53. Yang, S. Y., Ahmed, S., Satheesh, S. V. & Matthews, J. Genome-wide mapping
                                            73


    and analysis of aryl hydrocarbon receptor (AHR)- and aryl hydrocarbon receptor
    repressor (AHRR)-binding sites in human breast cancer cells. Arch. Toxicol.
    (2018) doi:10.1007/s00204-017-2022-x.
54. Jaeger, C. & Tischkau, S. A. Role of Aryl Hydrocarbon Receptor in Circadian
    Clock Disruption and Metabolic Dysfunction. Environmental Health Insights (2016)
    doi:10.4137/EHI.S38343.
55. Shimba, S. & Watabe, Y. Crosstalk between the AHR signaling pathway and
    circadian rhythm. Biochemical Pharmacology (2009)
    doi:10.1016/j.bcp.2008.09.040.
56. Kalmes, M., Hennen, J., Clemens, J. & Blömeke, B. Impact of aryl hydrocarbon
    receptor (AhR) knockdown on cell cycle progression in human HaCaT
    keratinocytes. Biol. Chem. (2011) doi:10.1515/BC.2011.067.
57. Nacarino-Palma, A. et al. The aryl hydrocarbon receptor promotes differentiation
    during mouse preimplantational embryo development. Stem Cell Reports (2021)
    doi:10.1016/j.stemcr.2021.08.002.
58. Gialitakis, M. et al. Activation of the Aryl Hydrocarbon Receptor Interferes with
    Early Embryonic Development. Stem Cell Reports (2017)
    doi:10.1016/j.stemcr.2017.09.025.
59. Kim, T. H. & Dekker, J. ChIP-seq. Cold Spring Harb. Protoc. (2018)
    doi:10.1101/pdb.prot082644.
60. Quang, D. & Xie, X. FactorNet: A deep learning framework for predicting cell type
    specific transcription factor binding from nucleotide-resolution sequential data.
    Methods (2019) doi:10.1016/j.ymeth.2019.03.020.
61. Gilfillan, G. D. et al. Limitations and possibilities of low cell number ChIP-seq.
    BMC Genomics (2012) doi:10.1186/1471-2164-13-645.
62. Carroll, T. S., Liang, Z., Salama, R., Stark, R. & de Santiago, I. Impact of artifact
    removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front. Genet.
    (2014) doi:10.3389/fgene.2014.00075.
63. Matys, V. et al. TRANSFAC®: Transcriptional regulation, from patterns to profiles.
    Nucleic Acids Research (2003) doi:10.1093/nar/gkg108.
64. Kulakovskiy, I. V. et al. HOCOMOCO: A comprehensive collection of human
    transcription factor binding sites models. Nucleic Acids Res. (2013)
    doi:10.1093/nar/gks1089.
65. Wasserman, W. W. & Sandelin, A. Applied bioinformatics for the identification of
    regulatory elements. Nature Reviews Genetics (2004) doi:10.1038/nrg1315.
                                             74


66. Kiesel, A. et al. The BaMM web server for de-novo motif discovery and regulatory
    sequence analysis. Nucleic Acids Res. (2018) doi:10.1093/nar/gky431.
67. GuhaThakurta, D. & Stormo, G. D. Identifying target sites for cooperatively
    binding factors. Bioinformatics (2001) doi:10.1093/bioinformatics/17.7.608.
68. Kulakovskiy, I. et al. From binding motifs in chip-seq data to improved models of
    transcription factor binding sites. in Journal of Bioinformatics and Computational
    Biology (2013). doi:10.1142/S0219720013400040.
69. Atherton, J. et al. A model for sequential evolution of ligands by exponential
    enrichment (SELEX) data. Ann. Appl. Stat. (2012) doi:10.1214/12-AOAS537.
70. Karimzadeh, M. & Hoffman, M. M. Virtual ChIP-seq: predicting transcription factor
    binding by learning from the transcriptome. bioRxiv (2019) doi:10.1101/168419.
71. Gotea, V. et al. Homotypic clusters of transcription factor binding sites are a key
    component of human promoters and enhancers. Genome Res. (2010)
    doi:10.1101/gr.104471.109.
72. Yan, J. et al. XTranscription factor binding in human cells occurs in dense clusters
    formed around cohesin anchor sites. Cell (2013) doi:10.1016/j.cell.2013.07.034.
73. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
    sequence and chromatin accessibility data. Genome Res. (2011)
    doi:10.1101/gr.112623.110.
74. Guo, W. L. & Huang, D. S. An efficient method to transcription factor binding sites
    imputation: Via simultaneous completion of multiple matrices with positional
    consistency. Mol. Biosyst. (2017) doi:10.1039/c7mb00155j.
75. Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific
    transcription factor binding. Genome Biol. (2019) doi:10.1186/s13059-018-1614-y.
76. Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. in
    Proceedings of the ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining (2016). doi:10.1145/2939672.2939785.
77. Yang, S. Y., Ahmed, S., Satheesh, S. V. & Matthews, J. Genome-wide mapping
    and analysis of aryl hydrocarbon receptor (AHR)- and aryl hydrocarbon receptor
    repressor (AHRR)-binding sites in human breast cancer cells. Arch. Toxicol.
    (2018) doi:10.1007/s00204-017-2022-x.
78. Elith, J., Leathwick, J. R. & Hastie, T. A working guide to boosted regression
    trees. Journal of Animal Ecology (2008) doi:10.1111/j.1365-2656.2008.01390.x.
79. Gregorutti, B., Michel, B. & Saint-Pierre, P. Correlation and variable importance in
    random forests. Stat. Comput. (2017) doi:10.1007/s11222-016-9646-1.
                                            75


80. Gorkin, D. U., Leung, D. & Ren, B. The 3D genome in transcriptional regulation
    and pluripotency. Cell Stem Cell (2014) doi:10.1016/j.stem.2014.05.017.
81. Denholtz, M. & Plath, K. Pluripotency in 3D: Genome organization in pluripotent
    cells. Current Opinion in Cell Biology (2012) doi:10.1016/j.ceb.2012.11.001.
82. Salomoni, P. & Pandolfi, P. P. Transcriptional regulation of cellular transformation.
    Nature Medicine (2000) doi:10.1038/77459.
83. Atlasi, Y. & Stunnenberg, H. G. The interplay of epigenetic marks during stem cell
    differentiation and development. Nature Reviews Genetics (2017)
    doi:10.1038/nrg.2017.57.
84. Li, Y., Hu, M. & Shen, Y. Gene regulation in the 3D genome. Human Molecular
    Genetics (2018) doi:10.1093/hmg/ddy164.
85. Zabidi, M. A. et al. Enhancer-core-promoter specificity separates developmental
    and housekeeping gene regulation. Nature (2015) doi:10.1038/nature13994.
86. Ay, F. & Noble, W. S. Analysis methods for studying the 3D architecture of the
    genome. Genome Biol. 16, 1–15 (2015).
87. Bonev, B. & Cavalli, G. Organization and function of the 3D genome. Nature
    Reviews Genetics (2016) doi:10.1038/nrg.2016.112.
88. Zheng, H. & Xie, W. The role of 3D genome organization in development and cell
    differentiation. Nature Reviews Molecular Cell Biology (2019)
    doi:10.1038/s41580-019-0132-4.
89. Mirabella, A. C., Foster, B. M. & Bartke, T. Chromatin deregulation in disease.
    Chromosoma (2016) doi:10.1007/s00412-015-0530-0.
90. Boltsis, I., Grosveld, F., Giraud, G. & Kolovos, P. Chromatin Conformation in
    Development and Disease. Frontiers in Cell and Developmental Biology (2021)
    doi:10.3389/fcell.2021.723859.
91. Kaiser, V. B. & Semple, C. A. When TADs go bad: Chromatin structure and
    nuclear organisation in human disease. F1000Research (2017)
    doi:10.12688/f1000research.10792.1.
92. Lupiáñez, D. G. et al. Disruptions of topological chromatin domains cause
    pathogenic rewiring of gene-enhancer interactions. Cell (2015)
    doi:10.1016/j.cell.2015.04.004.
93. Valton, A. L. & Dekker, J. TAD disruption as oncogenic driver. Current Opinion in
    Genetics and Development (2016) doi:10.1016/j.gde.2016.03.008.
94. Gröschel, S. et al. A single oncogenic enhancer rearrangement causes
                                           76


     concomitant EVI1 and GATA2 deregulation in Leukemia. Cell (2014)
     doi:10.1016/j.cell.2014.02.019.
95.  Northcott, P. A. et al. Enhancer hijacking activates GFI1 family oncogenes in
     medulloblastoma. Nature (2014) doi:10.1038/nature13379.
96.  Lotta, L. A. et al. Integrative genomic analysis implicates limited peripheral
     adipose storage capacity in the pathogenesis of human insulin resistance. Nat.
     Genet. (2017) doi:10.1038/ng.3714.
97.  Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome
     neighborhoods. Science (80-. ). (2016) doi:10.1126/science.aad9024.
98.  Weischenfeldt, J. et al. Pan-cancer analysis of somatic copy-number alterations
     implicates IRS4 and IGF2 in enhancer hijacking. Nat. Genet. (2017)
     doi:10.1038/ng.3722.
99.  Pucéat, M. Capturing chromosome conformation. in Methods in Molecular Biology
     (2021). doi:10.1007/978-1-0716-0664-3_1.
100. Zhao, Z. et al. Circular chromosome conformation capture (4C) uncovers
     extensive networks of epigenetically regulated intra- and interchromosomal
     interactions. Nat. Genet. (2006) doi:10.1038/ng1891.
101. Simonis, M. et al. Nuclear organization of active and inactive chromatin domains
     uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet.
     (2006) doi:10.1038/ng1896.
102. Belton, J. M. & Dekker, J. Chromosome conformation capture carbon copy (5C) in
     budding yeast. Cold Spring Harb. Protoc. (2015) doi:10.1101/pdb.prot085191.
103. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions
     reveals folding principles of the human genome. Science (80-. ). (2009)
     doi:10.1126/science.1181369.
104. Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high-
     resolution capture Hi-C. Nat. Genet. (2015) doi:10.1038/ng.3286.
105. Wei, C. L. et al. A global map of p53 transcription-factor binding sites in the
     human genome. Cell (2006) doi:10.1016/j.cell.2005.10.043.
106. Liu, X. et al. In Situ Capture of Chromatin Interactions by Biotinylated dCas9. Cell
     (2017) doi:10.1016/j.cell.2017.08.003.
107. Morgan, S. L. et al. Manipulation of nuclear architecture through CRISPR-
     mediated chromosomal looping. Nat. Commun. (2017)
     doi:10.1038/ncomms15993.
                                            77


108. Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals
     principles of chromatin looping. Cell (2014) doi:10.1016/j.cell.2014.11.021.
109. Boyle, S., Rodesch, M. J., Halvensleben, H. A., Jeddeloh, J. A. & Bickmore, W. A.
     Fluorescence in situ hybridization with high-complexity repeat-free oligonucleotide
     probes generated by massively parallel synthesis. Chromosom. Res. (2011)
     doi:10.1007/s10577-011-9245-0.
110. Kempfer, R. & Pombo, A. Methods for mapping 3D chromosome architecture.
     Nature Reviews Genetics (2020) doi:10.1038/s41576-019-0195-2.
111. Dunham, I. et al. An integrated encyclopedia of DNA elements in the human
     genome. Nature (2012) doi:10.1038/nature11247.
112. Jung, I. et al. A compendium of promoter-centered long-range chromatin
     interactions in the human genome. Nat. Genet. (2019) doi:10.1038/s41588-019-
     0494-8.
113. Li, X. et al. Long-read ChIA-PET for base-pair-resolution mapping of haplotype-
     specific chromatin interactions. Nat. Protoc. (2017) doi:10.1038/nprot.2017.012.
114. Li, G. et al. ChIA-PET tool for comprehensive chromatin interaction analysis with
     paired-end tag sequencing. Genome Biol. (2010) doi:10.1186/gb-2010-11-2-r22.
115. Yardımcı, G. G. et al. Measuring the reproducibility and quality of Hi-C data.
     Genome Biol. (2019) doi:10.1186/s13059-019-1658-7.
116. Smith, E. M., Lajoie, B. R., Jain, G. & Dekker, J. Invariant TAD Boundaries
     Constrain Cell-Type-Specific Looping Interactions between Promoters and Distal
     Elements around the CFTR Locus. Am. J. Hum. Genet. (2016)
     doi:10.1016/j.ajhg.2015.12.002.
117. Roy, S. et al. A predictive modeling approach for cell line-specific long-range
     regulatory interactions. Nucleic Acids Res. (2015) doi:10.1093/nar/gkv865.
118. Hait, T. A., Amar, D., Shamir, R. & Elkon, R. FOCS: A novel method for analyzing
     enhancer and gene activity patterns infers an extensive enhancer-promoter map.
     Genome Biol. (2018) doi:10.1186/s13059-018-1432-2.
119. Gao, T. & Qian, J. Eagle: An algorithm that utilizes a small number of genomic
     features to predict tissue/ cell type-specific enhancer-gene interactions. PLoS
     Comput. Biol. (2019) doi:10.1371/journal.pcbi.1007436.
120. Cao, Q. et al. Reconstruction of enhancer-target networks in 935 samples of
     human primary cells, tissues and cell lines. Nat. Genet. (2017)
     doi:10.1038/ng.3950.
121. Fishilevich, S. et al. GeneHancer: Genome-wide integration of enhancers and
                                             78


     target genes in GeneCards. Database (2017) doi:10.1093/database/bax028.
122. Corradin, O. et al. Combinatorial effects of multiple enhancer variants in linkage
     disequilibrium dictate levels of gene expression to confer susceptibility to common
     traits. Genome Res. (2014) doi:10.1101/gr.164079.113.
123. He, B., Chen, C., Teng, L. & Tan, K. Global view of enhancer-promoter
     interactome in human cells. Proc. Natl. Acad. Sci. U. S. A. (2014)
     doi:10.1073/pnas.1320308111.
124. Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer-promoter interactions are
     encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48,
     488–496 (2016).
125. Moore, J. E., Pratt, H. E., Purcaro, M. J. & Weng, Z. A curated benchmark of
     enhancer-gene interactions for evaluating enhancer-target gene prediction
     methods. Genome Biol. (2020) doi:10.1186/s13059-019-1924-8.
126. Cao, F. & Fullwood, M. J. Inflated performance measures in enhancer–promoter
     interaction-prediction methods. Nature Genetics (2019) doi:10.1038/s41588-019-
     0434-7.
127. Whitaker, J. W., Nguyen, T. T., Zhu, Y., Wildberg, A. & Wang, W. Computational
     schemes for the prediction and annotation of enhancers from epigenomic assays.
     Methods (2015) doi:10.1016/j.ymeth.2014.10.008.
128. Nolis, I. K. et al. Transcription factors mediate long-range enhancer-promoter
     interactions. Proc. Natl. Acad. Sci. U. S. A. (2009) doi:10.1073/pnas.0902454106.
129. Maksimenko, O. & Georgiev, P. Mechanisms and proteins involved in long-
     distance interactions. Frontiers in Genetics (2014) doi:10.3389/fgene.2014.00028.
130. Dall’Agnese, A. et al. Transcription Factor-Directed Re-wiring of Chromatin
     Architecture for Somatic Cell Nuclear Reprogramming toward trans-
     Differentiation. Mol. Cell (2019) doi:10.1016/j.molcel.2019.07.036.
131. Stadhouders, R., Filion, G. J. & Graf, T. Transcription factors and 3D genome
     conformation in cell-fate decisions. Nature (2019) doi:10.1038/s41586-019-1182-
     7.
132. Arnould, C. et al. Loop extrusion as a mechanism for formation of DNA damage
     repair foci. Nature (2021) doi:10.1038/s41586-021-03193-z.
133. Fudenberg, G. et al. Formation of Chromosomal Domains by Loop Extrusion. Cell
     Rep. (2016) doi:10.1016/j.celrep.2016.04.085.
134. Kim, Y., Shi, Z., Zhang, H., Finkelstein, I. J. & Yu, H. Human cohesin compacts
     DNA by loop extrusion. Science (80-. ). (2019) doi:10.1126/science.aaz4475.
                                              79


135. Ren, G. et al. CTCF-Mediated Enhancer-Promoter Interaction Is a Critical
     Regulator of Cell-to-Cell Variation of Gene Expression. Mol. Cell (2017)
     doi:10.1016/j.molcel.2017.08.026.
136. Guo, Y. et al. CRISPR Inversion of CTCF Sites Alters Genome Topology and
     Enhancer/Promoter Function. Cell (2015) doi:10.1016/j.cell.2015.07.038.
137. Weintraub, A. S. et al. YY1 Is a Structural Regulator of Enhancer-Promoter
     Loops. Cell (2017) doi:10.1016/j.cell.2017.11.008.
138. Bailey, S. D. et al. ZNF143 provides sequence specificity to secure chromatin
     interactions at gene promoters. Nat. Commun. (2015) doi:10.1038/ncomms7186.
139. Wen, Z., Huang, Z. T., Zhang, R. & Peng, C. ZNF143 is a regulator of chromatin
     loop. Cell Biol. Toxicol. (2018) doi:10.1007/s10565-018-9443-z.
140. Ye, B. Y. et al. ZNF143 is involved in CTCF-mediated chromatin interactions by
     cooperation with cohesin and other partners. Mol. Biol. (2016)
     doi:10.1134/S0026893316030031.
141. Ye, B. et al. ZNF143 in Chromatin Looping and Gene Regulation. Frontiers in
     Genetics (2020) doi:10.3389/fgene.2020.00338.
142. Zhou, Q. et al. ZNF143 mediates CTCF-bound promoter–enhancer loops required
     for murine hematopoietic stem and progenitor cell function. Nat. Commun. (2021)
     doi:10.1038/s41467-020-20282-1.
143. Knight, P. A. & Ruiz, D. A fast algorithm for matrix balancing. IMA J. Numer. Anal.
     (2013) doi:10.1093/imanum/drs019.
144. Forcato, M. et al. Comparison of computational methods for Hi-C data analysis.
     Nat. Methods 14, 679–685 (2017).
145. Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE.
     Genome Biol. (2006).
146. Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference
     human epigenomes. Nature (2015) doi:10.1038/nature14248.
147. Ernst, J. & Kellis, M. ChromHMM: Automating chromatin-state discovery and
     characterization. Nature Methods (2012) doi:10.1038/nmeth.1906.
148. Feingold, E. A. et al. The ENCODE (ENCyclopedia of DNA Elements) Project.
     Science (2004) doi:10.1126/science.1105136.
149. Ji, H. et al. An integrated software system for analyzing ChIP-chip and ChIP-seq
     data. Nat. Biotechnol. (2008) doi:10.1038/nbt.1505.
                                            80


150. Szklarczyk, D. et al. STRING v11: Protein-protein association networks with
     increased coverage, supporting functional discovery in genome-wide
     experimental datasets. Nucleic Acids Res. (2019) doi:10.1093/nar/gky1131.
151. Dixon, J. R. et al. Topological domains in mammalian genomes identified by
     analysis of chromatin interactions. Nature (2012) doi:10.1038/nature11082.
152. Akdemir, K. C. et al. Disruption of chromatin folding domains by somatic genomic
     rearrangements in human cancer. Nat. Genet. (2020) doi:10.1038/s41588-019-
     0564-y.
153. Fullwood, M. J. & Ruan, Y. ChIP-based methods for the identification of long-
     range chromatin interactions. J. Cell. Biochem. 107, 30–39 (2009).
154. Koehler, A. B. Journal of the American Statistical Association. Int. J. Forecast.
     (1995) doi:10.1016/0169-2070(95)90067-5.
155. Barzilai, J. & Borwein, J. M. Two-point step size gradient methods. IMA J. Numer.
     Anal. (1988) doi:10.1093/imanum/8.1.141.
156. Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature
     (2017) doi:10.1038/nature24277.
157. Wang, H., Huang, B. & Wang, J. Predict long-range enhancer regulation based on
     protein-protein interactions between transcription factors. Nucleic Acids Res.
     (2021) doi:10.1093/nar/gkab841.
158. Kheradpour, P. & Kellis, M. Systematic discovery and characterization of
     regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. (2014)
     doi:10.1093/nar/gkt1249.
159. Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing
     genomic features. Bioinformatics (2010) doi:10.1093/bioinformatics/btq033.
160. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through
     RNA-sequencing of 922 individuals. Genome Res. (2014)
     doi:10.1101/gr.155192.113.
161. Nica, A. C. et al. The architecture of gene regulatory variation across multiple
     human tissues: The muTHER study. PLoS Genet. (2011)
     doi:10.1371/journal.pgen.1002003.
162. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc.
     Ser. B (1996) doi:10.1111/j.2517-6161.1996.tb02080.x.
                                            81


                          APPENDIX A
          SUPPLEMENTARY FIGURES FOR CHAPTER 2
             Table A.1. List of TFs in K562 cell line
     ID                  TF                FRiP         Motif
ENCFF002DBD            CTCF             35.5666667    MA0139.1
ENCFF926FUM             YY1               11.395      MA0095.2
ENCFF664XPS             SPI1               8.045      MA0080.5
ENCFF495MHZ            NFE2                7.095      MA0841.1
ENCFF465JKF             MYC                9.53       MA0147.3
ENCFF002CVW             FOS             6.25333333    MA0476.1
ENCFF497XOD             MITF               8.69       MA0620.3
ENCFF178MOP           SMAD5                7.55       MA1557.1
ENCFF182QDI            EGR1                13.28      MA0162.4
ENCFF958KNK            ATF3                14.1       MA0605.2
ENCFF010UHD            RFX1               12.235      MA0509.2
ENCFF948TXN            CREM                12.2       MA0609.2
ENCFF973LDQ           ZNF24                6.835      MA1124.1
ENCFF544XKC          PKNOX1                23.15      MA0782.2
ENCFF334FMW            USF1                10.75      MA0093.3
ENCFF895QLA            REST               11.385      MA0138.2
 ENCFF710IEF           ATF4                12.3       MA0833.2
                                82


              Table A.1 (cont’d)
ENCFF917COW     E2F6             15.925   MA0471.2
ENCFF833FCO   GABPA               15.25   MA0062.3
ENCFF422NGZ     MAX               24.15   MA0058.3
ENCFF564WEB     ELF1              16.3    MA0473.3
ENCFF492GXZ   FOXK2               5.31    MA1103.2
 ENCFF706ISJ ZBTB7A               10.21   MA0750.2
ENCFF059ONJ     MNT                9.9    MA0825.1
ENCFF968JVX    IKZF1              15.85   MA1508.1
ENCFF213EPU   ESRRA               6.62    MA0592.3
ENCFF493ABN    NRF1               38.9    MA0506.1
ENCFF106DAY     E2F1             10.375   MA0024.3
ENCFF502KHR     ELF4              5.43    MA0641.1
ENCFF670ZCR     MGA               6.805   MA0801.1
ENCFF664ZGR   NR2C1               5.91    MA1535.1
ENCFF558DSF  HMBOX1            7.31333333 MA0895.1
ENCFF209MQX   TEAD4               9.16    MA0809.2
ENCFF370ENX     NFIC             10.585   MA0161.2
ENCFF255EOB   NR2F2               5.02    MA1111.1
ENCFF613RNG   MEIS2               10.8    MA0774.1
ENCFF968KBN     ATF2             12.265   MA1632.1
ENCFF408FQC  ZBTB33               12.35   MA0527.1
 ENCFF175IIE  NR2F1               10.36   MA0017.2
                      83


                         Table A.1 (cont’d)
 ENCFF868QLL              ATF7              10.555       MA0834.1
 ENCFF308IXJ              MAFF               5.19        MA0495.3
 ENCFF337DKJ              JUND               11.4        MA0491.2
 ENCFF114IWY            ZNF143              6.595        MA0088.2
 ENCFF113PMT              NFYB              13.65        MA0502.2
 ENCFF179NDS           BHLHE40                5.9        MA0464.2
 ENCFF429XKT             CEBPB              10.855       MA0466.2
             Table A.2. List of TFs in GM12878 cell line
     ID                    TF                FRiP          Motif
ENCFF002DAJ              CTCF            28.9333333      MA0139.1
ENCFF967ACD               YY1                14.6        MA0095.2
ENCFF911BYP             MEF2B               9.845        MA0660.1
ENCFF405NFV             TBX21               11.21        MA0690.1
ENCFF810CEL             NR2F1                8.62        MA0017.2
ENCFF141SAU            ZNF143                12.4        MA0088.2
ENCFF095GMM           BHLHE40                19.7        MA0464.2
ENCFF726VEK              ATF7               11.85        MA0834.1
ENCFF628QJU              NFIC               10.41        MA0161.2
ENCFF339KUO              PAX5                6.28        MA0014.3
                                  84


             Table A.2 (cont’d)
ENCFF456FQB  RELB               10.385 MA1117.1
ENCFF609KOY  ATF2               7.045  MA1632.1
ENCFF593FBF  BATF               11.54  MA1634.1
ENCFF394DLH  EBF1                14.1  MA0154.4
ENCFF807AKG  ELF1                8.64  MA0473.3
ENCFF939TZS  JUNB               7.935  MA0490.2
ENCFF328QLX MEF2A                7.16  MA0052.4
ENCFF467NRS  NFYB                8.26  MA0502.2
ENCFF248QFF RUNX3                21.1  MA0684.2
                     85


Figure A.1. TF-wave predicts TF binding sites accurately and outperforms
Wellington in GM12878. TF-wave and Wellington were evaluated on the same datasets
using the averaged performance of 5-fold cross-validation. (a) Performance comparison
in GM12878 for TF-Wave and Wellington using auPR. The x-axis and y-axis are auPR of
applying Wellington and TF-wave to predict TFBS on the same data. Each point
represents a TF, points above the diagonal line indicate TF-wave performs better than
Wellington. (f) Performance comparison in GM12878 for TF-Wave and Wellington using
auROC. The x-axis and y-axis are auROC of applying Wellington and TF-wave to predict
TFBS on the same data. Each point represents a TF, points above the diagonal line
indicate TF-wave performs better than Wellington.
                                          86


Figure A.2. Wavelet transform can boot the TFBS prediction performance in K562
and GM12878. TF-wave and GBT trained on the original DNase-Seq (baseline model)
signal were evaluated on the same datasets using the averaged performance of 5-fold
cross-validation. (a, c) Performance comparison in K562 GM12878 for TF-Wave and
baseline model using auPR. The x-axis and y-axis are auPR of applying baseline model
and TF-wave to predict TFBSs on the same data. Each point represents a TF, points
above the diagonal line indicate TF-wave performs better than baseline model. (b, d)
Performance comparison in K562 and GM12878 for TF-Wave and baseline model using
auROC. The x-axis and y-axis are auROC of applying Wellington and TF-wave to predict
TFBS on the same data. Each point represents a TF, points above the diagonal line
indicate TF-wave performs better than baseline model.
                                         87


Figure A.3. Different TFs have different consensus footprint shapes. Aggregated
DNase-Seq read-depth for top 5% predictions for NRF2, ATF7, YY1, SPI1, CREM, and
FOXK2. Each TF has a unique footprint shape. Palindrome motifs such as CREM and
ATF7 leave a symmetric footprint. NRF1 is similar to palindrome motifs, and also leaves
a symmetric footprint. Motifs that are not palindrome such as SPI1, YY1, FOXK2 leaves
asymmetric footprints.
                                            88


                                  APPENDIX B
                  SUPPLEMENTARY FIGURES FOR CHAPTER 3
Figure B.1. Ten putative AHR binding sites and their predicted binding probabilities
                                        89


                       APPENDIX C
   SUPPLEMENTARY MATERIAL FOR CHAPTER 4
      Table C. 1. List of TFs used in K562 cell line
     ID                                          TF
ENCFF002DBD                                   CTCF
ENCFF926FUM                                     YY1
ENCFF254YOX                                  NR2C2
ENCFF002CXU                                  RAD21
ENCFF664XPS                                    SPI1
ENCFF911VSD                                   NFE2
ENCFF465JKF                                    MYC
ENCFF002CVW                                    FOS
ENCFF947KPB                                 POLR2A
ENCFF497XOD                                    MITF
ENCFF178MOP                                  SMAD5
ENCFF076IFG                                    ZEB2
ENCFF701MXF                                  TFDP1
ENCFF116DIO                                  ZNF148
ENCFF182QDI                                   EGR1
ENCFF958KNK                                    ATF3
ENCFF490VVG                                  HDAC2
ENCFF644WLI                                   RNF2
                            90


             Table C.1 (cont’d)
ENCFF948TXN                       CREM
ENCFF232KAH                        ATF1
ENCFF973LDQ                       ZNF24
ENCFF511QHY                      ZMYM3
ENCFF284LRP                        TAL1
ENCFF440JJW                       MLLT1
ENCFF822FWM                        TAF1
ENCFF122QSN                     HNRNPLL
ENCFF544XKC                      PKNOX1
 ENCFF294HEI                     PRDM10
ENCFF334FMW                        USF1
ENCFF895QLA                       REST
ENCFF517KRT                      GABPB1
 ENCFF710IEF                       ATF4
ENCFF622RBW                     SMARCE1
 ENCFF417LEJ                     ZBTB40
ENCFF917COW                        E2F6
ENCFF788EBS                      ZNF316
ENCFF320THV                       NR2C1
ENCFF685KAG                      GABPA
ENCFF493TZM                       TCF12
ENCFF186QUP                       VEZF1
                    91


             Table C.1 (cont’d)
ENCFF248AOD                     ZNF766
ENCFF152VMJ                      CTBP1
ENCFF422NGZ                       MAX
ENCFF823SYE                       DPF2
ENCFF692PVV                      FOSL1
ENCFF076MSV                       AFF1
ENCFF273TYA                     ZNF592
ENCFF168HAG                       PBX2
ENCFF273EYJ                     POLR2G
ENCFF564WEB                       ELF1
ENCFF366UBB                     NCOR1
ENCFF492GXZ                      FOXK2
 ENCFF872JJJ                    POLR2B
 ENCFF706ISJ                    ZBTB7A
ENCFF462JFW                     ZBTB11
ENCFF996ZGL                     C11orf30
ENCFF417TXD                       USF2
ENCFF738WCE                     KDM1A
ENCFF019ALY                      HDAC1
ENCFF059ONJ                       MNT
ENCFF968JVX                      IKZF1
ENCFF801YZV                     POLR2H
                    92


            Table C.1 (cont’d)
ENCFF429XKT                     CEBPB
ENCFF388YOB                     ZNF395
ENCFF213EPU                     ESRRA
ENCFF606CCB                     CEBPG
ENCFF493ABN                      NRF1
ENCFF120WOF                     ZNF639
ENCFF096XMD                     RAD51
ENCFF314ULQ                    L3MBTL2
ENCFF106DAY                       E2F1
ENCFF883TOD                    SMARCA4
ENCFF602AXP                     ZNF589
ENCFF502KHR                       ELF4
ENCFF157UUF                     NFRKB
ENCFF993GXU                    CBFA2T3
ENCFF642BNC                    CBFA2T2
ENCFF356ASJ                       E2F5
ENCFF670ZCR                       MGA
ENCFF083YCQ                       E4F1
ENCFF881QBT                       PML
ENCFF558DSF                    HMBOX1
ENCFF209MQX                     TEAD4
ENCFF521CRG                      ZFP36
                   93


             Table C.1 (cont’d)
ENCFF241TBP                      SOX6
ENCFF680WBN                     RBM25
ENCFF370ENX                       NFIC
ENCFF057ZUY                      BCOR
ENCFF925ANU                      EP400
ENCFF255EOB                      NR2F2
ENCFF927JBT                      MAFG
ENCFF225MPC                     ARID1B
ENCFF613RNG                      MEIS2
ENCFF968KBN                       ATF2
ENCFF408FQC                     ZBTB33
ENCFF484HCG                     SUPT5H
ENCFF646VQW                     TRIM24
ENCFF350YXB                      MTA3
ENCFF433RKB                        ZFX
 ENCFF175IIE                     NR2F1
ENCFF868QLL                       ATF7
ENCFF002CVU                     CCNT2
ENCFF626KTJ                       PHF8
ENCFF002CXW                     POLR3A
ENCFF379MPS                      RBBP5
ENCFF474NLG                     HCFC1
                    94


                                  Table C.1 (cont’d)
            ENCFF671DOL                                     UBTF
            ENCFF549TYR                                     EP300
             ENCFF308IXJ                                    MAFF
            ENCFF337DKJ                                     JUND
            ENCFF114IWY                                    ZNF143
            ENCFF113PMT                                     NFYB
            ENCFF179NDS                                   BHLHE40
            ENCFF041YQC                                     SMC3
            ENCFF380FJL                                      TBP
Figure C.1. A,B Examples of first-order PPI mediated long-range enhancer
promoter interactions in K562 cell line. A, RAD21 binds on TTC31 promoter, CTCF
binds on enhancer, the PPI between RAD21 and CTCF regulates the long-range
enhancer-promoter interactions. B, YY1 binds on the E2F2 promoter, CTCF binds on the
                                          95


                                   Figure C.1 (cont’d)
enhancer, the PPI between YY1 and CTCF regulates this long-range enhancer-promoter
interactions. C, matrices generation within each TAD, clustering of TAD based on the TF
binding on enhancers and promoters within each TAD. A linear model was fitted between
the enhancer-promoter interaction frequency and genomic distance.
Figure C.2. Prioritized first-order PPIs can predict long-range enhancer-promoters
accurately. A, RAD21 binds on S100A2 promoter, CTCF binds on enhancer, the
prioritized PPI between RAD21 and CTCF regulate this long-range enhancer-promoter
interactions. B, RAD21 binds on HNRNPUL1 promoter, RAD21 binds on enhancer, the
prioritized PPI between EP300 and RAD21 regulate this long-range enhancer-promoter
interactions.
                                            96


Figure C.3. Performance comparison between first-order PPI model and second-
order PPI model. A. model comparison on the dataset used for second-order PPI model.
B. model comparison on balanced-distance controlled dataset used for second-order PPI
model
                                        97


Figure C.4. Performance comparison in K562 cells on balanced data. EP3ICO,
ProTECT, TargetFinder, and IM-PET are applied on the same input datasets and are
evaluated based on the averaged performance of 5-fold cross-validation on a balanced
data. As a baseline comparison, enhancer-gene activity correlations are also included in
the analysis. A ROC curves and PR curves of first-order PPIs performance in K562 cells.
B ROC curves PR curves of second-order PPIs performance in K562 cells.
                                         98


Figure C.5. Comparison with shuffled PPIs on balanced datasets. A. ROC, and PR
plots for the models based on the original first-order TF PPI features (red), the models
based on the shuffled first-order TF PPI features (green). B. ROC and PR plots for the
models based on the original second-order TF PPI features (purple), the models based
on the shuffled second-order TF PPI features (green).
                                           99


Figure C.6. Performance comparison in K562 cells on balanced data with distance
control. EP3ICO, ProTECT, TargetFinder, and IM-PET are applied on the same input
datasets and are evaluated based on the averaged performance of 5-fold cross-validation
on a balanced data with distance control. As a baseline comparison, enhancer-gene
activity correlations are also included in the analysis. A ROC curves and PR curves of
first-order PPIs performance in K562 cells. B ROC curves PR curves of second-order
PPIs performance in K562 cells.
                                            100


Figure C.7. Comparison with shuffled PPIs on balanced datasets with distance
control. A. ROC, and PR plots for the models based on the original first-order TF PPI
features (red), the models based on the shuffled first-order TF PPI features (green). B.
ROC and PR plots for the models based on the original second-order TF PPI features
(purple), the models based on the shuffled second-order TF PPI features (green).
                                          101


Figure C.8. Comparison with ProTECT, TargetFinder, IM-PET, and baseline model
based on a common dataset. A. ROC, and PR plots for the models on the common
dataset. B. ROC and PR plots for the models on the balanced dataset. C. ROC and PR
plots for the models on the balanced dataset with distance control.
                                          102