UNDERSTANDING THE GENETIC BASIS OF HUMAN DISEASES BY COMPUTATIONALLY MODELING THE LARGE-SCALE GENE REGULATORY NETWORKS By Hao Wang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering — Doctor of Philosophy 2022 ABSTRACT UNDERSTANDING THE GENETIC BASIS OF HUMAN DISEASES BY COMPUTATIONALLY MODELING THE LARGE-SCALE GENE REGULATORY NETWORKS By Hao Wang Many severe diseases are known to be caused by the genetic disorder of the human genome, including breast cancer and Alzheimer's disease. Understanding the genetic basis of human diseases plays a vital role in personalized medicine and precision therapy. However, the pervasive spatial correlations between the disease-associated SNPs have hindered the ability of traditional GWAS studies to discover causal SNPs and obscured the underlying mechanisms of disease-associated SNPs. Recently, diverse biological datasets generated by large data consortia provide a unique opportunity to fill the gap between genotypes and phenotypes using biological networks, representing the complex interplay between genes, enhancers, and transcription factors (TF) in the 3D space. The comprehensive delineation of the regulatory landscape calls for highly scalable computational algorithms to reconstruct the 3D chromosome structures and mechanistically predict the enhancer-gene links. In this dissertation, I first developed two algorithms, FLAMINGO and tFLAMINGO, to reconstruct the high-resolution 3D chromosome structures. The algorithmic advancements of FLAMINGO and tFLAMINGO lead to the reconstruction of the 3D chromosome structures in an unprecedented resolution from the highly sparse chromatin contact maps. I further developed two integrative algorithms, ComMUTE and ProTECT, to mechanistically predict the long- range enhancer-gene links by modeling the TF profiles. Based on the extensive evaluations, these two algorithms demonstrate superior performance in predicting enhancer-gene links and decoding TF regulatory grammars over existing algorithms. The successful application of ComMUTE and ProTECT in 127 cell types not only provide a rich resource of gene regulatory networks but also shed light on the mechanistic understanding of QTLs, disease-associated genetic variants, and high-order chromatin interactions. ACKNOWLEDGEMENTS I want to give my deepest thanks to my advisor, Dr. Jianrong Wang, for his support, guidance, and encouragement during my Ph.D. program. He led me into the field of bioinformatics and guided me to be a successful Ph.D. student, ranging from detailed algorithm design to high-level project design. I am very grateful for his focus on my development, as I can clearly feel the consideration behind every project we worked on together. I will never forget the discussions we had during the late nights and on the way home. It is a privilege to have such an advisor that cares more about my development than himself, and I deeply appreciate it. Beyond the academic development, his enthusiasm, dedication, and meticulous attitude toward work and life profoundly influenced me. I will remember the 'principles of doing things' that he taught me and continuously practice them no matter where I am. No words can comprehend my appreciation, and I feel blessed to learn from him for five years. I am also grateful to my committee members, Dr. Jianliang Qian, Dr. Yuehua Cui, and Dr. Carlo Piermarocchi, for their insightful guidance during my Ph.D. career. Discussion with them brought me to several brand-new research areas, which significantly expanded my skill sets. I wanted to thank my colleagues and friends, Jiaxin Yang, Wenjie Qi, Dr. Binbin Huang, Hongjie Ke, Zhongjie Ji, and many others, for their support and accompany. I wanted to give my deepest thanks to my parents, Haihong Liu and Jianhua Wang for their support and understanding. Without their love and support, I could never achieve so much on my own. iv TABLE OF CONTENTS LIST OF FIGURES ........................................................................................................ viii CHAPTER 1 INTRODUCTION ........................................................................................ 1 CHAPTER 2 RECONSTRUCT HIGH-RESOLUTION 3D GENOME STRUCTURES FOR DIVERSE CELL-TYPES USING FLAMINGO .................................................................. 4 2.1 INTRODUCTION.................................................................................................... 4 2.2 RESULTS .............................................................................................................. 8 2.2.1 FLAMINGO algorithm to reconstruct high-resolution 3D genome architectures ......................................................................................................... 8 2.2.2 Benchmark performance based on simulated structures ........................... 12 2.2.3 Superior reconstruction accuracy across diverse cell-types ...................... 15 2.2.4 Advanced scalability for large-scale chromosome conformations ............. 18 2.2.5 Analysis of multi-way interactions and QTLs by FLAMINGO beyond 2D Hi-C contact maps ...................................................................................................... 21 2.2.6 Geometrical property of chromatin structures ............................................ 26 2.2.7 Reference structure to interpret single-cell variabilities ............................. 26 2.2.8 Robust performance to handle missing data in Hi-C datasets ................... 28 2.2.9 Cross cell-type prediction of 3D structures ................................................ 33 2.2.10 Boost the resolution of 3D structures from low-resolution Hi-C ............... 37 2.3 DISCUSSION ....................................................................................................... 38 2.4 METHODS ........................................................................................................... 42 2.4.1 Chromatin contact maps and epigenomics datasets ................................. 42 2.4.2 Model framework of FLAMINGO ............................................................... 43 2.4.3 Reconstruct 3D genome structures based on low-rank matrix completion 44 2.4.4 Assemble predicted structures from different scales ................................. 48 2.4.5 Benchmark performance using simulated genome structures ................... 50 2.4.6 Performance comparison based on experimental Hi-C data ..................... 51 2.4.7 Analysis of multi-way chromatin interactions and QTLs ............................ 53 2.4.8 Curvature analysis for predicted 3D genome structures ............................ 54 2.4.9 Comparison with image-based single-cell structures ................................. 55 2.4.10 Cross cell-type prediction of 3D genome structures ................................ 55 2.4.11 Improve the resolution of 3D genome structures ..................................... 57 CHAPTER 3 PREDICT HIGH-RESOLUTION SINGLE-CELL 3D CHROMOSOME STRUCTURES USING TFLAMINGO............................................................................ 59 3.1 INTRODUCTION.................................................................................................. 59 3.2 RESULTS ............................................................................................................ 62 3.2.1 tFLAMINGO reconstructs high-resolution single-cell 3D chromosome structures ............................................................................................................ 62 3.2.2 Performance validation based on the simulation analyses ........................ 66 v 3.2.3 Performance comparison based on the STORM dataset .......................... 68 3.2.4 Performance comparison based on the bulk tissue chromatin contact maps ........................................................................................................................... 71 3.2.5 Performance comparison in imputing high-resolution chromatin contact maps ........................................................................................................................... 73 3.2.6 Single-cell compartment and TAD analyses of tFLAMINGO ..................... 76 3.2.7 Spatial analysis of gene activities in 3D space by tFLAMINGO ................. 79 3.2.8 Dynamic single-cell chromatin interaction landscape identified by tFLAMINGO ........................................................................................................ 82 3.2.9 Relationship between single-cell chromatin interactions and bulk Capture-C interactions ......................................................................................................... 84 3.2.10 Interpreting genetic variants based on single cell chromatin interactions 84 3.2.11 Predicting functional gene regulatory links in single cells ........................ 85 3.2.12 Analysis of single-cell multi-way interactions by tFLAMINGO.................. 87 3.3 DISCUSSION ....................................................................................................... 89 3.4 METHODS ........................................................................................................... 93 3.4.1 Model framework of tFLAMINGO .............................................................. 93 3.4.2 Chromatin contact maps and data preprocessing ..................................... 94 3.4.3 Complete single-cell chromatin contact maps based on the low-rank tensor completion .......................................................................................................... 96 3.4.5 Reconstruct the single cell 3D chromatin structure based on low-rank matrix completion .......................................................................................................... 99 3.4.6 Performance evaluation based on simulated chromatin structures ......... 100 3.4.7 Performance comparison based on the STORM 3D genome imaging data ......................................................................................................................... 102 3.4.8 Performance comparison in reconstructing 3D chromatin structures based on experimental single cell Hi-C data ............................................................... 102 3.4.9 Performance comparison with Higashi in imputing high-resolution single cell chromatin contact maps ................................................................................... 104 3.4.10 Identification of the single-cell compartment A/B and TAD boundaries . 104 3.4.11 Differential methylated gene analysis across clusters of single cells ..... 105 3.4.12 Analyses of single-cell chromatin interactions and genetic variants ...... 105 CHAPTER 4 DECIPHER THE COMBINATORIAL GRAMMAR OF TRANSCRIPTION FACTORS IN LONG-RANGE MULTI-ENHANCER REGULATION ............................ 107 4.1 INTRODUCTION................................................................................................ 107 4.2 RESULTS .......................................................................................................... 113 4.2.1 ComMUTE predicts long-range multi-enhancer regulations based on TF regulatory grammars ........................................................................................ 113 4.2.2 Robust performance in predicting enhancer-gene links ........................... 117 4.2.3 Integration of the TF regulatory grammar boost the predictive accuracy . 121 4.2.4 ComMUTE captures direct enhancer-enhancer interactions ................... 123 4.2.5 Superior accuracy in predicting multi-enhancer regulations .................... 125 4.2.6 ComMUTE decodes the TF regulatory grammars of gene expression .... 126 4.2.7 Predicted enhancer-gene links are enriched with QTLs and GWAS SNPs ......................................................................................................................... 130 vi 4.2.8 Multi-enhancer regulations unravel the regulatory basis of epistasis-QTLs ......................................................................................................................... 131 4.3 DISCUSSION ..................................................................................................... 132 CHAPTER 5 PREDICT LONG-RANGE ENHANCER REGULATION BASED ON PROTEIN-PROTEIN INTERACTIONS BETWEEN TRANSCRIPTION FACTORS ..... 135 5.1 INTRODUCTION................................................................................................ 135 5.2 MATERIALS AND METHODS ........................................................................... 143 5.2.1 Chromatin contact maps and multi-omics datasets ................................. 144 5.2.2 Generation of the training dataset and the matrix of features .................. 146 5.2.3 Hierarchical TF community detection on the PPI network ....................... 149 5.2.4 Predictive model of long-range enhancer-promoter interactions ............. 153 5.2.5 Feature selection ..................................................................................... 154 5.2.6 Cross-validation and performance comparison ....................................... 155 5.2.7 Genome-wide prediction of long-range enhancer-promoter interactions . 158 5.2.8 Feature interpretation for mechanistic insights ........................................ 159 5.2.9 Pathway enrichment analysis for genes regulated by specific TF PPIs ... 160 5.2.10 cis-eQTL enrichment analysis for predicted long-range enhancer-promoter interactions ....................................................................................................... 161 5.2.11 cis-eQTL enrichment around TF binding sites ....................................... 162 5.2.12 trans-eQTL enrichment analysis for enhancer-mediated TF-gene pairs 162 5.3 RESULTS .......................................................................................................... 164 5.3.1 Long-range enhancer-promoter interaction prediction based on PPIs among TFs ................................................................................................................... 164 5.3.2 Boosted performance based on features of TF PPIs ............................... 167 5.3.3 Genome-wide prediction of long-range enhancer-promoter interactions . 173 5.3.4 Important protein-protein interactions regulating chromatin interactions . 174 5.3.5 Genes regulated by different TF PPIs are enriched in distinct pathways . 177 5.3.6 Predicted enhancer-promoter interactions are enriched with cis-eQTLs . 180 5.3.7 cis-eQTLs are enriched in binding sites of prioritized TFs ....................... 181 5.3.8 trans-eQTLs are enriched in enhancer-mediated TF-gene pairs ............. 183 5.4 DISCUSSION ..................................................................................................... 185 CHAPTER 6 DISCUSSION ......................................................................................... 189 6.1 SUMMARY ......................................................................................................... 189 6.2 FUTURE DIRECTION ........................................................................................ 192 APPENDICES ............................................................................................................. 193 APPENDIX A SUPPLEMENTARY FIGURES FOR CHAPTER 2 ............................ 194 APPENDIX B SUPPLEMENTARY FIGURES FOR CHAPTER 3 ............................ 219 APPENDIX C SUPPLEMENTARY FIGURES FOR CHAPTER 4 ............................ 243 APPENDIX D SUPPLEMENTARY FIGURES FOR CHAPTER 5 ............................ 256 BIBLIOGRAPHY ......................................................................................................... 277 vii LIST OF FIGURES Figure 2.1 Overview of FLAMINGO................................................................................. 7 Figure 2.2 Simulation analyses of FLAMINGO. ............................................................. 11 Figure 2.3 Superior accuracy and scalability of FLAMINGO. ........................................ 14 Figure 2.4 Interpretation of multi-way chromatin interactions and QTLs. ...................... 20 Figure 2.5 Geometrical signature of predicted chromatin conformations. ..................... 24 Figure 2.6 Robust performance of FLAMINGO under different missing rates. .............. 28 Figure 2.7 Cross cell-type predictions by iFLAMINGO. ................................................. 32 Figure 2.8 iFLAMINGO improves the resolution of predicted 3D structures. ................. 35 Figure 3.1 Overview of tFLAMINGO.............................................................................. 63 Figure 3.2 Simulation analyses of tFLAMINGO. ............................................................ 65 Figure 3.3 Performance validation based on the STORM dataset. ............................... 67 Figure 3.4 Systematic performance comparison in reconstructing single-cell chromosome structures. ..................................................................................................................... 70 Figure 3.5 Systematic performance comparison in imputing high-resolution single-cell chromatin contact maps. ............................................................................................... 72 Figure 3.6 Compartment analyses and TAD analyses in single cells. ........................... 75 Figure 3.7 Dynamic single-cell 3D chromosome structures reflects distinct methylation landscape of genes. ...................................................................................................... 78 Figure 3.8 Analyses of single-cell chromatin interactions. ............................................. 81 Figure 3.9 Identification of the single-cell multi-way chromatin interactions based on the predicted chromosome structures. ................................................................................ 86 viii Figure 4.1 Bayesian framework of ComMUTE in predicting multi-enhancer regulations. .................................................................................................................................... 112 Figure 4.2 Performance comparisons with JEME across 35 gold-standards support the superior performance of ComMUTE............................................................................ 116 Figure 4.3 Integration of TF modules improve the predictive accuracy. ...................... 120 Figure 4.4 Direct functional and physical interactions between predicted co-regulating enhancers. .................................................................................................................. 122 Figure 4.5 Validation of the predicted multi-enhancer regulations based on SPRITE. 125 Figure 4.6 Accurate predictions of cooperative TF modules. ...................................... 127 Figure 4.7 Predicted enhancer-gene interactions are enriched with eQTLs. ............... 129 Figure 5.1 Schema of ProTECT in predicting PPI mediated enhancer-gene links. ..... 139 Figure 5.2 Performance comparison in GM12878 and K562. ..................................... 166 Figure 5.3 TF PPI features provide additional information beyond TF bindings and activity- based features. ........................................................................................................... 170 Figure 5.4 Genome-wide prediction of enhancer-promoter interactions reveals functional roles of TF PPIs in gene regulation. ............................................................................ 172 Figure 5.5 Predicted enhancer-promoter interactions are enriched with cis-QTLs and trans-eQTLs. ............................................................................................................... 179 Figure A.1 5kb-resolution 3D structures for 23 chromosomes predicted by FLAMINGO. .................................................................................................................................... 194 Figure A.2 1kb-resolution 3D structures for 23 chromosomes predicted by FLAMINGO. .................................................................................................................................... 197 Figure A.3 Overview of the assembly algorithm of FLAMINGO. ................................. 200 Figure A.4 High similarity of predicted structures using different conversion factors... 201 Figure A.5 Convergence and model performance under different down-sampling rates based on simulated structures. ................................................................................... 202 ix Figure A.6 Model performance under different number of loci and down sampling rates based on simulated structures. ................................................................................... 203 Figure A.7 Validation of the assembly algorithm based on simulations. ...................... 204 Figure A.8 Performance validation using low-resolution Hi-C data and FISH data. .... 205 Figure A.9 Predicted 3D structures of chr1 by FLAMINGO in six cell-types at 5-kb resolution..................................................................................................................... 206 Figure A.10 The observed long-range chromatin interactions are supported by TF ChIP- seq and Capture-C interactions. .................................................................................. 207 Figure A.11 Performance comparison in GM12878 based on off-diagonal distances. 208 Figure A.12 Performance comparison in the additional five cell-types. ....................... 209 Figure A.13 Example of 3D chromatin loops reconstructed by FLAMINGO. ............... 210 Figure A.14 High scalability of FLAMINGO over existing algorithms. .......................... 211 Figure A.15 FLAMINGO leads to the discovery of multi-way chromatin interactions. . 212 Figure A.16 FLAMINGO provides structural basis of long-range QTLs. ...................... 213 Figure A.17 Comparison between the predicted structures with single-cell chromosome structures. ................................................................................................................... 214 Figure A.18 FLAMINGO robustly reconstructs the high-resolution 3D structures using a small fraction of observed Hi-C data. .......................................................................... 215 Figure A.19 The imputation of 3D distances based on 1D epigenomics data in iFLAMINGO. ................................................................................................................ 216 Figure A.20 Performance of cross cell-type predictions using iFLAMINGO. ............... 217 Figure A.21 Convergence and parameter tuning of FLAMINGO. ................................ 218 Figure B.1 3D structures of chromosome 19 in 10kb-resolution for 351 mESC cells predicted by tFLAMINGO based on snm3C data. ....................................................... 219 x Figure B.2 3D structures of chromosome 19 in 10kb-resolution for 7 mESC cells predicted by tFLAMINGO based on scHi-C data. ....................................................................... 227 Figure B.3 3D structures of chromosome 21 in 10kb-resolution for 16 K562 cells predicted by tFLAMINGO based on scHi-C data. ....................................................................... 228 Figure B.4 Differential linear relationships between single-cell 3C datasets and bulk Hi-C datasets....................................................................................................................... 229 Figure B.5 Schema of the band wise log-regression method to rescale the single-cell interaction frequencies. ............................................................................................... 230 Figure B.6 3D Validation of the transformed single-cell interaction frequencies based on three additional datasets. ............................................................................................ 231 Figure B.7 Robust performance of tFLAMINGO under different settings based on simulations. ................................................................................................................. 232 Figure B.8 Accurate reconstruction of a simulated structure with 3000 loci under the 0.5% down sampling rate. .................................................................................................... 233 Figure B.9 Systematic performance evaluation based on simulations. ....................... 234 Figure B.10 Convergence of tFLAMINGO. .................................................................. 235 Figure B.11 tFLAMINGO identifies underlying structural variations. ........................... 236 Figure B.12 Single-cell compartment and TAD analyses in GM12878. ....................... 237 Figure B.13 Justification of the optimal number of clusters. ........................................ 238 Figure B.14 Pathway enrichments of differential methylated genes. ........................... 239 Figure B.15 Dynamic 3D structures across 15 single cells.......................................... 240 Figure B.16 Simulation-based methods fail to handle long-range interactions. ........... 241 Figure B.17 Simulation analyses confirms the limitation of simulation-based models. 242 Figure C.1 Predictive power of the features used in ComMUTE. ................................ 243 xi Figure C.2 Parameter selection based on the optimal AUROC................................... 244 Figure C.3 Summary statistics of the predicted enhancer-gene links. ......................... 245 Figure C.4 Summary of the input epigenomic datasets. .............................................. 246 Figure C.5 Convergence of ComMUTE....................................................................... 246 Figure C.6 Performance comparison with JEME based on the enrichment analyses. 247 Figure C.7 Performance comparison with existing methods based on the enrichment of experimental chromatin interactions. ........................................................................... 248 Figure C.8 Cross-cell-type comparison with TargetFinder. ......................................... 249 Figure C.9 Evaluating the accuracy of predicted enhancer-gene links based on different epigenomic datasets. .................................................................................................. 250 Figure C.10 Example of predicted multi-enhancer regulations. ................................... 251 Figure C.11 Example of predicted multi-enhancer regulations. ................................... 252 Figure C.12 Co-binding analysis based on TF motif occurrence. ................................ 253 Figure C.13 Example of direct chromatin interactions between co-regulating enhancers. .................................................................................................................................... 253 Figure C.14 ComMUTE discovers clear TF grammars for gene regulations. .............. 254 Figure C.15 Convergence of ComMUTE..................................................................... 255 Figure D.1 Summary of training dataset generation and confounding factor controls. 256 Figure D.2 Predictive power of features are supported by the differential distributions of features. ...................................................................................................................... 257 Figure D.3 Advanced feature dimension reduction is needed due to the risk of overfitting. .................................................................................................................................... 258 Figure D.4 Hierarchical network-community detection based on the PPI network to construct model-level TF PPI features. ....................................................................... 259 xii Figure D.5 PPI community detection based on the MCL. ............................................ 260 Figure D.6 Enrichment analysis and PPI support analysis for TF module pairs. ......... 261 Figure D.7 Model performance as a function of the number of decision trees. ........... 262 Figure D.8 Performance of ProTECT using different epigenomic signals. .................. 263 Figure D.9 Performance comparison based on the imbalanced training data and the genomic bin-split cross-validation. .............................................................................. 264 Figure D.10 Performance comparison using five Hi-ChIP datasets. ........................... 265 Figure D.11 Performance comparison using four different ChIA-PET datasets. ......... 266 Figure D.12 Performance comparison based on different combinations of Hi-C data and TF ChIP-seq data. ....................................................................................................... 267 Figure D.13 Summary of genome-wide predictions by ProTECT in GM12878 and K562. .................................................................................................................................... 268 Figure D.14 Validation of ProTECT predicted enhancer-gene links with enhancer degree greater than one. ......................................................................................................... 269 Figure D.15 Performance comparison with the ABC model in the whole genome-wide. .................................................................................................................................... 270 Figure D.16 Comparing the TF PPI abundance score in the Hi-C supported enhancer- gene links and the ProTECT predictions. .................................................................... 271 Figure D.17 Examples of prioritized module-level TF PPIs features. .......................... 272 Figure D.18 Identification of the directions of TF PPI features. ................................... 273 Figure D.19 Differential pathway enrichments of genes regulated by different module- level TF PPIs based on the ProTECT predictions. ...................................................... 274 Figure D.20 QTL enrichment analysis in K562. ........................................................... 275 Figure D.21 ProTECT predicts enhancer-gene links based on the imputed TF binding sites. ............................................................................................................................ 276 xiii CHAPTER 1 INTRODUCTION Genetic disorders have been proved to be closely related to the disease risks of all individuals, and the change of a single nucleotide of the human genome may cause several diseases. Therefore, understanding the relationships between genetic variants and diseases plays a central role in proposing individualized clinical therapy. Over the recent 20 years, Genome-wide association studies (GWAS) have been widely applied and predicted millions of disease-associated SNPs. Traditionally, people mainly focused on the SNPs within the coding region of genes and used the genes containing the SNPs as mediators to explain the SNP-disease association. However, genes only take 2% of the human genome, and the mechanisms of SNPs within the non-coding region of the human genome remain unclear. Recently, the development of the Next Generation Sequencing (NGS) technique has been the driving force in studies of functional genomics and generates large-scale genome-wide coverage epigenomic datasets measuring gene expression, chromatin opening, and transcription factor (TF) binding sites across diverse cell types. Taking advantage of big biological data, over a million enhancers were discovered within the non-coding regions, which can be bound by transcription factors and regulate the expression of both local and distal genes through 3D chromatin loops. The complex interplay between genes, enhancers, and TFs are summarized in the gene regulatory network, where nodes represent the biological factors and edges represent the regulatory 1 links. The gene regulatory network provides a clear roadmap of how genetic variants contribute to diseases by disturbing genes, enhancers, and TFs. Predicting the interactions between genes and enhancers are challenging. Disputing one can assign the nearest gene to enhancers as the target gene in 1D space, it has been proved that the enhancers can be brought to the proximal of the distal target genes in 3D space through long-range chromatin loops and regulate the gene expression. Experimentally, the chromosome conformation capture technique, including Hi-C and Capture-C, has been used to profile the chromatin contact between DNA fragments. However, the experimental data can only predict chromatin interactions in low resolution. Furthermore, the experimental data can only predict short-range chromatin interactions (<500kb) and has low power in predicting long-range chromatin interactions. Therefore, a robust computation model for predicting the long-range interaction between enhancers and genes is in great need. To address this problem, we developed a series of machine learning models to predict and utilize the 3D chromosome structures enhancer-gene interactions: FLAMINGO, tFLAMINGO, ComMUTE, ProTECT, and APRIL. In Chapter 2 and 3, we introduce FLAMINGO and tFLAMINGO, which reconstruct 3D chromosome structures based on Hi-C contact maps. FLAMINGO is a highly scalable and accurate algorithm for predicting high-resolution 3D chromosome structures from Hi-C contact maps. Using FLAMINGO, we successfully reconstructed the 3D structures of all 23 human chromosomes in the highest resolution (1kb) in six cell types. tFLAMINGO further expands the reconstruction of 3D chromosome structures into single cells using low-rank tensor completion. The application of tFLAMINGO in four single-cell chromatin interaction datasets provides a unique opportunity to study the dynamic 3D chromosome structures 2 across single cells and the relationship with gene regulations. In Chapter 4 and Chapter 5, we introduce ComMuTE and ProTECT, which predict functional regulatory interactions between enhancers and genes. ComMuTE models the joint regulatory effect of multiple enhancers and TFs using a graphical statistical model. We applied ComMuTE in 127 cell types/tissues to predict enhancer-gene links and provided a mechanistic explanation of high-order chromatin interactions and epistasis QTLs. ProTECT predicts the enhancer- gene links by modeling the Protein-Protein Interactions (PPI) between TFs. The elucidation of TF-mediated enhancer-gene links provides new insights into understanding the trans-QTL. 3 CHAPTER 2 RECONSTRUCT HIGH-RESOLUTION 3D GENOME STRUCTURES FOR DIVERSE CELL-TYPES USING FLAMINGO A modified version of this chapter was previously published (Wang H. et al, 2022): Wang H., Yang J., Zhang Y., Qian, J. and Wang. J. (2022) Reconstruct high-resolution 3D genome structures for diverse cell-types using FLAMINGO. Nature Communications. 2.1 INTRODUCTION The three-dimensional (3D) architecture of genomes plays pivotal roles in DNA replication, genome stability and tissue differentiation1-3. Quantitative characterization of spatial chromosome conformations is crucial for deciphering the complex systems of spatially coordinated transcriptional and epigenetic activities4-6, leading to the understanding of gene regulation mechanisms. The genome-wide high-throughput chromosome conformation capture technique such as Hi-C7, 8 has been one of the driving forces in studies of 3D genome structures. The Hi-C datasets profiled from different cell-types and species7, 9-13 have revealed structural components of genome organization7, 10, 14, such as chromatin loops, topologically associated domains (TADs), and chromatin compartments. Although these findings have provided powerful insights into the governing rules of chromosome folding at large scales (~100kb-1Mb), such as the loop extrusion model15, 16, it is still computationally difficult to accurately reconstruct high-resolution spatial conformations, such as at ~5kb resolution, for all chromosomes in large genomes. 4 Since the collection of Hi-C experiments is growing, the resulting massive Hi-C data call for efficient computational algorithms for modeling 3D genomes. Previous algorithms of 3D reconstruction using Hi-C data have been able to predict spatial distances mainly at low-resolutions or within specific genomic segments17. Typically, based on experimentally estimated conversion functions14, the observed Hi-C contact frequency is converted into spatial distances, which we term as observed Hi-C distances in this paper. In general, a consensus structure or an ensemble of structures are inferred by maximizing the similarity between predicted and observed Hi-C distances using optimization-based (such as MDS- type or manifold learning techniques)18-27 or probabilistic approaches (such as MCMC strategy) 28-32. Representative state-of-the-art algorithms that have been shown to outperform other methods, along with some recent developments, include ShRec3D 33, GEM-FISH34, Hierarchical3DGenome35, RPR36, SuperRec37, ShNeigh38 and PASTIS28 (Methods, Supplementary Note 1). The accuracy of a predicted structure is mainly evaluated by its capability of recapitulating the measured pairwise distances between genomic loci from Hi-C. Spearman correlation is one of the widely used metrics to quantify the accuracy. However, four fundamental challenges still need addressing in developing an efficient algorithm: (1) High scalability to reconstruct high-resolution spatial configurations for all chromosomes from massive Hi-C datasets; (2) Superior performance to handle large fractions of missing data, which is a common drawback of Hi-C experiments; (3) Capability to make accurate cross cell-type structure predictions, since the vast majority of cell-types lack Hi-C data; and (4) Capability to predict high- resolution structures from low-resolution Hi-C contact maps. 5 To address the above four challenges, we have developed a low-rank matrix completion based methodology for reconstructing 3D genome structures from Hi-C data. Low-rank matrix completion has been found to be a powerful modeling framework for 3D shape inferences in different scientific fields39-41. One of the unique advantages of such a modeling method is that it is able to explicitly leverage the low-rank property of a pairwise- distance matrix (rank≤5 for Euclidean distance matrix, see Methods)42 in an objective function for optimization, and such a low-rank property has not been explicitly utilized in previous approaches, such as multidimensional scaling based methods. Efficient incorporation of the low-rank constraint into the modeling process allows fast structure reconstruction from just a small subset of Hi-C data, making the algorithm scalable for high-resolution structure predictions for large chromosomes with high fractions of missing data. Our efforts have led us to create a Fast Low-rAnk Matrix completion algorithm for reconstructINg high-resolution 3D Genome Organizations from Hi-C data, FLAMINGO (https://github.com/wangjr03/FLAMINGO), which has been implemented to generate both 5kb- and 1kb-resolution 3D chromosomal structures for the human genome. Based on extensive performance evaluations using data from both simulated structures and experimental Hi-C datasets from the human genome, the high-resolution chromosome structures generated by FLAMINGO demonstrate substantially improved accuracy, compared with other state-of-the-art methods. The predicted high-resolution spatial distances in 3D space are further justified by orthogonal experiments (such as ChIA- PET43, Capture-C44, 45 and SPRITE46), providing biological insights into long-range chromatin interactions in gene regulation. Beyond 2D contact maps, the predicted 3D 6 structures by FLAMINGO can help to identify higher-order multi-way chromatin interactions, interpret potential mechanisms of genetic QTLs, characterize the geometrical patterns of chromatin folding, and facilitate the understandings of structural variations. Moreover, even using only 10% of down-sampled Hi-C contacts, FLAMINGO Figure 2.1 Overview of FLAMINGO. (a) Schematic figure of FLAMINGO. Biologically, the distance matrix (size 𝑁 by 𝑁 ) is induced by the 3D coordinate matrix of DNA fragments (size 𝑁 by 3), which guarantees that the rank of the distance matrix is no more than five (upper panel). The low-rank property suggests the potential of information compression (𝑁 2 entries to 5𝑁 entries), and enables FLAMINGO to efficiently reconstruct structures from incomplete distance matrices and perform superiorly against large portions of missing data. Equipped with high scalability, FLAMINGO can quickly predict the optimal coordinate matrix that reproduces the observed distances from Hi-C data (middle panel), leading to the high-resolution 3D genome structure and the completed distance matrix (lower panel). (b) Reconstructed 5kb-resolution structure of chromosome 1 in the human genome by FLAMINGO. Chromatin compartments (A: orange; B: blue) demonstrate polarized positioning in the predicted structure. A representative example of predicted loop structures is shown in the zoom-in view, where both anchors interact with each other (supported by ChIA-PET interactions) and are bound by CTCF and Rad21. Color gradients represent consecutive TADs within each type of compartments. still achieves higher accuracy than other methods, demonstrating its superior capability of handling missing data in Hi-C. In addition, an integrative version of our algorithm, 7 iFLAMINGO, is built to further combine 1D epigenomics data, such as DNase-seq signals, with Hi-C data, which allows us to make cross cell-type predictions of 3D genome architectures and boost the resolution of predictions. These algorithmic advantages will not only expand the coverage of cell-types for 3D genome modeling but also improve the information extraction from the fast-growing collection of experimental Hi-C data. 2.2 RESULTS 2.2.1 FLAMINGO algorithm to reconstruct high-resolution 3D genome architectures Based on the ‘beads on a string’ polymer model47, every chromosome is modeled as a chain of ‘beads’ consisting of DNA fragments or loci, and the pairwise distances between genomic loci are biologically induced from the Gram matrix of their 3D coordinates (Figure 2.1.a). To reconstruct the 3D spatial structure, the normalized chromatin contact maps from Hi-C experiments can be converted into an observed distance matrix as suggested by previous studies10, 14, whose validity and robustness are justified by both computational model selections and empirical comparisons with image-based data (see Methods). The observed distance matrix typically contains large portions of unmeasured distances (namely, missing data), especially for high-resolution genomic loci (~5kb fragments)10. FLAMINGO predicts the optimal genome structure based on a low-rank matrix completion framework (Figure 2.1.a). The objective function contains three terms: (1) a term to impose the low-rank constraint on the Gram matrix of predicted 3D coordinates, since the 3D distance matrix has a rank at most five; (2) a term measuring the differences between predicted and observed distances, which is evaluated on the measured subset of pairs of loci; and (3) a penalty term penalizing unrealistic distances between adjacent DNA fragments. FLAMINGO uses the alternating-direction method of multipliers48 to solve the 8 optimization problem. At convergence, the optimal 3D structure that minimizes the objective function is identified, along with the completed pairwise distance matrix (Figure 2.1. a). The key feature of FLAMINGO is to incorporate the low-rank constraint (rank≤5) of the 3D distance matrix of size 𝑁 × 𝑁 into the optimization process, where 𝑁 is the number of genomic loci, such as the number of 5kb DNA fragments. Since the pairwise spatial distances are generated by the 3D coordinate matrix of genomic loci (rank≤3), the resulting symmetric Euclidean distance matrix has a rank at most 542. It is because the squared Euclidean distance matrix is a sum of three matrices: one being the Gram matrix of rank at most 3 and each of the other two being of rank at most 1 (see Methods). And thus, it has intrinsic degrees of freedom at most 5𝑁, which, compared to the size 𝑁 × 𝑁 of the entire full matrix, is extremely small when 𝑁 is large. Therefore, in order to recover the entire distance matrix, we may just need the number of measurements of the distance matrix to be proportional to the intrinsic degrees of freedom. In fact, as long as the information of the underlying distance matrix is not concentrated on a few entries, each randomly selected measurement of pairwise distances will be equally informative, suggesting that the information can be substantially compressed 49 (Figure 2.1.a). Hence, by minimizing the rank of the inferred Gram matrix, low-rank matrix completion models49 offer at least two benefits (Methods): (1) accurate 3D structures can be reconstructed from subsets of observed distances; and (2) fast matrix calculations can be carried out based on sparsity and low-rankness of the underlying matrices. Remarkably, both benefits are heavily needed for high-resolution structure predictions. By dividing the genome into high-resolution DNA fragments such as at 5kb-resolution, the size of the 9 distance matrix becomes huge, many entries of which have no data due to the limited sequencing depth of Hi-C experiments. Thus, FLAMINGO is able to build high-resolution 3D structures from the fast-growing collection of Hi-C datasets with decent scalability at computational complexity 𝑂(𝑁 2 ) without demanding increased sequencing depths (Figure A.1 and Figure A.2). To enable parallel computations, FLAMINGO also employs a hierarchical strategy by dividing each chromosome into 1Mb domain-level fragments that are further divided into 5kb DNA fragments, where we define a 1Mb fragment as a domain (Methods). The same low-rank matrix completion algorithm is applied on both the inter-domain hierarchy consisting of 1Mb fragments, which leads to a basic structural skeleton, and the intra- domain hierarchy of 5kb fragments, which results in intra-domain structures. Different from other methods that only align the endpoints of domain fragments 34 or whose refinement processes are dominated by intra-domain distances35, an iterative rotation algorithm along the three spatial directions is developed to assemble intra-domain structures into the inter-domain skeleton, by aligning all measured off-diagonal distances so as to maximize the consistency with inter-domain 5kb-resolution Hi-C contacts (Figure A.3, Methods). At convergence, the iterative rotation algorithm leads to the full high- resolution structures for each chromosome. FLAMINGO has been applied on the normalized Hi-C datasets from six human cell-types (GSE6352510) to generate 3D structures for chromosomes 1-22 and X at 5kb-resolution (Figure A.3), which are the largest resources of reconstructed 3D structures for the human genome at high-resolution (https://github.com/wangjr03/FLAMINGO). For example, at 5kb-resolution, chromosome 1 contains 44,027 DNA fragments, excluding the 10 centromere and telomere regions, and 94.5% entries of the observed distance matrix in GM12878 are missing data. The structure of chromosome 1 can be predicted quickly by FLAMINGO (Figure 2.1.b). The two types of chromatin compartments (A/B) are organized into separable positions in the predicted structure, consistent with the polarized architecture observed from the multiplexed FISH14. By zooming into the high-resolution structure, predicted loop structures are found corresponding to previously annotated TADs (Figure 2.1.b), where the pairs of CTCF-associated Hi-C loop anchors (CTCF- CTCF pairs) are predicted with significantly shorter spatial distances, compared to genomic-distance controlled pairs in two cases: 1) pairs between a CTCF-anchor and a random anchor with the same genomic separation ( CTCF-random pairs, Figure A.1 Figure 2.2 Simulation analyses of FLAMINGO. (a) Given a benchmark structure, the distance matrix is down-sampled using different down-sampling rates and mixed with 11 Figure 2.2 (cont’d) different levels of noise (Noise level 1: low-level; Noise level 2: high-level; see Methods). The incomplete noisy distance matrices are used as inputs for FLAMINGO. The reconstructed 3D structures are compared with the benchmark structure by calculating relative errors and correlations. (b) One example of the reconstructed structure by FLAMINGO (down-sampling rate=0.5, noise level 1, see Methods), which aligns with the benchmark structure almost identically (correlation=0.9999999, relative error=0.0037). (c-d) The performance of FLAMINGO (relative errors: the y-axis) under various down- sampling rates and noise levels, with respect to the accuracy of 3D distance matrices (c) and 3D coordinates of DNA fragments (d). Error bars represent the standard deviations of relative errors examined based on n=10 independently down-sampled distance matrices under each down-sampling rate. Data are presented as mean values +/- SD. Source data are provided as a Source Data file. boxplot, right, p-value=5.21x10-4, one-sided Wilcoxon test), and 2) pairs between random anchors with the same genomic separation (random-random pairs, Supplementary Fig 1 boxplot, left, p-value=2.78x10-5, one-sided Wilcoxon test). In addition, FLAMINGO has also generated 3D chromosomal structures at 1kb-resolution in GM12878 for all chromosomes (Figure A.2), which represent spatial reconstructions with the highest resolution to date. Moreover, FLAMINGO is robust to the choice of conversion factors for converting interaction frequency to distance, where the conversion factor is chosen within the range suggested by previous studies14, 28 (Figure A.4). 2.2.2 Benchmark performance based on simulated structures The performance of FLAMINGO was benchmarked on simulated structures. The distance matrix generated from the benchmark structure was randomly down-sampled and then mixed with noise (Figure 2.2.a, Methods). By applying FLAMINGO on the noisy incomplete distance matrices, the reconstructed 3D structures can be identified with fast convergence (Figure A.5), and they are in strong agreement with the original benchmark structures (relative error<0.03 and correlation>0.999) (Figure 2.2.b). In addition, the 12 accuracy is robust against a wide range of down-sampling rates and different levels of noise (Figure 2.2.c and 2.2.d, Figure A.5, correlation>0.999), demonstrating that FLAMINGO is capable of handling missing data. The high accuracy is also found to be robust when FLAMINGO is applied to a series of simulated structures with different sizes (Figure A.6), suggesting the performance is not affected by the number of genomic loci along chromosomes. Furthermore, to validate the iterative assembly algorithm for organizing intra-domain structures, we partitioned the benchmark structure into different domains and then reconstructed the whole structure using the assembly algorithm. The assembled structures recapitulate the benchmark structure with high accuracy (relative 13 error < 0.005, correlation > 0.999) and are independent of specific choices of domain partitions (Figure A.7a and 7b). Figure 2.3 Superior accuracy and scalability of FLAMINGO. (a) The reconstructed structure of chromosome 21 at 5kb-resolution (left). The color gradient represents the genomic distance to the centromere (flanking centromere; yellow; flanking telomere: black). As an example, FLAMINGO recovers the chromatin loop formed by two TADs (chr21:31,375,000-32,985,000; middle), corresponding to inter-TAD hotspots in the reconstructed 3D distance matrix (right). (b) Robust performance of FLAMINGO across six cell-types at 5kb-resolution. Correlations between predicted and observed distance matrices are calculated for all 5kb fragments (all-points: blue) and fragments within domains (intra-domain: salmon). Error bars represent the standard deviations across n=23 chromosomes. (c) Performance comparison with the state-of-the-art algorithms based on Hi-C data in GM12878 at 5kb-resolution (all-points: left; intra-domain: right). Error bars represent the standard deviations across chromosomes with complete predictions (n=23 for FLAMINGO, Hierarchical3Dgenome and SuperRec; n=10 for ShRec3D; n=9 for ShNeigh; n=6 for RPR). GEM-FISH does not have error bars because it can only complete the prediction for chromosome 21. (d) Orthogonal chromatin 14 Figure 2.3 (cont’d) interaction data provides additional evaluation metrics: anchors of chromatin interactions are expected to have short 3D distances. (e-g) FLAMINGO predicts significantly shorter distances between anchors of chromatin interactions profiled by Capture-C (n=3,692) (e), ChIA-PET (n=214) (f) and SPRITE (n=871) (g). The statistical significance (***) is calculated by one-sided Mann-Whitney test: (e) p-value=9.4x10-25 (orange) and p- value=7.6x10-24 (blue); (f) p-value=2.8x10-22 (orange) and p-value=5.1x10-20 (blue); (g) p- value=7.4x10-31 (orange) and p-value=6.5x10-42 (blue). The 3D structures of different methods are normalized for fair comparison. The center lines of boxplots show the median, the upper and lower box limits show the 25 th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (h) One example of chromatin loops predicted by FLAMINGO for a significant ChIA-PET interaction (red links) linking the KCNA2 promoter (red) with a distal enhancer (orange). (i) Comparison of the computational scalability by measuring the runtime (y-axis) as a function of different numbers of genomic loci (x-axis). Source data are provided as a Source Data file. 2.2.3 Superior reconstruction accuracy across diverse cell-types The performance of FLAMINGO on experimental Hi-C data in the human genome was then systematically evaluated and compared with the state-of-the-art methods. As demonstrated in Figure 2.1.b and Figure A.1, FLAMINGO is able to quickly reconstruct 3D chromosome structures at 5kb-resolution, which are qualitatively consistent with both large-scale chromatin properties, such as compartments and TADs, and small-scale structural details, such as chromatin loops and CTCF/cohesin bindings. The predicted structural skeletons (1Mb-resolution) of chromosomes are strongly supported by results from both Hi-C (average correlation=0.95, Figure A.8.a) and FISH14 (average correlation=0.80, Figure A.8.b), consistently higher than other methods. The reconstructed structures also vary across different cell-types, consistent with cell-type specific chromatin contact patterns from Hi-C (Figure A.9). Taking the predicted structure of chromosome 21 in GM12878 as an example, FLAMINGO reconstructs clear loop structures for TADs and predicts short 3D distances for inter-TAD chromatin contacts 15 (Figure 2.3.a). Compared to the fuzzy input distance matrix converted from Hi-C (Figure A.10), the distance matrix derived from the predicted 3D structure shows substantially improved resolution (Figure 2.3.a), and the reconstructed long-range inter-TAD contacts are supported by experimental Capture-C interactions (Figure A.10). To quantitatively evaluate the genome-wide accuracy at 5kb-resolution, the predicted 3D chromosome structures were evaluated according to their consistency with the observed distance matrices derived from Hi-C (Methods). Similarities between structures are quantified by Spearman correlations, which have been widely used as accuracy metrics in structure analysis. To note, achieving high correlations at 5kb-resolution is a much harder problem than at low-resolutions (e.g. 100kb- or 1Mb-resolution), because Hi-C signals at 5kb-bins are much noisier and the number of high-resolution constraints in optimization is huge. Remarkably, the Spearman correlations between the predicted and observed 3D distances at 5kb-resolution, including both diagonal sub-matrices for intra- domain structures and off-diagonal sub-matrices for inter-domain structures, are robustly high across all six cell-types (Figure 2.3.b, left). The predicted structure in IMR90 shows the highest correlation (average correlation=0.603 across 23 chromosomes), followed by structures predicted in GM12878 and K562 (average correlations=0.512 and 0.525 respectively). The Spearman correlations based on off-diagonal points alone (i.e. inter- domain distances) also show similar levels (correlations>0.42), except for HUVEC (correlation=0.32). These results are significant achievements, considering the extensive noisy constraints imposed by the huge number of pairwise distances at 5kb-resolution. For example, in chromosome 1, there are 6.7 × 107 pairs of 5kb fragments with measured Hi-C contacts as constraints. Furthermore, the predicted intra-domain structures 16 demonstrate higher correlations across the six cell-types (Figure 2.3.b, right), especially in GM12878, K562 and IMR90 (average correlation>0.73). In addition, even at 1kb- resolution, the reconstructed 3D structures achieve high correlations with the observed spatial distances for both whole chromosomal structures and intra-domain structures (Figure A.2, all-points correlations ~0.4 and intra-domain correlations ~0.6). These consistently high correlations indicate that FLAMINGO is able to capture both long-range genome folding patterns and detailed structures within domains. FLAMINGO was then compared with other methods, GEM-FISH34, ShRec3D33, Hierarchical3DGenome35, ShNeigh38, RPR36 and SuperRec37, which are state-of-the-art and recently developed algorithms representing different modeling strategies (Methods, Supplementary Note 1). Strikingly, FLAMINGO achieved substantially higher correlations than all the other methods, for both whole chromosome structures and intra-domain structures, at 5kb-resolution (Figure 2.3.c, Figure A.11 and A.12). For example, FLAMINGO achieved a correlation of 0.53 for whole chromosome structures in GM12878, while the other methods only achieved correlations below 0.45 (Figure 2.3.c, left). Similar advantage of FLAMINGO is also observed when the performance comparison is restricted to off-diagonal long-range inter-domain distances (Figure A.11). Moreover, focusing on detailed intra-domain structures, FLAMINGO achieved a correlation of 0.76 in GM12878, while the other methods only achieved correlations below 0.6 (Figure 2.3.c, right). Similarly, FLAMINGO outperformed across all the other five cell-types at 5kb- resolution (Figure A.12). To further leverage orthogonal data for performance comparisons, high-resolution chromatin interactions profiled by Capture-C45, ChIA-PET50 and SPRITE46 experiments 17 were used to evaluate whether the reconstructed structures assign short 3D distances between interacting anchors (Figure 2.3.d, Methods). Remarkably, FLAMINGO consistently demonstrated higher accuracy than other methods across all three sets of experimental metrics (Figure 2.3.e-g). The reconstructed structures from FLAMINGO assign statistically significant shorter 3D distances between anchors of chromatin interactions (𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 2 × 10−16, one-sided Mann-Whitney test), while other methods are less likely to capture the structural proximity for chromatin interactions. As an example (Figure 2.3.h), a long-range ChIA-PET interaction (~130kb) on chromosome 1 links a distal enhancer element (the anchor 2) to the promoter region of gene KCNA2 (the anchor 1), where both anchors are bound by CTCF and Rad21. Interestingly, in the reconstructed high-resolution structure by FLAMINGO, the enhancer and the promoter are in close proximity with each other and the genomic region in between forms a smooth chromatin loop. As comparison, the Hierarchical3DGenome algorithm does not assign a short spatial distance between the interacting enhancer and the KCNA2 promoter (Figure A.13.a). Additional examples can be found in Figure A.13 b-c. These results not only provide rigorous evidence to validate the superior accuracy, but they also underscore the impacts of FLAMINGO on decoding the mechanisms underlying orchestrated gene regulation in 3D space. 2.2.4 Advanced scalability for large-scale chromosome conformations High-resolution 3D structure modeling places stringent demands for performance, reliability, and more importantly, scalability on algorithms, since a large number of genomic loci and pairwise distances are used in the optimization procedure. Based on efficient information compression and matrix computation, the computational complexity 18 of FLAMINGO is 𝑂(𝑘𝑁 2 ), where 𝑁 is the number of genomic loci, such as the number of 5kb DNA fragments, and 𝑘 is a small constant. For example, it only took 42 minutes and 2.2GB memory for FLAMINGO to reconstruct the 5kb-resolution 3D structure for chromosome 1, the largest chromosome in the human. For chromosomes 2-22 and chromosome X, FLAMINGO was able to predict their structures even faster (Figure A.14a). As comparison, the state-of-the-art algorithms all have inferior scalability. The running times for Hierarchical3Dgenome and ShRec3D increase rapidly when the number of genomic loci becomes large (Figure 2.3.i), while the other methods (i.e. SuperRec, ShNeigh, RPR and GEM-FISH) are even slower (Figure A.14b). Most of these methods can only make predictions for short chromosomes (e.g. chr12-22) at 5kb-resolution. Furthermore, because FLAMINGO can accurately predict the 3D structures based on a small subset of pairwise distances, the scalability of FLAMINGO can be improved further by down-sampling the distance matrix from Hi-C (Figure A.14.c). In addition, based on our tests of 1kb-resolution reconstruction for all chromosomes in GM12878 (Figure A.2), FLAMINGO can generate complete predictions for large chromosomes fast. For the largest chromosome (chr1), it takes less than 25 hours using 200GB memory to reconstruct the 1kb-resolution 3D structure. Therefore, FLAMINGO provides drastic improvements on the computational scalability, which is much desired since a large number of Hi-C datasets are to be generated in the near future51, 52. 19 Figure 2.4 Interpretation of multi-way chromatin interactions and QTLs. (a) SPRITE multi-way interactions on chr21 are predicted with shorter spatial distances than the genomic-distance controlled background (**: p-value < 10-2, ***: p-value < 10-3; one-sided Wilcoxon test). The x-axis corresponds to 3-way (p-value=2.3x10-9, n=302), 4-way (p- value=1.9x10-4, n=17), and 5-way interactions (p-value=7.8x10-3, n=7). The center lines of boxplots show the median, the upper and lower box limits show the 25 th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (b) FLAMINGO captures more 3-way interactions across different distance thresholds, compared to using normalized Hi-C contact map derived distance matrix. (c) One example of a SPRITE 3-way chromatin interaction captured by FLAMINGO. (d) The SNP- promoter pairs of long-range eQTLs (>900kb) are assigned with significantly shorter spatial distances by FLAMINGO, compared to genomic-distance controlled random pairs (**: p-value=4.1x10-3; one-sided Wilcoxon test, n=1,227). The center lines of boxplots show the median, the upper and lower box limits show the 25 th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (e) One example of long-range eQTLs interpreted by FLAMINGO. The SNP rs77725975 (blue) and the promoter of FERMT3 (red) are placed in close 3D proximity. (f) The SNP- H3K4me1 pairs of distal hQTLs are assigned with significantly shorter spatial distances by FLAMINGO, compared to genomic-distance controlled random pairs (**: p- value=6.3x10-3; one-sided Wilcoxon test, n=20,950). The center lines of boxplots show the median, the upper and lower box limits show the 25th and 75th percentiles respectively. 20 Figure 2.4 (cont’d) The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (g) One example of distal H3K4me1-QTLs interpreted by FLAMINGO. The SNPs rs79377415, rs56369941, rs7518642 (blue) and the H3K4me1 ChIP-seq peak (red) are placed in close 3D proximity. Source data are provided as a Source Data file. 2.2.5 Analysis of multi-way interactions and QTLs by FLAMINGO beyond 2D Hi-C contact maps To demonstrate the biological discoveries enabled by FLAMINGO that are not directly visible from 2D contact maps, the reconstructed 3D chromatin structures are used to resolve two important questions. First, we analyzed the predicted 3D structure’s capability of capturing multi-way chromatin interactions. Spatially coordinated molecular processes frequently form multi-way interactions (e.g. 3-way, 4-way or 5-way interactions) in 3D space46, 53, 54, which play pivotal roles in coupled transcriptional and epigenetic activities55. However, Hi-C contact maps can only reveal pairwise 2-way chromatin interactions. Moreover, the high rates of missing data in Hi-C result in large genomic regions with almost no measured interactions, further limiting the capability of finding multi-way interactions from 2D contact maps. Since FLAMINGO recovers the whole spatial structure, we hypothesize that the predicted 3D structures can improve the identification of multi- way interactions. The multi-way chromatin interactions profiled by SPRITE experiments in GM1287846 are used to justify this hypothesis. In addition to pairwise interacting anchors (Figure 2.3.g), the GM12878 structure predicted by FLAMINGO consistently assigns significantly shorter spatial distances among anchors of multi-way interactions in SPRITE (Figure 2.4.a, p-value<10-2, one-sided Wilcoxon test), compared to genomic- distance controlled random samples, suggesting the predicted 3D structures are in strong 21 agreement with the higher-order organizations of multi-way interactions. More importantly, compared to using the Hi-C contact map derived distance matrix, the predicted 3D structure by FLAMINGO can capture more multi-way interactions (Figure 2.4.b, Figure A.15.a-b). Here, a multi-way interaction is considered to be captured if all interacting anchors are located in the same 3D spatial neighborhood, where all pairwise spatial distances between anchors are smaller than a specified threshold. As shown in Figure 4b, across a wide range of thresholds on normalized spatial distances, FLAMINGO consistently demonstrates higher capabilities of discovering more 3-way interactions. Even if relaxed distance thresholds are used, 28.5% 3-way interactions from SPRITE experiments can not be identified based on Hi-C contact map derived distance matrix, while being captured by FLAMINGO (Figure 2.4.b). It is because these 3-way interactions involve distal interacting anchors across very long-range genomic regions (median genomic distance=2.32Mb), where Hi-C contact maps suffer from high rates of missing data. Similar results are also found for 4-way and 5-way interactions (Figure A.15.a and 15.b), where FLAMINGO achieves much higher advantages. Figure 4c shows a representative example of a 3-way interaction that has been identified by SPRITE experiments46. The three interacting anchors are brought into spatial proximity based on the predicted loop structures, which are also highlighted in the predicted distance matrix (Figure 2.4.c, right). As comparison, the distance matrix based on the Hi-C contact map shows no signals of spatial closeness for the three anchors. As another interesting example, a candidate 4-way interaction mediated by CTCF across a 12Mb genomic region in chr1 is discovered by FLAMINGO, while the Hi-C based distance matrix shows no spatial patterns (Figure A.15.c). These results suggest that, by reconstructing 3D 22 spatial structures, FLAMINGO can help to identify multi-way chromatin interactions and reveal higher-order genome organizations, beyond 2D Hi-C contact maps. Second, we analyzed the predicted 3D structure’s utility in interpreting genetic associations, such as long-range expression QTLs (eQTL) and distal histone QTLs (hQTL) in matched cell-types or tissues. QTLs statistically link genetic variants to molecular phenotypes and facilitate understandings of disease genetics. But it has been challenging to delineate the underlying molecular mechanisms of genetic associations. Spatial proximity between genetic variants and target genes or histone modification peaks have been suggested to mediate genetic associations56, 57. Similar to the approach of multi- way chromatin interaction analysis, the predicted 3D structure is evaluated with respect to its ability of interpreting QTLs58-62 based on predicted short spatial distances, compared to using the Hi-C contact map derived distance matrix. Interestingly, across a wide range of thresholds on normalized spatial distances, substantially higher fractions of eQTLs and hQTLs are found to have their genetically associated loci (i.e. SNP-promoter or SNP- histone pairs) placed into small 3D neighborhoods by FLAMINGO (Figure A.16). Focusing on the long-range eQTLs61 whose SNPs and target gene promoters are >900kb away, these SNP-promoter pairs are found to be assigned with significantly shorter spatial distances, compared to genomic-distance controlled random pairs (p-value=1.3x10-3, one-sided Wilcoxon test, Figure 2.4.d), suggesting the effectiveness of FLAMINGO in interpreting genetic associations. For each specific long-range eQTL (>900kb), a random set of SNP-promoter pairs with the same genomic-distance from the same chromosome is generated (Methods). Among these long-range eQTLs (n=1,227), 671 of them (54.7%) are predicted to have spatial distances that are at least 2-fold shorter than the median 23 Figure 2.5 Geometrical signature of predicted chromatin conformations. (a) TAD boundaries demonstrate lower curvatures than flanking genomic regions. The center lines of boxplots (n=11,208) show the median of normalized curvatures, the upper and lower box limits show the 25th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (b) The regions with high curvatures show higher GC- content compared with genomic background. One-sided Mann-Whitney test (***): p- value=2.7x10-29 (green, n=5,261) and p-value=3.4x10-34 (blue, n=5,261). The center lines of boxplots (n=5,261) show the median, the upper and lower box limits show the 25 th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (c) The consensus structure predicted by FLAMINGO consistently aligns with the average structure across single cells in K562. Right: the errors between the predicted consensus structure and the average structure (blue) are smaller than the intrinsic standard deviations among single cells (orange). (d) The consensus structure predicted by FLAMINGO is in strong agreement with the average structure across the subset of cells in cluster 2. Right: the errors between the predicted consensus structure and the cluster-2 specific average structure (blue) are smaller than the intrinsic standard 24 Figure 2.5 (cont’d) deviations among single cells in cluster 2 (orange). Source data are provided as a Source Data file. spatial distances of genomic-distance controlled random pairs. As a representative example (Figure 2.4.e), the SNP rs77725975 is a significant long-range eQTL to the gene FERMT3 (p-value=2.6x10-4) in whole blood cells61 with a genomic distance of 983kb. This eQTL is placed into 3D proximity by FLAMINGO in GM12878, where the SNP rs77725975 and FERMT3’s promoter are located spatially close to each other, while the Hi-C based distance matrix fails to provide structural basis to interpret this eQTL. Similarly, distal hQTLs62 are also found to be assigned with significantly shorter spatial distances by FLAMINGO, compared to genomic-distance controlled random pairs (p-value=2.84x10-3, one-sided Wilcoxon test, Figure 2.4.f). Among the distal hQTLs (n=20,950), 11,797 of them (56.3%) are predicted to have spatial distances that are at least 2-fold shorter than the median spatial distances of genomic-distance controlled random pairs. As shown in Figure 4g for a set of distal hQTLs (p-value<1.8x10-4), FLAMINGO reconstructs a loop structure which brings the SNPs close to the specific target H3K4me1 peak that is ~75kb away. In contrast, the distance matrix derived from Hi-C contact maps shows no signal of long-range interactions in this region. These results strongly support the FLAMINGO’s ability of interpreting the potential mechanisms of distal QTLs by leveraging the reconstructed spatial proximity information, a critical step further to decipher genetic associations with molecular phenotypes. 25 2.2.6 Geometrical property of chromatin structures To gain additional insights into genome folding, 3D geometrical metrics are needed to describe the complex shapes of chromatin structures, which can not be directly obtained from Hi-C contact maps. The reconstructed 3D structures provide a systematic platform for dissecting geometrical signatures of chromatin organization. To do this, we calculated the curvatures for every 5kb genomic bin along the 3D curves of chromosomes. A larger curvature around a genomic region indicates the chromatin bends more sharply, while a smaller curvature suggests the region is relatively straight. Interestingly, the curvatures around TAD boundaries show significantly lower curvature than flanking genomic regions (Figure 2.5.a, p-value=2.2x10-16, one-sided Mann-Whitney test). Considering the loop extrusion model16, it suggests that, when a loop is established and the extrusion complex stops sliding, the DNA located around the extrusion complex is maintained rigid. In addition, genomic regions with large curvatures show significantly higher GC-contents (Figure 2.5.b), consistent with the increased flexibility of GC-rich DNA sequences63, 64 that may facilitate intra-TAD interactions. 2.2.7 Reference structure to interpret single-cell variabilities Based on observations of recent single-cell Hi-C and imaging data65-68, chromatin structure is dynamic and demonstrates variabilities across individual cells. The optimal consensus structure reconstructed from bulk tissue Hi-C by FLAMINGO thus provides a reference of chromatin folding aggregated from a pool of cells, which can be used as a basis to delineate and interpret the ensemble of chromatin configurations 69, 70. We compared FLAMINGO’s predicted consensus structure to the single-cell structures profiled by diffraction-limited 3D imaging68 to analyze their relationship. The image-based 26 dataset68 contains an ensemble of single-cell structures for a specific genomic region in chr21 at 30kb-resolution. The averaged structure is calculated from the ensemble and is then compared with FLAMINGO’s prediction. Figure 5c shows the comparison for a loop structure in this region. Both 5kb- and 30kb-resolution predictions from FLAMINGO align well with the averaged structure of single cells (Figure 2.5.c, left). More importantly, the differences between these structures are consistently smaller than the intrinsic standard deviations among single-cells within the ensemble (Figure 2.5.c, right), suggesting that the consensus structure can sufficiently quantify the major patterns of structural configurations. In addition, it suggests that the distance information derived from Hi-C contact frequency is overall consistent with the spatial configurations obtained from imaging techniques. To further analyze the structural variations relative to the consensus structure, the single-cell structures are classified into five different clusters, where individual cells belonging to the same clusters have similar structures. Structural variabilities are observed across distinct single-cell clusters. Interestingly, for the subset of cells in cluster 2, the cluster-specific average structure is highly similar to the predicted consensus structure (Figure 2.5.d, left), with the differences largely smaller than the intrinsic standard deviations among single cells within this cluster (Figure 2.5.d, right), further supporting the biological relevance of the predicted structure. The other four clusters also similarly demonstrate the overall folding patterns, each of which contains specific variations relative to the predicted consensus structure (Figure A.17). Across all five clusters, the consensus structure consistently shows smaller differences to the cluster-specific average structures, than the intrinsic standard deviations of single cells within each cluster (Figure A.17). These results suggest that the predicted consensus 27 structures by FLAMINGO can facilitate improved interpretation of the structural heterogeneity in ensembles of single-cell structures. 2.2.8 Robust performance to handle missing data in Hi-C datasets Due to limited sequencing depths of typical Hi-C experiments and low mappabilities of certain genomic regions, the observed distance matrices from Hi-C usually contain large Figure 2.6 Robust performance of FLAMINGO under different missing rates. (a) Reconstructed 3D structures and completed distance matrices by FLAMINGO and 28 Figure 2.6 (cont’d) Hierarchical3DGenome in chr21:34,000,000-35,000,000 using down-sampled data. As inputs, the observed distance matrix from Hi-C is down-sampled with different down- sampling rates (columns). Four TADs within this genomic region are annotated by colors. The inter-TAD interaction recovered only by FLAMINGO is highlighted by the black arrow. (b) FLAMINGO correctly recovers the short 3D distance between the two distal TAD boundaries (5’ of the blue TAD and 3’ of the brown TAD) as highlighted in (a), with 70% down-sampled data. After normalization, FLAMINGO predicts a 3D distance of 0.156 (p- value=3.8x10-2, n=1,000, permutation test, genomic distance controlled), while Hierarchical3DGenome predicts 0.173 (p-value=0.1862, n=1,000, permutation test, genomic distance controlled). (c) The observed distance matrix from Hi-C data, along with TAD annotations and the highlighted inter-TAD interaction. (d) The inter-TAD interactions recovered by FLAMINGO (zoom-in view of the blue and brown TADs within chr21:34,100,000-34,850,000) are supported by CTCF and cohesin bindings and the convergent CTCF motifs (red arrows). The inter-TAD interactions are missed by Hierarchical3DGenome. (e) FLAMINGO achieves higher reconstruction accuracy against missing data. Correlations between predicted and observed intra-domain structures (the y-axis) are calculated for FLAMINGO and the state-of-the-art methods under different down-sampling rates (the x-axis). The dots show the average correlations based on n=10 independently down-sampled input matrices and error bars correspond to the standard deviations across the ten random samples. Smaller down-sampling rates represent larger fractions of missing data. Source data are provided as a Source Data file. portions of missing data10, 71, which present a very challenging problem for high-resolution modeling. For instance, considering the same Hi-C dataset for chromosome 1, the rate of missing data is 21% at 100kb-resolution but quickly increases to 94.5% at 5kb- resolution. Overall, the rate of missing data is >80% across chromosomes 1-22 and X in the human genome at 5kb-resolution (Figure A.14.a). By incorporating the low-rank property of the distance matrix into the optimization procedure, FLAMINGO has the superior advantage of handling high rates of missing data. To demonstrate FLAMINGO’s capability of handling missing data, the observed distances derived from Hi-C were further down-sampled to check whether FLAMINGO still can reproduce the same high-resolution structures (Methods). As a representative example 29 on chromosome 21 (chr21:34,000,000-35,000,000), FLAMINGO was able to robustly reconstruct the structure even if 50% of the observed pairwise distances from Hi-C was further down-sampled (Figure 2.6.a). By further down-sampling the dataset to the levels with only 20% and 5% of observed data remaining, FLAMINGO was still able to infer the loop structures formed by the four TADs in this region, with slightly increased intra-TAD fluctuations. In contrast, Hierarchical3DGenome predicted fuzzy structures with substantial fluctuations across all down-sampling rates. In addition, specific intra-TAD chromatin contacts were also captured by FLAMINGO, as shown by the specific hotspots within the TAD blocks in the predicted distance matrices at 50% and 70% of down- sampling rates (Figure 2.6.a), while Hierarchical3DGenome only generated vague distance matrices without detailed structures within TAD blocks. More interestingly, FLAMINGO was also able to predict the short 3D distance for long-range inter-TAD contacts in the loop structure using only 70% of observed data (p-value=0.038, permutation test, genomic distance controlled) (Figure 2.6.b), while Hierarchical3DGenome predicted a much longer distance (p-value=0.186). The predicted inter-TAD distance is in agreement with the original Hi-C distance matrix (Figure 2.6.c) and demonstrates a higher level of specificity, although it was inferred from down- sampled data. As additional justifications of the predicted structure with missing data (down-sampling rate = 70% or 50%), the specific intra- and inter-TAD chromatin contacts recovered by FLAMINGO, but not predicted by Hierarchical3DGenome, are supported by CTCF and cohesin bindings, along with convergent pairs of CTCF motifs (Figure 2.6.d, Figure A.18.a). 30 As global quantitative evaluations, the recovered 3D structures and predicted distances by FLAMINGO using different down-sampled input matrices are compared with the originally observed distances. Strikingly, for the whole 5kb-resolution distance matrix including both inter- and intra-domain structures, the correlation coefficients remain stable and high (~0.49), until less than 30% of observed distances from Hi-C are kept for predictions (Figure A.18.b). Focusing on detailed intra-domain structures, the correlation coefficients still remain to be robustly high (>0.74), until less than 50% of observed distances are kept (Figure 2.6.e). Across the wide range of down-sampling rates, FLAMINGO robustly achieves higher accuracy than other algorithms, based on comparisons using observed Hi-C contact maps (Figure 2.6.e, Figure A.18.b) and also other chromatin interaction datasets, such as Capture-C, ChIA-PET and SPRITE (Figure A.18.c-e). For example (Figure 2.6.e), using only 10% of observed data, FLAMINGO achieved better accuracy than the state-of-the-art method, Hierarchical3DGenome, which used all of the observed data (Figure 2.3.c). These results clearly demonstrate FLAMINGO’s ability to accurately reproduce high-resolution structures based on Hi-C with large fractions of missing data, which will significantly relax the demand of sequencing depths in Hi-C experiments and thus promote wide implementations of Hi-C in practice. 31 Figure 2.7 Cross cell-type predictions by iFLAMINGO. (a) Hi-C data from the source cell-type and 1D epigenomics data from the target cell-type are integrated by iFLAMINGO to predict the 3D genome structure in the target cell-type (left). An example of the 3D structure of chromosome 21 for K562 predicted from GM12878 is shown (GM12878->K562). K562-specific structural properties are highlighted by arrows where iFLAMINGO correctly captures, while the GM12878-specific structure shows substantial differences. Two intra-domain structures are further highlighted in the three structures (orange and purple). (b) Comparison of 3D distances between interacting ChIA-PET anchors based on the predicted 3D structures of GM12878 (blue), GM12878->K562 (orange) and K562 (pink). P-value=3.0x10-4 (n=1,562, one-sided Mann-Whitney test). The center lines of boxplots show the median, the upper and lower box limits show the 25th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. Outliers outside this range were removed from the figure. (c-d) Performance comparisons between iFLAMINGO (the y- axis) and FLAMINGO (the x-axis) on cross cell-type predictions for n=30 source-target pairs. Source-target pairs are colored by source cell-types. The performance is quantified by correlations between predicted and observed distances for all DNA fragments, i.e. all- 32 Figure 2.7 (cont’d) points, in (c) and fragments within the same domains, i.e. intra-domain, in (d). (e) Performance estimation (correlation of 3D distances, y-axis) for cross cell-type predictions of intra-domain structures as a function of 1D epigenomic similarities between cell-types (correlations of genome-wide DNase-seq data, x-axis). The regression line is fitted based on cross cell-type predictions from GM12878 and K562 (p-value=0.02, n=12, two-sided Student’s t-test). Source data are provided as a Source Data file. 2.2.9 Cross cell-type prediction of 3D structures Currently experimental Hi-C data have been collected only for a limited number of cell- types, due to the cost of the experiments or the difficulty of collecting sufficient numbers of cells for certain cell-types71. To enlarge the coverage of cell-types for 3D genome modeling, FLAMINGO is further extended to iFLAMINGO, an integrative version of the algorithm that can make cross cell-type predictions. To predict the 3D structure for a cell- type without Hi-C data, defined as target cell-type, iFLAMINGO combines two pieces of information (Figure 2.7.a, Methods): (1) Hi-C data from another cell-type, defined as source cell-type, which provides the overall structural backbone of the genome; and (2) chromatin accessibility data, such as DNase-seq, from the target cell-type, which provides the cell-type specific 1D epigenomic landscape. DNase-seq data are widely available across a large panel of cell-types and can characterize chromatin accessibilities at base pair resolution6. Since the levels of DNase-seq signals of a pair of genomic loci are associated with their 3D distances, for instance, co-accessible loci being significantly closer to each other in 3D space (Figure A.19.a), a regression model is built to impute approximate 3D distances based on DNase-seq signals in the target cell-type (Figure A.19.b). The imputed cell-type specific distances are then incorporated into iFLAMINGO to predict the 3D genome structure in the target cell-type (Methods). 33 iFLAMINGO was applied on the Hi-C data from GM12878 to predict the 3D genome structure in K562 by integrating K562-specific DNase-seq data into the modeling process. The resulting structure of chromosome 21 is shown in Figure 7a (GM12878->K562). The 3D structure predicted based on GM12878 Hi-C alone is shown as the negative control, and the structure predicted directly from K562 Hi-C is included as the positive control (Figure 2.7.a). The GM12878->K562 structure not only captures the global structural signatures of the K562 genome but also reconstructs detailed loop structures more similar to K562, both of which are highlighted in Figure 7a. By comparing with K562-specific chromatin interactions profiled by independent ChIA-PET experiments72, the predicted 3D distances between interaction anchors from the GM12878->K562 structure are significantly shorter than the distances from the GM12878 structure (Figure 2.7.b, p- value=0.0003, one-sided Mann-Whitney test), suggesting the quantitatively improved similarity between the GM12878->K562 and K562 structures. Furthermore, the predicted spatial distances in the GM12878->K562 structure achieve a higher correlation with the experimentally-derived spatial distances of K562 Hi-C (correlation=0.62, Figure A.19.c), compared to the correlation achieved by the basic experimentally-derived spatial distances of GM12878 Hi-C with the experimentally-derived spatial distances of K562 Hi- C (correlation=0.55), suggesting that the predicted GM12878->K562 structure by iFLAMINGO captures the cell-type specificity of K562. 34 Figure 2.8 iFLAMINGO improves the resolution of predicted 3D structures. (a) Scheme of the high-resolution 3D structure prediction. Low-resolution distance matrix from Hi-C for 𝑁 large DNA fragments of size 10kb, are divided into smaller DNA fragments of size 5kb, resulting in a 2𝑁 by 2𝑁 distance matrix, where the small DNA fragments inherit the same distances to other fragments from the original large fragment. The high-resolution 1D epigenomics signals in each small DNA fragment are integrated into iFLAMINGO to predict the high-resolution 3D genome structures. As one example, the 3D structure of chromosome 5 at 5kb-resolution predicted from the 10kb-resolution distance matrix is shown. (b) Example of the predicted 5kb-resolution 3D structure of chromosome 10 from 25kb-resolution distance matrix (middle, 25kb->5kb), compared with the 25kb-resolution structure (left) and the 5kb-resolution structure (right). The large- scale structural differences are highlighted by red boxes. The comparisons of detailed intra-domain structures (red) are shown in inset. The red arrows represent the boundaries. (c-e) Performance comparison of predicting 5kb-resolution structures from 10kb- resolution (c), 25kb-resolution (d), and 50kb-resolution distance matrices (e). Correlations between predicted and observed 5kb-resolution distances are calculated for all DNA fragments, i.e. all-points, and for fragments within the same domains, i.e. intra- domain. The bar plot shows the average correlations across n=23 chromosomes and the error bars show the standard deviations across 23 chromosomes. Data are presented as mean values +/-SD. Source data are provided as a Source Data file. 35 iFLAMINGO was further applied on all source-target pairs from the six cell-types with Hi- C data, and the performance was evaluated based on the correlations between predicted and observed distance matrices in target cell-types (Methods). As comparison, the optimal structures predicted by FLAMINGO without using DNase-seq data are included as negative controls. Among all the 30 source-target cell-type pairs, iFLAMINGO achieved a higher accuracy for almost all the cross cell-type predictions (Figure A.20), not only for the whole distance matrices (Figure 2.7.c) but also for intra-domain structures (Figure 2.7.d). These consistent improvements underscore iFLAMINGO’s ability of cross cell-type structure predictions and highlight the importance of 1D epigenomic information in 3D genome modeling. To further demonstrate iFLAMINGO’s potential on enlarging the cell-type coverage for 3D structure reconstructions, the accuracy of cross cell-type 3D predictions is plotted as a function of 1D epigenomic similarities between the source and target cell-types (Figure 2.7.e). Using GM12878 or K562 as source cell-types, the accuracy of predicted intra- domain 3D structures in target cell-types is significantly associated with the 1D epigenomic correlations to the source cell-types (p-value=0.02). Based on the fitted linear function, to obtain a cross cell-type prediction with accuracy>0.6, which is a level already higher than the state-of-the-art methods using Hi-C directly from the target cell-types (Figure 2.3.c), iFLAMINGO only requires Hi-C data available from a source cell-type with medium 1D epigenomic similarities (correlation>0.65). Combined with the ongoing experimental efforts of chromatin characterizations, such as the 4D Nucleome Consortium51, iFLAMINGO will substantially expand the catalog of cell-types with high- resolution 3D structures. 36 2.2.10 Boost the resolution of 3D structures from low-resolution Hi-C Since another limiting factor of experimental Hi-C data is the resolution of contact maps being low73, 74, a tradeoff of genome-wide coverage of sequencing reads, it is much desired to predict high-resolution 3D structures from low-resolution contact maps of Hi-C. By incorporating high-resolution 1D epigenomic data, such as DNase-seq, iFLAMINGO is able to boost the resolution of the predicted 3D genome structures (Figure 2.8.a, Figure A.19). After splitting low-resolution DNA fragments into high-resolution bins, DNase-seq signals help delineate the distance ambiguity across consecutive bins and fine-tune the structures through optimization (Methods). As a representative example, FLAMINGO was applied to the 25kb-resolution distance matrix for chromosome 10, resulting in a 25kb-resolution 3D structure (Figure 2.8.b, left). On the other hand, based on the 5kb-resolution distance matrix, the 5kb-resolution structure was generated by FLAMINGO as the benchmark structure (Figure 2.8.b, right). Finally, applying iFLAMINGO on the 25kb-resolution distance matrix, along with the DNase-seq data, led to a 5kb-resolution structure, the 25kb->5kb structure (Figure 2.8.b, middle), which shows increased similarity to the 5kb-resolution benchmark structure. The 25kb->5kb structure not only captures large-scale structural properties but also recovers the detailed high-resolution loops in the 5kb-resolution structure, which are missing in the 25kb-resolution structure (Figure 2.8.b). To quantitatively evaluate the accuracy of boosted resolution genome-wide, a series of low-resolution distance matrices, at 10kb, 25kb and 50kb resolution, respectively, were generated from the same Hi-C datasets. The reconstructed structures were then compared with the original 5kb-resolution distance matrix. Across all the tests, 37 iFLAMINGO achieved the highest correlations to the benchmark structures (Figure 2.8.c- e). For instance, using 10kb-resolution distance matrices as inputs, iFLAMINGO achieved an average correlation of 0.37 for the whole reconstructed 5kb-resolution matrices and an average correlation of 0.79 for intra-domain matrices, both of which are higher than the state-of-the-art methods even when they were directly applied on 5kb-resolution input matrices (Figure 2.3.c). Therefore, iFLAMINGO not only substantially improves the information extraction from low-resolution Hi-C data but will also widely facilitate the implementation of Hi-C protocols without stringent constraints on resolution. 2.3 DISCUSSION In this study, we have developed an algorithm, FLAMINGO, to reconstruct high-resolution spatial conformations for large genomes in 3D space. Using low-rank matrix completion techniques, FLAMINGO is able to substantially improve data mining efficiency for Hi-C experiments. Based on a series of rigorous performance evaluations, FLAMINGO consistently demonstrates superior accuracy and advanced scalability compared to other state-of-the-art methods. The strong agreements between the predicted genome architectures and orthogonal experimental evidence, such as Capture-C, ChIA-PET and SPRITE, further highlight FLAMINGO’s ability of capturing high-resolution spatial signatures of chromatin. Biologically, the reconstructed 3D structures facilitate additional discoveries and understandings, beyond 2D contact maps, such as higher efficiency of identifying multi-way chromatin interactions, interpretation of long-range QTLs, geometrical properties associated with TAD boundaries, and providing structural references to analyze single-cell variabilities of chromatin folding. Furthermore, FLAMINGO, along with its integrative version iFLAMINGO, addresses four fundamental 38 challenges in 3D genome modeling: (1) high scalability to reconstruct high-resolution 3D structures for all chromosomes from massive Hi-C datasets; (2) robust performance to handle large portions of missing data in Hi-C; (3) accurate cross cell-type prediction of 3D structures for cell-types lacking Hi-C datasets; and (4) boosting the resolution of reconstructed 3D structures from low-resolution Hi-C contact maps. Given all these advantages, FLAMINGO will be an important tool for both computational and experimental studies on 3D genomes. The reconstructed high-resolution structures across different cell-types will significantly facilitate biological insights into the spatial organization of chromatin and its underlying mechanisms. As one of the major benefits of FLAMINGO, the generated high-resolution 3D structures can serve as a platform to understand how transcriptional regulation is modulated in 3D space. Overlaid with functional genomics data, FLAMINGO predictions provide high- resolution structural supports for long-range regulatory links between enhancers and promoters (Figure 2.3.e-h), and recover the short 3D distances between CTCF- associated boundaries of chromatin loops (Figure 2.6.a-d, Figure A.1). Moreover, beyond 2D chromatin contact maps, FLAMINGO can help to analyze higher-order multi-way interactions (Figure 2.4.a-c) and long-range cis-regulatory QTLs (Figure 2.4.d-g), and characterize geometrical signatures of chromatin shapes (Figure 2.5.a-b). In recent years, deep learning models have been developed to predict regulatory interactions in gene regulation and TAD organization from DNA sequences75, 76. Since FLAMINGO and deep learning models have complementary algorithmic strengths, it is expected to gain system- level knowledge on the relationship between gene regulation and chromatin organization by combining FLAMINGO with these deep learning algorithms. 39 The optimized consensus structure provides an efficient representation of the 3D genome for biologists with the advantage of high interpretability. Another type of methods aim to infer variations of the underlying chromatin structures, namely ensemble structures, using either polymer simulation models77-79 or machine learning algorithms69, 70. While modeling structural variations is important, it is sometimes difficult to biologically interpret an individual structure from a pool of predictions and to delineate experimental cell-to-cell variations from the increased noisy fluctuations. As shown in the comparisons between the reconstructed structure and the ensemble of single-cell structures, including both ensemble average structures and variable cluster-specific structures (Figure 2.5.c-d), FLAMINGO’s predictions can serve as effective reference structures to standardize the relative variabilities across single cells. Equipped with the complementary advantages of accuracy and robustness against noise, FLAMINGO can help the ensemble-structure learning algorithms to improve both the predictive performance and the interpretation of structures. There are currently two limitations of FLAMINGO, which require future methodology developments. First, although the transformation function from Hi-C contact frequency to spatial distance has been justified for intra-chromosomal contacts by previous studies14, 34 and our analyses (Figure 2.5.c-d, Figure A.4, Figure A.17), there is currently no systematic estimation of the function for inter-chromosomal contacts. Thus, FLAMINGO can only reconstruct 3D structures for each chromosome separately, while it is difficult to assemble the structure for the whole genome including inter-chromosomal distances. Similarly, due to the lack of sequencing reads, centromere and telomere regions are excluded from the reconstruction of spatial chromosome conformations. These regions, 40 especially centromere regions that have been demonstrated to be important in regulating chromatin organization by previous studies69, 80, are components that should not be excluded if organizations for the whole genome are to be assembled. In order to achieve complete reconstructions of 3D genome, future algorithmic developments will be needed to overcome this limitation. Second, the consensus structure predicted by FLAMINGO represents the population-average architecture from large numbers of cells, which can not capture the highly dynamic property of 3D chromatin81, 82 (such as the dynamic chromatin loops and TADs). The multi-scale spatial conformation of chromosomes varies from cell to cell83 and the variability plays important roles in epigenetics, gene regulation and DNA damage repair84. A series of ensemble-structure prediction algorithms have been developed to explore the dynamic conformations69, 70, 77-79. As a future development that can help to further overcome this limitation, single-cell Hi-C datasets will be needed to predict 3D structures for individual cells. Single-cell Hi-C datasets are highly sparse and raise significant challenges in handling missing data. Although FLAMINGO demonstrates superior performance against missing data for bulk tissue Hi-C datasets even with ~98% missing rate at 5kb-resolution (Figure 2.6.e, corresponding to 50% down- sampling rate), typical single-cell Hi-C experiments have >99.99% missing rates at 100kb-resolution. Therefore, the highly sparse single-cell Hi-C datasets require further algorithmic improvements, in order to characterize the detailed structural variations across individual cells. Overall, the combined strengths of handling large rates of missing data, making cross cell-type predictions, and boosting resolutions, suggest high impacts of FLAMINGO on 3D genome analyses. High-resolution structures can be inferred for diverse panels of cell- 41 types spanning different differentiation lineages, without increasing sequencing depths or requiring closely similar cell-types. Thus, it will not only improve the data mining of existing Hi-C data but also address the urgent need from large-scale Hi-C data resources to be generated in the near future, such as the 4D Nucleome Consortium. Together with the recent image-based 3D genome information4 and the high-dimensional epigenomics data6, 85, FLAMINGO is expected to substantially expand our understandings of the spatially orchestrated genome architectures across cell-types. 2.4 METHODS 2.4.1 Chromatin contact maps and epigenomics datasets We collected the Hi-C chromatin contact maps of six human cell-types, including GM12878, K562, IMR90, HMEC, HUVEC, and NHEK, from the GEO database 10 (GEO:GSE63525). To remove potential biases in the Hi-C data, we normalized chromatin interaction-frequency matrices using the Knight-Ruiz normalization method as suggested by previous studies10. The normalized Hi-C interaction frequencies are then transformed into 3D Euclidean distances based on the exponential function14 : 𝐷𝑖𝑗 = 𝐼𝐹𝑖𝑗 (−𝜂) , where 𝐷𝑖𝑗 represents the squared pairwise 3D distance between DNA fragments 𝑖 and 𝑗 , 𝐼𝐹represents the interaction frequency, and 𝜂 is a free parameter. In fact, after testing our model by taking different values of 𝜂 in the range suggested by previous experimental estimates14, 28, we have found that the accuracy of reconstruction is robust to the choice of 𝜂 (Figure A.4). Therefore, by default, 𝜂 is set to 0.5 (𝜂/2 = 0.25) as suggested by previous literature14. The validity of 3D distances converted from Hi-C contact maps, which are termed as observed distances from Hi-C in this paper, are also supported by the high similarity between the reconstructed structure and averaged structures of single 42 cell clusters, whose 3D configurations are directly obtained from imaging data (Figure 2.5.c and 2.5.d, Figure A.17). The genome-wide DNase-seq datasets of chromatin accessibility from the six cell-types were collected from the ENCODE and Roadmap consortia50, 86. In a specific cell-type, for each DNA fragment, the averaged DNase-seq signal (namely fold-change over genomic background) within the fragment is used to represent the cell-type specific chromatin accessibility in the genomic locus. Additional details on data collection and preprocessing are given in Supplementary Note 1. 2.4.2 Model framework of FLAMINGO FLAMINGO reconstructs 3D genome structures based on Hi-C chromatin contact maps using the low-rank matrix completion technique (Figure 2.1.a), which can efficiently delineate underlying low-rank structures from the large and noisy pairwise distance matrices. The cell-type specific 3D coordinates of high-resolution DNA fragments for each chromosome are predicted by solving a constrained rank-minimization problem using the augmented Lagrangian method48, which can converge fast and can robustly handle large amounts of missing data. To enable parallel computation, a hierarchy of two scales (1Mb and 5kb) is used to model each chromosome and an integrative assembly strategy is designed to build optimal high- resolution chromosomal structures from these two scales (Figure A.3). Based on simulated benchmark analysis, the performance of FLAMINGO does not rely on specific choices of resolutions or domain partitions (Figure A.7). In addition, an integrative variant of FLAMINGO, iFLAMINGO (Figure 2.7.a, Figure A.19), is also developed to incorporate 43 cell-type specific DNase-seq datasets into the model so as to (1) enable cross cell-type predictions and (2) boost resolution of predicted 3D genome structures. 2.4.3 Reconstruct 3D genome structures based on low-rank matrix completion Each chromosome is modeled as a ‘beads-on-a-string’ polymer chain, where each DNA fragment is modeled as a bead, and the centromere and telomere regions are removed from the analysis as suggested by previous studies33-35. Structure reconstruction requires inferring the optimal 3D coordinates of consecutive DNA fragments along a chromosome, which maximally align with the pairwise 3D distances between DNA fragments observed from Hi-C data. A unique property of FLAMINGO is its capability to leverage the low-rank nature of a pairwise distance matrix from Hi-C; namely, the high-dimensional pairwise distance matrix is biologically generated by the underlying low-rank coordinate matrix of DNA fragments (rank≤3). Defined by the coordinate matrix (𝑷), the Gram matrix (𝑿 = 𝑷𝑷𝑇 ) has a rank≤3. The squared Euclidean distance matrix ( 𝑫) is a sum of three matrices: 𝑫 = diag(𝑿)𝟏𝑇 + 𝟏𝑇 diag(𝑿) − 2𝑿 where rank(𝑿) ≤ 3, rank(diag(𝑿)𝟏𝑇 ) ≤ 1, 𝑎𝑛𝑑 rank(𝟏𝑇 diag(𝑿)) ≤ 1 . Due to the property of ranks for matrix addition, the Euclidean distance matrix has a rank≤5. Based on the theory of matrix completion42, the low-rank property of both the pairwise Euclidean distance matrix (rank≤5) and the Gram matrix (rank ≤ 3) guarantees that, under certain randomness assumptions on measurements, the underlying 3D structure can be predicted using a small fraction of data from Hi-C (Figure 2.1.a). We define 𝑷 as the 𝑁 𝑏𝑦 3 coordinate matrix for 𝑁 consecutive DNA fragments along a chromosome. We also define 𝐷𝑖,𝑗 as the squared 3D spatial distance between DNA fragments 𝑖 and 𝑗. Thus, the objective function for 3D genome reconstruction is: 44 min ||𝑿||∗ subject to 𝑋𝑖,𝑖 + 𝑋𝑗,𝑗 − 2𝑋𝑖,𝑗 = 𝐷𝑖,𝑗 , (𝑖, 𝑗) ∈ 𝛺; 𝑿𝟏 = 0; 𝑿 = 𝑿𝑇 ; 𝑿 𝑖𝑠 positive semidefinite, (1) where 𝑿 = 𝑷𝑷𝑇 is the Gram matrix, ||𝑿||∗ represents the nuclear norm Tr(√𝑿𝑇 𝑿 ), which is related to the rank of matrix 𝑿, and the measurement set 𝛺 represents a subset of indices of DNA fragment pairs. We further introduce a linear sampling operator 𝐴 as: 𝐴(𝑿) = 𝑓 ∈ 𝑅 |𝛺|∗1 , 𝑓𝑖 =< 𝑿, 𝝎𝛼𝑖 > 𝑓𝑜𝑟 𝛼𝑖 ∈ 𝛺, (2) where 𝛼𝑖 =(𝛼𝑖,1 , 𝛼𝑖,2 ) is the index of a DNA fragment pair. The matrix basis 𝝎𝛼𝑖 is defined as: 𝝎𝛼𝑖 = 𝒆𝛼𝑖,1 ,𝛼𝑖,1 + 𝒆𝛼𝑖,2 ,𝛼𝑖,2 − 𝒆𝛼𝑖,1 ,𝛼𝑖,2 − 𝒆𝛼𝑖,2,𝛼𝑖,1 , (3) where 𝒆𝑖,𝑗 represents a matrix which has 1 at entry (𝑖, 𝑗) and 0 otherwise. For later use, we define the adjoint of 𝐴 as 𝐴∗ , where 𝐴∗ 𝒚 = ∑𝑖 𝑦𝑖 𝝎𝛼𝑖 . The subset of DNA fragment pairs (𝛺 and 𝛼𝑖 ) is randomly down-sampled from all measured pairs of DNA fragments with specified down-sampling rates. Intuitively, by defining 𝝎 and 𝛼𝑖 , the linear operator 𝐴 summarizes all the constraints in one notation so that the objective function can be re- written in a compact form: 𝑚𝑖𝑛𝑷 Trace(𝑷𝑷𝑇 ), subject to 𝐴(𝑷𝑷𝑇 ) = 𝒃, (4) 45 where 𝒃 = 𝐴(𝑴) and 𝑴 represents the true underlying low-rank Gram matrix from Hi-C data satisfying 𝑀𝑖,𝑖 + 𝑀𝑗,𝑗 − 2𝑀𝑖,𝑗 = 𝐷𝑖,𝑗 . A penalization term is further added to the objective function to control unexpected large distances predicted between adjacent DNA fragments caused by low Hi-C data quality at certain genomic locations. Therefore, the final objective function is: minP Trace(𝑷𝑷𝑇 ) + 𝜆/2‖𝐵(𝑷𝑷𝑇 ) − 𝑑 𝑡 𝟏‖22 , subject to 𝐴(𝑷𝑷𝑇 ) = 𝒃, (5) where 𝜆 represents the penalization parameter, and the scalar 𝑑 𝑡 represents the maximal allowed distance between adjacent DNA fragments. The linear measurement operator 𝐵 projects the Gram matrix to the sub-diagonal elements: 𝐵(𝑿) = 𝑔(𝑿) ∈ 𝑅 (𝑛−1)∗1 , where 𝑔𝑖 (𝑿) =< 𝑿, 𝝎𝛽𝑖 > 𝑓𝑜𝑟 𝛽𝑖 = (𝑖, 𝑖 + 1), and 𝟏 ∈ 𝑅 (𝑛−1)∗1. (6) The adjoint of 𝐵 is denoted as 𝐵 ∗ , where 𝐵 ∗ 𝒚 = ∑𝑖 𝑦𝑖 𝝎𝛽𝑖 . Intuitively, the low-rank matrix completion model only needs a subset of the whole set of pairwise distances, which is indexed by 𝛺, to reconstruct the Gram matrix 𝑷𝑷𝑇 , and it requires the optimal matrix 𝑷𝑷𝑇 to follow three properties (Figure 2.1.a): (1) The rank of matrix 𝑷𝑷𝑇 should be as small as possible by minimizing the trace of 𝑷𝑷𝑇 . This property is consistent with the low-rank assumption for 3D chromatin structures; (2) The pairwise distances based on the reconstructed 3D coordinates of DNA fragments should align with the subset of 3D distances indexed by 𝛺 by satisfying the optimization constraints. This ensures that the model can accurately reconstruct 3D genome structures consistent with observed pairwise distances; (3) The 3D distances between adjacent DNA fragments are 46 bounded. This constraint removes unrealistically stretched structures of chromatin and guarantees a smooth genome structure. Since the trace function Trace(𝑷𝑷𝑇 ) is convex with respect to 𝑷 , we solve the optimization problem by the alternating-direction method of multipliers49. The augmented Lagrangian is given by: 𝐿(𝑷; 𝛬) = Trace(𝑷𝑷𝑇 ) + 𝜆/2‖𝐵(𝑷𝑷𝑇 ) − 𝑑 𝑡 𝟏‖22 + 𝑟/2‖𝐴(𝑷𝑷𝑇 ) − 𝒃 + 𝛬‖22 , (7) where 𝜆 is the penalty parameter, 𝑟 is the regularization parameter, and 𝛬 is the Lagrangian multiplier. The gradient of the augmented Lagrangian with respect to 𝑷 is given by: 2𝑷 + 2𝜆𝐵 ∗ (𝐵(𝑷𝑷𝑇 ) − 𝑑 𝑡 𝟏)𝑷 + 2𝑟𝐴∗ (𝐴(𝑷𝑷𝑇 ) − 𝒃 + 𝛬)𝑷. (8) Starting from 𝛬 = 0 and a random initial guess for 𝑷 , the following iteration will continue until the error between the reconstructed and observed distances indexed by 𝛺 is smaller than a specified threshold (default=10-3): 𝑷 is updated with the Barzilai-Borwein steepest descent method using the current 𝛬 and then 𝛬 is updated using the current 𝑷 49. The accuracy of the model does not rely on the value of 𝑟 and 𝜆, and we have set the parameters 𝑟 = 1 and 𝜆 = 10 based on the previous study of low-rank reconstruction of the Euclidean geometry49. To tune the only free parameter of the model, 𝑑 𝑡 , which is the maximal allowed distance between adjacent DNA fragments, we test FLAMINGO on experimental Hi-C data using different values of 𝑑 𝑡 to select the distance yielding the smallest objective function as the default value (Figure A.21.b), which is found to be 47 robust across different chromosomes and cell types (Figure A.21.c). This model demonstrates fast convergence when applied on both simulated data and experimental Hi-C data (Figure A.5; Figure A.21.b). FLAMINGO has an intrinsic computational complexity 𝑂(𝑘𝑁 2 ) , where 𝑘 is a down- sampling rate to define the subset (𝛺) of DNA fragment pairs (Supplementary Note 1). Thus, FLAMINGO has sufficiently high scalability to predict high-resolution structures for large genomes, where 𝑁 is large. Moreover, by using the low-rank property of a 3D distance matrix, FLAMINGO can reconstruct 3D genome structures using a small down- sampling rate 𝑘 , such as 0.2, which can substantially accelerate the optimization. Furthermore, the parallelized computation enabled by the hierarchical prediction strategy further boosts the reconstruction speed. 2.4.4 Assemble predicted structures from different scales The same low-rank matrix completion algorithm is applied separately at two scales: (1) the 1Mb domain-level scale; and (2) the 5kb intra-domain scale. To construct the final 3D structure, the predicted intra-domain structures are assembled into the skeleton specified by the domain-level structures. At each 1Mb domain-level DNA fragment, the center of the corresponding intra-domain structure is assigned at the 3D coordinates predicted for the domain-level fragment. The assigned intra-domain structures are then rotated to minimize the overall reconstruction error between the predicted and the observed pairwise distances over DNA fragments across adjacent domains (inter-domain fragment distances) (Figure A.3). To identify the optimal 3D rotation matrices and control the corresponding computational cost, we search for a series of optimal 3D Givens rotation 48 matrices on each dimension. The 3D rotation matrices are then approximated by the multiplication of the 3D Givens rotation matrices. Denote the predicted intra-domain structure for domain 𝑖 as 𝑺𝑖 . The optimal 3D Givens rotation matrices for the 𝑥-axis across domains are identified by: min𝜃𝑖 𝑥 ∑𝑗,𝑘(||𝒓𝜃𝑖 𝑥 (𝑺𝑖,𝑗 − 𝑪𝑖 ) + 𝑪𝑖 − 𝑺𝑖+1,𝑘 ||2 − 𝐷 𝑖,𝑗;𝑖+1,𝑘 )2 , (9) where 𝒓𝜃𝑖 𝑥 is the 3D Givens rotation matrix of 𝑺𝑖 for the 𝑥-axis with parameter 𝜃 𝑖 𝑥 , 𝑺𝑖,𝑗 represents the DNA fragment 𝑗 within domain 𝑖, 𝑪𝑖 represents the center of domain 𝑖 (which is inferred from the domain-level prediction), and 𝐷𝑖,𝑗;𝑖+1,𝑘 represents the observed squared 3D distance between two inter-domain DNA fragments (fragment 𝑗 of domain 𝑖 and fragment 𝑘 of domain 𝑖 + 1) from adjacent domains. The same algorithm is applied to all domains consecutively to search for the rotation matrices of the 𝑥 -axis for all domains. Intuitively, the objective function searches for the best rotation 𝒓𝜃𝑖 𝑥 of domain 𝑖 around its center 𝑪𝑖 to match the distances between fragments across adjacent domains observed from the Hi-C data. The rotation matrices for the 𝑦-axis and 𝑧-axis are obtained similarly. Therefore, a series of 3D Givens rotation matrices are identified iteratively for the three axes. Multiplying the converged 3D Givens rotation matrices together yields the optimal 3D rotation matrices which are used to rotate the intra-domain structures, leading to the final genome structure. Since it jointly models all inter-domain distances between adjacent domains (i.e. off-diagonal points) and robustly identifies the global optimal rotation matrices for all intra-domain structures, the rotation algorithm will better align reconstructed structures with the Hi-C data and boost the accuracy of reconstruction. 49 2.4.5 Benchmark performance using simulated genome structures To quantitatively benchmark the accuracy of FLAMINGO, we simulated 3D genome structures and generated matrices of squared pairwise distances between DNA fragments. The FLAMINGO algorithm was then applied to the squared pairwise distance matrices to reconstruct the 3D structures. The model performance was evaluated by comparing the reconstructed structure with the original structure in two ways. (1) The relative error between the reconstructed 3D coordinates ( 𝑪re ) and the benchmark coordinates 𝑪benchmark of DNA fragments was calculated: 𝑅𝐸coord = ‖𝑪re − 𝑪benchmark ‖22 / ‖𝑪benchmark ‖22 . (2) The relative error between the reconstructed pairwise distance matrix 2 (𝑹) and the original squared distance matrix (𝑫) was calculated: 𝑅𝐸 = ‖𝑹 − 𝑫(1/2) ‖2 / 2 ‖𝑫(1/2) ‖2 . Moreover, Spearman correlations between predicted and benchmark structures were also calculated to quantify the accuracy. To test the performance of FLAMINGO with respect to missing data, we randomly down- sampled subsets of the squared pairwise distances as inputs and considered other squared pairwise distances as missing. Multiple down-sampled datasets were generated with different fractions of missing data in terms of different down-sampling rates. FLAMINGO was applied to these down-sampled squared pairwise distance matrices, and the resulting 3D coordinates of DNA fragments were used to calculate the relative errors and correlations. To further test the performance of FLAMINGO on noisy inputs, we added two levels of white noise separately into the down-sampled squared pairwise distance matrices. As suggested by previous research49, the first level of noise (Noise level 1) was generated 50 by the normal distribution 𝑁(𝛿, 𝛿), where 𝛿 represents the minimum value from the down- sampled squared pairwise distances. Similarly, the second level of noise (Noise level 2) was generated by the normal distribution 𝑁(2𝛿, 𝛿). In this way, the noisy down-sampled squared pairwise distances remain positive with high probability, consistent with the basic property of Euclidean distances. The simulations and down-sampling procedures were repeated 10 times for each benchmark setting. To test the assembly algorithm, we divided the benchmark structure into different domains or fragments. The intra-domain structures were reconstructed separately and then assembled for the final structures, which were compared with the benchmark structure. The relative errors of pairwise distances and 3D coordinates were calculated to demonstrate the high accuracy of the assembly algorithm and its robustness with respect to different choices of domain partitions (Figure A.7). 2.4.6 Performance comparison based on experimental Hi-C data For each of the six cell-types, we reconstructed the 3D structures using FLAMINGO at 5kb-resolution for each of the 23 chromosomes, based on the normalized Hi-C input datasets. To quantitatively evaluate the global reconstruction accuracy of FLAMINGO, we calculated the Spearman correlation coefficients between reconstructed and observed 3D distances for all pairs of DNA fragments, which are defined as all-points correlations. To further evaluate the accuracy of reconstructed intra-domain structures, we also calculated intra-domain correlations based on pairs of DNA fragments within the same domains. An accurately reconstructed structure is expected to demonstrate high correlations, at both all-point and intra-domain levels, which further suggest that the reconstructed structure quantitatively aligns with the observed Hi-C datasets. 51 We compared the performance of FLAMINGO with seven representative state-of-the-art algorithms:ShRec3D33, GEM-FISH34, Hierarchical3DGenome35, SuperRec37, ShNeigh38 and RPR36. These methods were selected because they have been shown in previous studies to perform better than other methods using similar modeling strategies, and other existing methods are not included in the comparison because either they have been shown to have less accurate performance by previous studies or they do not practically converge at 5kb-resolution in our tests. All these methods were applied, based on their suggested parameters, on all of the 23 chromosomes in the six cell-types at 5kb resolution (Supplementary Note 1). GEM-FISH only finished for chromosome 21. ShRec3D, ShNeigh and RPR finished predictions only for short chromosomes (ShRec3D: chr13-22, ShNeigh: chr15-22 and chrX, and RPR: chr17-22). Hierarchical3DGenome and SuperRec finished predictions for all 23 chromosomes. The correlation coefficients based on those chromosomes with complete predictions were calculated using the same method as explained above. At 5kb-resolution, the run-times on an AMD EPYC processor with 25 cores were recorded. The maximum memory was set to be 100GB, sufficient for all algorithms. To further quantify the performance of FLAMINGO with respect to large fractions of missing data, we randomly down-sampled the squared pairwise distance matrix with different down-sampling rates. Using the down-sampled input data, we tested the performance of FLAMINGO and other methods based on the correlation metrics described above. For each down-sampling rate, ten random samples with missing data were generated. The correlation coefficients were calculated for each random sample to evaluate the model performance. Because of impractically long computational times 52 needed by other methods for large chromosomes, only the chromosomes with complete predictions from these methods are included in this comparison. As orthogonal biological information for model comparisons, we also collected significant long-range chromatin interactions profiled from different experiments, including ChIA- PET72, Capture-C45, and SPRITE46. For each chromatin interaction, we calculated the predicted 3D distances between the interacting DNA fragments from different reconstruction algorithms. Since interacting DNA fragments (anchors) are close to each other in 3D space, the algorithm is considered to have higher accuracy if it yields shorter predicted distances between interacting DNA fragments. 2.4.7 Analysis of multi-way chromatin interactions and QTLs The multi-way chromatin interactions in GM12878 are collected from a dataset of SPRITE experiments46. To identify significant multi-way interactions, Market-Basket algorithm is used to search for higher-order associations of multiple genomic regions that are supported by SPRITE sequencing reads. Significant 3-way, 4-way and 5-way interactions are called based on confidence threshold=0.1 and support thresholds=3x10 -4, 2x10-4 and 1.7x10-4, respectively. The support thresholds are selected based on the curves of called significant multi-way interactions as a function of different thresholds, and the values corresponding to the elbow points are chosen. Genomic-distance controlled random samples of multi-way interactions are used to generate the background null distribution for statistical testing on the spatial distances among multi-way interacting anchors from the SPRITE data. To compare the fractions of SPRITE multi-way interactions captured by short predicted distances from FLAMINGO versus the fractions captured by short distances converted from Hi-C contact maps, distances are normalized by F-norm to 53 guarantee fair comparisons. A variety of thresholds of distances are used to define 3D spatial neighborhoods. A multi-way interaction is considered to be captured if all interacting anchors are located in the same 3D spatial neighborhood. The eQTL datasets58-61 and hQTL datasets62 are collected from matched cell-types, including whole blood cells and lymphoblastoid cells. The same normalization procedure is applied to compare the capability of assigning short spatial distances for QTLs based on the predicted distances versus the distances converted from Hi-C contact maps. Similarly, a variety of thresholds of distances are used to define 3D spatial neighborhoods. And long- range eQTLs (>900 kb) and distal hQTLs are evaluated whether it can be interpreted using the predicted spatial proximity by checking whether the SNP and the target region (i.e. a gene’s promoter or histone modification peak) are predicted with shorter spatial distances, compared to samples of genomic-distance controlled random pairs. For every QTL, 1,000 random genomic-distance controlled pairs from the same chromosome are generated for comparison. 2.4.8 Curvature analysis for predicted 3D genome structures To calculate the curvature in each 5kb genomic bin, a quadratic parametric function was fitted locally based on the specific genomic bin and the two neighboring upstream/downstream bins. Assume the parametric representation of the curve is 𝒓(𝑡) ⃗⃗⃗⃗⃗⃗⃗⃗ = (𝑥(𝑡), 𝑦(𝑡), 𝑧(𝑡)), where each dimension can be written as a quadratic function, e.g. 𝑥(𝑡) = 𝑎0 + 𝑎1 𝑡 + 𝑎2 𝑡 2 . By fitting the curve locally, the curvature is calculated as 𝜅 = |𝒓 ⃗ ′ | / | ⃗𝒓 ′ |3. To have a fair comparison across different chromosomes, curvatures ⃗ ′′ × 𝒓 are normalized by the median values of each chromosome. Curvature is then calculated around TAD boundaries10. 54 2.4.9 Comparison with image-based single-cell structures 3D coordinates of genomic bins at 30kb-resolution across single cells for a 2Mb region in chromosome 21 are collected68 and compared with FLAMINGO’s predictions. In K562, 797 single cells are kept for comparison by filtering out cells with >10% bins having no data (missing data). Linear interpolation is used to fill the missing coordinates in each single cell. To normalize the scales of structures, the 3D coordinate matrix (𝑷) of every single cell (30kb-resolution) is centered, and then scaled by the F-norm: 𝑷scaled = 𝑷 / || 𝑷 ||𝐹 . Singular value decomposition (SVD) is then used to rotate and align the normalized single-cell structures (Supplementary Note 1). The average structure across single cells is calculated by taking the mean coordinates for each genomic bin. The predicted consensus structure by FLAMINGO (5kb-resolution) is centered, scaled and rotated using the same procedure, and is then aligned with the average structure of single cells or cluster-specific average structures. A 30kb-resolution version of the consensus structure is calculated by taking the average coordinates of six consecutive 5kb-resolution bins. Hierarchical clustering is applied on single-cell structures based on Euclidean distance to classify the ensemble of single cells into clusters, which can systematically represent the structural variabilities across single cells. After aligning the predicted consensus structure with variable single-cell structures, the differences of coordinates along the genomic region are calculated and compared to the intrinsic standard deviations among single cells. 2.4.10 Cross cell-type prediction of 3D genome structures To predict 3D genome structures in cell-types without Hi-C datasets which are defined as target cell-types, we further expand the FLAMINGO algorithm to combine the Hi-C 55 dataset from a source cell-type and the DNase-seq dataset from the target cell-type, resulting in an integrative variant of FLAMINGO, named as iFLAMINGO. Intuitively, the Hi-C data from the source cell-type facilitate the inference of an approximate structure, which is fine-tuned by the cell-type specific DNase-seq data from the target cell-type. Based on the observation that 3D distances between interacting DNA fragments are associated with chromatin accessibilities (Figure A.19a), we impute the 3D distances between any two DNA fragments in the target cell-type (𝐷𝑖,𝑗 ) based on DNase-seq signals and 1D genomic distances (Figure A.19b). The imputation is achieved by fitting a linear regression model in the source cell-type: 𝐷𝑖,𝑗 = 𝛼1 𝑆𝑖 + 𝛼2 𝑆𝑗 + 𝛼3 𝐺𝑖,𝑗 , where 𝛼1 , 𝛼2 , and 𝛼3 are fitting parameters to be determined, 𝐷𝑖,𝑗 represents the observed distance, 𝑆𝑖 represents the DNase-seq signal of DNA fragment 𝑖, and 𝐺𝑖,𝑗 represents the 1D genomic distance between DNA fragments 𝑖 and 𝑗. Based on the fitted regression model, 3D distances between DNA fragments can be imputed in the target cell-type, using the target cell-type specific DNase-seq data, which are then summarized into a matrix 𝑬. Therefore, the imputed 3D distance matrix 𝑬 represents the target cell-type specific information which can be used to improve the reconstruction of the corresponding 3D structure. The imputed 3D distance matrix is integrated into the original objective function as a penalization term, so that we will solve the following problem to reconstruct the 3D structure: minP Trace(𝑷𝑷𝑇 ) + 𝜆/2‖𝐵(𝑷𝑷𝑇 − 𝑑 𝑡 𝟏)‖22 + 𝛾‖𝐴(𝑷𝑷𝑇 ) − 𝐴(𝑬𝑀 )‖22 , subject to 𝐴(𝑷𝑷𝑇 ) = 𝒃, ( 10 ) 56 where 𝛾 is the penalization parameter and 𝑬𝑀 is the Gram matrix of the imputed 3D distance matrix (𝑬) for the target cell-type. The penalization term tunes the reconstructed 3D structure in the target cell-type to align with the imputed 3D distances from DNase- seq. Hence, by borrowing information from the source cell-type Hi-C data, iFLAMINGO predicts the cell-type specific 3D genome structures in the target cell-type. To validate the performance of cross cell-type predictions, iFLAMINGO was applied to 30 source-target cell-type pairs, based on the six cell-types with Hi-C data available. For each source-target cell-type pair, we predicted the 3D genome structure for the target cell-type based on the Hi-C data from the source cell-type and the DNase-seq data from the target cell-type. The reconstructed 3D structures for target cell-types were evaluated by calculating the correlation coefficients between the reconstructed 3D distance matrix and the observed one based on the Hi-C dataset from the target cell-type. As comparisons, we also evaluated the performance using the reconstructed 3D distance matrices solely based on Hi-C data from the source cell-type, without incorporating the DNase-seq information from the target cell-type. 2.4.11 Improve the resolution of 3D genome structures iFLAMINGO integrates the high-resolution chromatin accessibility data to improve the resolution of predicted 3D genome structures, such as 5kb-resolution, based on relatively low-resolution Hi-C contact maps, such as 10kb-resolution. Given a Hi-C contact map at 10kb-resolution, we divide each 10kb genomic fragment into two consecutive 5kb fragments. The 5kb fragments inherit the same pairwise 3D distances from the original 10kb fragment. In this way, the 𝑚 𝑏𝑦 𝑚 3D distance matrix at 10kb-resolution is expanded into a 2𝑚 𝑏𝑦 2𝑚 3D distance matrix at 5kb-resolution, which serves as the initial structure 57 for high-resolution reconstruction. The high-resolution DNase-seq datasets of chromatin accessibility are then incorporated to impute the 3D distances between 5kb DNA fragments, following the same method described above (Figure A.19b). By applying the iFLAMINGO algorithm on the expanded 3D distance matrix from a low-resolution Hi-C contact map and the imputed one from a high-resolution DNase-seq dataset, the 3D genome structure at 5kb-resolution is then reconstructed. We applied the model on the Hi-C dataset in GM12878 for all of 23 chromosomes at resolution of 10kb, 25kb, and 50kb, respectively. The model performance is evaluated using the correlation coefficients (all- points and intra-domain) between the reconstructed and the observed 3D distance matrices at 5kb-resolution. 58 CHAPTER 3 PREDICT HIGH-RESOLUTION SINGLE-CELL 3D CHROMOSOME STRUCTURES USING TFLAMINGO 3.1 INTRODUCTION The 3D chromosome structures provide the structural foundation of gene regulation, DNA replication, and cell differentiation. Comprehensive profiling of the 3D chromosome structures is important for understanding the structural basis of the interplay between genes, regulatory elements, and genetic variants. Chromosome conformation capture- based methods, including Hi-C and Capture-C, have been widely used to profile the contacts between DNA fragments in different cell types/tissues and generate important observations of the genome structures, such as chromatin loops, topologically associated domains (TADs), and chromatin compartments. However, the chromatin contact maps generated in bulk tissue only represent the average structure of millions of cells, thus cannot reflect the dynamic 3D chromatin structures across single cells. In recent years, the toolbox for measuring the chromosome conformations in single cells has been largely expanded, including Dip-C, single nucleosome Hi-C (snHi-C), single- nucleus methyl-3C sequencing (snm3C) and single-cell Hi-C (scHi-C). These experimental methods measure contact frequencies between DNA fragments in individual cells and generate massive single-cell chromatin contact maps. These datasets push the understanding of the chromosome conformation from bulk tissue to singe cells and innovate the variable cell-to-cell chromosome structures. However, limited by the low sequencing depth, the single-cell chromatin contact maps are highly sparse in high 59 resolution (i.e. >99.9% missing rate at 10kb resolution), making it highly challenging to further study the high-resolution 3D chromosome structures. To address this emerging question, computational methods to predict the high-resolution single-cell 3D chromosome structures are highly desired. In general, previous efforts in computationally predicting 3D chromosome structures can be classified into two categories: MDS-based methods and simulation-based methods. In the first category, the observed interacting frequencies from the single-cell chromatin contact maps are firstly converted to the spatial distance. The 3D chromosome structures are reconstructed from the derived distance matrices using the MDS-based methods or the recurrent plots. ShRec3D and RPR are representative methods in this category. These methods are solely data-driven and do not have any additional assumptions on the 3D structures. However, these methods cannot handle a significant fraction of missing data and demonstrate a relatively low accuracy in reconstructing high-resolution single- cell 3D chromosome structures. In the second category, the observed contacts from the single-cell chromatin contact maps are used as constraints in simulating 3D chromosome structures based on the polymer simulation models. The representative algorithms include isdHi-C, Si-C, and NucDynamics. In the simulation process, the algorithms simulate a 3D chromosome structure based on the biophysics properties of the DNA sequences and further refine the structures to maximize the contact probabilities of the observed interacting anchors. Benefiting from the polymer simulation, these methods have been successfully applied to reconstruct the single-cell chromosome structures. However, the simulation-based methods have strong prior assumptions about the chromosome structures. Based on the objective functions of these methods, the invariant 60 biophysical properties of single cells are considered as equally important as the dynamic single-cell chromatin contact maps, which may result in a decreased ability in capturing the structural variations across single cells. Additionally, these algorithms are configured with pre-defined parameters, requiring additional parameter selection procedures for different datasets. To address these problems, we developed a low-rank tensor completion-based method, tFLAMINGO, to reconstruct high-resolution 3D chromatin structures from single-cell chromatin contact maps. As a powerful tool in video reconstruction and compression, the low-rank tensor completion methods leverage the similarity between frames to infer the missing pixels. Similarly, tFLAMINGO models every single-cell chromatin contact as a frame, and models the whole dataset as a video, which facilitates the information sharing across single cells to complete the missing data. Apart from the low-rank property of the tensor, tFLAMINGO further utilized the low-rank property of the single-cell chromatin contact maps, which guarantees the underlying single-cell 3D chromosome structures can be recovered using a subset of pairwise distances. These two algorithmic advantages distinguish tFLAMINGO from existing methods with superior accuracy in reconstructing the single-cell 3D chromatin structures and strong abilities in capturing the dynamic structural variabilities. We applied tFLAMINGO on four single-cell chromatin conformation datasets (Dip-C, snHi-C, snm3C and scHi-C) in three cell types and reconstructed the 3D chromatin structures for all single cells in 10kb and 30kb resolution. Based on the extensive simulated datasets and experimental bulk tissue chromatin contact maps, tFLAMINGO demonstrates superior performance over existing methods in predicting 3D chromosome 61 structures. Beyond the robust compartment and TAD structures across single cells, the predicted 3D structures by tFLAMINGO capture the dynamic single-cell chromatin interactions, which allow us to evaluate dynamic gene regulations in 3D space. Furthermore, the predicted 3D structures of tFLAMINGO provide new biological insights into the mechanistic interpretation of GWAS SNPs and high-order chromatin interactions. 3.2 RESULTS 3.2.1 tFLAMINGO reconstructs high-resolution single-cell 3D chromosome structures Single-cell chromatin conformation capture (3C) experiments measure the 3D chromosome structures and generate the chromatin contact maps for tens to hundreds of cells simultaneously (Figure 3.1.a). Limited by the low sequencing depth, the single- cell chromatin contact maps are highly sparse and contain large fraction of missing data in high resolution (>99.9% in 10kb resolution). Unlike existing methods, which models every single cell separately, tFLAMINGO jointly models the whole single-cell 3C dataset as a tensor, where frontal slices represent the single-cell chromatin contact maps (Figure 3.1.a). Such formalism of tFLAMINGO enables the imputation of missing contact frequencies in one cell to borrow information from other cells, thus mitigating the high missing rates of single-cell 3C datasets and accurately reconstructing single-cell 3D chromosome structures. Given a sparse tensor of the single-cell 3C dataset, tFLAMINGO constructs a low-rank dense tensor that optimally aligns with the observed entries from the inputs, thus completing the missing values (Figure 3.1.a). Computationally, the construction process is achieved by minimizing the tensor tubal rank of the dense tensor and requiring the 62 Figure 3.1 Overview of tFLAMINGO. (a) Schematic figure of tFLAMINGO. Biologically, the scHi-C experiment generates the highly-sparse contact maps for N cells. For every single cell, the distance matrix derived from the scHi-C experiment is a low-rank matrix (rank≤5). Thus, the tensor organizing the distance matrices of N cells is a low-rank tensor and the missing values can be completed using the low-rank tensor completion method. To accurately reconstruct the 3D chromatin structure of single cells at high resolution, tFLAMINGO utilizes the tube-wise Fourier Transformation to borrow information across single-cells, while keeping the geometric character of every single cell. Based on the completed distance matrix, FLAMINGO is used to reconstruct the 3D chromatin structure for every single cell. (b) Reconstruction of the 3D chromatin structure for 14 single cells at 10kb-resolution by tFLAMINGO using GM12878 Dip-C data. reconstructed values equal to the observed values on the measurement set. Based on the completed single-cell chromatin contact maps, the 3D chromosome structures are predicted by our in-house 3D reconstruction algorithm, FLAMINGO, for every single cell. The key design of tFLAMINGO is to model the whole single-cell 3C dataset as a tensor with dimension 𝑀 × 𝑀 × 𝑁 , where 𝑀 represents the number of genomic loci and 𝑁 represents the number of single cells. The low-rank tensor completion method has been 63 widely used to represent large scale high-dimensional datasets with low-dimensional features, and its application includes video compression and reconstruction (Figure 3.1.a). In the single-cell 3C dataset, the low-rank properties are guaranteed in two aspects. Firstly, the tensor summarizing all single-cell chromatin contact maps is a low-rank tensor. This is because the cell-type-specific chromosome structures are observed to be robust at low-resolution (i.e. > 1MB resolution), suggesting single cells share a consensus backbone structure. In the single-cell 3D dataset, the information is redundant since the consensus structure is repeatedly measured in all single-cell chromatin contact maps. Therefore, the single-cell chromatin contact maps are complementary in terms of characterizing the consensus structure, concluding the low-rank property of the single- cell 3D dataset. Based on this property, the missing values in one cell can be inferred by borrowing information from the measurements of the same contacts in other cells. In tFLAMINGO, the information integration is facilitated by the tube-wise Fourier Transformation across all cells. Secondly, single-cell chromatin contact maps are low- rank matrices. According to the Euclidian geometry, the 𝑀 × 𝑀 chromatin contact map is induced by the 𝑀 × 3 coordinate matrix, thus having rank≤ 5. This property guarantees that the chromatin contact maps can be fully compressed and reconstructed using up to five singular values, which is far less than the number of genomic loci at high resolution. Thus, the chromatin contact maps can also be recovered based on a small fraction of the observed entries. By taking the advantage of the low-rank properties, tFLAMINGO facilitates large-scale information sharing within and across single cells, and completes the missing values of the sparse single-cell chromatin contact maps. 64 Our previously developed algorithm, FLAMINGO, is used to predict the 3D chromosome structures from the completed single-cell chromatin contact maps. FLAMINGO demonstrates superior performance and scalability in reconstructing high-resolution 3D chromosome structures. These features are especially important for reconstructing the high-resolution single-cell chromosome structures, which involves predicting the 3D Figure 3.2 Simulation analyses of tFLAMINGO. (a) Schematic figure of simulations. Three consensus structures are generated from the same starting structure with parameter 𝑊 controlling the similarity between consensus structures. Each consensus structure is repeated ten times to generate a tensor with 30 frontal slices. The resulting tensor is further mixed with different levels of noise (no noise, noise level 1, and noise level 2) and down-sampled. The highly noisy and incomplete tensor is used as the input of tFLAMINGO to reconstruct the consensus structures. (b) Performance of tFLAMINGO under different weights with 1,000 beads and 0.5% down-sampling rates. (c) Examples of benchmark consensus structures and tFLAMINGO predictions (weights=0.6, correlations=0.783, RMSD=0.133). location of several thousand genomic loci. Remarkably, FLAMINGO takes less than 25 hours to reconstruct the 3D structure of human chromosome 1 in 1kb resolution. We applied tFLAMINGO on four single-cell 3C datasets to predict single-cell 3D chromosome structures in 10kb and 30kb resolution, providing the largest cohort of high- 65 resolution single-cell chromatin structures. As an example, tFLAMINGO predicted the 3D chromosome structures of chromosome 21 for 14 GM12878 cells based on the Dip-C data, whose missing rate is over 99.95% in 10kb resolution (Figure 3.1.b, Figure B.1-3). 3.2.2 Performance validation based on the simulation analyses The performance of tFLAMINGO is firstly validated by reconstructing the simulated benchmark structures. We simulated a sparse tensor with three consensus structures, whose similarity is controlled by the weight 𝑊 (Figure 3.2.a). tFLAMINGO is applied on the simulated dataset to reconstruct the underlying 3D structures. Across a wide range of weight 𝑊 , the predicted 3D structures of tFLAMINGO are highly coherent with the benchmark consensus structures, verifying that tFLAMINGO can capture the structural variations across single cells. As a representative example, at weight 0.6, the predicted 3D structures accurately capture the unique layouts of different consensus structures (Figure 3.2.c, Figure B.11). Moreover, the predicted 3D structures of the frontal slices are classified into three clusters based on the pairwise RMSD, which is consistent with their original identities during the data generation process. In addition to weights, tFLAMINGO demonstrates exceptional performance on simulated datasets with the different numbers of beads and frontal slices, as well as different down-sampling rates (Figure B.7-10). Remarkably, tFLAMINGO can accurately reconstruct the 3D structure with 3000 beads 66 Figure 3.3 Performance validation based on the STORM dataset. (a) The number of STORM structures that are correctly captured by predicted single-cell structures for all methods. The center lines of boxplots show the median, the upper and lower box limits show the 25th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. (b) Examples of predicted 3D chromatin structures and top 20 aligned STORM structures for tFLAMINGO and isdHi-C. (c-d) tFLAMINGO accurately reconstructs the underlying 3D structures from snHi-C data. For each snHi-C single cell, the correlations between the raw snHi-C distance matrix and STORM distance matrices are calculated. The top 20 correlated STORM structures are considered to represent the true underlying 3D structures of the snHi-C distance matrix. (c) tFLAMINGO predictions show the highest correlations with the top 20 STORM structures. The center lines of boxplots show the median, the upper and lower box limits show the 25th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. (d) Example of the predicted 3D chromatin structure of snHi-C single cell 1. tFLAMINGO shows the highest correlation with the pooled STORM structures (correlation=0.676). 67 based on only 0.1% of the pairwise distances (Figure B.7-8, correlation=0.647, RMSD = 0.193), proving tFLAMINGO is able to reconstruct the high-resolution 3D chromosome structures from highly sparse chromatin contact maps. Furthermore, we compared the performance of tFLAMINGO with the baseline method, which completes the missing values by averaging all frontal slices. Strikingly, tFLAMINGO shows significantly higher accuracy over the baseline method, demonstrating the algorithmic design of tFLAMINGO is necessary to reconstruct the single-cell 3D chromosome structures accurately (Figure B.10). 3.2.3 Performance comparison based on the STORM dataset The performance of tFLAMINGO in reconstructing the 3D structures of human chromosome 21 for 14 K562 cells is benchmarked with the STORM dataset and compared with four state-of-art algorithms: RPR, ShRec3D, Si-C and isdHiC. The similarity between the predicted structures and STORM structures are quantified by the Spearman correlations, which has been widely used to quantify the accuracy of structure reconstructions. Since the STORM experiment measures the 3D structures of a 2MB region in thousand cells, we calculated the Spearman correlations between all pairs of predicted structures and STORM structures. Such a comparisons provides direct evidence of the model performance at the single-cell level. Firstly, we evaluated the consistency between the predicted structures and STORM structures. On average, tFLAMINGO predictions are supported by 73.4 STORM structures (Figure 3.3.a, correlation>0.8), while less than 50 STORM structures support other methods. For example, the predicted single-cell 3D structures of tFLAMINGO align well with the average of the top 20 STORM structures based on the correlations. In comparison, the 68 predictions of isdHiC demonstrate distinct structures with the STORM data (Figure 3.3.b). Secondly, the reconstruction accuracy of different algorithms is evaluated using the STORM structures. Since the underlying single-cell 3D chromosome structures of snHi- C data are unknown, the average of the top 20 STORM structures with the highest similarity with the raw snHi-C contact maps is used as the gold-standard to evaluate the model performance. Across all methods, tFLAMINGO demonstrates the highest correlations (Figure 3.3.c, median correlation 0.56). Figure 3.3.d shows one example where tFLAMINGO accurately reconstructs the underlying structures (correlation = 0.853) while other methods demonstrate lower accuracy (correlation < 0.4). These results not only validate the accuracy of tFLAMINGO in predicting single-cell 3D chromosome structures, but also suggest that the prediction of tFLAMINGO widely exists in the population of K562 cells, thus supporting the biological significance of tFLAMINGO. 69 Figure 3.4 Systematic performance comparison in reconstructing single-cell chromosome structures. (a) Performance comparison in 10kb-resolution across four single-cell datasets (GM12878 Dip-C dataset, K562 snHi-C dataset, mESC scHi-C dataset, and mESC snm3C dataset). The chromatin contact maps generated in the matching tissues are used as gold standards. (b) Performance comparison in 30kb- resolution across four single-cell datasets (GM12878 Dip-C dataset, K562 snHi-C dataset, mESC scHi-C dataset, and mESC snm3C dataset). The predictions of NucDynamics provided by Si-C are directly used. (c) Example of the accurately reconstructed distance matrix of tFLAMINGO at chr19:8,200,000-9,100,000. (d) UMAP visualization of the reconstructed distance matrices. tFLAMINGO accurately predicts the cell-type-specific chromatin structures. 70 3.2.4 Performance comparison based on the bulk tissue chromatin contact maps The performance of tFLAMINGO is systematically evaluated and compared with existing algorithms based on the bulk tissue chromatin contact maps. As the orthogonal evidence, the bulk tissue chromatin contact maps from Hi-C, 3D ATAC-PALM and GAM are collected to verify the predicted pairwise distance matrices of different algorithms. For every single cell, Spearman correlations based on two sets of distances are calculated to quantify the model performance: (1) Spearman correlations based on the measured distances in the bulk tissue datasets (termed as ‘all distance correlations’) and (2) Spearman correlations based on the measured distances in both bulk tissue datasets and each single-cell contact map (termed as ‘validated distance correlations’). Comparing these two metrics, all distance correlations tend to quantify the accuracy of the completed missing values, while the validated distance correlations evaluate the accuracy of recapitulating the observed values. Strikingly, tFLAMINGO demonstrates superior performance in reconstructing the single-cell 3D chromosome structures, especially at 10kb resolution (Figure 3.4.a-b). More importantly, tFLAMINGO shows even more improvement over existing methods based on all distance correlations (Figure 3.4.a-b , tFLAMINGO: 0.52, other methods < 0.3), suggesting a highly enhanced ability in imputing the missing pairwise distances. The advanced performance stems from the algorithmic design of tFLAMINGO. Unlike the simulation-based methods, where the missing values are completed based on the polymer simulation, tFLAMINGO uses the low-rank structures learned from the observed data to impute the missing values, thus showing better consistency with the biological ground truth. At 30 kb resolution, tFLAMINGO still archives consistently high accuracy. Figure 3.4.c shows a representative example of 71 predicted chromatin contact maps predicted by tFLAMINGO and Si-C. At this 1MB genomic region, the distance matrix predicted by tFLAMINGO accurately recapitulates the domain structures and long-range chromatin interactions of the GAM chromatin contact map (correlation 0.68), which are not observed in the prediction of Si-C Figure 3.5 Systematic performance comparison in imputing high-resolution single- cell chromatin contact maps. (a) Evaluation of the accuracy of imputed single-cell contact maps on mESC snm3C datasets at 30kb-resolution. Bulk tissue chromatin contact maps generated by orthogonal experiments are used as gold standards. The error bar represents the standard deviations across 351 single cells. (b) Example of imputed single cell contact maps by tFLAMINGO and Higashi. The TAD pattern predicted by tFLAMINGO aligns with the bulk-tissue CTCF Chip-seq peaks. (c-d) Performance comparison in identifying cell types based on the imputed contact maps across different resolutions. (c) The quantitative accuracy is evaluated by the Adjusted Rand Index (ARI). (d) UMAP of distance matrices predicted by Higashi and tFLAMINGO. Dots represent single cells and are colored by the cell types. (correlation 0.24). These results provide quantitative support for the superior performance of tFLAMINGO in reconstructing the single-cell 3D chromosome structures. 72 Furthermore, we evaluated the performance of different algorithms in capturing structural variations across cell types. All algorithms are applied to the snm3C dataset, including 351 mESC cells and 96 NMuMG cells, to reconstruct the single-cell 3D chromosome structures. The distance matrices induced by the predicted 3D chromosome structures are projected into the two-dimensional space using UMAP to discover clusters of cells. Since the great majority of pairwise distances are missing from the raw single cell snm3C chromatin contact maps at 30kb resolution (missing rate >99.9%), the observed values from the raw single cell snm3C dataset cannot reflect the cell-type specific structural variations and only one cloud of cells are observed (Figure 3.3.d). By jointly modeling all cells within the same cell type, tFLAMINGO identifies the cell-type-specific structures and correctly projects cells into the matching clusters(Figure 3.3.d). In comparison, no clear clusters of cells are observed in the UMAP plots based on the predictions of other methods(Figure 3.3.d). These results suggest that, by jointly modeling all cells, tFLAMINGO has better abilities to complete the missing data and capture the cell-type specific structural variations of single cells. 3.2.5 Performance comparison in imputing high-resolution chromatin contact maps Currently, analyses of the single-cell 3D chromatin structures in high resolution are significantly hindered by the sparsity of the single cell datasets. To enhance the usability of single-cell datasets at high-resolution, the computational methods are developed to impute the single-cell chromatin contact maps and Higashi is the latest and most powerful one. In tFLAMINGO, the 3D distances between the DNA fragments are naturally induced by the predicted single-cell 3D chromosome structures, thus leading to a complete 73 distance matrix. By further converting the spatial distances to interaction frequencies using the observed negative exponent function, tFLAMINGO can help to impute the high- resolution chromatin contact maps. To evaluate the performance of tFLAMINGO in imputing chromatin contact maps, tFLAMINGO is applied to the snm3C dataset of 351 mESC cells to impute the single-cell chromatin contact maps in 30kb resolution and compared with Higashi. The performance is evaluated based on the correlations between the imputed chromatin contact maps and bulk tissue chromatin contact maps measured by 3D ATAC-PALM, GAM, and Hi-C in bulk mESC cells. Compared with Higashi, tFLAMINGO demonstrates higher correlations (Figure 3.4.a, correlations > 0.44) across all comparisons. As Figure 3.5.b shows, the chromatin contact map imputed by tFLAMINGO for the single-cell 1 accurately captures the TAD structures of the bulk Hi-C contact maps. Furthermore, the TAD boundaries in the imputed single-cell chromatin contact maps are supported by the CTCF binding, which further illustrates the accuracy of tFLAMINGO. In comparison, no clear TAD structures are observed in the chromatin contact map imputed by Higashi, and the TAD boundaries are not consistent with the CTCF binding profiles. The accuracy of imputation is also evaluated by the ability in identifying cell-type-specific structures. tFLAMINGO and Higashi are used to impute the chromatin contact maps for 351 mESC cells and 96 NMuMG cells in 1MB, 250 kb, and 30kb resolutions (Figure 3.4.c). The cells are further clustered based on the imputed distance matrices in the two- dimensional space to assign cell identities. Adjusted Rand Index (ARI) is used to evaluate the similarity between the predicted cell-type identities and the ground truth. At 1MB and 250 kb resolution, both Higash and tFLAMINGO demonstrate high ARI. However, 74 tFLAMINGO can still correctly predict the cell identify in 30kb resolution (Figure 3.4.c, ARI: 0.53, while Higashi fails to correctly identify clusters of different cell types (ARI: 0.13), suggesting tFLAMIGO enjoys a better ability in capturing the high-resolution cell-type- specific structural variations. As Figure 3.5.d shows, the chromatin contact maps imputed by tFLAMINGO can be robustly clustered into two cell clusters across all resolutions. In contrast, the cell clusters based on the imputation of Higashi are gradually merged as Figure 3.6 Compartment analyses and TAD analyses in single cells. (a) Compartment identification of chromosome 19 for 351 mESC single cells at 30kb- resolution. PC1 scores based on the bulk Hi-C contact maps pooled single-cell contact maps and all 351 single cells are shown. (b) Example of TAD boundary identification at chr19:13,620,000-15,750,000. Seven out of eight predicted TAD boundaries are 75 Figure 3.6 (cont’d) supported by the bulk tissue CTCF Chip-seq dataset. The TAD boundary without CTCF Chip-seq peak contains a transposable B2 SINE element with B-box and a TE-derived CTCF motif. (c-d) Regions with higher TAD boundary scores tend to have (c) higher CTCF binding strength and (d) a higher number of CTCF peaks (High: <20% quantile; Mid: 20%-80%; Low: >80%). (e-f) Functional regions tend to have higher structural stabilities across single cells. The RMSD is calculated between the 3D chromatin structure of every single cell and the averaged structures of all cells. Lower RMSD represents the 3D location of DNA fragments that are stable across single cells. (e) Compartment A shows lower RMSD compared with compartment B. (f) Genes specifically expressed in GM12878 cells show lower RMSD compared with other genes. resolution increases. These results clearly demonstrate the usability of tFLAMINGO in imputing single-cell chromatin contact maps at high resolution. 3.2.6 Single-cell compartment and TAD analyses of tFLAMINGO Based on the bulk tissue Hi-C chromatin contact maps, the chromosomes are segregated into densely interconnected regions, i.e. compartments and topological associated domains (TADs), which delineate the outlines the chromosome structures in 3D space. Due to the high sparsity of the single-cell chromatin contact maps, discovering compartments and TADs for every cell is still challenging. Therefore, the completed high- resolution chromatin contact map imputed by tFLAMINGO provides a foundation to study the single-cell compartment and TAD structures. Based on the chromatin contact maps imputed by tFLAMINGO for 351 mESC single cells, single-cell compartments and TADs are called following the existing methodology. As Figure 3.6.a shows, chromosome 19 can be divided into two major compartments based on the bulk tissue Hi-C dataset in mESC. Interestingly, the same compartment structure is also observed in the pooled (average) single-cell chromatin contact maps, further verifying the accuracy of the predicted single-cell 3D chromosome structures. Moreover, all 351 mESC cells show 76 consistent distributions of the PC1 scores, suggesting the chromosome structures are highly stable across single cells at the compartment level. We also repeated the analyses in 14 GM12878 cells, and the similar distributions of the PC1 scores along the genome are observed across single cells and bulk tissue Hi-C data (Figure B.12). To gain insights into chromatin structures across single cells at the TAD level, we further calculated the TAD boundary scores from the single-cell chromatin contact maps imputed by tFLAMINGO. As an example, Figure 3.6.b shows the distribution of the TAD boundary scores in a ~2MB genomic region. Eight regions with consistently high TAD boundary scores across all single cells are identified as TAD boundaries. Interestingly, seven out of eight TAD boundaries are intensively bound by CTCFs, which is consistent with the loop extrusion model. For the TAD boundary without CTCF binding, a CTCF motif (p- value < 1.2x10-5) and B-Box regulatory element are observed in a SINE/B2 transposable element, which have been proved to shape the chromatin structures by serving as TAD boundaries. Quantitatively, DNA fragments with high TAD boundary scores show significantly higher CTCF binding intensity (p-value = 2.48x10-5) and instances (p-value = 8.32x10-4) compared with DNA fragments with medium and low TAD boundary scores (Figure 3.6.c and Figure 3.6.d). These results further highlight that the formation of the persistent TAD boundaries across single cells is mediated by CTCF binding, which is coherent with the loop-extrusion model. We further explored the structural stabilities of 3D chromatin structures across single cells and their relationships with gene regulations in GM12878. RMSD between the single-cell 3D chromatin structures and the average structure across all cells is calculated along the 77 chromosome to quantify the structural stabilities at each genomic location. Interestingly, we found a differential distribution of the RMSD in compartment A/B. Compared with Figure 3.7 Dynamic single-cell 3D chromosome structures reflects distinct methylation landscape of gene. (a) In snm3C dataset, the contact map and coupled DNA Methylation are provided. (b) Less methylated genes show closer 3D distances across single cells. For every single cell, genes are divided into three groups based on the DNA methylation scores (<30% quantile, 30%-70% quantile, and >70% quantile). The pairwise distances between pairs of two genes within the same group are calculated across all single cells. The center lines of boxplots show the median, the upper and lower box limits show the 25th and 75th percentiles respectively. The whiskers extend up to 1.5 times the interquartile range away from the limits of the boxes. (c) 351 single cells are divided into two clusters based on the similarity of the predicted 3D chromatin structures. The consensus structures of the two cell clusters are visualized. (d) Identification of the differentially methylated genes across two cell clusters. (e) Distribution of the pairwise distances between the cluster-specific differentially methylated genes across cell clusters. Genes show shorter 3D distances in cells with lower DNA methylation scores. (f) Example of 3D chromatin structures facilitating the densely organized differentially expressed genes. 78 genomic regions in compartment B, compartment A shows significantly lower RMSD, implying more stable 3D chromosome structures across all single cells (Figure 3.6.e, p- value=5.33x10-5). This result suggests the open chromatin regions are more stable in the 3D space, probably because of their essential role in transcriptional regulations. In addition, the genomic regions harboring the genes specifically expressed in GM12878 show lower RMSD compared with other genes (p-value=1.92x10-3), further supporting the observation that the genomic regions with essential transcriptional and regulatory functions have more stable 3D structures across single cells. These results not only verify the accuracy of tFLAMINGO, but also provides new functional interpretations of the cell- to-cell structural variations. 3.2.7 Spatial analysis of gene activities in 3D space by tFLAMINGO To further demonstrate the critical role of 3D chromosome structures in regulating gene expressions, we analyzed the spatial organizations of the differentially methylated genes using the snm3C dataset. Beyond profiling the single-cell chromatin contact maps, the snm3C dataset simultaneously measures the single-cell DNA methylation signals, which overlays the epigenomic information with the 3D structural information (Figure 3.7.a). The gene activity is quantified by the average DNA methylation signals within the promoter region for every cell. We divided protein-coding genes from chromosome 19 into three groups based on the strength of the DNA methylation signals and calculated the spatial distances between genes within each group based on the pooled 3D chromosome structures of 351 mESC cells. As Figure 3.7.b shows, genes with the lowest DNA methylation scores have shorter spatial distances, while genes with medium and high DNA methylation scores show relatively longer spatial distances. Considering the DNA 79 methylation signal of the promoter region is a counter metric of the gene activity, this result suggests the highly expressed genes are densely organized into 3D neighborhoods, which potentially enables the gene-gene regulations. To further explain the variability of gene expressions across single cells in the 3D space, we clustered the 351 mESC cells based on the 3D chromosome structures and studied the spatial relationships between the differentially methylated genes. By projecting the distance matrices into the two-dimensional space using UMAP, two clusters of cells are identified with the maximized average silhouette width, where the first cluster contains 117 cells and the second cluster contains 234 cells. While these two clusters of cells show a similar backbone structure, local chromosome structures are extensively re-organized (Figure 3.7.c). By comparing the single-cell DNA methylation profiles across two clusters, 116 genes are identified as the cluster-specific differentially methylated genes, suggesting the distinct transcriptional landscapes in two cell clusters. As Figure 3.7.d shows, since 71 of 116 genes show strong DNA methylation signals in the second cell cluster but weak signals in the first cluster, they are considered to be specifically methylated in the second cluster. For the same reason, 45 genes are considered to be specifically methylated in the second cluster. To investigate the relative spatial localities of the differentially methylated genes, we calculated the pairwise distances between genes in two cell clusters based on two pooled cluster-specific structures. Strikingly, the differentially methylated genes exhibit shorter pairwise distances based on the matching 3D chromosome structures, while they are loosely scattered along with the unmatching 3D chromatin structure (Figure 3.7.e). As a representative example, Figure 3.7.f shows a 3MB genomic region, which contains 9 genes specifically methylated in the first cell 80 Figure 3.8 Analyses of single-cell chromatin interactions. (a) Predicted single-cell chromatin interactions across 15 GM12878 cells at 30kb-resolution. Altogether, the predicted single-cell chromatin interactions capture three TADs shown in the bulk Hi-C contact maps and Capture-C interactions. Across single cells, different 3D chromatin structures and associated single-cell chromatin interactions are observed, confirming the dynamicity of the 3D chromatin structures. Only statistically significant chromatin interactions are shown (p-value < 5x10-5). (b) Predicted single-cell chromatin interactions are strongly supported by bulk tissue Capture-C interactions. (c) Predicted single-cell chromatin interactions have a lower p-value in the bulk tissue Hi-C dataset, suggesting strong interactions are less dynamic across single cells. (d) Consistency of single-cell chromatin interactions across 15 cells under different resolutions. Under a certain resolution, fractions of the different number of cells containing a specific chromatin interaction are calculated. (e) Genes linked by the single-cell chromatin interactions have higher expression values in GM12878. (f) Single-cell chromatin interactions are enriched with TCGA-LAML somatic mutations compared with distance-controlled random interactions. (g) Example of GWAS SNP captured by single-cell chromatin interactions. 81 Figure 3.8 (cont’d) rs1736135 is linked with gene RBM11 through chromatin interactions in three single cells. Potentially, rs1736135 controls the expression of RBM11 gene expression through chromatin interactions by creating a CTCF motif, thus associated with Crohn’s disease. (h) Schematic of predicting the functional chromatin interactions based on the tFLAMINGO predicted contact maps and coupled DNA methylation scores using mESC snm3C data. LASSO regression is used to select the DNA fragments whose 3D distances to the gene promoter can best predict the DNA methylation scores of the target gene. The longest distances between DNA fragments and target gene promoters are limited to 3MB. (i) Enrichment of enhancers along effect sizes predicted by LASSO. As comparisons, the correlations between the DNA fragments and target gene promoters across single cells are used as the predictive score for enhancer enrichment analysis. The result of distance- controlled random chromatin interactions is also shown. cluster and 14 genes for the second cell cluster. Interestingly, the two groups of genes are located in two distinct 3D neighbors and interact with each other through 3D chromatin loops. Apart from the 3D proximity of the differentially methylated genes, we further confirmed that the two sets of genes are enriched in different biological pathways, suggesting their unique roles in cell development at different stages. These results further demonstrate the biological utilities of tFLAMINGO in interpreting the dynamic activity of genes across single cells. 3.2.8 Dynamic single-cell chromatin interaction landscape identified by tFLAMINGO As a direct contribution of tFLAMINGO, the predicted high-resolution 3D chromosome structures, as well as the chromatin contact maps, fully characterize the interaction landscape at 10kb resolution and facilitate the study of single-cell chromatin interactions. Therefore, we identified chromatin interactions based on the predicted chromatin contact maps of tFLAMINGO for 15 GM12878 cells in 30kb resolution (Figure B.15). As a representative example, Figure 3.8.a shows the single-cell chromatin interactions in a 82 5.5MB genomic region (chr21:15,500,000-21,000,000). In this region, the combined single-cell chromatin interactions form three TADs, consistent with the bulk Hi-C and Capture-C data. Surprisingly, chromatin interaction landscapes are drastically changed across single cells, and TADs are observed to be shifting, merging, and vanishing. For example, we observed three TADs from single-cell 1. The imputed single-cell chromatin contact maps accurately capture the bulk chromatin contact maps, and three compact domains are observed on the 3D structure. However, for single-cell 7, the second and third TAD from the bulk tissue Hi-C contact maps are merged into a larger compact domain, and only two TADs persist. In single-cell 12 and single-cell 15, the chromatin loops are untangled, resulting in the vanishment of two TADs. As a functional validation of the predicted single-cell chromatin interactions, we evaluated whether the genes linked by the single-cell chromatin interactions are specifically and highly expressed in GM12878. We found that the linked genes have high Z-scores of gene expression in GM12878, comparing with the randomly selected genes (Figure 3.8.e). This result highlights the regulatory effect of the predicted single-cell chromatin interactions. The systematic comparisons of TAD structures across single cells and bulk chromatin contact maps, along with the analysis of the gene expression specificity, unveil the dynamic chromosome structures in an unprecedented resolution, which cannot be observed in bulk tissue chromatin contact maps or the low-resolution single-cell chromatin contact maps. 83 3.2.9 Relationship between single-cell chromatin interactions and bulk Capture-C interactions We further leverage the orthogonal Capture-C dataset to evaluate the consistency between chromatin interactions in bulk tissue and single cells. By overlapping the predicted single-cell chromatin interactions and Capture-C interactions, we found that the Capture-C dataset captures 73% of the single-cell chromatin interactions (Figure 3.8.b). On the other hand, the Capture-C interactions that overlap with the single-cell chromatin interactions tend to have lower p-values (p-value = 3.47x10-3), suggesting stronger chromatin interactions are conserved across single cells (Figure 3.8.c). In fact, a large fraction of chromatin interactions are shared across different cells. As shown in Figure 3.8.d, the fraction of single-cell chromatin interactions shared by different numbers of cells is calculated as the resolution increases. At 250 kb resolution, over 50% of the single-cell chromatin interactions are captured in two cells, and around 40% of single-cell chromatin interactions are shared by three cells (Figure 3.8.d). These analyses strongly suggest the bulk tissue chromatin contact maps only reflect the average of million cells and have no dynamicity. Therefore, the development of tFLAMINGO can largely boost the understanding of the dynamic chromatin interactions at the single-cell level. 3.2.10 Interpreting genetic variants based on single cell chromatin interactions The 3D chromatin structures and chromatin contact maps predicted by tFLAMINGO provide the structural basis of the GWAS SNPs and disease-associated somatic mutations. Firstly, we used the single-cell chromatin interactions to interpret the LAML- associated somatic mutations. We overlapped the LAML-somatic mutations with the interacting anchors of the chromatin interactions and calculated the enrichment of somatic 84 mutations. Compared with the random chromatin interactions with genomic distances controlled, the single-cell chromatin interactions predicted by tFLAMINGO show a higher enrichment of somatic mutations (Figure 3.8.f), suggesting the disease-SNP associations are mediated by the chromatin interactions. Figure 3.8.g shows one representative example, where the SNP rs1736135 is associated with the Crohn's disease. Based on the predicted single-cell chromatin interactions, this SNP is linked to the promoter of an oncogene RBM11, whose overexpression can significantly decrease the survival rate of the patients. Interestingly, the SNP rs1736135 creates a CTCF motif by transiting a T to C in the alternate genome (E-value: 0.89 to 0.103), which potentially established the chromatin interactions with the RBM11 promoter and finally contributed to the Crohn's disease. This evidence further confirms the regulatory function of the single-cell chromatin interactions and provides a new approach to understanding the disease-associated genetic variants mechanistically. 3.2.11 Predicting functional gene regulatory links in single cells In addition to the single-cell chromatin interactions, we further predict the functional regulatory links for gene expression. In bulk tissue, the regulatory elements are computationally linked to the promoter of genes based on the 1D genomic distances and co-activity patterns. However, these methods model the chromosomes as 1D strings and leave out the important 3D chromosome structures. According to the phase separation model, the gene expressions are controlled by the dynamic binding and unbinding events between genes and regulatory elements on the 3D chromosome structures. This transient binding process can not be explained by the static bulk tissue datasets but can be captured by spatial distances between the DNA fragments across single cells. Therefore, 85 single-cell 3D chromosome structures predicted by tFLAMINGO can help to predict the transcriptional regulations between genes and regulatory elements. Specifically, we linked the target genes and DNA fragments, whose close spatial distances to the target genes are associated with the high gene activities across single cells, based on the predicted single-cell 3D chromosome structures. The DNA methylation signals of every gene are considered to have a linear relationship with the spatial distances of all DNA fragments with 1D distance smaller than 3MB across all single cells. LASSO model is Figure 3.9 Identification of the single-cell multi-way chromatin interactions based on the predicted chromosome structures. (a) The three-way chromatin interactions are predicted from the 3D chromatin structures by evaluating the pairwise 3D distances. (b) Predicted three-way chromatin interactions are supported by scSPRITE data (n=3408, p-value<1.4x10-5). Higher scSPRITE scores represent the three-way interactions are observed in more single cells from scSPRITE dataset. (c-d) Example of (c) 3D chromatin 86 Figure 3.9 (cont’d) structure of predicted three-way interactions and (d) scSPRITE scores. (e) Example of the 3D chromatin structure of cluster-specific three-way interactions. used to prioritize the most influential DNA fragments, i.e. master regulators, of the gene activities (Figure 3.8.h). DNA fragments with larger effect sizes in LASSO are considered to have more substantial regulatory effects on the target gene expressions, as the smaller 3D distances are associated with lower DNA methylation signals of promoters and higher gene activities. The orthogonal enhancer annotation dataset in mESC is used to evaluate the regulatory effect of the linked DNA fragments. We calculated the enhancer enrichment among the predicted DNA fragment-gene links at different effect size cut-offs and used the enrichment to quantify the accuracy of predicted links. As shown in Figure 3.8.i, DNA fragments with high effect sizes are enriched with enhancers, suggesting the regulatory effects of the highly ranked DNA fragments. As comparison, the DNA fragments are also linked to the genes based on the co-activities and 1D genomic distances. In comparison, the predictions of tFLAMINGO demonstrate consistently higher enhancer enrichment across all cut-offs. These analyses clearly demonstrate that, overlaid with the gene activities, the ability of tFLAMINGO to accurately predict the functional chromatin interactions. 3.2.12 Analysis of single-cell multi-way interactions by tFLAMINGO The high-order genome organization enables the multi-way interactions between DNA fragments, which play an important role in gene regulations. Beyond the 1D genomic distances and 2D chromatin contact maps, the single-cell 3D chromosome structures predicted by tFLAMINGO enable directly probing the spatial distances between multiple 87 DNA fragments and predicting the multi-way chromatin interactions from the 3D space. We depicted the three-way interactions by identifying sets of three closely allocated DNA fragments on the pooled 3D chromosome structure of 351 mESC cells. FOr each set of three DNA fragments, we calculated the average pairwise distances to quantify the compactness of the three-way interactions. To evaluate the statistical significance of the three-way interactions, 1000 sets of randomly selected DNA fragments were generated with the genomic distance controlled and their average pairwise distances were used to calculate the empirical p-values. Overall, we predicted 973 statistically significant three- way interactions (p-value < 0.05). To evaluate the predicted three-way interactions, we overlapped the scSPRITE clusters to the predicted three-way interactions and used the normalized counts of the overlapping scSPRITE clusters to quantify the accuracy (termed as the ‘scSPRITE score’ hereafter). We observed that the predicted three-way interactions show significantly higher scSPRITE scores than random three-way interactions with 1D genomic distances controlled. Figure 3.9.c shows an example of the predicted three-way interaction, where three anchors are brought to the same 3D neighborhood by a chromatin loop. Interestingly, the three-way interaction is also frequently observed in the scSPRITE cluster, validating the accuracy of the predicted three-way interactions. To further understand the high-order organization of chromosomes across single cells, we divided 15 GM12878 cells into three clusters based on similar chromosome structures and predicted three-way interactions in each cluster. The three-way interactions are predicted based on the pooled structure of every cluster. We identified around 1000 three- way chromosome interactions in each cluster, suggesting that chromosomes display 88 dynamic high-order structures across single cells. As shown in Figure 3.9.e, the anchors of the predicted cluster-specific three-way interactions show short spatial distances on the matched 3D chromosome structure. However, they are far from each other on the two unmatched 3D chromosome structures, suggesting the complex high-order structures across single cells. Given these analyses, the single-cell 3D chromosome structures predicted by tFLAMINGO reveal the high-order chromosome conformation and innovate the study of multi-way chromosome interactions. 3.3 DISCUSSION In this work, we developed tFLAMINGO to reconstruct the single-cell 3D chromosome structures at high resolution from the sparse single-cell chromatin contact maps. Equipped with the low-rank tensor completion method, tFLAMINGO mitigates the high missing rates of the single-cell chromatin contact maps by borrowing information from all contacts in all cells. The application of tFLAMINGO on four single-cell chromatin conformation capture datasets provides a rich resource of single-cell 3D chromosome structures at 10kb and 30kb resolution. Based on the extensive performance evaluations, tFLAMINGO achieves superior accuracy in reconstructing the single-cell 3D chromosome structures, imputing single-cell chromatin contact maps over the existing state-of-art methods, and capturing the cell-type-specific structural variations. The high consistency between the experimental super-resolution imaging data and tFLAMINGO predictions further confirms the accuracy of tFLAMINGO in predicting the highly dynamic single-cell 3Dchromosome structures. Biologically, tFLAMINGO confirms the robust compartment and TAD structures across single cells. Coupled with the DNA methylation dataset, tFLAMINGO unveils the interplay between dynamic gene regulations and 3D 89 chromosome structures across single cells. The detailed delineation of the high-resolution single-cell 3D chromosome structures by tFLAMINGO facilitates prediction of the single- cell chromatin interactions and provides mechanistic interpretations of GWAS SNPs and somatic mutations. Beyond the 2D chromatin contact maps, the characterization of the chromosome structure in the 3D space further enables the predictions of the high-order chromatin organizations and multi-way chromatin interactions. Compared with existing methods, tFLAMINGO enjoys three unique advantages by modeling the low-rank structure of the single-cell chromatin contact maps: (1) substantially improved ability in handling high missing rate of the single-cell 3C datasets; (2) superior accuracy in predicting single-cell 3D chromosome structures and (3) robust performance in imputing the cell-type-specific high-resolution chromatin contact maps. Equipped with all these advantages, tFLAMINGO is designed for the single-cell 3C datasets and can aid the biological identification and interpretation of the cell-to-cell structural variations, differential single-cell gene expression, single-cell chromatin interactions, and the genetic variants. As a data-driven model, tFLAMINGO solely relies on the input single-cell chromatin interactions and does not introduce any bias into the reconstruction. This feature is crucial for reconstructing the high-resolution single-cell 3D chromosome structures, as the information from the highly sparse single-cell chromatin contact maps poses fewer constraints on the predicted structures compared with the prior assumptions. As another important class of methods, the constrained polymer simulation-based model relies on both pre-determined biophysics properties of the DNA sequences and the observed single-cell chromatin contact maps to predict the 3D chromosome structures. This 90 strategy successfully reconstructs the low-resolution chromosome structures (i.e. 1MB) when most of the DNA fragments are constrained in the relatively dense low-resolution single-cell chromatin contact maps. However, the predictive accuracy in high resolution (i.e. 10kb) is drastically decreased for two reasons. Firstly, the high-resolution chromatin contact maps are incredibly sparse, and the simulation process is dominated by the pre- defined biophysics property, which is invariant across single cells and cell types and may contradict the observed chromatin contact maps. In this case, the model cannot find an optimal structure to satisfy both constraints, thus deviating from the observed values. Therefore, the simulated 3D chromosome structures cannot accurately capture the high- resolution structural variations across single cells. Secondly, the simulation-based methods tend to predict the chromosome structures as extended smooth strings, which violates the observed long-range chromatin interactions. For example, isdHi-C demonstrates a high accuracy (Figure B.16, correlation: 0.73) on a genomic region with 22% of long-range chromatin interactions (>200kb) but failed on the adjacent genomic region with 42% of long-range interactions (Figure B.16, correlation 0.18). Further simulation analyses confirm the limitation of isdHiC in predicting the condensed ball-type structures with massive long-range interactions (Figure B.17). Therefore, the simulation- based methods cannot accurately reconstruct the 3D structures from chromatin contact maps with lots of long-range interactions. In comparison, since no prior assumptions of the chromosome structures are made, tFLAMINGO demonstrates robust performance in reconstructing 3D structures with different geometrical patterns across different resolutions, which implies a high accuracy in reconstructing the dynamic 3D spatial structures. 91 We envision two future developments of tFLAMINGO. First, the information-sharing mechanism of tFLAMINGO requires that all cells are from the same cell type and share a similar backbone structure. In the current framework of tFLAMINGO, all cells are equally important in the Fourier transformation, and the low-rank features shared by all cells will be extracted for the reconstructions. This assumption can be satisfied when the cell-type identities of single-cell chromatin contact maps are provided, or the dataset only contains cells from one cell type. However, some highly complex tissue may contain cells from multiple cell types with unknown cell-type identities. For example, the human brain tissue consists of several highly differentiated cell types, and the cell-type deconvolution is challenging. In this case, the consensus structure of the dataset is essentially a mixture of multiple cell-type-specific chromosome structures, and the assumption of tFLAMINGO is not fulfilled. Although tFLAMINGO demonstrates superior performance on the simulated datasets with three similar consensus structures, it is still challenging to accommodate datasets with multiple fundamentally re-wired structures. Therefore, additional algorithmic improvement of tFLAMINGO is required to simultaneously de- convolve the single-cell 3D chromatin contact maps and reconstruct the single-cell 3D chromosome structures with the cell-type specificity retained. Secondly, tFLAMINGO can be improved to reconstruct the time-dependent single-cell 3D chromosome structures. Based on the recent single-cell RNA-seq data analysis, the differential gene expression patterns are observed at the different stages of cell differentiation along the lineage trajectory, suggesting the changing gene regulations and 3D chromosome structures. Reconstructing the time-dependent single-cell 3D chromosome structures will open new avenues to understand the dynamic gene expression during the cell cycle, cell 92 differentiation, and cellular activation from the 3D space. Currently, the experimental assessment of the lineage-specific single-cell 3D chromatin conformation is still challenging, and the computational methods are highly applaudable. tFLAMINGO has inherent algorithmic advantages in modeling the ordered single-cell chromatin contact maps in two aspects. Firstly, tFLAMINGO models all single-cell chromatin contact maps as the frontal slices of a low-rank tensor. The order of the frontal slices can be easily extended to incorporate the lineage information of single cells by assigning cells based on the time order. Secondly, tFLAMINGO demonstrates superior ability in preserving the cell-type-specific structures based on the simulation analyses. This critical advantage of tFLAMINGO guarantees that the subtle changes in 3D chromosome structures can be accurately captured. Therefore, the further development of tFLAMINGO can not only capture the structural variations across single cells but also recapitulate the time- dependent structural properties. 3.4 METHODS 3.4.1 Model framework of tFLAMINGO tFLAMINGO reconstructs single-cell 3D chromatin based on the low-rank tensor completion method. In the framework of tFLAMINGO, the missing values of the sparse tensor summarizing all single-cell chromatin contact maps is firstly completed and the underlying 3D structures are predicted for every cell. This framework brings two algorithmic advancements: (1) tFLAMINGO jointly models the chromatin contact maps of all single cells. This ensures the information could be borrowed across single cells; (2) tFLAMINGO makes full use of the low-rank property of the single cell chromatin contact maps. This property guarantees the underlying 3D chromatin structures can be accurately 93 recovered from the sparse chromatin contact maps under high missing rates. Computationally, we solved a tensor rank-minimization problem using the ADMM method to complete the missing values and used our in-house 3D reconstruction algorithm FLAMINGO to reconstruct the 3D structures. 3.4.2 Chromatin contact maps and data preprocessing Chromatin contact maps from four single-cell 3C studies are collected, including the Dip- C experiment in GM12878, the snHi-C experiment in K562, the snm3C experiment in mESC and scHi-C experiment in mESC (see Data availability). Although the interaction frequencies from single-cell 3C datasets show strong agreement with the bulk tissue dataset, the single cell chromatin contact maps tend to have much smaller interaction frequencies at high resolution, thus cannot be directly converted to the 3D distances between DNA fragments using the conversion transformation function observed from the bulk-tissue Hi-C data. More importantly, different linear relationships are observed for interaction frequencies with different 1D genomic distances, suggesting the potential confounding effect of the 1D distances. Based on this observation, tFLAMINGO maps the single-cell chromatin contact maps to the same scale as the bulk Hi-C contact maps using the band-wise log-linear regression. Interaction frequencies between DNA fragments with similar 1D genomic distances are jointly modeled, which can be represented as a diagonal band on the interaction frequency matrix. Apart from interaction frequencies, 1D genomic distances between interacting DNA fragments, missing rate of single cells and expected interaction frequencies between interacting DNA fragments are considered as covariates: 94 𝑏𝑢𝑙𝑘 𝑠𝑐 𝑠𝑐 𝑠𝑐 𝑙𝑜𝑔 (𝐼𝐹𝑖,𝑗 ) = 𝛼𝑙 ∗𝑙𝑜𝑔 (𝐼𝐹𝑘;𝑖,𝑗 ) + 𝛽𝑙 ∗𝑙𝑜𝑔 (𝐷𝑖𝑠𝑡𝑖,𝑗 ) + 𝜃𝑙 ∗ 𝑀𝑅𝑘 + 𝛾𝑙 ∗𝑙𝑜𝑔 (𝐼𝐹𝑘;𝑖,𝑖 ∗ 𝐼𝐹𝑘;𝑗,𝑗 ), ( 11 ) 𝑏𝑢𝑙𝑘 where 𝐼𝐹𝑖,𝑗 represents the interaction frequency between 𝑖𝑡ℎ DNA fragment and 𝑗𝑡ℎ 𝑠𝑐 DNA fragment in the bulk-tissue Hi-C contact map, 𝐼𝐹𝑘;𝑖,𝑗 represents the interaction frequency between 𝑖𝑡ℎ DNA fragment and 𝑗𝑡ℎ DNA fragment in the contact map of 𝑘𝑡ℎ single cell, 𝐷𝑖𝑠𝑡𝑖,𝑗 represents the 1D genomic distance between 𝑖𝑡ℎ and 𝑗𝑡ℎ DNA fragment, and 𝑀𝑅𝑘 represents the missing rate of 𝑘𝑡ℎ single cell contact map. To account for the different log-linear relationships under different distance ranges, the regression parameters are estimated in every distance band as suggested by previous studies. We applied the estimated log-linear transformation functions on different single cell 3C datasets across different resolution and observed high correlations between the transformed single-cell interaction frequencies and observed bulk Hi-C interaction frequencies, validating the robustness and generalizability of the estimated transformation functions. The bulk-tissue chromatin contact maps generated by four studies are collected from GEO and 4DN databases, including bulk-tissue Hi-C experiments in GM12878 and K562 (GSE63525), GAM experiment in mESC (GSE64881), 3D ATAC-PALM experiment in mESC (GSE126112) and bulk-tissue Hi-C experiment in mESC (4DNFI5IAH9H1). The GAM data only provide chromatin contact maps at 30kb resolution. Except the GAM data, all bulk-tissue chromatin contact maps are used to validate the performance in 10kb and 30kb resolution. Whenever possible, the chromatin contact maps normalized by the Knight-Ruiz normalization are used. The scSPRITE data is collected from the GEO database (GSE154353) and preprocessed according to the instructions. 95 3.4.3 Complete single-cell chromatin contact maps based on the low-rank tensor completion The missing rate of the single-cell contact map is high (>99.9% at 30kb resolution), making the reconstruction of the 3D chromatin structures in high resolution extremely challenging. tFLAMINGO mitigates the high missing rate by borrowing information in two directions: (1) the same contact of two DNA fragments across all contact maps and (2) all contacts between DNA fragments within the same contact maps. Biologically, every sparse single-cell contact map represents a randomly down-sampled ‘snapshot’ of the consensus 3D chromatin structure with structural variations. Therefore, the missing entry in one single cell contact map can be imputed by borrowing information from the same entry measured in other single cell contact maps. tFLAMINGO facilitates the information- sharing across all single cells using a Fourier transformation-based method. To borrow information from contacts within the same contact map, tFLAMINGO takes advantage of the low-rank property of the single-cell contact map. According to Euclidean geometry, the distance matrix derived from the single cell contact map is induced by a 3D coordinate matrix, thus having the low-rank property (rank≤ 5). The low-rank property guarantees that the missing values can be reconstructed from a small fraction of observed values. Equipped with an SVD-based method, tFLAMINGO borrows information across all contacts within the same single cell contact map. Computationally, the single cell contact maps are summarized into a tensor, where each frontal slice represents a single cell contact map and a tube that perpendicular to the plane of paper represents a contact between a pair of DNA fragments across all single cells. tFLAMINGO aims to recover a dense tensor with minimum error compared with the 96 sparse input tensor on the observed entries using a t-SVD-based method. The t-SVD method has been widely used to identify the low-rank structures of high-dimensional tensor. Similar to the matrix SVD, t-SVD decomposes the tensor into the multiplication of three tensors: 𝑇 𝑜𝑏𝑠 = 𝑈 ∗ 𝑆 ∗ 𝑉 𝑇 , where ∗ represents the circular convolution product (t- product) of tensors. According to the tensor-completion theory, the tensor completion problem can be solved by calculating the matrix SVD across all frontal slices of the tensor in the Fourier domain. The observed tensor 𝑇 𝑜𝑏𝑠 is transformed into the Fourier domain using a tube-wise Fourier Transformation: 𝑇̂𝑖,𝑗,𝑘 𝑜𝑏𝑠 = ∑𝑁 𝑜𝑏𝑠 𝑛=1 𝑇𝑖,𝑗,𝑛 ∗ 𝑒 −2𝜋𝑖𝑘𝑛/𝑁 . ( 12 ) Intuitively, the contact between DNA fragment 𝑖 and DNA fragment 𝑗 for single cell 𝑘 in the Fourier domain (𝑇̂𝑖,𝑗,𝑘 𝑜𝑏𝑠 ) are calculated from the same contact across all single cells 𝑜𝑏𝑠 (𝑇𝑖,𝑗,𝑛 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑛). Therefore, if any single cell chromatin contact map contains observed values for the contact (𝑖, 𝑗), all values in the tube 𝑇̂𝑖,𝑗,: will be completed in the Fourier domain by aggregating the observations of all cells. Given the tensor 𝑇̂ 𝑜𝑏𝑠 in the Fourier domain, the SVD is applied on every fontal slice of 𝑇̂ 𝑜𝑏𝑠 ( 𝑇̂:,:,𝑘 𝑜𝑏𝑠 ): 𝑇̂:,;,𝑘 𝑑𝑒𝑛𝑠𝑒 = 𝑈𝑘𝑜𝑏𝑠 ∗ 𝑆𝑘𝑜𝑏𝑠 ∗ (𝑉𝑘 𝑜𝑏𝑠 )𝑇 . The SVD procedure captures the low-rank structures of the frontal slices and borrows information across all contacts within each cell. The recovered tensor is then transformed into the original domain using the inverse Fourier Transformation, and the resulting tensor can maximally approximate the input one. 97 For single-cell chromatin contact maps, the high missing rate of the observed tensor 𝑇 𝑜𝑏𝑠 requires the completion process only relies on a few observed entries. In tFLAMINGO, the objective function of the low-rank tensor reconstruction is: ||𝑋|| 𝑇𝑁𝑁 , 𝑠. 𝑡. 𝛺(𝑇 𝑜𝑏𝑠 ) = 𝛺(𝑋), ( 13 ) where 𝑇 𝑜𝑏𝑠 represents the sparse tensor summarizing all single cell chromatin contact maps, 𝑋 represents the recovered dense tensor, 𝛺 represents the set of observed entries in 𝑇 𝑜𝑏𝑠 and 𝑇𝑁𝑁 represents the Tensor Nuclear Norm. To achieve fast and accurate convergence, tFLAMINGO simplifies the optimization problem by solving the equivalent optimization problem in the Fourier domain: ||𝑏𝑙𝑘𝑑𝑖𝑎𝑔(𝑋̂)||∗ 𝑠. 𝑡. 𝛺(𝑋̂) = 𝛺(𝑇̂), ( 14 ) where 𝑋̂ represents the transformed tensor in the Fourier domain, 𝑏𝑙𝑘𝑑𝑖𝑎𝑔 represents the block diagonal matrix constructed by placing the frontal slices of the tensor 𝑋 into diagonal submatrices of a large matrix, ∗ represents the matrix nuclear norm. tFLAMINGO uses the Alternating Direction Method of Multipliers (ADMM) algorithm to solve the optimization problem and the original objective function can be re-written as: |𝑏𝑙𝑘𝑑𝑖𝑎𝑔(𝑍̂)|∗ + 1𝛺(𝑇̂)=𝛺(𝑋̂) 𝑠. 𝑡. 𝑋̂ − 𝑍̂ = 0, ( 15 ) where 𝑍 is introduced as an intermediate variable. The iterative updating scheme can be derived as: 98 2 𝑋 𝑡+1 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑋:𝛺(𝑋)=𝛺(𝑇 𝑜𝑏𝑠 ) {||𝑋 − (𝑍 𝑡 − 𝑄 𝑡 )||𝐹 }, 1 1 2 𝑍 𝑡+1 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑍 {𝜌 ||𝑏𝑙𝑘𝑑𝑖𝑎𝑔(𝑍̂)|| + 2 ||𝑍̂ − (𝑋 𝑡 + 𝑄 𝑡 )|| }, ∗ 𝐹 𝑄 𝑡 = 𝑄 𝑡−1 + 𝑋 𝑡 − 𝑍 𝑡 , ( 16 ) where 𝜌 is a free parameter. In tFLAMINGO, 𝜌 is set to 1 by default according to previous analysis. 𝑋 𝑡+1 can be analytically solved as : 𝑡 𝑡 𝑍𝑖,𝑗,𝑘 − 𝑄𝑖,𝑗,𝑘 (𝑖, 𝑗, 𝑘) ∉ 𝛺 𝑇 𝑜𝑏𝑠 𝑡+1 𝑖,𝑗,𝑘 𝑋𝑖,𝑗,𝑘 ={ 𝑜𝑏𝑠 𝑇𝑖,𝑗,𝑘 (𝑖, 𝑗, 𝑘) ∈ 𝛺 𝑍 𝑡+1 can be solved by applying the soft-thresholded t-SVD method on 𝑋 𝑡 + 𝑄 𝑡 . Through iterations, 𝑍 borrows information across all contacts across all single cells using the t- SVD method and 𝑋 guarantees the imputed values are close to the observed values on the measurement set. Upon convergence, tFLAMINGO imputes a much denser contact map for every single cell which maximally aligns with the observed single cell chromatin contact map. 3.4.5 Reconstruct the single cell 3D chromatin structure based on low-rank matrix completion Given the single-cell chromatin contact maps imputed by the low-rank tensor completion algorithm, tFLAMINGO reconstruct the 3D chromosome structures. The chromatin contact maps are converted to pairwise distance matrices using the observed conversion function: 𝐼𝐹𝑖,𝑗 = 𝑃𝐷𝑖𝑗−𝛼 , where 𝛼 is set to 0.25 based on the previous studies. Our in-house 3D chromatin reconstruction algorithm, FLAMINGO, is used to reconstruct the high- 99 resolution 3D chromatin structures for every single cell. Algorithmically, FLAMINGO reconstructs the 3D chromosome structures based on the low-rank matrix completion technique, which guarantees an accurate reconstruction of the 3D coordinate matrices from highly noisy and sparse chromatin contact maps. Compared with existing methods, FLAMINGO demonstrates superior accuracy and scalability in reconstructing high- resolution chromatin structures (up to 1kb) from extremely sparse chromatin contact maps (missing rate >99%). 3.4.6 Performance evaluation based on simulated chromatin structures The performance of tFLAMINGO is extensively evaluated by reconstructing the simulated benchmark structures. In the simulation, a benchmark structure with 𝑙 beads is generated. The 𝑙 by 𝑙 benchmark distance matrix induced by the benchmark structure is down-sampled 𝑛 times with the down-sampling rate 𝛾 and mixed with three levels of noise: (1) no noise, (2) noise level one, which is generated by the normal distribution 𝑁(𝛿, 𝛿), where 𝛿 is the minimum value of the down-sampled matrix, and (3) noise level two, which is generated by the normal distribution 𝑁(2𝛿, 𝛿). Thus, a sparse tensor is constructed to simulate the single-cell 3C dataset. tFLAMINGO is applied on the sparse tensor to reconstruct the benchmark 3D structures. The model performance is quantified by two metrics: (1) Spearman correlations between the pairwise distance matrices predicted by tFLAMINGO and benchmark pairwise distance matrices and (2) the RMSD between the predicted 3D coordinates ( 𝐶 𝑝𝑟𝑒𝑑 ) and benchmark 3D coordinates 1 (𝐶 𝑏𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 ): 𝑅𝑀𝑆𝐷 = √𝑛 ∑𝑛𝑖=1 ||𝐶𝑖𝑝𝑟𝑒𝑑 − 𝐶𝑖𝑏𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 || . To demonstrate the robustness of tFLAMINGO, the simulated datasets are generated under different combinations of 𝑙, 𝑛 and 𝛾. 100 Apart from the white noise, we further evaluated the performance of tFLAMINGO on inputs with noise generated from random structures. Given a benchmark structure, 𝑛 random structures are generated and corresponding pairwise distance matrices are mixed with the benchmark pairwise distance with weight 𝑊: 𝐷 = 𝐷𝑏𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 ∗ (1 − 𝑊) + 𝐷𝑟𝑎𝑛𝑑𝑜𝑚 ∗ 𝑊. The resulting noisy pairwise distance matrices are used as the input of tFLAMINGO to recover the benchmark structure. Compared with the white noise, the structured noise follows a similar pattern to the benchmark distances, i.e. consecutive points show lower distances, thus making the reconstruction of the benchmark structure more challenging. Further, tFLAMINGO demonstrates excellent performance on datasets containing multiple consensus structures. Biologically, single-cell 3C data may contain cells in different developmental stages, which share similar backbone structures with structural variations. In the simulation, three consensus structures are generated with the weight 𝑊 to illustrate the heterogeneity of the single-cell 3D chromosome structures: 𝑆𝑖 = 𝑆𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 ∗ 𝑊 + 𝑆𝑟𝑎𝑛𝑑𝑜𝑚 ∗ (1 − 𝑊) , where weight 𝑊 controls the similarity of the consensus structures. A sparse tensor with 3𝑁 frontal slices is further constructed by down-sampling the distance matrix induced by each consensus structure 𝑁 times and mixed with noise and used as the input of tFLAMINGO to reconstruct the underlying 3D structures. tFLAMINGO is applied on the sparse tensor to recover the 3D structures. A wide range of weight is tested in the simulation and the performance of tFLAMINGO is evaluated by the correlations of pairwise distances and RMSD of 3D coordinates. 101 3.4.7 Performance comparison based on the STORM 3D genome imaging data We applied tFLAMINGO on K562 snHi-C data to reconstruct the 3D chromatin structures of chromosome 21 for 16 single cells. To evaluate the model performance, we benchmarked the predicted single cell 3D chromatin structures with the STORM 3D genome imaging data. As the data quality control, single-cell structures measured by STORM data with missing rates greater than 0.5 are excluded from the analysis. For each pair of the predicted structures and STORM structures, the Spearman correlations of spatial distances are calculated to evaluate the consistency. To approximate the underlying 3D structures of the snHi-C dataset, we calculated the Spearman correlations between the raw single-cell chromatin contact maps and STORM structures. The top 20 STORM structures with the highest Spearman correlations with each single-cell chromatin contact maps are considered to be the true underlying structures. Therefore, the Spearman correlations between the predicted structures and approximated ground- truth measure the reconstruction accuracy. Note that, these two metrics are complimentary, since the first Spearman correlation evaluates the consistency between the predictions and STORM data, and the second Spearman correlation validates the accuracy of reconstructing 3D structures of the snHi-C data. Therefore, a better algorithm is expected to achieve high values in both correlations. 3.4.8 Performance comparison in reconstructing 3D chromatin structures based on experimental single cell Hi-C data We applied tFLAMINGO on four single cell Hi-C datasets (human GM12878 Dip-C and human K562 snHi-C, mESC scHi-C and mESC snm3C) to reconstruct the single-cell 3D chromatin structures (chr21 in human GM12878 and K562, chr19 in mESC) in 10kb and 102 30kb resolution. To evaluate the accuracy of completed missing data, we calculated the Spearman correlations based on all available entries from the bulk chromatin contact maps (termed as ‘all distance correlations’). The ability of recovering the observed values of the single-cell 3C datasets is further quantified by the Spearman correlations based on the observed distances from the single-cell 3C dataset (termed as ‘validated distance correlations’). Therefore, an accurately reconstructed single-cell 3D chromosome structures should demonstrate high all distance correlation as well as validated distance correlation. The performance of tFLAMINGO is compared with six existing algorithms in reconstructing 3D chromatin structures: ShRec3D, NucDynamics, RPR, isdHi-C and Si- C. Another existing algorithm, MBO, is not included into the comparison due to no code availability. Predicted chromatin structures of NucDynamics based on mESC scHi-C data are directly used. ShRec3D, RPR, isdHi-C and Si-C are applied on the same dataset as tFLAMINGO in 10kb and 30kb resolution to predict the single cell 3D chromatin structures. To systematically compare the performance of tFLAMINGO with existing methods, we used two sets of experimental datasets as gold-standards: (1) the single cell 3D chromatin structures provided by the multiplexed STORM 3D genome imaging data in human K562 and (2) bulk tissue chromatin contact maps in human GM12878, human K562 and mESC. We further evaluated cell-type specificity of the predicted 3D chromatin structures by different algorithms. All algorithms are applied on the snm3C data with 351 mESC single cells and 96 NMuMG cells to reconstruct the single-cell 3D chromatin structures. UMAP 103 plots based on the distance matrices predicted by different algorithms are used to visualize the clusters of single cells and evaluate the performance. 3.4.9 Performance comparison with Higashi in imputing high-resolution single cell chromatin contact maps We benchmarked tFLAMINGO and Higashi on the snm3C dataset in 1mb, 250kb and 30kb resolution. For both algorithms, the original cell types of the single cells are provided as the inputs. The model performance is evaluated by: (1) the correlations with the observed distances in bulk-tissue chromatin contact maps and (2) the ability to recover cell-type identity based on the imputed chromatin contact maps. The Adjusted Random Index (ARI) is used to quantify the consistency between the predicted single cell clusters and the original cell-type identify. 3.4.10 Identification of the single-cell compartment A/B and TAD boundaries The single-cell interaction frequency matrices are derived from the single cell distance matrices completed by tFLAMINGO using the conversion function 𝐼𝐹𝑖𝑗 = 𝑃𝐷𝑖𝑗−4 (corresponding to the conversion from interaction frequencies to distances with the conversion factor -0.25). The expected interaction frequency matrices then normalize the observed interaction frequency matrices following the standard procedure in previous studies. PC1 scores calculated from the normalized interaction frequency matrices are used to represent the compartment A/B. The single-cell TAD boundaries are called based on the single-cell interaction frequency matrices using the TADCompare software. The structural stability of the 3D chromatin structures is quantified by calculating the RMSD between the average structure from tFLAMINGO predictions and the single-cell 3D 104 chromatin structures along the genome: 𝑅𝑀𝑆𝐷𝑖𝑘 =|𝐶𝑜𝑜𝑟𝑑𝑖𝑘 − 𝐶𝑜𝑜𝑟𝑑𝑖𝑝𝑜𝑜𝑙𝑒𝑑 |, where 𝑅𝑀𝑆𝐷𝑖𝑘 represents the RMSD in single cell 𝑘 at the genomic location 𝑖. 3.4.11 Differential methylated gene analysis across clusters of single cells To demonstrate the relationships between 3D chromatin structures and gene regulations, 351 mESC cells from the snm3C data are grouped into two clusters based on the pairwise distance matrices and two clusters are selected based on the highest average silhouette score. The coupled single-cell DNA methylation signals of snm3C data are overlapped with gene promoters to quantify the single-cell gene activities. The differential methylation analysis of genes across two clusters of single cells is performed using the DEGseq2 package with default settings. 3.4.12 Analyses of single-cell chromatin interactions and genetic variants To predict the single-cell chromatin interactions, the distance matrices induced by the predicted single-cell 3D chromosome structures are converted to the interaction frequency matrices using the same conversion function as above. FitHi-C is used to predict the statistically significant chromatin interactions from the single-cell interaction frequency matrices. To control the false positive rates, the p-value threshold is set to 1x10-20, which is a stringent criterion compared with previous analyses. As validation, we overlapped the single-cell chromatin interactions with the bulk Capture-C dataset and calculated the fraction of overlapping. To provide a mechanistic interpretation of GWAS SNPs and somatic mutations, we overlapped the SNPs with the single-cell chromatin interactions and calculated the enrichment of the SNPs. As a comparison, links between randomly selected DNA fragments with genomic distances controlled are generated. The motif matching score of CTCF is calculated using the TOMTOM software. 105 Furthermore, we predicted the three-way chromatin interactions from the predicted 3D structures of tFLAMINGO. For every set of three DNA fragments, the average pairwise spatial distances were calculated to quantify the compactness. Specifically, for a set of DNA fragments 𝑖, 𝑗, 𝑘 ( 𝑖 < 𝑗 < 𝑘 ), the averaged 3D pairwise distance is calculated 1 as: 𝐷𝑖,𝑗,𝑘 = 3 (𝐷𝑖𝑗 + 𝐷𝑖𝑘 + 𝐷𝑗𝑘 ), where 𝐷𝑖𝑗 represents the 3D genomic distances between DNA fragment 𝑖 and 𝑗. As comparison, the average spatial distances of 1,000 sets of DNA 𝑏𝑔 1 fragments with the same 1D genomic distances are calculated: 𝐷𝑚,𝑛,𝑝 = 3 (𝐷𝑚𝑛 + 𝐷𝑚𝑝 + 𝐷𝑛𝑝 ), where 𝑛 − 𝑚 = 𝑗 − 𝑖 and 𝑝 − 𝑛 = 𝑘 − 𝑗. The empirical p-values are calculated as 1 𝑃𝑖𝑗𝑘 = 1000 (1 + #{𝐷𝑖𝑗𝑘 > 𝐷𝑏𝑔 }) . Similar to the identification of the two-way chromatin interactions, the adjacent three-way interactions are pruned and the most significant ones are selected: 𝑆𝐼𝑖,𝑗,𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑚,𝑛,𝑝 𝑃𝑚𝑛𝑝 𝑓𝑜𝑟 𝑎𝑙𝑙 |𝑚 − 𝑖| = |𝑛 − 𝑗| = |𝑝 − 𝑘| < 5. 106 CHAPTER 4 DECIPHER THE COMBINATORIAL GRAMMAR OF TRANSCRIPTION FACTORS IN LONG-RANGE MULTI-ENHANCER REGULATION 4.1 INTRODUCTION The comprehensive profiling of the epigenomic landscapes has discovered millions of putative enhancer regions across hundreds of cell lines. In the 3D space, enhancers are brought to the proximity of the gene through DNA loops and regulate the gene expressions. Such enhancer-gene link exhibits strong cell-type specificity and is ubiquitous across the human genome, highlighting the crucial role of enhancer regulation in cell differentiation and development. Beyond the one-on-one interaction between enhancers and genes, recent case studies demonstrate that multiple enhancers can synergistically control the expression of a single gene87-89. For example, the kni gene in the Drosophila embryo is regulated by the orchestration of an intronic enhancer and a distal enhancer which is ~35kb away from the promoter87. Another experimental analysis further shows that multi-enhancer regulation is crucial for phenotypic robustness88. These important biological discoveries demonstrate the complex landscape of transcriptional regulations and highlight the vital role of multi-enhancer regulation. To characterize the interaction landscape of the genome, several experimental techniques have been developed, including Hi-C, ChIA-PET, and Capture-C, to measure the interaction frequencies between pairs of two genomic loci and demonstrate the cell- type-specific genome organization. However, these methods rely on the ligation between the cross-linked DNA anchors to detect the chromatin interactions, thus demonstrating 107 less ability to capture long-range and multi-way interactions. To further characterize the high-order chromosome conformations, three techniques that do not require proximity ligations are invented: SPRITE, GAM, and ChIA-Drop. The successful applications of these methods in the human genome and mouse genome not only generate the high- resolution chromatin contact maps but also unravel the multi-way enhancer-promoter interactions and the functional relationships between the co-binding transcription factors (TFs). For example, IRF, STAT, AP1, and SMAT family motifs are frequently observed within the interacting anchors, suggesting the cooperative action of the TFs in shaping the 3D chromosome structures and regulating gene expressions. Although these experimental techniques have revealed the multi-way chromatin interactions and largely expanded the understanding of the interplay between genes, enhancers, and TFs, they are only available in human GM12878 cells and mouse ESC cells, thus limiting the studies of other important tissues and cell lines. In addition, these techniques can only capture the interactions between relatively large DNA fragments (SPRITE: 5kb; GAM: 30kb; ChIA-Drop: 10kb), which is not sufficient to pinpoint the actual functional enhancers. Given these limitations, computational methods are developed to predict the cell-type- specific enhancer-promoter interactions by integrating multi-omics datasets generated by the large consortia, e.g. ENCODE and Roadmap Epigenomics projects. By evaluating the gene expression, enhancer activates, genomic distances, and other DNA sequence features, these methods link the enhancers to the promoters and predict the cell-type- specific enhancer-promoter interactions. In light of the machine learning techniques, these methods can be classified into supervised learning methods and unsupervised 108 learning methods. For the unsupervised methods, the 1D genomic distances and the correlations between the enhancer activities and gene expression across diverse cell types are used as the predictive score to prioritize the enhancer-gene links. Compared with the supervised learning methods, the unsupervised learning methods do not require the experimental chromatin interactions to train the model but demonstrate lower accuracy based on comprehensive benchmarking analyses. For the supervised learning methods, the experimentally verified chromatin interactions are used to annotate the interacting enhancer-gene links. By learning the differential distributions of the multi- omics feature between the interacting and non-interacting enhancer-gene links, these methods can make genome-wide predictions. Currently, most computational methods in predicting long-range enhancer-gene interactions are supervised learning methods, including TargetFinder, JEME, IM-PET, ProTECT, RIPPLE, FOCS, and EAGLE. Besides the epigenomic signals, transcription factor (TF) binding sites are also used to improve the predictive accuracy. For example, TargetFinder uses the peaks of the TF ChIP-seq datasets within the enhancers, promoters, and intervening genomic windows as features to predict the enhancer-gene links. ProTECT further incorporates the Protein-Protein Interactions (PPIs) between the enhancer-binding and promoter-binding TFs to predict the TF-mediated enhancer-gene links. The expanded feature set largely improves the model performance. However, these methods can only be applied to cell types with experimental chromatin interaction datasets, which are unavailable in a significant fraction of cell lines. Although the trained models can be directly applied to another cell type to make cross-cell-type predictions, a recent analysis suggests that the cross-cell-type predictions are less accurate. In addition, a severe overfitting problem has been observed 109 for some methods, suggesting the high false-positive rates of the genome-wide predictions. Furthermore, the existing methods consider potential enhancer-gene links as independent samples and thus cannot capture the synergistic effect of multiple enhancers in regulating the same target gene. Finally, while TargetFinder and ProTECT use TF bindings as features, how TFs cooperate with each other to regulate distal target genes is still unclear, thus providing little mechanistic insights into the complex gene regulations. According to the recent experimental studies, apart from the co-binding TFs in the same locus, TFs that bind to different loci can also synergistically regulate the expression of genes through multi-enhancer regulations. For example, GFI1b, RUNX1, and MYB are observed to regulate the expression of Myc gene by binding to a cluster of Myc enhancers. The deletion of a distal enhancer, which is ~ 1.7Mb away from the Myc gene, can significantly downregulate the expression of Myc gene, suggesting the critical role of the synergistic effect of TFs and enhancers in maintaining the gene expressions. In another example, the expression of -globin genes is regulated by five enhancers spanning a continuous 24kb genomic region. Interestingly, instead of binding to the same enhancer, four master erythroid transcription factors NF-E2, GATA1, SCL, and KLF1, show differential binding profiles across five enhancers, highlighting the cross- enhancer cooperation between these TFs. Globally, the integrative analysis of the long- range chromatin interactions and TF ChIP-seq datasets leads to the discoveries of TF clusters enriched in the interacting DNA anchors. These global analyses and case studies strongly support the need to model the multi-TF and multi-enhancer regulations in understanding the complex gene regulatory networks. 110 As stated above, existing methods can only model the one-on-one enhancer-gene links, except JEME. To quantify the additive effect of nearby enhancers, JEME jointly models the linear relationships between the activities of all nearby enhancers (+/- 1MB of the TSS) and gene expressions using a LASSO model. Together with the epigenomic signals and 1D genomic distances, the LASSO coefficients are used as a feature in the random forest model to predict the enhancer-gene links. By modeling the additive effect of multiple enhancers, the co-regulating enhancers tend to reside in the same TAD and super- enhancer, contain similar TF motifs and have correlated epigenomic signals across cell lines. However, JEME does not incorporate the information of TFs into the algorithm and thus cannot provide a mechanistic interpretation of the multi-enhancer regulations. In this study, we developed a new unsupervised learning model, ComMUTE, to predict the long-range multi-way enhancer-gene links by mechanistically modeling the TF regulatory grammar of gene expressions. As a scalable Bayesian graphical model, ComMUTE integrates gene regulatory grammars by modeling the combinatorial TF modules of the co-regulating enhancers and thus mechanistically links multiple enhancers to the target genes simultaneously. In the framework of ComMUTE, genes are clustered into gene groups, and the enriched TF combinations across all genes are considered to be the group-specific TF grammars. Based on the gene group-specific TF combinations, a subset of enhancers that can synergistically provide the required TFs are prioritized as the co-regulating enhancers. Since ComMUTE is an unsupervised model, no experimental chromatin interaction dataset is required, which greatly expands the usability of ComMUTE in vast cell types without Hi-C datasets. We applied ComMUTE in 127 cell types/tissues to predict the cell-type-specific multi-way enhancer-gene links. 111 Compared with existing algorithms, ComMUTE demonstrates consistently improved performance by benchmarking with 19 cell-type-specific Hi-C and Capture-C datasets. Figure 4.1 Bayesian framework of ComMUTE in predicting multi-enhancer regulations. (a) Genes are regulated by different combination of TFs (TF modules), which binds to multiple interacting enhancers with the same target gene. (b) Examples of Hi-C interactions linking enhancers (orange) and genes (red) showing the linked enhancers cooperatively provide the TF combination STAT1-CREB to regulate the target genes. (c) Plate diagram of ComMUTE. (d) Iterative scheme of ComMUTE. First, given the TF regulatory grammar, ComMUTE searches for a set of enhancers that cooperatively provides the required TFs (combinatorial TF profile) by comparing the similarity (KL). Combined with epigenomics data, ComMUTE predicts enhancer-gene interaction. Second, based on the predicted enhancer-gene interactions, the genes are assigned to different TF regulatory groups based on enhancer-binding TF profiles. ComMUTE repeats the two steps until convergence. (e) Example of predicted cell-type specific enhancer- gene interactions in GM12878, K562, HUVEC and H1. The ComMUTE predictions are consistent with cell-type specific DNase-seq data. 112 By randomly shuffling the TF bindings across enhancers in the inputs, we demonstrated that incorporating the TF module can significantly improve the accuracy of predicted enhancer-gene links. Interestingly, the co-regulating enhancers demonstrate high partial correlations of activities conditioned on the target gene expression, multi-correlations, and enrichment of Hi-C interactions, confirming the enhancers are directly associated and not mediated by the joint interacting promoters. Furthermore, the SPRITE multi-way interactions strongly support the multi-way enhancer-gene links. In addition to enhancer- gene links, we also evaluated the predicted TF regulatory grammars of gene groups. We observed a high PPI enrichment and clear co-expression patterns of the predicted combinatorial TFs. Moreover, the predicted enhancer-gene links are enriched with QTLs and GWAS SNPs. Strikingly, the epistasis eQTLs are precisely captured by the predicted multi-way enhancer-gene links, innovating new biological insights in mechanistically interpreting the high-order functional associations between SNPs. 4.2 RESULTS 4.2.1 ComMUTE predicts long-range multi-enhancer regulations based on TF regulatory grammars In the framework of ComMUTE, the relationships between TFs, enhancers and genes are summarized as a three-layer network, where TFs bind to enhancers and enhancers interact with genes. The gene regulatory grammars are represented by the combinatorial TF profiles across all linked enhancers. Based on this model, the synergistic regulatory effect of multiple TFs is aggregated by the co-regulating enhancers (Figure 4.1.a). To verify the three-layer gene regulatory networks, we analyzed the Capture-C dataset and observed common TF combinations at different genomic loci, suggesting different genes 113 share similar regulatory grammars. As a representative example, we identified that the combination of STAT1 and CREB are collectively shared by the IFNGR1 gene and the IRF1P2 gene (Figure 4.1.b). At the IFNGR1 gene locus, two distal enhancers (~120 kb) are simultaneously linked to the gene promoter by the Capture-C interactions. While both enhancers show strong CREB binding signals, the STAT1 exclusively binds to the second enhancer. Similarly, two enhancers are linked to the promoter of the IRF1P2 gene (Figure 4.1.b). Interestingly, STAT1 and CREB show differential binding signals within two enhancer regions, suggesting the cross-enhancer cooperation is necessary for the activation of the target gene. Based on the three-layer gene regulatory network, ComMUTE is specifically designed to model the multi-enhancer regulations and complex regulatory grammars. To predict the interacting probabilities of candidate enhancer-gene links, ComMUTE evaluates whether linking the enhancers to the target genes can improve the TF profiles of all co-regulating enhancers towards the required gene regulatory grammar. This unique design enables ComMUTE to model the joint regulatory effect of multiple enhancers and TF grammars, which distinguishes ComMUTE from existing algorithms. The interacting probability −𝑒 between enhancer 𝑖 (𝑒𝑖 ) and gene 𝑗 (𝐺𝑗 ) is calculated as 𝑃(𝑒𝑖 ~𝐺𝑗 |𝐴, 𝑇𝐹𝑒𝑖 , 𝑇𝐹𝐺𝑗 𝑖 , 𝑀𝐼 ), where 𝐴 represents the activity based features(enhancer activity, gene expression and their correlation across 127 cell-types), 𝐷 represent the 1D genomic distance between enhancers and TSS of genes, 𝑇𝐹𝑒𝑖 represents the binarized occurrence vector of TF −𝑒 motifs for the candidate enhancer and 𝑇𝐹𝐺𝑗 𝑖 represents the TF profile of all enhancers linked to the 𝐺𝑗 except 𝑒𝑖 . These features are selected because they demonstrate strong predictive power in dissecting Capture-C interactions and random enhancer-gene links 114 (Figure C.1). To model the complex regulatory grammar, we introduced two latent variables to represent the gene group membership ( 𝐼 ) and the group-specific TF regulatory grammar (𝑀). Since both 𝐼 and 𝑀 are unknown, ComMUTE further predicts the probabilistic membership of genes by comparing the gene-specific TF profiles to the group-specific regulatory grammars: 𝑃(𝐼 = 𝑘|𝑇𝐹𝐺𝑗 , 𝑀) , where 𝑇𝐹𝐺𝑗 represents the TF profile of all linked enhancers and 𝑀 represents the group-specific regulatory grammars. After the gene groups memberships are updated, 𝑀 is updated accordingly. To efficiently infer the underlying distributions of features and latent variables, ComMUTE utilized an iterative Gibbs sampling framework (Figure 4.1.c). In each iteration, genes are assigned to different groups based on the predicted enhancer-gene links from the last iteration and the group-specific TF profiles are calculated. The prediction of enhancer- gene links are then calculated by evaluating the KL divergence between the gene regulatory grammar𝑀𝐼 and TF profile of 𝐺𝑗 if the enhancer is linked to the gene: 𝑒𝑥𝑝 (−𝑅 ∗ 𝐾𝐿(𝑇𝐹𝐺𝑗 |𝑀𝐼 )), where 𝑅 is a free scaling parameter. The predicted enhancer-gene links are then used to update the gene group membership. To avoid the searching for optimal enhancer combinations stuck at certain states through iterations, ComMUTE adopted a Simulated Annealing-based searching strategy and tested different combinations of enhancers and TFs before moving to the next iteration. To tune the unknown model parameters, we tested a wide range of values for the scaling parameter 𝑅 and number of gene groups, and selected the optimal values based on the highest AUROC when benchmarking predicted enhancer-gene links with Capture-C interactions in GM12878 (Figure C.2). Upon convergence, three sets of predictions are made by ComMUTE: (1) probabilistic score of enhancer-gene links, (2) mixture memberships of genes and (3) TF 115 regulatory grammar across gene groups. These outputs systematically delineate how multiple TF synergistically regulate the gene expression through multi-way enhancer regulations. To expand the generalizability of ComMUTE, we predicted enhancer-gene links based on the imputed and non-imputed multi-omics dataset in diverse cell types. In total, four Figure 4.2 Performance comparisons with JEME across 35 gold-standards support the superior performance of ComMUTE. (a) Performance of ComMUTE with shuffled TFs (blue) and only one gene group (purple). For the shuffled TF version, the TF binding profiles are shuffled across all enhancers to disable the TF features of ComMUTE. For the one-gene-group version, all genes are assigned to the same gene groups, aiming to only capture master TF regulators for all genes. Compared with the shuffled TF and one- gene-group version, ComMUTE achieves higher AUROC (y-axis) in both K562 (upper) and GM12878 (upper). (b-c) Examples of ComMUTE predicted enhancer-gene links (blue) based on the combination of NF-κB-CREB (b) and SMAD4-ZBTB33-NF-κB (c). The predicted enhancer-gene links are supported by Hi-C (brown), ChIA-PET (purple) and Capture-C (red). versions of predictions are generated: (1) imputed DNase-seq and RNA-seq across 127 cell-types/tissues; (2) imputed H3K27ac and RNA-seq across 127/tissue; (3) non-imputed DNase-seq and RNA-seq across 29 cell-types/tissues and (4) non-imputed H3K27ac and RNA-seq across 29 cell-types/tissues. These large-scale predictions of enhancer-gene 116 links provide a rich resource for understanding the complex multi-enhancer regulations. In this paper, we focused on the first version due to the broadest cell-type coverage. Across 127 cell-types/tissues, ComMUTE predicted ~80,000 cell-type-specific enhancer- gene links for each context. The predicted enhancer-gene links follow a similar 1D genomic distance distribution to Capture-C interactions. On average, over 20% of enhancers and over 85% of genes have a degree of more than one, suggesting the universal existence of multi-enhancer regulations (Figure C.3). As expected, the predicted enhancer-gene links show significantly higher correlations over the random chromatin interactions (Figure C.3), supporting the functional interactions between enhancers and genes. As an example, distinct cell-type-specific regulatory landscapes are predicted at the DHRS3 gene locus across GM12878, K562, H1 and HUVEC (Figure 4.1.e). The predicted enhancer-gene links precisely capture the cell-type-specific DNase peaks. Interestingly, a distal enhancer is linked to the DHRS3 gene (~40kb away) by skipping the nearest gene, suggesting the ability of ComMUTE to capture the long-range enhancer-gene links. 4.2.2 Robust performance in predicting enhancer-gene links To systematically evaluate the performance of ComMUTE, we compared the genome- wide predictions of ComMUTE with 19 experimental chromatin interactions, i.e. Capture- C, Hi-C and ChIA-PET, and 16 tissue-specific eQTL annotations. As a commonly used metric, AUROCs that are calculated based on the Cross-Validations are used to evaluate the performance of the supervised learning models. However, significant concerns are raised about the inflated performance due to inappropriate segmentation of the training/testing sets and the selection of negative samples. Unlike the currently existing 117 methods, ComMUTE is an unsupervised algorithm that does not need experimental datasets for training and does not need to Cross-Validations for performance validation. Given this significant algorithmic advancement, ComMUTE is free from the risk of overfitting, and the genome-wide predictions can be directly evaluated based on the orthogonal experimental chromatin interactions. As a representative example, the predicted enhancer-gene links achieve high AUROC (>0.93) in GM12878 by benchmarking with ChIA-PET interactions and four eQTL datasets, supporting the superior performance of ComMUTE (Figure 4.2.a). The performance of ComMUTE is compared with six state-of-art algorithms: JEME, TargetFinder, IM-PET, RIPPLE, Ernst et al and FOCS. Since the predictive probabilities of JEME and IM-PET are available across 127 cell types, we compared the performance of ComMUTE with these two methods based on 19 chromatin interaction datasets. Based on the observation that the 1D genomic distances can significantly inflate the AUROCs, we limited the comparison to the commonly evaluated enhancer-gene links of all three methods. This strategy guarantees that these methods are benchmarked on the exact same set of enhancer- gene links, thus excluding the confounding effects induced by the differential distributions of the input features. Based on the rigorous evaluation strategy, ComMUTE demonstrated consistently improved AUROC over JEME and IM-PET (Figure 4.2.b). As shown in Figure 4.2.c, ComMUTE achieved an AUROC of 0.73 in the CD4+ T cell, which is higher than JEME (AUROC: 0.68) and IM-PET (AUROC: 0.64). In addition, the AUPR based on the predictions of ComMUTE (AUPR: 0.04) is also higher than the predictions of JEME (AUPR: 0.03) and IM-PET (AUPR: 0.02). Globally, ComMUTE outperforms JEME and IM-PET across all 19 comparisons, suggesting a strong agreement between the 118 ComMUTE predictions and experimental chromatin interactions. In addition to the AUROC, we further calculated the enrichment of experimental chromatin interactions and the ranked enhancer-gene links. In almost all scenarios, ComMUTE achieves higher enrichments of true positives over JEME, especially for the top-ranked links (Figure C.6). In addition to JEME and IM-PET, we also compared with RIPPLE, Ernst et al, and FOCS (Figure C.7). The enrichment of ten experimental chromatin interactions among the predicted enhancer-gene links is calculated to quantify the model performance. In fact, since all of the existing methods are trained on the experimental data, they naturally tend to prioritize the experimentally validated enhancer-gene links, leading to an inflated enrichment. Strikingly, ComMUTE still demonstrated the highest enrichment in eight of ten comparisons, excluding comparisons based on Hi-C datasets in HeLa cells and K562 cells (Figure C.8). The predictions based on different epigenomic features also show improved performance compared with the background pairs with 1D genomic distances controlled (Figure C.9). To demonstrate the usability of ComMUTE in expanding the understanding of gene regulations to cell types without experimental chromatin interaction datasets, we compared the performance of ComMUTE with TargetFinder to predict cell-type-specific enhancer-gene links. TargetFinder is selected based on the improved performance in making cross-cell-type predictions compared with other methods. We trained TargetFinder in one of the six-cell types using Hi-C interactions and used the trained model to predict enhancer-gene links in the other five cell types. We compared 30 sets of cross-cell-type predictions of TargetFinder with the cell-type-specific predictions of ComMUTE. As shown in Figure C.8, ComMUTE shows consistently high AUPR in all cell 119 types, which further supports the superior performance of ComMUTE in predicting high- quality enhancer-gene links in understudied cell types. We further studied the cell-type-specificity of the predicted enhancer-gene links in 127 cell types/tissues based on the enrichment of 35 functional interaction datasets. As shown in Figure 4.2.d, different clusters of cell type are identified, suggesting the predicted enhancer-gene links in this cell-type are highly similar and supported by the same sets of gold standards. For example, distinct clusters for B-cells, T-cells, Epithelial related cells, Figure 4.3 Integration of TF modules improve the predictive accuracy. (a) Performance of ComMUTE with shuffled TFs (blue) and only one gene group (purple). For the shuffled TF version, the TF binding profiles are shuffled across all enhancers to disable the TF features of ComMUTE. For the one-gene-group version, all genes are assigned to the same gene groups, aiming to only capture master TF regulators for all genes. Compared with the shuffled TF and one-gene-group version, ComMUTE achieves higher AUROC (y-axis) in both K562 (upper) and GM12878 (upper). (b-c) Examples of ComMUTE predicted enhancer-gene links (blue) based on the combination of NF-κB- CREB (b) and SMAD4-ZBTB33-NF-κB (c). The predicted enhancer-gene links are supported by Hi-C (brown), ChIA-PET (purple) and Capture-C (red). 120 and ES cells are observed, suggesting the fundamentally rewired regulatory links during the cell differentiation and development. 4.2.3 Integration of the TF regulatory grammar boost the predictive accuracy As a significant algorithmic advancement of ComMUTE, the combinatorial TF modules of gene regulations are integrated to boost the prediction of enhancer-gene links. To demonstrate the contribution of TF-related features in improving the performance of ComMUTE, we specifically tested two cases. Firstly, we set the number of gene groups to one (termed as the ‘one gene group’ hereafter). In this case, only the TF regulatory grammar shared by all genes is identified and used to predict enhancer-gene links. In the second case, we shuffled the TF binding profile within enhancers to disable the TF feature (termed as the ‘shuffled TF’ hereafter). In this case, the prediction of enhancer-gene links is solely based on the epigenomic features and 1D genomic distances. Strikingly, the original configuration of ComMUTE achieves a median AUROC of ~0.7 in GM12878, while the one gene group setting and shuffled TF setting achieve AUROCs around 0.68 (Figure 4.3.a). Similar decreased performances are also observed in K562 (Figure 4.3.a). These permutation analyses demonstrate that the improved accuracy of ComMUTE is due to the integration of the TF module and prove that the TF modules predicted by ComMUTE can accurately capture the underlying gene regulatory grammar. For example, NF-κB and CREB are well-known co-factors in regulating gene expression. These two TFs are also predicted to synergistically regulate gene expressions in our predictions. Given the discovered regulatory grammar of NF-κB and CREB, the predicted multi-enhancer regulations accurately captured the cross-enhancer cooperation of these two TFs. At the NOG gene locus, ComMUTE linked two distal enhancers to the gene 121 promoter across a 122kb genomic window. The predicted enhancer-gene links are extensively supported by Hi-C and ChIA-PET interactions. Interestingly, a strong NF-κB binding peak is observed in the first enhancer but not in the second enhancer, and conversely, CREB only shows high signals in the second enhancer (Figure 4.3.b). The multi-enhancer regulations at the SMAD12 gene locus also capture the exclusive NF-κB and CREB binging signals across co-regulating enhancers (Figure C.10.a). A more complex TF module of SMAD4, ZBTB33, and NF-κB are captured at the SNRPD1 locus, where three enhancers are linked to the gene promoter and supported by Capture-C interactions (Figure 4.3.c). Although none of the linked enhancers contain binding sites of all three TFs, the multi-enhancer regulation facilitates the formation of the TF module Figure 4.4 Direct functional and physical interactions between predicted co- regulating enhancers. (a) The enhancers regulating the same genes are directly interacted, which yields higher partial activity correlations conditioned on the common target genes (top). Compared with JEME, ComMUTE predictions show significantly 122 Figure 4.4 (cont’d) higher partial correlations between co-regulating enhancers. (b) The co-regulating enhancers synergistically regulate the target gene, which yields higher multi-correlation between all enhancers and genes. Compared with JEME, ComMUTE predictions show significantly higher multi-correlations. (c) Co-regulating enhancers predicted by ComMUTE are enriched with direct Hi-C interactions (y-axis) compared with JEME and random enhancer-enhancer interactions (x-axis). (d) Example of multi-enhancer regulations predicted by ComMUTE. ComMUTE predicts the enhancer-gene interactions (blue curve) between seven enhancers (orange) to the WSB1 gene (red) based on the TF combination CHD2-PAX5-IRF3, and five of them are supported by Capture-C interactions (red curve). The direct interactions between enhancers are supported by Hi- C (brown curves) and ChIA-PET (purple curves). and precisely controls the gene expression. A similar example of the PARS2 gene also shows that three linked enhancers collectively provide the TF module of SMAD4-ZBTB33- NF-κB (Figure C.10.b). Together with the permutation tests, these examples strongly support the superior ability of ComMUTE in decoding the regulatory grammar of the long- range gene regulations. 4.2.4 ComMUTE captures direct enhancer-enhancer interactions In ComMUTE, multiple enhancers are linked to the same gene based on their functional cooperation in regulating gene expressions. Here, we demonstrate that the functional and physical interactions between co-regulating enhancers are direct and not mediated by the common target genes. Firstly, we calculated the partial correlations of the enhancer activities between the co-regulating enhancers conditioned on the target gene expression (Figure 4.4.a). The partial correlations evaluate the direct associations of enhancer activities between enhancers by removing the indirect associations mediated by the target gene. Compared with the enhancer modules predicted by JEME, the multi-enhancer regulations predicted by ComMUTE shows significantly higher partial correlations, suggesting the co-regulating enhancers have strong direct functional associations. 123 Secondly, the multi-correlations are calculated to measure the association between the target genes and all co-regulating enhancers, which evaluates the combinatorial regulatory effect of all co-regulating enhancers on gene expressions (Figure 4.4.b). In comparison, ComMUTE achieved the multi-correlation of 0.6 (median), while JEME only achieved ~0.48 multi-correlations, suggesting that co-activation patterns between the co- regulating enhancers and target genes. The results of the comparison based on partial correlations and multi-correlations strongly support the accuracy of the multi-enhancer regulations predicted by ComMUTE. To further verify the direct physical interactions between multiple enhancers, we calculated the enrichment of the Hi-C interactions among all possible pairwise interactions between co-regulating enhancers. Compared with JEME and randomly linked enhancers, ComMUTE demonstrated the highest enrichment (Figure 4.4.c), supporting the enhancers are directly interacting with each other in the 3D space. As a representative example, seven enhancers are linked to the WSB1 gene promoter based on the TF combinations of CHD2-PAX5-IRF3 (Figure 4.4.d). Compared with Capture-C interactions, four out of seven predictions are supported, including the interaction of an enhancer that is ~307kb away from the gene promoter. Interestingly, three enhancer-enhancer links are supported by Hi-C and Capture-C interactions. The left-most enhancer is linked to the right-most enhancer across a 464kb genomic window, suggesting the existence of a DNA loop at this locus. In another two examples of the LMTK2 gene and the RALY gene, the predictions of ComMUTE are not only extensively supported by Capture-C but also capture the long-range enhancer-enhancer interactions of Hi-C and ChIA-PET interactions (Figure C.13). These results demonstrate that the co-regulating enhancers 124 directly interact in the 3D space and establish functional cooperation in regulating gene expressions. Figure 4.5 Validation of the predicted multi-enhancer regulations based on SPRITE. (a) The multiple interacting enhancers and genes are densely organized in 3D space and forming high-order chromatin interactions, e.g. three-way enhancer-promoter-enhancer interactions. (b) ComMUTE predictions (red) are enriched with SPRITE three-way chromatin interactions (y-axis) under all resolutions (x-axis), compared with JEME (blue) and IM-PET (grey). (c) Example of ComMUTE predicted multi-enhancer regulations (blue curves). The three-way enhancer-promoter-enhancer interactions formed by the co- regulating enhancers and the promoter are extensively supported by SPRITE. 4.2.5 Superior accuracy in predicting multi-enhancer regulations In addition to evaluating the accuracy of the enhancer-gene links and enhancer-enhancer links, we further evaluated the multi-enhancer regulations based on the SPRITE dataset. Unlike the evaluations based on the pairwise interactions, the integration of the SPRITE dataset facilitates the evaluation of multi-way interactions, i.e. co-regulating enhancers and the target gene. We specifically focused on the three-way interactions, where two enhancers are simultaneously linked to one gene (Figure 4.5.a). By overlapping the predicted three-way enhancer-gene links with SPRITE three-way interactions under different resolutions, the enrichment of SPRITE interactions is calculated to quantify the accuracy of the predicted three-way enhancer-gene links. Strikingly, ComMUTE 125 demonstrates a significantly higher enrichment over JEME and IM-PET across all resolutions (Figure 4.5.b), supporting the high accuracy of the predicted three-way enhancer-gene links. As a representative example (Figure 4.5.c), 13 enhancers are linked to the PDE4DIP gene promoter in a ~114kb genomic window. The possible three-way interactions between two enhancers and the PDE4DIP gene promoter are extensively supported by 15 SPRITE three-way interactions in 1kb resolution. By combining proximal enhancers within the 1kb genomic window into 18 1kb genomic bins, 83% of possible three-way interactions are supported by SPRITE interactions. 4.2.6 ComMUTE decodes the TF regulatory grammars of gene expression One of the significant contributions of ComMUTE is discovering TF modules that can synergistically regulate the target gene expressions, which represent the underlying gene regulatory grammars. Based on the predictions of ComMUTE, diverse TF combinations are captured across different gene groups (Figure 4.6.a), suggesting that the gene regulatory grammars are highly complex. Such TF combinations cannot be observed from the traditional co-binding analyses for three reasons. First, the flexible Bayesian framework of ComMUTE allows the TFs to bind to some, but not all, of the co-regulating enhancers. Therefore, instead of co-binding in the 1D space, these TFs are interacting in the 3D space, which can not be captured based on the correlations of TF ChIP-seq signals. Second, the prioritized TF modules are predicted for clusters of genes. For each gene cluster, the members are not required to have spatial proximity and could be far away from each other or even located in different chromosomes. Thus, the long-range functional gene-gene relationships can not be captured by the co-binding analyses based 126 on the sequential TF ChIP-seqs. Third, ComMUTE only prioritized the functional TF combinations. Although TF motifs are frequently observed within co-regulating Figure 4.6 Accurate predictions of cooperative TF modules. (a) Gene-group-specific TF combinations predicted by ComMUTE. The heatmap shows the TF enrichments (columns) across all gene groups (rows). Twenty clusters of TFs are observed. (b) TF module predicted by ComMute are significantly co-expressed across cell-types compared with random TF combinations (purple) and (c) The TF modules predicted by ComMUTE are enriched with PPI compared (x-axis) with controls (blue and orange curves). (d) Example of ComMUTE predicted enhancer-gene links with CTCF binding sites in one enhancer and YY1 binding sites in another enhancer. The regulatory function of the PPI between CTCF and YY1 is well-studied. enhancers, only a few of them are functional and contribute to the gene regulations. By modeling the TF grammar of gene regulations, only master TF regulators are used for predicting enhancer-gene links. Compared with the TF groups captured by hierarchical 127 clustering analysis (Figure C.14), the TF profiles predicted by ComMUTE show clear TF enrichments. To evaluate the predicted TF regulatory grammar, we calculated the pairwise correlations of the TF gene expressions within TF modules predicted by ComMUTE (Figure 4.6.b). As a comparison, the correlations based on the randomly paired TFs are calculated. To further control the potential bias caused by the different occurrence of motifs across TFs, we generated a more stringent control by randomly pairing TFs that are captured by at least one TF module. Compared with both controls, the TF modules prioritized by ComMUTE shows the highest correlations of activities, supporting the functional interactions of the TF grammar. We further validated the predicted TF modules using the PPI datasets. Compared with controls, the predicted TF modules are highly enriched with PPIs, suggesting these TFs are not only functionally associated but also physically interacted (Figure 4.6.c). As a representative example, Figure 3.6.d shows an example of the two co-regulating enhancers which are predicted from the YY1-CTCF TF combinations. Interestingly, YY1 only binds to the right most enhancer and CTCF only binds to the left most enhancer, suggesting the YY1-CTCF combinations are co-localized in the 3D space, rather than in the 1D genome. The PPI between YY1 and CTCF are well-known and plays an important role in establishing chromatin interactions and gene regulations, which is consistent with the observed long-range multi-enhancer regulation of the SLAC39A6 gene (Figure 4.6.c). 128 Figure 4.7 Predicted enhancer-gene interactions are enriched with eQTLs. (a) Schematic figure of eQTL SNPs located in the enhancers, whose functional interactions with the target genes are mediated by enhancer-gene interactions. (b) Example of the ComMUTE predicted enhancer-gene link (blue curve) mediating the SNP-gene interaction of eQTL(purple curve, rs1403222, p-value: 3.147x10-49). The predicted enhancer-gene interaction is supported by the Capture-C interaction (red curve). (c) Global enrichment of eQTLs from different resources (x-axis) in predicted enhancer-gene links of ComMUTE, JEME and random controls. In all groups, ComMUTE predictions 129 Figure 4.7 (cont’d) show significantly higher enrichment compared with JEME predictions (p-value < 2.2x10- 16, Binomial test). (d) Enrichment of GWAS SNPs in predicted enhancer-gene links. (e) Multi-enhancer regulations capture epistasis eQTLs. The interactions between SNPs in regulating the common gene expression are mediated by three-way enhancer-promoter- enhancer interactions. (f) Enrichment of epistasis eQTLs (y-axis) in the multi-enhancer regulatory networks. Multi-enhancer regulatory networks predicted by ComMUTE are significantly enriched with epistasis eQTLs compared with JEME (p-val=1.73x10-5, Binomial test). (g) Example of predicted multi-enhancer regulatory networks (orange curves) mediating epi-stasis eQTLs. The SNPs of two pairs of epistasis eQTLs of TUBB6 gene, i.e rs12966726-rs7229921 pair and rs12966726-rs8092506 pair, are located in the interacting enhancers and regulate the gene through enhancer-gene links. 4.2.7 Predicted enhancer-gene links are enriched with QTLs and GWAS SNPs To further support the superior accuracy of ComMUTE, we utilized the functional genomic datasets of eQTLs and hQTLs by calculating the enrichment of QTLs in predicted long- range enhancer-gene links. The enhancer-gene links are supported by the QTL datasets if the enhancers harbor the SNPs and are linked to the same target genes or histone peaks (Figure 4.7.a). As a representative example, ComMUTE predicted one enhancer gene link at the GIMAP4 gene locus, which is supported by the Capture-C interactions (Figure 4.7.b). Interestingly, the linked enhancer harbors a significant eQTL of the GIMAP4 gene (rs1403222, p-value=3.15x10-49). A similar example at the IL6R gene locus is also shown in Figure C.15, where four eQTLs of the IL6R gene are precisely captured by four predicted enhancer-gene links. Globally, we compared the QTL enrichment of ComMUTE with JEME and randomly linked enhancer-gene pairs with 1D genomic distance controlled and observed significantly higher enrichments of ComMUTE (p-value<2.2x10-16) across five QTL datasets (Figure 4.7.c). These results not only supports the superior performance of ComMUTE in discovering the functional interactions 130 between enhancer-gene links but also suggest that the SNP-gene associations are mediated by the physical enhancer-gene links. We further interpreted the GWAS SNPs based on the predicted enhancer-gene links. Compared with JEME and distance-controlled random links, ComMUTE predictions are significantly enriched with GWAS SNPs, suggesting the interacting enhancers are also strongly associated with disease phenotypes (Figure 4.7.d). Figure 4.7.e shows an example of Leukemia-associated SNP rs3024505. Based on the predictions of ComMUTE, the enhancer that contains the SNP rs3024505 is linked to the MAPKAPK4 gene, which is a well-known leukemia-related gene based on the TF grammar of STAT1 and RUNX3. By overlapping the SNP location with the TF ChIP-seq signals, we found the SNP is precisely located within the summits of the TF ChIP-seq peaks, which further supports the predicted regulatory effect of the combination of STAT1 and RUNX3. These examples, together with the global enrichment of QTLs and GWAS SNPs, provide a mechanistic interpretation of the disease association of the SNP, where the SNP disrupts the TF binding sites within enhancers and dysregulate the disease- related genes. 4.2.8 Multi-enhancer regulations unravel the regulatory basis of epistasis-QTLs The key feature that distinguishes ComMUTE from existing methods is the prediction of multi-enhancer regulations with close spatial proximity and strong functional associations, which delineates the high-order chromatin interaction landscape. In the eQTL analyses, the high-order interactions between SNPs, i.e. epistasis eQTLs, are predicted to be associated with the disease phenotypes. However, the traditional analyses require a large number of tests to prioritize all possible combinations of SNPs, thus hampering their 131 applications in the whole genome. Here, we showed that the multi-way enhancer-gene interactions predicted by ComMUTE could help to discover the high-order interactions of SNPs within co-regulating enhancers. To demonstrate that the predicted multi-enhancer regulation can precisely capture the epistasis eQTLs, we overlapped the SNPs of the epistasis eQTLs to the co-regulating enhancers with the same target genes and calculated the enrichment of epistasis eQTLs. Compared with the predictions of JEME and random links, ComMUTE achieves a significant higher enrichment, suggesting the predicted multi-enhancer regulations are not only physically interacted but also functionally associated (Figure 4.7.f). Take the HLA-DQA1 gene as an example, the interaction of two SNPs (rs9270911 and rs9271589) is predicted to be associated with the gene expression. Strikingly, the two epistasis eQTLs are captured by the multi-enhancer regulations predicted by ComMUTE, where the co-regulating enhancers captured the SNPs. Another example at the TUBB6 gene locus is also shown in Figure 4.7.h. Two pairs of epistasis eQTLs, i.e. rs12966726-rs7229921 and rs12966726-rs8092506, are identified as statistically significant epistasis eQTLs of the TUBB6 gene from the previous analysis. Interestingly, both sets of epistasis eQTLs are captured by the co-regulating module of six enhancers based on the predictions of ComMUTE. 4.3 DISCUSSION In this study, we developed a Bayesian Graphical model, ComMUTE, to predict the multi- enhancer regulations across 127 cell types/tissues. By jointly modeling the TF bindings of all co-regulating enhancers, ComMUTE captures the synergistic effect of multiple enhancers in regulating the target gene expressions and unravels the complex high-order 132 gene regulatory landscape. As an unsupervised learning algorithm, ComMUTE does not require the experimental chromatin interaction datasets for training, thus showing strong generalizability compared with existing supervised learning algorithms. Furthermore, the unsupervised framework of ComMUTE fully addresses the overfitting risks of the existing algorithms and facilitates rigorous evaluation of the model performance. By extensively comparing the performance with existing cutting-edge algorithms based on 19 experimental chromatin interaction datasets, ComMUTE demonstrates consistently improved performance in predicting long-range enhancer-gene interactions. Based on the permutation analyses, we show that the integration of the TF binding sites and gene regulatory grammars can significantly improve the performance of ComMUTE, suggesting the cooperation of TFs is important in decoding the complex enhancer-gene regulatory networks. The genome-wide application of ComMUTE in 127 cell-types/tissues based on four sets of imputed and non-imputed epigenomic and transcriptomic datasets to delineate the multi-enhancer regulations (Figure C.4). We show that the co-regulating enhancers predicted by ComMUTE demonstrate strong partial correlations, multi-correlations, and enrichment of Hi-C-supported enhancer-enhancer interactions, supporting the direct functional and physical interactions between these enhancers. We highlighted several examples where the long-range co-regulating enhancers are directly linked by Hi-C and ChIA-PET interactions. These results strongly support the utility of ComMUTE in predicting the large-scale cell-type-specific multi-enhancer regulatory landscape. Beyond predicting enhancer-gene links, the predicted TF grammars are also supported by the 133 PPIs and co-activate patterns, which suggest new biological innovations in gene regulations. The accurate prediction of the multi-enhancer regulatory landscape provides new avenues to mechanistically interpret QTLs, GWAS SNPs, and epistasis eQTLs. The high consistency between the functional genomic datasets and predicted enhancer-gene links further supports the accuracy of the predicted enhancer-gene links. More importantly, these results suggest that the SNP-gene associations and SNP-disease association are mediated by the enhancer-gene links, which bring the SNPs to the proximal of the gene promoters and indirectly control the phenotypes. As a unique contribution of ComMUTE, the predicted multi-enhancer regulations facilitate the interpretation of the epistasis eQTLs, where the co-regulating enhancers bring SNPs to the proximal 3D neighborhoods of the target gene promoters. Together with the global enrichment analyses, we highlighted several examples where the SNPs of epistasis eQTLs are accurately captured by the co-regulating enhancers. Therefore, the predicted multi-enhancer regulations can help explain the discovered QTLs and GWAS SNPs and provide a mechanistic approach to significantly reduce the required number of tests in predicting epistasis eQTLs. 134 CHAPTER 5 PREDICT LONG-RANGE ENHANCER REGULATION BASED ON PROTEIN- PROTEIN INTERACTIONS BETWEEN TRANSCRIPTION FACTORS A modified version of this chapter was previously published (Wang H. et al, 2021): Wang H.*, Huang B*., and Wang. J. (2021) Predict long-range enhancer regulation based on protein-protein interactions between transcription factors. Nucleic Acids Research. 5.1 INTRODUCTION Cell-type specific transcriptional regulation plays important roles in differentiation and development 90-102. In addition to proximal regulatory elements, e.g. promoters, which are located around transcriptional start sites (TSS) of genes, distal enhancers provide complex and precise controls on gene expression through long-range regulation 103, 104. Based on recent genome-wide enhancer annotations from ENCODE and Roadmap Epigenomics projects 50, 86, hundreds of thousands of putative enhancers across the whole human genome have been identified, especially in non-coding regions, highlighting the biological impacts of enhancer regulation. Although a series of computational algorithms have been developed to predict the genomic locations of cell-type specific enhancers 105, 106, it remains challenging to identify the specific target genes regulated by enhancers in different cell-types or tissues. Unlike promoters, enhancers are usually located far away from their target genes along the genome 107 and the nearest genes may not be regulated by a proximal enhancer 108. In three-dimensional (3D) space, an 135 enhancer and its target genes are placed close to each other through long-range chromatin interactions, i.e. enhancer-promoter interactions 109. The discoveries of tissue-specific long-range enhancer regulation have the potential to enable novel insights in a wide range of different biological studies. As one of the canonical examples, long-range regulation by distal enhancers play pivotal roles in controlling the tissue and condition-specific expression of the mouse 𝛽-globin (Hbb) gene expression 90, 94, 95. As another well-known example, the expression of the Shh gene in mouse limb bud is precisely regulated by a distal enhancer located 850kb away, which is critical for the proper limb development 96-98, 110. In addition to normal tissue development, the annotation of long-range enhancer regulation has also facilitated the interpretation of genetic variants underlying complex diseases. A non-coding genetic variant associated with obesity is located in an intron of the FTO gene but regulates the IRX3 and IRX5 genes that are located >400kb away 91, 99, 111. Similar examples of long-range interactions linking disease-associated genetic variants to distal genes have also been found in studies of autoimmune diseases 92, 93, 100-102. Given the functional importance of long-range enhancer regulation, experimental techniques have been developed to identify chromatin interactions linking distal enhancers to promoters of their target genes. Based on the pioneering chromosome conformation capture (3C) technology 112, along with its derivatives of 4C and 5C 113, 114, the genome-wide version, i.e. Hi-C 7, has been applied to several human cell-types and tissues 10, 45, 50. Furthermore, the promoter-enriched genome conformation assay, Capture Hi-C 115, improves the resolution and cell-type specificity of the identified chromatin interactions for gene promoters 116. On the other hand, the method of chromatin 136 interaction analysis with paired-end-tag sequencing (ChIA-PET) 117 was developed to capture long-range chromatin interactions associated with a protein of interest, such as a specific transcription factor (TF), with high-resolution and cell-type specificity 118. These cutting-edge technologies have generated large-scale chromatin contact maps for a number of cell-types or tissues in the human genome and other model species 10, 45, 50, 118. Although experimental techniques have substantially expanded the catalog of annotations for long-range chromatin interactions, there are several limitations that hinder in-depth analysis on cell-type specific enhancer-promoter interactions. First, the resolution of interacting genomic anchors profiled by Hi-C and Capture Hi-C is relatively low (~5-10kb genomic fragments) 10, 115, which makes it difficult to pinpoint the specific enhancers involved in long-range regulation. Second, while Capture Hi-C and ChIA-PET experiments can discover cell-type or tissue-specific enhancer regulation, data generated by Hi-C experiments have been found to be largely invariant across different cell-types or tissues 119. Third, the background noise levels of Hi-C and Capture Hi-C datasets are high, leading to many false positive discoveries 71. Fourth, due to the dependency on specific protein antibodies, such as CTCF or RNA Pol II 118, each ChIA-PET experiment can only profile a subset of long-range interactions, resulting in large numbers of false negative interactions that are not identified 120. Because of these limitations, computational models are needed to predict cell-type specific long-range enhancer regulation, based on integration of multi-omics signatures, e.g. genomics, transcriptomics, and epigenomics. Large-scale multi-omics data resources collected by the ENCODE and Roadmap Epigenomics projects contain the 137 multi-view information of gene regulation 50, including gene expression, transcription factor binding and histone modifications. They can help to overcome the limitations of experimental techniques because they are cell-type or tissue specific 121, provide high- resolution signal landscape along the genome 85, 122, have high signal-to-noise ratio 122, and cover the genomic binding sites for diverse transcription factors 50. The existing computational models of long-range enhancer-promoter interaction prediction can be grouped into two classes. For the first class, i.e. supervised algorithms, 3D chromatin interactions profiled by experimental techniques are used as labels for enhancer- promoter pairs. The commonly used features include: 1) cell-type specific gene expression based on RNA-seq data; 2) enhancer activity based on specific epigenetic signals, such as H3K4me1, H3K27ac or DNase hypersensitivity; 3) genomic separation distance between enhancers and gene promoters; and 4) correlations between gene expression and enhancer activity. Supervised methods incorporating some or all of these features include RIPPLE 123, FOCS 124, EAGLE 125 and JEME 126. As one of the most recently developed supervised methods, JEME 126 employs a combined approach of regression and random forest to predict long-range regulatory links between enhancers and genes. But it requires multi-omics datasets from a large panel of diverse cell-types and tissues as inputs, which is usually not available for users. The other two top- performing methods are IM-PET 127 and TargetFinder 128. These two algorithms not only integrate the features described above but also leverage additional features of transcription factor binding in promoters, enhancers, or genomic windows between enhancers and promoters. With respect to machine learning techniques, IM-PET employs a random forest model, and TargetFinder implements a boosting tree approach. For the 138 second class, i.e. unsupervised algorithms, every enhancer-promoter pair is assigned with a score and then ranked based on the scores. Top-ranking enhancer-promoter pairs are predicted to interact with each other. The scores are generally based on genomic separation distance and co-activity patterns, e.g. correlations, between enhancers and genes 129-131. Based on a systematic performance evaluation analysis 132, supervised Figure 5.1 Schema of ProTECT in predicting PPI mediated enhancer-gene links. (a) The enhancer-promoter interactions are regulated by PPIs between enhancer-binding 139 Figure 5.1 (cont’d) TFs (brown) and promoter-binding TFs (blue), which link distal enhancers (orange) to the proximity of promoters (red) in 3D chromatin structure. (b) Enrichment of TF-TF pairs in Hi-C interactions (y-axis) compared to background (x-axis). Points represent TF-TF pairs. Frequency is calculated as the fraction of enhancer-gene pairs containing the specific TF- TF pairs. Fold-change (FC) is the ratio of the frequency in Hi-C interactions over the frequency in background. TF-TF pairs are colored by the FC (red: FC>2; orange: 1 0). By controlling these three key sets of confounding factors, we thus construct the rigorous balanced training dataset for robust model training and performance evaluation. In total, the balanced training dataset contains 5,348 enhancer-promoter pairs in GM12878 and 8,650 enhancer- promoter pairs in K562. Based on the cell-type specific multi-omics datasets, the matrix of features are then constructed for enhancer-promoter pairs in the training dataset (Figure 5.1.e). There are three types of features incorporated into the model: 1) activity-based features; 2) genomic distance; and 3) TF PPI features. Activity-based features include (i) cell-type specific enhancer activity measured by DNase-seq signals as described above 86; (ii) cell-type specific gene expression measured by RNA-seq 86; and (iii) the activity correlations between enhancers and their paired genes calculated from diverse cell-types profiled in 147 the ENCODE and Roadmap Epigenomics projects 50, 86. All these activity-based features are differentially distributed across positive and negative training sets, suggesting they are informative to make predictions (Figure D.2.A-C). For each enhancer-gene pair, the genomic distance is calculated as the distance between the center of the enhancer and the gene’s TSS. Although they have been controlled in the positive and negative training sets based on genomic bins, there might be residue distance bias within bins. Therefore, the inclusion of genomic distances into the feature matrix captures the residue effects of genomic distances, leading to robust feature prioritization in subsequent analyses. TF PPIs are the most important set of features for the model because of both the mechanistic relationship with long-range regulation 140, 141, 155 and their significant enrichment in enhancer-promoter interactions (Figure 5.1.b, 5.1.c and Figure D.2.D). In each specific cell-type (i.e. GM12878 or K562 cells), all TFs with available ChIP-seq datasets are collected as described above and compared with the PPI database 148. From the pool of all candidate pairs, the TF-TF pairs that are capable of forming direct PPIs are considered as TF PPIs. Considering the differences of binding sites in enhancers or promoters, each TF PPI pair is allocated with two directional features. For example, TF a- TFb represents the PPI between enhancer-binding TFa and promoter-binding TFb, while TFb-TFa represents the PPI between enhancer-binding TFb and promoter-binding TFa. Thus, a set of directional TF PPI features is generated. Because the features are generated only for TFs with cell-type specific ChIP-seq signals, PPIs between TFs that are not active in the specific cell-type do not participate in the predictions. Enhancer- promoter pairs are scanned for TF binding peaks in enhancers and promoters. For each enhancer-promoter pair, if TFa binds to the enhancer and TFb binds to the promoter, then 148 the directional PPI feature TFa-TFb is labeled as 1. Therefore, a matrix of TF PPI features is constructed for all enhancer-promoter pairs. Combining with the activity-based features and genomic distances, the full matrix of features is then built (Figure 5.1.e). 5.2.3 Hierarchical TF community detection on the PPI network Due to the large number of TF PPI features, dimension reduction is fundamentally important for the construction of robust predictive models. Without dimension reduction, there are 1,888 TF PPI features in GM12878 and 7,066 TF PPI features in K562 cells. Although a number of TF PPIs are enriched in enhancer-promoter interactions (Figure 5.1.b and 5.1.c), direct incorporation of these TF PPI features makes the model to be over-complicated, leading to poor generalization of predictions. To illustrate the significant overfitting issues of direct incorporation of high-dimensional TF PPI features, a basic random forest model is used to test the performance in GM12878 10. The features include the activity correlations between enhancers and genes, genomic distances, and 1,888 active TF PPI features. Although the regular 5-fold cross-validation shows an AUC of 0.89, a rigorous genomic-bin split cross-validation (see subsequent sections on cross- validation) shows the unbiased AUC as 0.55, suggesting strong overfitting problems without advanced feature dimension reductions (Figure D.3). Thus, a novel predictive model is needed for predicting long-range enhancer-promoter interactions based on PPI features among transcription factors. To address the over-fitting problem, we substantially reduce the feature dimensions by hierarchically grouping individual TF PPIs into TF PPI modules based on the topology of the PPI network, while maintaining the predictability of the model (Figure 5.1E). TF PPI modules represent densely connected groups of TFs in the PPI network, and they are 149 hierarchically organized where smaller PPI modules merge together to form larger modules (Figure D.4). Biologically, using TF PPI modules as features is consistent with the regulatory mechanisms of long-range chromatin loops, because multiple TFs usually interact with each other as protein complexes. Empirically, the biological relevance of TF PPI modules is also supported by the data. As can be seen in Figure D.5, similar to individual TF-TF pairs, a specific subset of TF modules are strongly enriched in enhancer- promoter Hi-C interactions and are strongly supported by PPI connections (p- value=1.39x10-2, permutation test). TF PPI modules are computationally identified from the PPI network 148 using a random- walk based network-community detection approach. The PPI network, including non-TF protein nodes, is modeled as an undirected weighted graph, where the weights on edges are the ‘Experiment’ PPI scores from the STRING database 148. Define 𝑊 as the adjacency matrix of the PPI network, and define the diagonal degree matrix 𝐷 as 𝐷𝑖𝑖 = ∑𝑗 𝑊𝑖𝑗 . Hence, based on the stochastic model of random-walks on graphs 156, the 1-step 𝑊𝑖𝑗 transition probability from node 𝑖 to node 𝑗 is , and the p-step transition matrix 𝑇𝑟𝑎𝑛𝑠𝑝 𝐷𝑖𝑖 can be calculated as 𝑇𝑟𝑎𝑛𝑠𝑝 = (𝐷−1 ∗ 𝑊)𝑃 . Based on the p-step transition matrix, the pairwise distance matrix between TFs (denoted as 𝑅) can be further calculated as: 𝑅 = 𝑑𝑖𝑎𝑔(𝐺)𝑡 ∗ 𝟏 + 𝟏𝒕 ∗ 𝑑𝑖𝑎𝑔(𝐺) − 2𝐺 , where 𝐺 = 𝑇𝑟𝑎𝑛𝑠𝑝 ∗ 𝑇𝑟𝑎𝑛𝑠𝑝 𝑡 . Each entry in the matrix 𝑅 quantifies the distance between a pair of TFs based on the PPI network structure. Hierarchical clustering is then applied to the pairwise distance matrix 𝑅 to identify hierarchical PPI modules of TFs (Figure 5.1.e). “wald” method is used in the hierarchical clustering as suggested by previous studies of network-community detections 157. By testing multiple values (Figure D.4.A and 5.4.b), 𝑝 is set to be 20 in order to balance the 150 detection of both local (i.e. small-size) and global (i.e. large-size) modules (Supplementary Methods). In the constructed hierarchical clustering tree, the leaf nodes are individual TF PPIs. By applying the bottom-up merging strategy on the tree, individual TF PPIs are first grouped into small-size PPI modules, i.e. S-modules, with the maximum size of 𝑆𝑚𝑎𝑥 . S-modules represent densely connected TFs in the PPI network, corresponding to candidate protein complexes. S-modules are further merged to form large-size PPI modules, i.e. L-modules, with the maximum size of 𝐿𝑚𝑎𝑥 . L-modules represent larger PPI network components that cover multiple densely connected S-modules. Biologically, L-modules represent candidate groups of highly interacting protein complexes. The maximum sizes for S- modules (𝑆𝑚𝑎𝑥 ) and L-modules (𝐿𝑚𝑎𝑥 ) are selected based on the modularity score of the clustering 158 (Figure D.4, Supplementary Methods). The modularity score 𝑄 is defined as 1 𝑘𝑖 𝑘 𝑗 𝑄 = 2𝑚 ∗ ∑𝑖𝑗 (𝑊𝑖𝑗 − ) ∗ 𝛿(𝑐𝑖 , 𝑐𝑗 ) where 𝑊 is the adjacency matrix, 𝑘𝑖 is the degree of 2𝑚 1 node 𝑖, 𝑚 is the total number of edges in the PPI network (𝑚 = 2 ∑𝑖 𝑘𝑖 ), and 𝑐𝑖 is the membership assignment to modules for node 𝑖 . Modularity scores are extensively calculated for different choices of maximum module sizes (Figure D.4.C and 5.4.d), because the choice of specific maximum module sizes automatically determines the total number of modules and results in the final module membership assignments. The optimal size of S-modules is selected as the one yielding the maximum modularity score, which guarantees that the generated S-modules represent densely connected TF groups. The optimal size of L-modules is selected as the one corresponding to the elbow point of modularity score curves, leading to the delineation of large-scale PPI components without significant loss of modularity. Compared to Markov Cluster Algorithm, the PPI modules 151 from our approach demonstrate higher modularity scores and larger module sizes (Figure D.6), which is desired for feature dimension reductions. Using this procedure, a two-layer hierarchical modular structure is finally built and each individual TF PPI is assigned with the memberships belonging to a specific S-module and a specific L-module. Based on the TF PPI module assignments, individual TF PPI features (i.e. direct TF-TF PPIs) are merged into module-level PPI features, and, therefore, the feature matrix of TF PPIs are restructured accordingly (Figure 5.1.e). There are two types of module-level PPI features: (i) intra-module features, which include all S-modules and L-modules. The intra- module features cover PPIs between TFs within the same modules. (ii) inter-module features, which include inter S-module features and inter L-module features. The inter- module features cover PPIs linking TFs from two different modules. Given a pair of S- modules, e.g. S-module 𝑎 and S-module 𝑏, if there exists a TF member from S-module 𝑎 that has PPI with a TF member from S-module 𝑏, then the pair of S-modules 𝑎 and 𝑏 is included into the feature matrix as one inter S-module PPI feature. The inter L-module PPI features are defined in the same way by checking PPIs of TF members from two L- modules. Each inter-module feature is further split into two directional features, depending on the binding sites of TF members in enhancers and promoters. Using this approach, the PPI features are substantially reduced. For example, the 1,888 individual TF PPI features are reduced to only 78 module-level PPI features in GM12878 and the 7,066 individual TF PPI features are reduced to only 238 module-level PPI features in K562 cells. The training set of enhancer-promoter pairs are then scanned for module-level PPI features. For each specific enhancer-promoter pair, based on the counts of individual TF 152 PPI features calculated in the previous step, the counts of module-level PPI features are generated depending on the module memberships of TFs (Figure 5.1.e). For each module-level PPI feature, if multiple TF PPI features are found for an enhancer-promoter pair, the maximum count is used for the module-level feature. Although the number of features is substantially reduced after using module-level PPIs, the specific PPI information is still maintained in this procedure, as shown in Figure D.5. It suggests that the module-based dimension reduction does not cause the loss of information, while substantially reducing the risk of over-fitting. 5.2.4 Predictive model of long-range enhancer-promoter interactions Random forest model is used to predict cell-type specific long-range enhancer-promoter interactions based on the feature matrix constructed above, after module-based dimension reduction (Figure 5.1.e). Random forest model is selected due to its superior performance of handling non-linear feature dependency and its capability of prioritizing the key set of important features for subsequent biological interpretations. As a free model parameter, the number of decision trees in the model is extensively tested with different values, and the accuracy of predictions is found to be robust (Figure D.7). Additionally, to quantitatively demonstrate the contributions from TF PPIs, we train random forest models based on two versions of input features: 1) the model is trained using only activity-based features and genomic distances; and 2) the full set of features including module-level TF PPI features. The Area Under Curve (AUC) values of cross- validations are calculated for the two versions. The increased AUC from version 2 is the quantitative measurement of the additional information contributed from TF PPIs that is not encoded in activity-based or genomic distance features. 153 5.2.5 Feature selection In the random forest model, the backward feature elimination approach is used to select useful module-level TF PPI features, where the features with the minimum importance are recursively eliminated from the model. Furthermore, the statistical significance of the directions of TF PPI features are evaluated. As described in the previous section, every module-level PPI feature is split into a pair of two directional features, based on the binding sites of TFs in enhancers or promoters. For example, the feature module 𝑎 - module 𝑏 represents the PPI between an enhancer-binding TF member from module 𝑎 and a promoter-binding TF member from module 𝑏. Reversely, the feature module 𝑏 - module 𝑎 represents the PPI between an enhancer-binding TF member from module 𝑏 and a promoter-binding TF member from module 𝑎. Based on the statistical evaluation of the feature directions, insignificant directional features are merged into un-directional features. This feature merging procedure not only reduces the number of features but also reveals the biological roles of TF bindings in the context of different binding orientations. The determination of whether a pair of directional TF PPI features to be merged into an un-directional feature is a model selection problem. While Akaike Information Criterion (AIC) has been a widely used metric for parametric models, it can not be applied to random forest models, which are non-parametric. Instead, we use the Generalized Degrees of Freedom (GDF) method to calculate a relaxed AIC 159 for the random forest model. GDF is a metric to evaluate the degree of freedoms for Bernoulli distributed data, e.g. the binary labels for enhancer-promoter interactions. And it is defined as 𝐺𝐷𝐹 ≈ ∑𝑖(𝑦̂𝑖 ′ − 𝑦̂𝑖 )/(𝑦 ′ − 𝑦𝑖 ), where 𝑦𝑖 is the observed label for data point 𝑖, 𝑦′𝑖 is the perturbed 𝑖 154 label by inverting 𝑦𝑖 , i.e. 𝑦𝑖 ′ = 1 − 𝑦𝑖 , 𝑦̂𝑖 is the predicted label from the model using the unperturbed 𝑦𝑖 , and 𝑦 ̂ 𝑖 ′ is the predicted label from the model using the perturbed 𝑦𝑖 ′. As suggested by previous studies 159, to calculate GDF, 20% samples are simultaneously perturbed. The relaxed AIC of random forest models are then estimated as 𝐴𝐼𝐶 = −2𝑙𝑚 + 2𝐺𝐷𝐹 + 𝐺𝐷𝐹(𝐺𝐷𝐹 + 1)/(𝑁 − 𝐺𝐷𝐹 − 1), where 𝑁 represents the total number of data points and 𝑙𝑚 represents the goodness-of-fit of the random forest model. As suggested by previous analyses 159, 𝑙𝑚 is calculated as the averaged 𝑅 2 value from 5- fold cross-validations. For each pair of directional TF PPI features, the relaxed AIC metrics are calculated before and after they are merged into an un-directional feature. If a smaller AIC is observed by merging the two directional features, the model with the merged un-directional feature is then selected, because the reduced AIC suggests the directions of the pair are not statistically important. This procedure is conducted for all pairs of directional TF PPI features, and a final random forest model with the selected features is built. In GM12878 cells, the number of module-level TF PPI features is reduced to 53 from 78. In K562 cells, the number is reduced to 139 from 238. This feature selection process further boosts the generalizability of our model and improves the biological interpretations of the learned TF PPI features (i.e. directional or un-directional). 5.2.6 Cross-validation and performance comparison To evaluate the performance of our model, i.e. Area Under Curve (AUC), we designed a stringent strategy of 5-fold cross-validation. As highlighted by previous studies 132, 133, multiple factors have been found to substantially inflate the performance evaluations and cause overfitting problems. First, the confounding factors (i.e. TAD domain structures, 155 genomic distances between enhancers and promoters, and gene expression levels) need to be controlled. Otherwise, the performance will be biased and dominated by confounding factors. We addressed this issue in the step of data generation as described in previous sections. Negative samples are randomly generated with the confounding factors controlled to have the same distributions as seen from the positive samples. Second, inflated cross-validation AUC can be found due to the spatially proximal enhancer-promoter pairs across the training and testing datasets 132, 133. Because TF binding profiles are highly correlated among enhancers and promoters in neighboring genomic regions, proximal enhancer-promoter interactions that are allocated in the testing set will substantially inflate the accuracy. Therefore, random splits of samples based on typical cross-validation may suffer from the dependency of spatially proximal samples allocated in both training and testing sets, as has been noted in previous studies 132, 133. To address this issue, we developed a genomic bin-split cross-validation approach (Figure 5.1.e). In this approach, the human genome is first divided into consecutive 1Mb bins. In each of the 5-fold cross-validation steps, 80% of the genomic bins are selected as training bins. And the balanced and confounding factor controlled samples of enhancer-promoter pairs from the training bins are used to train the random forest model. The remaining 20% bins are selected as testing bins, and the samples of enhancer- promoter pairs from the testing bins are used to test the model. Using this genomic bin- split cross-validation method, the dependency between training and testing samples are broken and the model performance can be rigorously quantified. The performance of our model, ProTECT, is compared with two most recent supervised methods that also leverage TF information: IM-PET 127 and TargetFinder 128. In addition 156 to activity-based features and genomic distances, IM-PET and TargetFinder also includes the TF binding features in enhancers and promoters, while TargetFinder further incorporates TF binding information in the genomic windows between enhancers and promoters. By comparing with these two algorithms, we can further demonstrate the improved accuracy is obtained purely from the unique features of our model, i.e. the PPIs between TFs. The stand-alone package of IM-PET (https://github.com/tanlabcode/IM-PET) is applied to the same dataset. Since IM-PET automatically makes predictions for all enhancer-gene pairs with distances <2Mb, only the enhancer-gene pairs overlapping with the dataset are used for performance evaluation, leading to a fair comparison for IM-PET. The TargetFinder software (https://github.com/shwhalen/targetfinder) is also implemented to the same training and testing dataset. The same set of TF ChIP-seq peaks are used to generate the window related features for TargetFinder. 5-fold cross-validation with the same genomic bin-split strategy is applied to remove the potential issues of inflated performance evaluations. In addition, to quantitatively demonstrate that the improved accuracy of ProTECT is indeed contributed by TF PPI features, we randomly permute the PPIs between TFs, with the degree of each TF in the PPI network unchanged. Furthermore, for every TF, the specific binding sites in enhancers and promoters are also maintained. Therefore, only the TF PPI features are shuffled across enhancer-promoter pairs. The same model training and evaluation procedure are then applied on the permuted dataset. The resulting AUC is then compared to the model trained on the original dataset. This comparison 157 provides direct evidence on the contributions of TF PPIs to chromatin interaction regulation. 5.2.7 Genome-wide prediction of long-range enhancer-promoter interactions The trained ProTECT algorithm is applied to all enhancer-promoter pairs with genomic distances <2Mb across the whole human genome to make genome-wide predictions of cell-type specific enhancer-promoter interactions (Figure 5.1.e). The features for each candidate enhancer-promoter pair are generated in the same way as described in previous sections. By applying the trained random forest classifier, every candidate enhancer-promoter pair is assigned with a predicted score of interacting with each other. To derive unbiased estimates of the statistical significance for the scores, i.e. p-values, a null distribution of the scores is generated by permuting the feature matrix across enhancer-promoter pairs. This permutation approach effectively maintains the overall abundances of different features in the shuffled dataset. Based on the null distribution, the p-value for each enhancer-promoter pair is then calculated. Unlike the phase of model training, where the genomic distances are controlled in order to learn specific TF PPI signatures, the phase of genome-wide predictions requires the incorporation of genomic distance information. As shown by chromatin contact maps, e.g. Hi-C datasets, enhancer-promoter pairs with shorter genomic separation distances have higher probability to interact and the probabilities decay as the distances increase (Figure D.1.C). To statistically incorporate the genomic distances based on this prior knowledge, we use the pFDR algorithm 160 to transform p-values into distance-aware q-values. In pFDR, the distribution of distances between Hi-C linked enhancers and promoters is treated as prior probabilities of interactions for enhancer-promoter pairs. Based on Hi-C 158 data, ProTECT divides the range of distances into consecutive 20kb bins, and the prior probability of interactions for each distance bin is calculated as: 𝜋𝑖 = 5% ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝐻i − C in bini )/(number of significant Hi − C in bin1 ) , where πi is the prior probability for distance-bin i. The prior probability for bin 1 (i.e. the shortest distance bin) is set to be the default 0.05. The pFDR under rejection region [0, γ] in distance-bin i is then calculated as pFDR(γ) = πi Pr(P ≤ γ|H = 0)/Pr(P ≤ γ) = πi γ/Pr(P ≤ γ), where P represents the p-value for each enhancer-promoter interaction. P follows the uniform distribution under the null hypothesis, i.e. H=0, so that Pr(P ≤ γ|H = 0) = γ. Pr(P ≤ γ) can be estimated by Pr ̂(P ≤ γ) = (∑N j=1 δ(Pj ≤ γ))/N, where Pj is the p-value for the enhancer-promoter interaction j, N represents the total number of p- values, and δ(x) equals to 1 if x is true and equals to 0 otherwise. Therefore, the q-values can be calculated as Q(P) = infγ>P (πi γ/Pr̂(P ≤ γ)), which combines the information from both the distance-aware prior probabilities (πi ) and the p-values from the random forest model (P). Based on the q-value threshold of 0.05, the final genome-wide predictions of significant enhancer-promoter interactions are obtained. 5.2.8 Feature interpretation for mechanistic insights Using the trained random forest model of ProTECT, we evaluate and rank the importance of features, i.e. the module-level PPI features in the model. The top-ranking module-level PPIs are considered as important features, which represent putative protein complexes that may regulate chromatin interactions. Furthermore, in order to obtain detailed mechanistic understandings of important PPIs between specific TFs, we decode the module-level PPI feature importance into TF-level PPI feature importance. For each prioritized module-level PPI feature, we decompose it into individual TF-TF PPI features, 159 i.e. specific PPIs between an individual enhancer-binding TF and an individual promoter- binding TF. Then the genome-wide predictions of enhancer-promoter interactions are scanned, and the fractions of predictions that contain the specific TF-level PPI features are calculated. The fractions scanned from genome-wide predictions are highly correlated with the fractions calculated from the cross-validation samples in model training, and are more robust given the larger pool of genome-wide enhancer-promoter pairs (see Results). Using the fractions, the top-ranking TF-level PPI features are thus identified for each important module-level PPI feature. The prioritized features, both module-level and TF- level, shed light on new biological insights on long-range enhancer regulation. 5.2.9 Pathway enrichment analysis for genes regulated by specific TF PPIs To investigate whether chromatin interactions mediated by different TF PPIs may participate in distinct biological pathways, we classify genes based on the specific TF PPI features involved in their interactions with enhancers. For each top-ranking module-level PPI feature, we first identify the top five TF-level PPI features using the method described above. Then, we scan the genome-wide predictions of enhancer-promoter interactions and collect the subset of interactions that contain at least one of the top five TF-level PPI features. Finally, the subset of interactions are ranked by their q-values, and the top 1,000 genes regulated by these interactions are selected. In this way, the prioritized subset of genes represent strong targets of long-range enhancer regulation mediated by the important TF PPIs. Gene Ontology enrichment analyses are performed on different gene sets using DAVID 161 to check whether they are enriched with specific biological pathways. 160 5.2.10 cis-eQTL enrichment analysis for predicted long-range enhancer-promoter interactions As the orthogonal information to validate the accuracy of genome-wide predictions made by ProTECT, cis-eQTL datasets from the matched human tissues and cell-types are compared with the predicted enhancer-promoter interactions. Because our genome-wide predictions are made in human GM12878 and K562 cells, we selected four eQTL datasets 58-60, 162 which were profiled from either whole blood tissues or lymphoblastoid cells. A predicted enhancer-promoter interaction is considered to be supported by a cis- eQTL (i.e. a significantly associated SNP-gene pair), if the enhancer contains the SNP and the promoter matches with the gene. For each eQTL dataset, the fraction of predicted enhancer-promoter interactions that are supported by cis-eQTLs is calculated, and is compared to two versions of negative controls. The first version of negative control is based on random pairing enhancers with promoters that are within 2Mb distances. The second version of negative control further requires the genomic distances of random enhancer-promoter pairs follow the same distribution from our predicted enhancer- promoter interactions. Therefore, the second version is a more stringent control. For each version, 1,000 random samples are generated. And the statistical significance, i.e. p- values, of the observed overlapping fractions from our predictions is calculated as the portion of random samples showing a higher overlapping fraction than the real observed one. In addition to cis-eQTLs, we also use cis-hQTLs, i.e. histone QTLs, to evaluate the accuracy of our predictions. The hQTL dataset was also profiled from the human GM12878 cells 62. Similarly, a predicted enhancer-promoter interaction is considered to 161 be supported by a cis-hQTL (i.e. a significantly associated SNP-histone pair), if the enhancer contains the SNP and the promoter overlaps with the histone modification peak. The overlapping fraction is also compared with the two versions of negative controls to justify the enrichment of cis-hQTLs in support of our predictions. 5.2.11 cis-eQTL enrichment around TF binding sites For cis-eQTLs that overlap with predicted enhancer-promoter interactions, the genomic locations of the SNPs from cis-eQTLs are further compared with TF binding sites within enhancers. Here, the TF binding sites are defined as the ChIP-seq peak summits. For each enhancer included in this analysis, the TFs involved in important PPI features prioritized from the previous steps are selected. The genomic distances between the SNPs and the binding sites of these TFs are calculated. To statistically test whether the SNPs are closer to these important PPI-related TFs, two versions of random controls are generated. The first version is generated by randomly sampling binding sites of any TFs within the same set of enhancers. And the second version is generated by randomly sampling binding sites of TFs that are members of bottom-ranking PPI features, based on feature importance calculations from the previous sections. For each version of negative controls, p-values are calculated using Kolmogorov-Smirnov tests by comparing the cumulative distributions of distances. 5.2.12 trans-eQTL enrichment analysis for enhancer-mediated TF-gene pairs Compared to cis-eQTLs, trans-eQTLs can provide additional evidence to support the functional associations between the prioritized TFs and specific genes, where the TF’s PPIs are predicted to mediate enhancer-promoter interactions of the target genes. For enhancer-binding TFs that are members of the important PPI features, we first collect the 162 predicted enhancer-promoter interactions mediated by the corresponding PPI features. Genes regulated by these predicted interactions are thus considered as the downstream target genes of the specific enhancer-binding TFs. We define this relationship as enhancer-mediated TF-gene pairs. To exclude the possibility of promoter-mediated effects, we remove the genes whose promoters are also bound by the specific TF. Using the trans-eQTLs from the published database 163, we identify a subset of trans- eQTLs whose SNPs are located within TF’s gene bodies (plus -10kb from TSS) and target genes are covered in our input dataset. For this specific subset of trans-eQTLs, the SNPs are likely to disrupt the transcription of the TF genes, which in turn affects the TF’s regulation on the downstream target gene’s expression (Supplementary Methods). Hypergeometric test is used to statistically test whether the enhancer-mediated TF-gene pairs significantly overlap with the subset of trans-eQTLs described above. A TF-gene pair is considered to overlap with a trans-eQTL if the SNP is located within the TF’s gene body and the gene is the same as the trans-eQTL’s target gene. As comparisons, two versions of controls are generated based on the same set of TFs and enhancers. The first version uses the nearest genes to the enhancers as target genes, instead of using ProTECT’s predictions. The second version randomly selects genes within 2Mb distances as target genes. In each version, the same number of enhancer-promoter interactions are generated as seen from the foreground for each sample, and totally 1,000 random samples are created, along with the hypergeometric p-values. 163 5.3 RESULTS 5.3.1 Long-range enhancer-promoter interaction prediction based on PPIs among TFs As discovered by recent experimental studies 93-95, 97-102, 140, 141, the protein-protein interactions between specific transcription factors have been found to participate in the regulation of long-range chromatin loops, where the TFs bind to enhancers and promoters respectively (Figure 5.1.a). The PPIs between the enhancer-binding TFs and promoter- binding TFs facilitate the 3D proximity of enhancers and the target gene’s promoters. By analyzing the Hi-C interactions between enhancers and promoters in human GM12878 cells, a specific set of TF-TF pairs are found to be enriched in enhancer-promoter interactions (Figure 5.1.b), compared to their frequencies in distance-controlled random enhancer-promoter pairs. Interestingly, these TF-TF pairs are also enriched with known PPIs (Figure 5.1.c, p-value=10-3), suggesting that the TFs within each pair can establish interactions at the protein level. Figure 5.1.d shows two examples, where both enhancer- promoter Hi-C interactions contain enhancer-binding CTCF peaks and promoter-binding RUNX3 peaks. And the physical interaction between RUNX3 and CTCF is validated by the PPI database STRING 148, suggesting the RUNX3-CTCF interaction as a putative mechanism linking the enhancers with specific promoters. These observed enrichments strongly indicate the functional importance of TF PPIs in long-range chromatin loops and the possibility of predicting cell-type specific enhancer-promoter interactions using TF PPI features. Due to the large number of TF PPI features, i.e. PPIs between enhancer-binding TFs and promoter-binding TFs, basic predictive models significantly suffer from overfitting 164 problems, as shown in Figure D.3. Therefore, to efficiently leverage the information of TF PPIs from the high-dimensional feature space and overcome the overfitting risks, we developed a new machine learning classifier, ProTECT, to predict cell-type specific long- range enhancer-promoter interactions (Figure 5.1.e). Detailed algorithmic designs have been described in Materials and Methods. Overall, there are four main steps to achieve the final predictions: 1) Generation of the balanced Hi-C based training dataset, along with cell-type specific TF PPI features; 2) Dimension reduction of features based on hierarchical network community detection; 3) Predictive model construction using random forest; and 4) Genome-wide predictions of cell-type specific enhancer-promoter interactions. As a new predictive model, here we highlight a series of key novelties of ProTECT (see Materials and Methods for details). First, a rigorous method of controlling confounding factors, such as TAD domains, genomic separation distances and gene expression levels, is designed in the steps of data and feature generations. This method efficiently removes the impacts of confounding factors, which are fundamentally important to control as discussed by recent benchmark analyses 132, 133. Second, the graph-based dimension reduction approach not only addresses the potential risk of overfitting but also facilitates the prioritization of functionally important TF PPIs and TF complexes. Third, a generalized degree of freedom (GDF) technique 159 is incorporated to improve feature selections, leading to new biological understandings of specific TFs. Fourth, a stringent genomic bin- split cross-validation strategy is developed for unbiased and robust performance evaluation. This stringent strategy thoroughly breaks the dependency between the training and testing datasets and avoids the inflated performance estimations that have 165 Figure 5.2 Performance comparison in GM12878 and K562. (a) ProTECT, TargetFinder, and IM-PET are applied on the same input datasets and are evaluated based on the averaged performance of 5-fold genomic-bin split cross-validation. As a baseline comparison, a random forest model using only enhancer-gene activity correlations is also included in the analysis. (a-b) ROC curves in GM12878 (A) and K562 (B). (c-d) The enrichment of Hi-C interactions in top-ranking predictions. Cumulative odds ratios of true positives (y-axis), i.e. overlapping Hi-C interactions, are calculated across the ranked lists of predictions where predictions with stronger scores are ranked at the top (x-axis), in GM12878 (C) and K562 (D). (e-f) Examples of enhancer-promoter interactions predicted by ProTECT (pink paired lines) in GM12878 (E) and K562 (F). In each example, the highlighted enhancer (orange) is predicted to interact with the highlighted promoter (red) by ProTECT. Both predictions are supported by cell-type specific Hi-C interactions (black paired lines). The prioritized TF PPIs mediating the interactions are CTCF-RUNX3 (E) and CTCF-ELF1 (F) respectively, both of which are top-ranking PPI features from the random forest model. been commonly found in existing methods 132, 133. Fifth, a genomic distance-aware pFDR procedure 160 is implemented to identify statistically significant enhancer-promoter interactions along the whole human genome. We trained ProTECT using the high- resolution Hi-C datasets from the human GM12878 and K562 cell-lines separately 10. The 166 balanced and confounding factor-controlled training dataset contains 5,348 long-range enhancer-promoter interactions in GM12878 and 8,650 interactions in K562 cells. The trained classifiers were further applied to make genome-wide cell-type specific predictions of enhancer-promoter interactions. As shown in subsequent sections, the ProTECT algorithm not only improves the prediction accuracy substantially, but also reveals novel mechanistic insights on the functional roles of TF PPIs in the regulation of long-range chromatin loops. The prioritized TFs and their specific PPIs provide a new platform to understand the complex interplay among TFs, enhancers and genes, and remarkably, open a new avenue to systematically interpret both cis- and trans-eQTLs in human genetics analyses. 5.3.2 Boosted performance based on features of TF PPIs Using the genomic bin-split cross-validation strategy (see Materials and Methods), we rigorously tested the accuracy of ProTECT and compared with the other two supervised methods, i.e. IM-PET127 and TargetFinder 128. In both GM12878 and K562 cell-lines, ProTECT achieves the highest performance (Figure 5.2.a and 5.2.b): AUC=0.82 in GM12878 and AUC=0.78 in K562 cells. And the accuracy of ProTECT is robust with respect to the number of trees used in the random forest models (Figure D.7). As comparison, TargetFinder is ranked as the second algorithm with AUC values below 0.74, while the AUC metrics of IM-PET is around 0.6. As a baseline comparison, a random forest model using only activity correlations between enhancers and genes, without using TF PPI features, shows AUC values around 0.57. Because we systematically controlled confounding factors in the training dataset, the AUC estimates are not dominated or biased by those factors, especially the genomic separation distances. Therefore, these 167 comparisons strongly support that the ProTECT model substantially boosts the prediction accuracy over existing algorithms. In addition to the overall AUC metrics, to demonstrate that ProTECT has better capabilities of pinpointing true enhancer-promoter interactions in top-ranking predictions, we calculated the cumulative Odds Ratio (OR) of true positives along the ranked list of predictions. As shown in Figure 5.2.c and 5.2.d, ProTECT achieves much higher OR curves than other algorithms, especially in the zone of top-ranking predictions. Because top-ranking predictions are the main de novo discoveries used for experimental studies in practice, this observation further exemplifies the superior precision of ProTECT. Moreover, we further evaluated the robustness of ProTECT’s superior performance with respect to different settings of input features and data. As shown in Figure D.8, by setting different confidence score cutoffs on PPIs to be included as input features (i.e. 100, 200 and 300), ProTECT robustly achieves the highest accuracy (AUC>0.78) compared to other methods. In addition, using different epigenetic signals to represent cell-type specific enhancer activity levels, such as DNase-seq, H3K27ac and H3K4me1, ProTECT demonstrates highly similar accuracy, with DNase-seq and H3K27ac based versions slightly better than the H3K4me1 based version (Figure D.8). Furthermore, we also tested the performance on imbalanced dataset, where the ratio of positive-to-negative samples is 0.1, as suggested by previous studies 127, 128. ProTECT consistently shows the best ROC and Precision-Recall curves (Figure D.9). To obtain orthogonal evidence on ProTECT’s accuracy, we also used a diverse panel of Hi-ChIP 108, 164, 165 and ChIA-PET 50 datasets from the matched cell-types as gold-standards for enhancer-promoter interactions. Remarkably, ProTECT maintains the highest accuracy across all 168 comparisons based on different gold-standard datasets (Figure D.10 and 5.11). Across the five Hi-ChIP evaluations, ProTECT achieves AUC>0.78, while TargetFinder and IM- PET only show AUC<0.66. Using ChIP-PET datasets as gold-standards, ProTECT achieves AUC>0.84 while other methods demonstrate AUC<0.76. These tests systematically support the robustness of ProTECT’s performance advantages. Figure 5.2.e shows one example predicted by ProTECT in human GM12878 cells. The distal enhancer is located 99.4kb from the predicted target gene’s promoter, and this long- range prediction is supported by a cell-type specific Hi-C interaction 10. Based on the trained random forest model, this enhancer-promoter interaction is mediated by the PPI between the enhancer-binding CTCF and the promoter-binding RUNX3 (Figure 5.2.e). Interestingly, the correlation between the enhancer’s activity and the target gene’s expression across different cell-types is only 0.28, which strongly suggests the importance of incorporating TF PPI features in predicting enhancer-promoter interactions. A similar example from K562 is shown in Figure 5.2.f, where the distal enhancer is located 46kb from the predicted target gene’s promoter, and is also supported by a cell-type specific Hi-C interaction (Figure 5.2.f). This enhancer-promoter interaction, which only shows an activity correlation of 0.261, is successfully predicted based on the PPI between enhancer-binding CTCF and promoter-binding ELF1. Overall, these results demonstrate that TF PPI features can improve the delineation of specific interacting 169 Figure 5.3 TF PPI features provide additional information beyond TF bindings and activity-based features. (a) Schematic figure of the permutation test on TF PPI features. The shuffled PPIs are generated by randomly pairing two interacting TFs from the original pool of TF PPIs, while the degrees of PPI partners and TF binding sites in enhancers and promoters are maintained. Based on the shuffled PPI features, a new random forest model is trained and then evaluated by the same cross-validation procedure. (b) ROC plots for the models based on the original TF PPI features (red), the models based on the shuffled TF PPI features (salmon), and the baseline models based on activity-correlation features alone (blue), in GM12878 and K562 cells. enhancer-promoter pairs from neighboring non-interacting pairs, beyond the information of activity-related features. In addition, specific hypotheses of the mechanisms mediating chromatin interactions, i.e. the functional TF PPIs linking enhancers and promoters, are derived from the model simultaneously. To further justify that the superior performance of ProTECT is indeed due to the information from TF PPI features, we randomly shuffled the TF-TF connections in the PPI network (Figure 5.3.a). Therefore, the specific TF binding sites in enhancers and 170 promoters are strictly maintained (see Materials and Methods), while the PPI features across enhancer-promoter pairs are randomized. This shuffling strategy also controls the degree of PPI partners for each TF, i.e. the number of protein neighbors in the PPI network. By training the ProTECT model on the shuffled data, we found that the accuracy is substantially reduced. The AUC based on PPI-shuffled data is only 0.68, while the original AUC of ProTECT is 0.82 in human GM12878 cells (Figure 5.3.b). Similar decrease of performance is also observed in human K562 cells (Figure 5.3.b). The striking differences of prediction accuracy suggest that the performance improvement of ProTECT is mainly induced by TF PPI features, instead of TF binding information, consistent with previous biological studies of the functional roles of PPIs in chromatin loop regulation 146. To evaluate the model’s dependence on the cell-type specificity of TF bindings, we swapped the TF ChIP-seq data across GM12878 and K562, and run ProTECT based on the swapped data. As expected, the prediction accuracy decreased in both cell-types (Figure D.12.A and 5.12.B), suggesting the necessity of using TF datasets from the matched cell-types. Interestingly, ProTECT still maintains the highest prediction accuracy when other algorithms are also trained on the swapped TF data. In addition, to test the model’s dependence on the number of TFs included as features, we obtained the intersection subset of TFs whose ChIP-seq are available in both GM12878 and K562, and trained ProTECT based on features derived from this subset. The cell-type specific predictions in GM12878 and K562 demonstrate similar accuracy (AUC=0.74 and 0.70, Figure D.12.C), suggesting additional TFs are needed in each cell-type beyond the intersection subset. 171 Figure 5.4 Genome-wide prediction of enhancer-promoter interactions reveals functional roles of TF PPIs in gene regulation. (a) Summary of genome-wide predictions in GM12878 and K562. The venn-diagram shows the overlap between predicted enhancer-promoter interactions in GM12878 (yellow) and K562 (salmon). (b-c) 172 Figure 5.4 (cont’d) Feature importance (y-axis) of top 10 module-level TF PPI features based on the random forest models in GM12878 (B) and K562 (C). Each module-level PPI feature is named by the most abundant TF-level PPIs between the modules as axis-labels (x-axis). (d) Schematic figure of ranking specific TF-level PPIs in each PPI module. For each module- level PPI feature, all TF-level PPIs linking two TFs from the pair of two modules (the pair of modules can be the same to represent intra-module TF-level PPIs) are ranked by their occurrences in the predicted long-range enhancer-promoter interactions (abundance scores). (e-f) Examples of top 5 TF-level PPIs for three representative module-level features in GM12878 (E) and K562 (F). (g) Examples of predicted enhancer-promoter interactions regulated by RELB-YY1 in the ISCU locus. Predicted enhancer-promoter interactions for the ISCU gene are shown as the pink paired lines. Totally 11 enhancers are predicted to interact with the promoter of ISCU, and 5 predictions are supported by Hi-C (purple paired lines) or Capture Hi-C (grey paired lines). ChIP-seq signal tracks of RELB and YY1 (brown signal peaks) are consistent with predictions. (h) Schematic figure of ranking enhancer-promoter interactions regulated by specific TF PPIs. For each prioritized TF PPI feature, enhancer-promoter interactions are ranked based on the q- values inferred by ProTECT. Top 1,000 genes are then selected by following the ranked list of interactions for pathway enrichment analysis. (i) Pathway enrichments of genes regulated by five different TF PPIs in GM12878. The top 10 most enriched pathways for each TF PPI feature are shown. The heatmap is colored based on the -log10(p value) of pathway enrichments. 5.3.3 Genome-wide prediction of long-range enhancer-promoter interactions The trained random forest model is then applied to the genome-wide dataset in GM12878 and K562 cell-lines separately to predict novel enhancer-promoter interactions (Figure D.13.A-D). All enhancer-promoter pairs within 2Mb distance windows are included into genome-wide predictions (see Materials and Methods), as suggested by observations from experimental Hi-C datasets 10. For each enhancer-promoter pair, a p-value from the permutation test is generated, which is further used to derive a q-value based on the pFDR approach 160 (see Materials and Methods). Using the q-value threshold of 0.05, there are totally 60,016 significant enhancer-promoter interactions predicted in GM12878, and 80,591 significant enhancer-promoter interactions predicted in K562 (Figure 5.4.a). The median separation genomic distance between linked enhancers and promoters is 173 243kb in GM12878 (Figure D.13.E), consistent with enhancer’s function of long-range regulation. In the predicted GM12878 enhancer-promoter network, >37% of enhancers regulate multiple genes (Figure D.13.F), whose accuracy is consistent with the overall performance (Figure D.14) and 24% of these multi-gene enhancer links are supported by experimental chromatin interactions. On average, every gene is regulated by 6.9 enhancers (Figure D.13.G), suggesting combinations of multiple enhancers are recruited for precise transcriptional regulation. Similar patterns are also observed in the predicted K562 enhancer-promoter network (Figure D.13.H-J). Furthermore, the predicted enhancer-promoter interactions are highly cell-type specific. By comparing the predictions in GM12878 and K562, only 5,815 (~4.2%) enhancer-promoter interactions are shared by the two cell-types (Figure 5.4.a). Compared to the recent activity-by-contact (ABC) model 166, our genome-wide predictions demonstrate higher accuracy, as quantified by both ROC and Precision-Recall curves, using Hi-ChIP data as gold-standards (Figure D.15). 5.3.4 Important protein-protein interactions regulating chromatin interactions To gain insights of the underlying mechanisms of linking distal enhancers to target gene’s promoters, we analyzed the feature importance of module-level PPI features inferred by the random forest model and further prioritize the representative TF-level PPI features. We first identified the top-ranking module-level PPI features, which represent the protein complexes of interacting TFs involved in chromatin loops (Figure 5.4.b and 5.4.c). For example, in GM12878 cells, module(CTCF)-module(POLR2A) is ranked as the top 3rd feature (here the module-level features are named by the most abundant TF-level PPIs linking the modules). Interestingly, this is consistent with a recent experimental study 167 174 which also found that the enhancer-binding CTCF interacts with the promoter-binding Pol II and participates in the regulation of long-range chromatin loops. As another interesting example, the module-level PPI feature module(IKZF1)-module(RB1) is one of the top- ranking features in K562, consistent with their critical functions in leukemia cells and their impacts on chromatin structure 168, 169. Additional examples of the prioritized module-level TF PPIs are visualized as PPI networks in Figure D.16, showing the complex PPI connectivity between TF modules binding to enhancers and promoters. In order to characterize the key PPI features between individual TFs, instead of TF modules, we further decode the module-level PPI features into ranked TF-level PPI features (Figure 5.4.d), based on their occurrences across genome-wide predictions of enhancer-promoter interactions (see Materials and Methods). Genome-wide predictions are used to calculate the abundance scores for TF level PPIs because they provide a large pool of enhancer-promoter links, and the abundance scores are found to be highly correlated with the observations from cross-validation samples (Figure D.17, Spearman Correlation=0.95). For each module-level feature, the top 5 most abundant PPI features between specific enhancer-binding and promoter-binding TFs are identified. For example (Figure 5.4.e), RELB-YY1 is predicted to be a key TF-level PPI feature in long-range enhancer regulation. In support of this new discovery, RELB has recently been found to promote gene expression by interacting with YY1 170. As another example, SMC3-HDAC1 is one of the top-ranking features in K562 (Figure 5.4.f), consistent with the reported regulatory roles of HDAC1 on chromatin structure by interacting with SMC3 171. The discoveries of these key TFs and their PPIs as candidate functional factors in chromatin 175 loop formation may lead to new biological hypotheses of enhancer regulation for in-depth experimental investigations. As a demonstration of the potential importance of TF PPIs in linking distal enhancers to promoters, Figure 5.4.g shows the predicted long-range enhancer-promoter interactions for the gene ISCU. There are totally 11 enhancers predicted by ProTECT to interact with ISCU’s promoter, and 5 of them are supported by experimental data of chromatin interactions based on Hi-C or Capture Hi-C (Figure 5.4.g), indicating the high accuracy of the predictive model. The inferred top-ranking feature is the PPI between enhancer- binding RELB and promoter-binding YY1. Consistent with this prediction, YY1 has a strong ChIP-seq binding site at the promoter of ISCU, and almost all linked enhancers have ChIP-seq signals of RELB binding. Importantly, 4 out of the 5 validated enhancers show the strongest RELB ChIP-seq binding signals (Figure 5.4.g), indicating the shared mechanism of these enhancer-promoter interactions for the gene ISCU. In this region, the longest interaction predicted by ProTECT is from a distal enhancer located >547kb from ISCU’s promoter. Although not captured by chromatin contact map experiments, this specific enhancer contains a sharp ChIP-seq peak of RELB binding (Figure 5.4.g), suggesting this novel prediction as a strong candidate of enhancer-promoter interactions. It also implies the capability of ProTECT to discover long-range enhancer regulation that might be missed by experimental approaches. To investigate whether the orientations of PPI features between enhancer-binding and promoter-binding TFs have impacts in chromatin interactions, we designed a systematic model selection strategy to test whether a pair of two TF PPI features with opposite directions can be merged into one un-directional PPI feature without reducing the 176 predictive accuracy (see Materials and Methods). Using this approach, 32 pairs of directional PPI features in GM12878 are merged into 16 un-directional features, suggesting there is no statistical preference of binding sites (i.e. enhancers vs. promoters) between interacting TFs involved in these PPIs. For example, the features ATF2- SMARCA5 and SMARCA5-ATF2 are merged into an un-directional feature by the model, consistent with the observation that the two directional PPI features have similar abundance in enhancer-promoter interactions (Figure D.18.A). A similar example involves the merge of IKZF1-CREM and CREM-IKZF1 features (Figure D.18.A). In spite of these un-directional PPI features, there are 37 features remaining to be directional in GM12878. For example, there is a significant preference of SMC3-MXI1 feature over the MXI1-SMC3 feature (fold-enrichment=7.80, Figure D.18.B). This is an interesting observation considering the function of SMC3 (a subunit of cohesin 172) in chromatin structural maintenance, and the reported regulatory function of MXI1 binding in promoter regions 173. Another example corresponds to the preference of EP300-POL2R2A over POL2R2A-EP300 (fold-enrichment=9.19, Figure D.18.B), consistent with the well-known enhancer binding activities of EP300 174 and the transcriptional initiation function of POL2R2A 175. Similarly, 184 pairs of directional PPI features in K562 are merged into 92 un-directional features, while 47 PPI features remain to be directional. 5.3.5 Genes regulated by different TF PPIs are enriched in distinct pathways To evaluate the downstream impacts of chromatin interactions mediated by different TF PPIs, we focused on the top 5 module-level PPI features (Figure 5.4.b and 5.4.c). We identified the strongest enhancer-promoter interactions mediated by each feature separately based on the ranked q-values of predictions (see Materials and Methods). 177 Genes that are regulated by the top-ranking enhancer-promoter interactions are therefore collected for pathway enrichment analysis (Figure 5.4.h). Overall, these prioritized genes are enriched with immune-related or B-cell-related pathways (Figure D.19.A-B), which is expected since the predictions are inferred from GM12878 and K562 cell-lines. Strikingly, for each specific PPI feature, the gene sets are strongly enriched with distinct groups of pathways (Figure D.19.A-B). Figure 5.4.i shows the most enriched pathways for each TF PPI feature discovered in the GM12878 cell-line. Clearly, the enhancer-promoter interactions mediated by different TF PPIs are enriched with diverse biological processes. For example, the CTCF-YY1 feature is found to be associated with long-range regulation of genes in the B cell receptor signaling pathway, while the SMC3-POLR2A feature is associated with genes of the innate immune response pathway (Figure 5.4.i). To exclude the potential bias caused by gene background, we carried out pathway enrichment analysis based on two additional gene backgrounds, respectively: 1) genes with the same set of promoter-binding TFs; and 2) genes with the same set of enhancer-binding TFs (Figure D.19.C-D). Based on these two rigorous gene backgrounds, the majority (>67%) of enriched pathways are still discovered. These differentially enriched pathways further highlight the functional roles of TF PPIs in regulating gene expression and maintaining the specific cellular states. 178 Figure 5.5 Predicted enhancer-promoter interactions are enriched with cis-QTLs and trans-eQTLs. (a) cis-eQTLs and cis-hQTLs from multiple datasets (x-axis) are 179 Figure 5.5 (cont’d) significantly enriched in predicted enhancer-promoter interactions in GM12878 (red). The fractions of enhancer-promoter interactions overlapping with cis-QTLs (y-axis) are compared with other methods and two versions of controls: (1) random enhancer- promoter pairs (brown) and (2) distance-controlled random enhancer-promoter pairs (blue). 1,000 samples are generated for both versions to calculate p-values (***: p- value<1.04x10-4). Error bars represent sd. (b) Schematic figure of cis-eQTL SNPs located in the binding sites of functionally important TFs (blue) of chromatin interactions, compared to general enhancer-binding TFs (grey), as a mechanistic hypothesis of cis- regulatory effects on target gene expression. (c) Distributions of relative distances between cis-eQTL SNPs and binding sites of different enhancer-binding TFs. Relative distances (x-axis) are genomic distances between SNPs and TF ChIP-seq peak summits normalized by the sizes of TF peaks. Binding sites of top-ranking TFs inferred by ProTECT (red) significantly overlap with cis-eQTL SNPs, compared with bottom-ranking TFs (grey, p-value=3.02x10-4) and random enhancer-binding TFs (blue, p-value=4.17x10- 18). (d) Example of a cis-eQTL, i.e. the rs2488088-ADK pair, overlapping with a predicted enhancer-promoter interaction (pink paired lines). The predicted interaction is supported by Hi-C (black paired lines). The prioritized PPI feature is RUNX3-SMAD, consistent with the ChIP-seq signal tracks (brown signals). Zoom-in view of the distal enhancer (orange) shows the cis-eQTL SNP rs2488088 is located at the peak summit of RUNX3 binding site. (e) Schematic figure of trans-eQTL SNPs located in specific TF genes, whose binding to enhancers are predicted to mediate long-range enhancer-promoter interactions of trans- eQTL target genes. (f) Hypergeometric test on the overlaps between trans-eQTLs (i.e. trans- SNP-gene pairs) and enhancer-mediated TF-gene pairs, if the SNP is located in the TF’s gene body and the trans-eQTL’s target gene is the same as the TF’s target gene (red, p-value=0.014). The -log10(p-value) (y-axis) from the hypergeometric test is compared to two versions of controls: 1) nearest genes to the enhancers (brown); and 2) random target genes (blue). Each control is generated 1,000 times and the error bars show the sd. The black dash line corresponds to -log10(0.05). (g) Venn diagram comparing genes affected by weakened Hi-C interactions in PAX5 KO pro-B cells and genes regulated by PAX5 in ProTECT predictions (Hypergeometric test, p- value=5.64x10-165). (h) Example of a trans-eQTL, i.e. rs10973104-NOL6 pair, supported by the predicted enhancer-mediated PAX5-NOL6 pair. The predicted enhancer-promoter interaction for NOL6 (black paired lines) is based on the prioritized TF PPI feature PAX5- CTCF. ChIP-seq signals (brown signal tracks) show a strong CTCF peak in the NOL6 promoter (red) and strong PAX5 peaks in the linked enhancer (orange). The trans-eQTL SNP rs10973104 is located in the gene body of PAX5, which is 3.6Mb away from this locus. 5.3.6 Predicted enhancer-promoter interactions are enriched with cis-eQTLs Because the predictive model is trained on Hi-C datasets, we use cis-eQTLs as orthogonal evidence to quantitatively evaluate the accuracy of the genome-wide 180 predictions of enhancer-promoter interactions. By comparing the predictions with the SNP-gene pairs of significant eQTLs, we calculated the overlapping enrichment scores (see Materials and Methods). Using four eQTL datasets generated from matched cell- types or tissues (e.g. whole blood tissues or lymphoblastoid cell-lines) 58-60, 162, the predicted enhancer-promoter interactions in GM12878 cell-line show significantly higher fractions overlapping with eQTLs, compared to stringent distance-controlled random interactions and other algorithms (p-value<1.04x10-4, Figure 5.5.a). Similar, but relatively weaker, enrichment with eQTLs is found for predictions in K562 cell-line (Figure D.20.A). In addition to cis-eQTLs, we compared our predictions in GM12878 with histone-QTLs from the same cell-line 62 and also observed strong enrichment (p-value=3.27x10-5) compared to distance-controlled random samples and other algorithms (Figure 5.5.a). These observations not only support the high accuracy of genome-wide predictions but also suggest the putative mechanisms of cis-eQTLs mediated by chromatin interactions between regulatory elements and target genes. 5.3.7 cis-eQTLs are enriched in binding sites of prioritized TFs The prioritized TF PPI features by the ProTECT model provides a new metric of delineating functionally important TFs for enhancer regulation against general enhancer- binding TFs, which is complicated due to the large array of TFs binding to enhancers. For a typical enhancer, it contains 10 different TF binding sites on average, based on the counts of TF ChIP-seq peaks in GM12878 from the ENCODE project 50. However, binding itself is not sufficient to assign functional importance for TFs. As found by previous studies, TFs binding in enhancer regions are not equally important for the function of enhancers, with many enhancer-binding TFs lacking evidence of regulatory impacts on gene 181 expression 176. This ambiguity hinders the understanding of enhancer activation and downstream effects. We hypothesized the TFs involved with top prioritized PPI features are more likely to be functional for enhancers. We tested this hypothesis by checking the enrichment of cis-eQTL SNPs within the binding sites of the prioritized TFs in enhancers (Figure 5.5.b, see Materials and Methods). The cis-eQTLs are called in whole blood tissues from the GTEx project 162. Interestingly, the SNPs of cis-eQTLs are located significantly closer to the binding sites of prioritized TFs in GM12878 (p-value=4.17x10- 18, Kolmogorov-Smirnov test), compared to the binding sites of other adjacent enhancer- binding TFs (Figure 5.5.c). To control the potential bias caused by data availability, we also generated a more stringent background only using TFs included in the model but inferred with low feature importance (see Materials and Methods). Compared with this new background, the prioritized TFs are still significantly enriched with cis-eQTL SNPs (p-value=3.02x10-4, Kolmogorov-Smirnov test, Figure 5.5.c). In the K562 cell-line, cis- eQTL SNPs are also closer to the binding sites of the prioritized TFs but not statistically significant (Figure D.20.B). Overall, this analysis supports the stronger regulatory effects of prioritized TFs whose PPIs may mediate long-range enhancer-promoter interactions. Additionally, the prioritized TF binding sites provide a new layer of information to pinpoint regulatory SNPs at a higher resolution, by dissecting the ambiguity of numerous TF bindings within enhancers. As a representative example, a distal enhancer located >589kb away is predicted by ProTECT to interact with the promoter of the ADK gene in GM12878 (Figure 5.5.d), which is supported by experimental Hi-C data 10. This long-range interaction is also supported by a significant eQTL, i.e. rs2488088-ADK (p-value=3.29x10-19) 162. The prioritized TF 182 PPI feature for this interaction is RUNX3-SMAD, where RUNX3 binds to the enhancer and SMAD binds to the promoter. By zooming into the enhancer element, which is 1.2kb long and contains binding sites of 5 different TFs, the SNP rs2488088 is found to be precisely located at the ChIP-seq peak summit of RUNX3 (Figure 5.5.d), consistent with our prioritization of RUNX3 as the important TF for this enhancer. This observation also implies the mechanistic interpretation of this non-coding SNP, whose disruptive effect on the RUNX3 binding causes the loss of RUNX3-SMAD mediated long-range interaction to ADK. 5.3.8 trans-eQTLs are enriched in enhancer-mediated TF-gene pairs As one of the advantages of the ProTECT algorithm, both cis-regulatory elements (i.e. enhancers) and trans-regulatory factors (i.e. TFs) are jointly modeled in long-range chromatin interactions. In traditional studies of trans-regulation of gene expression, analyses have been mainly limited to promoter-binding TFs as candidate trans-regulatory factors 177, 178. Based on the functional impacts of the predicted important TF PPI features (Figure 5.4.b-I) and the observed enrichment of cis-eQTL SNPs in prioritized enhancer- binding TFs (Figure 5.5.b-D), we hypothesized that there is an enhancer-mediated pathway of trans-regulation, i.e. the enhancer-binding TFs associated with top-ranking PPI features for long-range chromatin interactions are trans-regulatory factors for the expression of distal target genes (Figure 5.5.e). To quantitatively validate this hypothesis, we compared the enhancer-mediated TF-gene pairs with significant trans-eQTLs 163, and the significance of overlaps are statistically tested using Hypergeometric tests (see Materials and Methods). Interestingly, the enhancer-mediated TF-gene pairs are found to be strongly supported by trans-eQTLs (p-value=0.014, Figure 5.5.f, Figure D.20.C), 183 suggesting that the SNPs of trans-eQTLs are associated with target gene’s expression via the disruption of the TF gene’s activity (Figure 5.5.e), although the SNPs may be located far away from the target genes or even located in different chromosomes. The observed statistical significance is also stronger than two versions of controls, excluding the potential confounding effects of biased enhancer activity and genomic distances (Figure 5.5.f, see Materials and Methods). To obtain additional experimental evidence on the predicted enhancer-mediated TF-gene regulation, we leveraged a differential Hi-C interaction dataset in mouse pro-B cells where 7,810 weakened Hi-C interactions were identified following PAX5 knock-out 179. The top- ranking PAX5 related PPI feature predicted by ProTECT is PAX5-CTCF, consistent with their collaborative roles in B cells 180, 181. Based on our genome-wide predictions in GM12878, we identified the subset of PAX5-CTCF mediated enhancer-promoter interactions (see Materials and Methods), and thus collected the enhancer-mediated target genes of PAX5. To purify the subsequent analysis, genes whose promoters are also bound by PAX5 are removed from the list. If PAX5 is a true trans-regulatory factor for these genes, the genes are expected to be targeted by the weakened long-range interactions following PAX5 knock-out. By mapping the genes to their homology in the mouse genome 182, 6,744 enhancer-mediated target genes of PAX5 are conserved. Strikingly, these genes are found to significantly overlap with the genes of weakened Hi- C interactions in PAX5-/- pro-B cells 179 (hypergeometric p-value=5.64x10-165, Figure 5.5.g). To control the potentially biased enhancer activity and TF bindings, we generated two versions of controls. The first version randomly selects genes as enhancer-mediated target genes of PAX5. And the second version randomly chooses target genes of other 184 TFs. 1,000 random samples are generated for each version and the same number of genes are selected for each sample. Both versions of negative controls show decreased overlap with genes of weakened Hi-C interactions in PAX5-/- pro-B cells (p-value=10-3), supporting the predicted trans-regulatory links between PAX5 and target genes by ProTECT. Figure 5.5.h shows one representative example of PAX5-CTCF mediated long- range enhancer-promoter interaction (~600kb), where the enhancer contains multiple PAX5 binding sites and the promoter of the target gene, i.e. NOL6, contains a strong CTCF binding site. Interestingly, NOL6 is linked with weakened Hi-C interactions in PAX5- /- pro-B cells. These strong experimental validations, along with the enrichment of trans- eQTLs, suggest the biological validity of the predicted enhancer-mediated TF-gene pairs, and provide a new regulatory mechanism to discover and interpret trans- regulatory genetic variants. 5.4 DISCUSSION In this study, we have developed a novel supervised algorithm, ProTECT (https://github.com/wangjr03/PPI-based_prediction_enh_gene_links), to predict long- range enhancer-promoter interactions. By incorporating new features of protein-protein interactions among transcription factors, the algorithm achieves superior performance compared to other methods, based on a rigorously designed genomic bin-split cross- validation procedure. Considering the overfitting risk of high-dimensional inter-dependent TF PPI features, a novel network-community based dimension reduction strategy is used to hierarchically organize TF PPIs into module-level features. This approach efficiently improves the generalizability of the predictive model to make robust predictions based on complex TF PPI patterns, while maintaining the detailed ranking of TF-level PPI features 185 for specific mechanistic understandings of long-range enhancer regulation. With the impacts of confounding factors strictly controlled, the relative contributions of different features are systematically evaluated, which shows that TF PPIs contain substantially additional information beyond activity-based features of enhancers and genes. The genome-wide implementation of ProTECT in GM12878 and K562 cell-lines generated 60,016 and 80,591 new predictions of significant enhancer-promoter interactions, which will be useful resources of cell-type specific enhancer regulation for biologists. In addition, a set of prioritized TF PPIs, in both module-level and TF-level, are identified as the key PPIs mediating long-range chromatin loops. Different TF PPIs are found to mediate enhancer regulation for genes in distinct biological pathways, implying specific functional roles of complex TF cooperation. The TF members participating in these prioritized PPI features can be used as candidate targets for knock-out to investigate the changes of specific enhancer-promoter interactions, which will expand the insights on the underlying mechanisms of chromatin loop formation and long-range gene regulation. To gain orthogonal evidence of the validity of genome-wide predictions, cis- and trans- eQTLs are compared with the predicted enhancer-promoter interactions in three ways, each of which supports one aspect of the interplay among TFs, enhancers and genes. First, the enrichment of overlaps between cis-eQTLs and enhancer-promoter interactions suggests the accuracy of predicted long-range cis-regulation by distal enhancers. Second, the enrichment of cis-eQTL SNPs located within the binding sites of prioritized TFs underscores the precise delineation of functionally important TFs for enhancer activities against other general enhancer-binding TFs. Third, the enrichment of overlaps between 186 trans-eQTLs and enhancer-mediated TF-gene pairs highlights the novel identification of trans-regulatory pathways from upstream TFs to downstream genes via distal enhancers. The promising enrichment analyses further indicate that the predictions from ProTECT can be used as a platform to interpret cis- and trans-eQTLs, i.e. characterize the non- coding SNP’s disruptive effects propagated through long-range enhancer regulation on gene expression. Therefore, combined with eQTL datasets, the ProTECT model can also be a useful tool to generate testable hypotheses in statistical genetics studies. To control the model complexity, only direct PPIs between TFs are included as features, while indirect PPIs between TFs may also participate in the regulation of chromatin loops. For example, an enhancer-binding TF and a promoter-binding TF may not be able to interact with each other but they both can interact with a third protein. The incorporation of module-level TF PPI features helps to capture the potential indirect PPIs to some degree, but does not explicitly address this problem. Due to the large number of indirect PPI features and the limited number of labeled samples for model training, more advanced designs of feature selection will be needed to achieve a balance between predictive accuracy and model generalizability. As a major novelty of the ProTECT model, the efficient inclusion of TF PPIs as features not only improves the predictions but also reveals mechanistic insights on long-range enhancer regulation. In the meantime, the algorithm requires the availability of large panels of TF ChIP-seq data for the specific cell-types under study, which may be a practical challenge for users. As one of the directions to extend the ProTECT model, it is possible to leverage the combined information of chromatin accessibility data, e.g. DNase-seq or ATAC-seq data, and TF binding motif annotation datasets as 187 approximations for cell-type specific TF bindings. Several recent studies have demonstrated the reasonable accuracy of this approximation 50, 86. Furthermore, multiple imputation algorithms have been recently developed for ENCODE cell-types or tissues to impute cell-type specific TF binding ChIP-seq signals 183, 184. The imputed TF binding signals can be used as alternative inputs for the model to make cell-type specific predictions of enhancer-promoter interactions, for cell-types lacking ChIP-seq datasets. As an evaluation of this possibility, we generated the imputed TF bindings by overlapping TF motifs with cell-type specific DNase-seq peaks, and then derived TF PPI features based on the imputed data. Remarkably, applied on the imputation-based input features, ProTECT is able to achieve high accuracy (Figure D.21). This evaluation strongly supports the wide applicability of ProTECT on diverse cell-types even if TF ChIP-seq data is not directly available. 188 CHAPTER 6 DISCUSSION Characterizing the high-order chromatin conformation and complex interplays between TFs, enhancers, and genes in the 3D space play an important role in understanding the complex gene regulations. In this dissertation, we showed two directions to fully delineate the interaction landscape based on the multi-omics datasets. We reconstructed the 3D chromosome structures from the chromatin contact maps and single-cell chromatin conformation capture datasets, which provide the structural basis of the long-range chromatin interactions. We also developed two computational algorithms to predict the long-range enhancer-gene regulations based on the TF bindings. Notably, we predicted the multi-enhancer regulations with high accuracy, which expanded the analyses of gene regulations from one enhancer to the cooperation of multiple enhancers. This chapter summarizes the results, biological innovations, and future directions of our work. 6.1 SUMMARY We first reconstruct the 3D chromosome structures from the chromatin contact maps based on the completion of the low-rank matrix. Our developed algorithm, FLAMINGO, demonstrated high accuracy and scalability in reconstructing high-resolution 3D structures from sparse chromatin contact maps. Using FLAMINGO, we successfully predicted the 3D structures of all 23 chromosomes in 5kb and 1kb resolution, which is the highest resolution for now. Based on the extensive evaluation of the simulated data and orthogonal biological evidence, FLAMINGO demonstrated superior performance over existing algorithms. The 3D chromosome structures predicted by FLAMINGO innovate 189 the interpretation of the long-range QTLs and multi-way interactions, where chromatin loops bring anchors into proximal 3D neighborhoods and facilitate long-range functional interactions. An integrative variant of FLAMINGO, iFLAMINGO, is further developed to facilitate the cross-cell-type prediction of the 3D structures and refine the resolution. The development of FLAMINGO provides a powerful tool to delineate the interaction landscape in high resolution. We further developed tFLAMINGO to predict the single-cell 3D chromosome structures. To mitigate the high missing rate of single-cell datasets, tFLAMINGO utilized a low-rank tensor completion method. Compared with existing algorithms, tFLAMINGO demonstrated superior performance in reconstructing single-cell 3D structures and imputing the chromatin contact maps. Given the complete single-cell 3D chromosome structures, we proved the 3D chromosome structures are robust in low-resolution but highly dynamic in terms of single-cell chromatin interactions. For example, TADs are overall robust but could be shifting, merging, and vanishing across single cells. We showed that the genomic loci with critical biological functions, e.g. open chromatin and active transcription, tend to be densely organized in the 3D space and less dynamic across single cells. The delineation of the single-cell 3D chromosome structures also provides a new approach to interpreting the somatic mutations, GWAS SNPs, and predicting the dynamic multi-way chromatin interactions. To computationally predict the long-range enhancer-gene interactions, we developed an unsupervised learning method, ComMUTE, to integrate the gene regulatory grammar and enhancer-binding TF profiles. The unsupervised framework of ComMUTE largely expands its usability in cell types without experimental chromatin interactions and avoids 190 the overfitting risks. Compared with existing algorithms, ComMUTE simultaneously links multiple enhancers with synergistic regulatory functions to the same target gene, which captures the multi-enhancer regulations. By extensively benchmarked with existing algorithms, ComMUTE demonstrated consistently improved performance in predicting enhancer-gene links and multi-enhancer regulations. The decoded high-order regulatory landscape shed light on understanding the eQTLs and GWAS SNPs. Strikingly, the multi- enhancer regulations predicted by ComMUTE can help predict the epistasis eQTLs, whose discovery is important for gene regulations but highly challenging due to the unrealistically large searching space. We proposed that the SNPs within the co-regulating enhancers should have a higher probability of being the epistasis eQTLs and thus significantly reduce the number of tests. In addition to ComMUTE, we also developed supervised learning, ProTECT, to predict the PPI-mediated enhancer-gene links. In addition to standard features used by other methods, we included a new set of features: the PPI between enhancer-binding TFs and promoter-binding TFs. Based on the permutation test, we proved that the new features can boost the accuracy in predicting the enhancer-gene links. We also developed a graph-based dimension reduction method and feature selection approach to avoid overfitting risks. Besides predicting enhancer-gene links, ProTECT also prioritized important TF-TF interactions to establish long-range regulatory interactions. Such predictions can help interpret the trans-eQTLs, where the SNPs and target genes are very far or even on the different chromosomes. Based on the global evaluations and case studies, SNPs may disrupt the important TF regulators and block the enhancer-gene links, thus indirectly controlling the distal target gene expressions. 191 In summary, the development of the four algorithms fully characterizes the high-resolution 3D chromosome structures in bulk tissue and single cells and depicts the more detailed enhancer-gene regulatory interactions across diverse cell types. The rich predictions and algorithmic advancements of these methods provide a solid foundation for future studies of the complex biological events in the 3D space. 6.2 FUTURE DIRECTION Although we captured the dynamic chromosome structures across single cells, investigating the cell-type-specific structures in single cells is still challenging since the single-cell chromatin conformation capture datasets are few. Therefore, an important feature of the desired algorithm is building a connection between the cell-types-specific epigenomic signals and 3D spatial distances between genomic loci. We will continue the study of the underlying driving force in shaping the 3D chromosome structures. Another important direction is utilizing the predictions of these methods to predict downstream biological events, for example, gene expressions, TF binding sites, and disease-associated genes. We will further integrate the 3D chromosome structures and enhancer-gene regulatory interactions into the following algorithms to improve the model performance and gain better mechanistic insights Another important direction is utilizing the predictions of these methods to predict downstream biological events, for example, gene expressions, TF binding sites and disease associated genes. We will further integrate the 3D chromosome structures and enhancer-gene regulatory interactions into following algorithms to improve the model performance and gain better mechanistic insights 192 APPENDICES 193 APPENDIX A SUPPLEMENTARY FIGURES FOR CHAPTER 2 Figure A.1 5kb-resolution 3D structures for 23 chromosomes predicted by FLAMINGO. 194 Figure A.1 (cont’d) 195 Figure A.1 (cont’d) 196 Figure A.2 1kb-resolution 3D structures for 23 chromosomes predicted by FLAMINGO. 197 Figure A.2 (cont’d) 198 Figure A.2 (cont’d) 199 Figure A.3 Overview of the assembly algorithm of FLAMINGO. 200 Figure A.4 High similarity of predicted structures using different conversion factors. 201 Figure A.5 Convergence and model performance under different down-sampling rates based on simulated structures. 202 Figure A.6 Model performance under different number of loci and down sampling rates based on simulated structures. 203 Figure A.7 Validation of the assembly algorithm based on simulations. 204 Figure A.8 Performance validation using low-resolution Hi-C data and FISH data. 205 Figure A.9 Predicted 3D structures of chr1 by FLAMINGO in six cell-types at 5-kb resolution. 206 Figure A.10 The observed long-range chromatin interactions are supported by TF ChIP-seq and Capture-C interactions. 207 Figure A.11 Performance comparison in GM12878 based on off-diagonal distances. 208 Figure A.12 Performance comparison in the additional five cell-types. 209 Figure A.13 Example of 3D chromatin loops reconstructed by FLAMINGO. 210 Figure A.14 High scalability of FLAMINGO over existing algorithms. 211 Figure A.15 FLAMINGO leads to the discovery of multi-way chromatin interactions. 212 Figure A.16 FLAMINGO provides structural basis of long-range QTLs. 213 Figure A.17 Comparison between the predicted structures with single-cell chromosome structures. 214 Figure A.18 FLAMINGO robustly reconstructs the high-resolution 3D structures using a small fraction of observed Hi-C data. 215 Figure A.19 The imputation of 3D distances based on 1D epigenomics data in iFLAMINGO. 216 Figure A.20 Performance of cross cell-type predictions using iFLAMINGO. 217 Figure A.21 Convergence and parameter tuning of FLAMINGO. 218 APPENDIX B SUPPLEMENTARY FIGURES FOR CHAPTER 3 Figure B.1 3D structures of chromosome 19 in 10kb-resolution for 351 mESC cells predicted by tFLAMINGO based on snm3C data. 219 Figure B.1 (cont’d) 220 Figure B.1 (cont’d) 221 Figure B.1 (cont’d) 222 Figure B.1 (cont’d) 223 Figure B.1 (cont’d) 224 Figure B.1 (cont’d) 225 Figure B.1 (cont’d) 226 Figure B.2 3D structures of chromosome 19 in 10kb-resolution for 7 mESC cells predicted by tFLAMINGO based on scHi-C data. 227 Figure B.3 3D structures of chromosome 21 in 10kb-resolution for 16 K562 cells predicted by tFLAMINGO based on scHi-C data. 228 Figure B.4 Differential linear relationships between single-cell 3C datasets and bulk Hi-C datasets. 229 Figure B.5 Schema of the band wise log-regression method to rescale the single- cell interaction frequencies. 230 Figure B.6 3D Validation of the transformed single-cell interaction frequencies based on three additional datasets. 231 Figure B.7 Robust performance of tFLAMINGO under different settings based on simulations. 232 Figure B.8 Accurate reconstruction of a simulated structure with 3000 loci under the 0.5% down sampling rate. 233 Figure B.9 Systematic performance evaluation based on simulations. 234 Figure B.10 Convergence of tFLAMINGO. 235 Figure B.11 tFLAMINGO identifies underlying structural variations. 236 Figure B.12 Single-cell compartment and TAD analyses in GM12878. 237 Figure B.13 Justification of the optimal number of clusters. 238 Figure B.14 Pathway enrichments of differential methylated genes. 239 Figure B.15 Dynamic 3D structures across 15 single cells. 240 Figure B.16 Simulation-based methods fail to handle long-range interactions. 241 Figure B.17 Simulation analyses confirms the limitation of simulation-based models. 242 APPENDIX C SUPPLEMENTARY FIGURES FOR CHAPTER 4 Figure C.1 Predictive power of the features used in ComMUTE. 243 Figure C.2 Parameter selection based on the optimal AUROC. 244 Figure C.3 Summary statistics of the predicted enhancer-gene links. 245 Figure C.4 Summary of the input epigenomic datasets. Figure C.5 Convergence of ComMUTE. 246 Figure C.6 Performance comparison with JEME based on the enrichment analyses. 247 Figure C.7 Performance comparison with existing methods based on the enrichment of experimental chromatin interactions. 248 Figure C.8 Cross-cell-type comparison with TargetFinder. 249 Figure C.9 Evaluating the accuracy of predicted enhancer-gene links based on different epigenomic datasets. 250 Figure C.10 Example of predicted multi-enhancer regulations. 251 Figure C.11 Example of predicted multi-enhancer regulations. 252 Figure C.12 Co-binding analysis based on TF motif occurrence. Figure C.13 Example of direct chromatin interactions between co-regulating enhancers. 253 Figure C.14 ComMUTE discovers clear TF grammars for gene regulations. 254 Figure C.15 Convergence of ComMUTE. 255 APPENDIX D SUPPLEMENTARY FIGURES FOR CHAPTER 5 Figure D.1 Summary of training dataset generation and confounding factor controls. 256 Figure D.2 Predictive power of features are supported by the differential distributions of features. 257 Figure D.3 Advanced feature dimension reduction is needed due to the risk of overfitting. 258 Figure D.4 Hierarchical network-community detection based on the PPI network to construct model-level TF PPI features. 259 Figure D.5 PPI community detection based on the MCL. 260 Figure D.6 Enrichment analysis and PPI support analysis for TF module pairs. 261 Figure D.7 Model performance as a function of the number of decision trees. 262 Figure D.8 Performance of ProTECT using different epigenomic signals. 263 Figure D.9 Performance comparison based on the imbalanced training data and the genomic bin-split cross-validation. 264 Figure D.10 Performance comparison using five Hi-ChIP datasets. 265 Figure D.11 Performance comparison using four different ChIA-PET datasets. 266 Figure D.12 Performance comparison based on different combinations of Hi-C data and TF ChIP-seq data. 267 Figure D.13 Summary of genome-wide predictions by ProTECT in GM12878 and K562. 268 Figure D.14 Validation of ProTECT predicted enhancer-gene links with enhancer degree greater than one. 269 Figure D.15 Performance comparison with the ABC model in the whole genome- wide. 270 Figure D.16 Comparing the TF PPI abundance score in the Hi-C supported enhancer-gene links and the ProTECT predictions. 271 Figure D.17 Examples of prioritized module-level TF PPIs features. 272 Figure D.18 Identification of the directions of TF PPI features. 273 Figure D.19 Differential pathway enrichments of genes regulated by different module-level TF PPIs based on the ProTECT predictions. 274 Figure D.20 QTL enrichment analysis in K562. 275 Figure D.21 ProTECT predicts enhancer-gene links based on the imputed TF binding sites. 276 BIBLIOGRAPHY 277 BIBLIOGRAPHY 1. Bickmore, W.A. The spatial organization of the human genome. Annu Rev Genomics Hum Genet 14, 67-84 (2013). 2. Cremer, T. & Cremer, C. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat Rev Genet 2, 292-301 (2001). 3. Sexton, T., Schober, H., Fraser, P. & Gasser, S.M. Gene regulation through nuclear organization. Nat Struct Mol Biol 14, 1049-1055 (2007). 4. Liu, M. et al. Multiplexed imaging of nucleome architectures in single cells of mammalian tissue. Nat Commun 11, 2907 (2020). 5. Gorkin, D.U., Leung, D. & Ren, B. The 3D genome in transcriptional regulation and pluripotency. Cell Stem Cell 14, 762-775 (2014). 6. Zhou, X. et al. The Human Epigenome Browser at Washington University. Nat Methods 8, 989-990 (2011). 7. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289-293 (2009). 8. Zheng, Y. & Keles, S. FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation. Nat Methods 17, 37-40 (2020). 9. Dixon, J.R. et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331-336 (2015). 10. Rao, S.S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665-1680 (2014). 11. Wang, Y. et al. The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions. Genome Biol 19, 151 (2018). 12. Schmitt, A.D. et al. A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome. Cell Rep 17, 2042-2059 (2016). 13. Duan, Z. et al. A three-dimensional model of the yeast genome. Nature 465, 363- 367 (2010). 14. Wang, S. et al. Spatial organization of chromatin domains and compartments in single chromosomes. Science 353, 598-602 (2016). 278 15. Fudenberg, G. et al. Formation of Chromosomal Domains by Loop Extrusion. Cell Rep 15, 2038-2049 (2016). 16. Sanborn, A.L. et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 112, E6456-6465 (2015). 17. Oluwadare, O., Highsmith, M. & Cheng, J. An Overview of Methods for Reconstructing 3-D Chromosome and Genome Structures from Hi-C Data. Biol Proced Online 21, 7 (2019). 18. Rieber, L. & Mahony, S. miniMDS: 3D structural inference from high-resolution Hi- C data. Bioinformatics 33, i261-i266 (2017). 19. Szalaj, P. et al. An integrated 3-Dimensional Genome Modeling Engine for data- driven simulation of spatial genome organization. Genome Res 26, 1697-1709 (2016). 20. Paulsen, J. et al. Chrom3D: three-dimensional genome modeling from Hi-C and nuclear lamin-genome contacts. Genome Biol 18, 21 (2017). 21. Mishra, B., Meyer, G. & Sepulchre, R. in 2011 50th IEEE Conference on Decision and Control and European Control Conference 4455-4460 (2011). 22. Zhang, Z., Li, G., Toh, K.C. & Sung, W.K. 3D chromosome modeling with semi- definite programming and Hi-C data. J Comput Biol 20, 831-846 (2013). 23. Adhikari, B., Trieu, T. & Cheng, J.L. Chromosome3D: reconstructing three- dimensional chromosomal structures from Hi-C interaction frequency data using distance geometry simulated annealing. Bmc Genomics 17 (2016). 24. Trieu, T. & Cheng, J.L. MOGEN: a tool for reconstructing 3D models of genomes from chromosomal conformation capturing data. Bioinformatics 32, 1286-1292 (2016). 25. Wang, S., Xu, J. & Zeng, J. Inferential modeling of 3D chromatin structure. Nucleic Acids Res 43, e54 (2015). 26. Peng, C. et al. The sequencing bias relaxed characteristics of Hi-C derived data and implications for chromatin 3D modeling. Nucleic Acids Res 41, e183 (2013). 27. Kapilevich, V., Seno, S., Matsuda, H. & Takenaka, Y. Chromatin 3D Reconstruction from Chromosomal Contacts Using a Genetic Algorithm. IEEE/ACM Trans Comput Biol Bioinform 16, 1620-1626 (2019). 28. Varoquaux, N., Ay, F., Noble, W.S. & Vert, J.P. A statistical approach for inferring the 3D structure of the genome. Bioinformatics 30, i26-33 (2014). 279 29. Hu, M. et al. Bayesian inference of spatial organizations of chromosomes. PLoS Comput Biol 9, e1002893 (2013). 30. Zou, C., Zhang, Y. & Ouyang, Z. HSA: integrating multi-track Hi-C data for genome-scale reconstruction of 3D chromatin structure. Genome Biol 17, 40 (2016). 31. Rousseau, M., Fraser, J., Ferraiuolo, M.A., Dostie, J. & Blanchette, M. Three- dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. Bmc Bioinformatics 12 (2011). 32. Carstens, S., Nilges, M. & Habeck, M. Inferential Structure Determination of Chromosomes from Single-Cell Hi-C Data. PLoS Comput Biol 12, e1005292 (2016). 33. Lesne, A., Riposo, J., Roger, P., Cournac, A. & Mozziconacci, J. 3D genome reconstruction from chromosomal contacts. Nat Methods 11, 1141-1143 (2014). 34. Abbas, A. et al. Integrating Hi-C and FISH data for modeling of the 3D organization of chromosomes. Nat Commun 10, 2049 (2019). 35. Trieu, T., Oluwadare, O. & Cheng, J. Hierarchical Reconstruction of High- Resolution 3D Models of Large Chromosomes. Sci Rep 9, 4971 (2019). 36. Hirata, Y., Oda, A., Ohta, K. & Aihara, K. Three-dimensional reconstruction of single-cell chromosome structure using recurrence plots. Sci Rep-Uk 6 (2016). 37. Zhang, Y.L., Liu, W.W., Lin, Y., Ng, Y.K. & Li, S.C. Large-scale 3D chromatin reconstruction from chromosomal contacts. Bmc Genomics 20 (2019). 38. Li, F.Z. et al. Chromatin 3D structure reconstruction with consideration of adjacency relationship among genomic loci. Bmc Bioinformatics 21 (2020). 39. DeVience, S.J. & Mayer, D. Speeding up dynamic spiral chemical shift imaging with incoherent sampling and low-rank matrix completion. Magn Reson Med 77, 951-960 (2017). 40. Shin, P.J. et al. Calibrationless parallel imaging reconstruction based on structured low-rank matrix completion. Magn Reson Med 72, 959-970 (2014). 41. Kim, J.H., Sim, J.Y. & Kim, C.S. Video deraining and desnowing using temporal correlation and low-rank matrix completion. IEEE Trans Image Process 24, 2658- 2670 (2015). 42. Gower, J.C. Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications 67, 81-97 (1985). 280 43. Fullwood, M.J. et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 58-64 (2009). 44. Hughes, J.R. et al. Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment. Nat Genet 46, 205-212 (2014). 45. Jung, I. et al. A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat Genet 51, 1442-1449 (2019). 46. Quinodoz, S.A. et al. Higher-Order Inter-chromosomal Hubs Shape 3D Genome Organization in the Nucleus. Cell 174, 744-757 e724 (2018). 47. Baldi, S., Korber, P. & Becker, P.B. Beads on a string-nucleosome array arrangements and folding of the chromatin fiber. Nat Struct Mol Biol 27, 109-118 (2020). 48. Eckstein, J. & Bertsekas, D.P. On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming 55, 293-318 (1992). 49. Tasissa, A. & Lai, R. Exact Reconstruction of Euclidean Distance Geometry Problem Using Low-Rank Matrix Completion. IEEE Transactions on Information Theory 65, 3124-3144 (2019). 50. Consortium, E.P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012). 51. Dekker, J. et al. The 4D nucleome project. Nature 549, 219-226 (2017). 52. Zheng, H. & Xie, W. The role of 3D genome organization in development and cell differentiation. Nat Rev Mol Cell Biol 20, 535-550 (2019). 53. Zheng, M. et al. Multiplex chromatin interactions with single-molecule precision. Nature 566, 558-562 (2019). 54. Beagrie, R.A. et al. Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519-524 (2017). 55. Hsieh, T.S. et al. Resolving the 3D Landscape of Transcription-Linked Mammalian Chromatin Folding. Mol Cell 78, 539-553 e538 (2020). 56. Delaneau, O. et al. Chromatin three-dimensional interactions mediate genetic effects on gene expression. Science 364 (2019). 57. Koch, L. Adding another dimension to gene regulation. Nature Reviews Genetics 16, 563-563 (2015). 281 58. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res 24, 14-24 (2014). 59. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet 44, 1084-1089 (2012). 60. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506-511 (2013). 61. Consortium, G.T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648-660 (2015). 62. Grubert, F. et al. Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions. Cell 162, 1051-1065 (2015). 63. Dekker, J. GC- and AT-rich chromatin domains differ in conformation and histone modification status and are differentially modulated by Rpd3p. Genome Biol 8, R116 (2007). 64. Jabbari, K., Chakraborty, M. & Wiehe, T. DNA sequence-dependent chromatin architecture and nuclear hubs formation. Sci Rep 9, 14646 (2019). 65. Sekelja, M., Paulsen, J. & Collas, P. 4D nucleomes in single cells: what can computational modeling reveal about spatial chromatin conformation? Genome Biol 17, 54 (2016). 66. Nagano, T. et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59-64 (2013). 67. Su, J.H., Zheng, P., Kinrot, S.S., Bintu, B. & Zhuang, X. Genome-Scale Imaging of the 3D Organization and Transcriptional Activity of Chromatin. Cell 182, 1641- 1659 e1626 (2020). 68. Bintu, B. et al. Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science 362 (2018). 69. Tjong, H. et al. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization. Proc Natl Acad Sci U S A 113, E1663-1672 (2016). 70. Dai, C. et al. Mining 3D genome structure populations identifies major factors governing the stability of regulatory communities. Nat Commun 7, 11549 (2016). 71. Yardimci, G.G. et al. Measuring the reproducibility and quality of Hi-C data. Genome Biol 20, 57 (2019). 72. Li, G. et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 148, 84-98 (2012). 282 73. Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat Commun 10, 5449 (2019). 74. Zhang, Y. et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat Commun 9, 750 (2018). 75. Fudenberg, G., Kelley, D.R. & Pollard, K.S. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 17, 1111-1117 (2020). 76. Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase- scale transfer learning. Nat Methods 17, 1118-1124 (2020). 77. Giorgetti, L. et al. Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950-963 (2014). 78. Qi, Y. & Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS Comput Biol 15, e1007024 (2019). 79. Brackley, C.A. et al. Predicting the three-dimensional folding of cis-regulatory regions in mammalian genomes using bioinformatic data and polymer models. Genome Biol 17, 59 (2016). 80. Qi, Y. et al. Data-Driven Polymer Model for Mechanistic Exploration of Diploid Genome Organization. Biophys J 119, 1905-1916 (2020). 81. Meluzzi, D. & Arya, G. Computational approaches for inferring 3D conformations of chromatin from chromosome conformation capture data. Methods 181-182, 24- 34 (2020). 82. Lin, X., Qi, Y., Latham, A.P. & Zhang, B. Multiscale modeling of genome organization with maximum entropy optimization. J Chem Phys 155, 010901 (2021). 83. Moller, J. & de Pablo, J.J. Bottom-Up Meets Top-Down: The Crossroads of Multiscale Chromatin Modeling. Biophys J 118, 2057-2065 (2020). 84. Di Stefano, M., Paulsen, J., Jost, D. & Marti-Renom, M.A. 4D nucleome modeling. Curr Opin Genet Dev 67, 25-32 (2021). 85. Consortium, E.P. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699-710 (2020). 86. Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317-330 (2015). 87. Perry, M.W., Boettiger, A.N. & Levine, M. Multiple enhancers ensure precision of gap gene-expression patterns in the Drosophila embryo. Proc Natl Acad Sci U S A 108, 13570-13575 (2011). 283 88. Tsai, A., Alves, M.R. & Crocker, J. Multi-enhancer transcriptional hubs confer phenotypic robustness. Elife 8 (2019). 89. Choi, J. et al. Evidence for additive and synergistic action of mammalian enhancers during cell fate determination. Elife 10 (2021). 90. Nord, A.S. et al. Rapid and pervasive changes in genome-wide enhancer usage during mammalian development. Cell 155, 1521-1531 (2013). 91. Schoenfelder, S. & Fraser, P. Long-range enhancer-promoter contacts in gene expression control. Nat Rev Genet 20, 437-455 (2019). 92. Vicente, C.T. et al. Long-Range Modulation of PAG1 Expression by 8q21 Allergy Risk Variants. Am J Hum Genet 97, 329-336 (2015). 93. Martin, P. et al. Capture Hi-C reveals novel candidate genes and complex long- range interactions with related autoimmune risk loci. Nat Commun 6, 10069 (2015). 94. Deng, W. et al. Controlling long-range genomic interactions at a native locus by targeted tethering of a looping factor. Cell 149, 1233-1244 (2012). 95. Ragoczy, T., Bender, M.A., Telling, A., Byron, R. & Groudine, M. The locus control region is required for association of the murine beta-globin locus with engaged transcription factories during erythroid maturation. Genes Dev 20, 1447-1457 (2006). 96. Lettice, L.A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet 12, 1725-1735 (2003). 97. Jeong, Y., El-Jaick, K., Roessler, E., Muenke, M. & Epstein, D.J. A functional screen for sonic hedgehog regulatory elements across a 1 Mb interval identifies long-range ventral forebrain enhancers. Development 133, 761-772 (2006). 98. Sagai, T. et al. A cluster of three long-range enhancers directs regional Shh expression in the epithelial linings. Development 136, 1665-1674 (2009). 99. Smemo, S. et al. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507, 371-375 (2014). 100. Dryden, N.H. et al. Unbiased analysis of potential targets of breast cancer susceptibility loci by Capture Hi-C. Genome Res 24, 1854-1868 (2014). 101. McGovern, A. et al. Capture Hi-C identifies a novel causal gene, IL20RA, in the pan-autoimmune genetic susceptibility region 6q23. Genome Biol 17, 212 (2016). 102. Jager, R. et al. Capture Hi-C identifies the chromatin interactome of colorectal cancer risk loci. Nat Commun 6, 6178 (2015). 284 103. Shlyueva, D., Stampfel, G. & Stark, A. Transcriptional enhancers: from properties to genome-wide predictions. Nat Rev Genet 15, 272-286 (2014). 104. Buecker, C. & Wysocka, J. Enhancers as information integration hubs in development: lessons from genomics. Trends Genet 28, 276-284 (2012). 105. Hoffman, M.M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9, 473-476 (2012). 106. Ernst, J. & Kellis, M. Chromatin-state discovery and genome annotation with ChromHMM. Nat Protoc 12, 2478-2492 (2017). 107. Pennacchio, L.A., Bickmore, W., Dean, A., Nobrega, M.A. & Bejerano, G. Enhancers: five essential questions. Nat Rev Genet 14, 288-295 (2013). 108. Mumbach, M.R. et al. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nat Genet 49, 1602-1612 (2017). 109. Gondor, A. & Ohlsson, R. Chromosome crosstalk in three dimensions. Nature 461, 212-217 (2009). 110. Kvon, E.Z. et al. Progressive Loss of Function in a Limb Enhancer during Snake Evolution. Cell 167, 633-642 e611 (2016). 111. Claussnitzer, M. et al. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N Engl J Med 373, 895-907 (2015). 112. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science 295, 1306-1311 (2002). 113. Zhao, Z. et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet 38, 1341-1347 (2006). 114. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 16, 1299-1309 (2006). 115. Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high- resolution capture Hi-C. Nat Genet 47, 598-606 (2015). 116. Schoenfelder, S., Javierre, B.M., Furlan-Magaril, M., Wingett, S.W. & Fraser, P. Promoter Capture Hi-C: High-resolution, Genome-wide Profiling of Promoter Interactions. J Vis Exp (2018). 117. Fullwood, M.J. & Ruan, Y. ChIP-based methods for the identification of long-range chromatin interactions. J Cell Biochem 107, 30-39 (2009). 285 118. Li, X. et al. Long-read ChIA-PET for base-pair-resolution mapping of haplotype- specific chromatin interactions. Nat Protoc 12, 899-915 (2017). 119. Smith, E.M., Lajoie, B.R., Jain, G. & Dekker, J. Invariant TAD Boundaries Constrain Cell-Type-Specific Looping Interactions between Promoters and Distal Elements around the CFTR Locus. Am J Hum Genet 98, 185-201 (2016). 120. Li, G. et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol 11, R22 (2010). 121. Meuleman, W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244-251 (2020). 122. Yen, A. & Kellis, M. Systematic chromatin state comparison of epigenomes associated with diverse properties including sex and tissue type. Nat Commun 6, 7973 (2015). 123. Roy, S. et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res 43, 8694-8712 (2015). 124. Hait, T.A., Amar, D., Shamir, R. & Elkon, R. FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer-promoter map. Genome Biol 19, 56 (2018). 125. Gao, T. & Qian, J. EAGLE: An algorithm that utilizes a small number of genomic features to predict tissue/cell type-specific enhancer-gene interactions. PLoS Comput Biol 15, e1007436 (2019). 126. Cao, Q. et al. Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet 49, 1428-1436 (2017). 127. He, B., Chen, C., Teng, L. & Tan, K. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A 111, E2191-2199 (2014). 128. Whalen, S., Truty, R.M. & Pollard, K.S. Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet 48, 488- 496 (2016). 129. Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford) 2017 (2017). 130. Thurman, R.E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75-82 (2012). 131. Corradin, O. et al. Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome Res 24, 1-13 (2014). 286 132. Moore, J.E., Pratt, H.E., Purcaro, M.J. & Weng, Z. A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods. Genome Biol 21, 17 (2020). 133. Cao, F. & Fullwood, M.J. Inflated performance measures in enhancer-promoter interaction-prediction methods. Nat Genet 51, 1196-1198 (2019). 134. Whitaker, J.W., Nguyen, T.T., Zhu, Y., Wildberg, A. & Wang, W. Computational schemes for the prediction and annotation of enhancers from epigenomic assays. Methods 72, 86-94 (2015). 135. Nolis, I.K. et al. Transcription factors mediate long-range enhancer-promoter interactions. Proc Natl Acad Sci U S A 106, 20222-20227 (2009). 136. Hnisz, D., Shrinivas, K., Young, R.A., Chakraborty, A.K. & Sharp, P.A. A Phase Separation Model for Transcriptional Control. Cell 169, 13-23 (2017). 137. Quevedo, M. et al. Mediator complex interaction partners organize the transcriptional network that defines neural stem cells. Nat Commun 10, 2669 (2019). 138. Maksimenko, O. & Georgiev, P. Mechanisms and proteins involved in long- distance interactions. Front Genet 5, 28 (2014). 139. Li, Y. et al. The structural basis for cohesin-CTCF-anchored loops. Nature 578, 472-476 (2020). 140. Beagan, J.A. et al. YY1 and CTCF orchestrate a 3D chromatin looping switch during early neural lineage commitment. Genome Res 27, 1139-1152 (2017). 141. Weintraub, A.S. et al. YY1 Is a Structural Regulator of Enhancer-Promoter Loops. Cell 171, 1573-1588 e1528 (2017). 142. Morgan, S.L. et al. Manipulation of nuclear architecture through CRISPR-mediated chromosomal looping. Nat Commun 8, 15993 (2017). 143. Zhang, K., Li, N., Ainsworth, R.I. & Wang, W. Systematic identification of protein combinations mediating chromatin looping. Nat Commun 7, 12249 (2016). 144. Wang, R. et al. Hierarchical cooperation of transcription factors from integration analysis of DNA sequences, ChIP-Seq and ChIA-PET data. BMC Genomics 20, 296 (2019). 145. Kato, M., Hata, N., Banerjee, N., Futcher, B. & Zhang, M.Q. Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol 5, R56 (2004). 287 146. Michaelis, C., Ciosk, R. & Nasmyth, K. Cohesins: chromosomal proteins that prevent premature separation of sister chromatids. Cell 91, 35-45 (1997). 147. Tan, K., Shlomi, T., Feizi, H., Ideker, T. & Sharan, R. Transcriptional regulation of protein complexes within and across species. Proc Natl Acad Sci U S A 104, 1283- 1288 (2007). 148. Szklarczyk, D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47, D607-D613 (2019). 149. Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol 7 Suppl 1, S4 1-9 (2006). 150. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008). 151. Amoutzias, G.D., Robertson, D.L., Van de Peer, Y. & Oliver, S.G. Choose your partners: dimerization in eukaryotic transcription factors. Trends Biochem Sci 33, 220-229 (2008). 152. Dixon, J.R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376-380 (2012). 153. Akdemir, K.C. et al. Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer. Nat Genet 52, 294-305 (2020). 154. Chesi, A. et al. Genome-scale Capture C promoter interactions implicate effector genes at GWAS loci for bone mineral density. Nat Commun 10, 1260 (2019). 155. Pugacheva, E.M. et al. CTCF mediates chromatin looping via N-terminal domain- dependent cohesin retention. Proc Natl Acad Sci U S A 117, 2020-2031 (2020). 156. Vishwanathan, S.V.N., Borgwardt, K.M., Risi Kondor, I. & Schraudolph, N.N. arXiv:0807.0093 (2008). 157. Pons, P. & Latapy, M. physics/0512106 (2005). 158. Newman, M.E. Modularity and community structure in networks. Proc Natl Acad Sci U S A 103, 8577-8582 (2006). 159. Hauenstein, S., Dormann, C.F. & Wood, S.N. arXiv:1603.02743 (2016). 160. Storey, J.D. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64, 479-498 (2002). 288 161. Huang da, W., Sherman, B.T. & Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57 (2009). 162. Consortium, G.T. et al. Genetic effects on gene expression across human tissues. Nature 550, 204-213 (2017). 163. Gong, J. et al. PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types. Nucleic Acids Res 46, D971-D976 (2018). 164. Mumbach, M.R. et al. HiChIRP reveals RNA-associated chromosome conformation. Nat Methods 16, 489-492 (2019). 165. Mumbach, M.R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 13, 919-922 (2016). 166. Fulco, C.P. et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat Genet 51, 1664-1669 (2019). 167. Jiang, Y. et al. Genome-wide analyses of chromatin interactions after the loss of Pol I, Pol II, and Pol III. Genome Biol 21, 158 (2020). 168. Dyson, N.J. RB1: a prototype tumor suppressor and an enigma. Genes Dev 30, 1492-1502 (2016). 169. Marke, R., van Leeuwen, F.N. & Scheijen, B. The many faces of IKZF1 in B-cell precursor acute lymphoblastic leukemia. Haematologica 103, 565-574 (2018). 170. Sarvagalla, S., Kolapalli, S.P. & Vallabhapurapu, S. The Two Sides of YY1 in Cancer: A Friend and a Foe. Front Oncol 9, 1230 (2019). 171. Stengel, K.R. & Hiebert, S.W. Class I HDACs Affect DNA Replication, Repair, and Chromatin Structure: Implications for Cancer Therapy. Antioxid Redox Signal 23, 51-65 (2015). 172. Losada, A., Hirano, M. & Hirano, T. Identification of Xenopus SMC protein complexes required for sister chromatid cohesion. Genes Dev 12, 1986-1997 (1998). 173. Lee, T.C. & Ziff, E.B. Mxi1 is a repressor of the c-Myc promoter and reverses activation by USF. J Biol Chem 274, 595-606 (1999). 174. Visel, A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854-858 (2009). 175. Lynch, C.J. et al. The RNA Polymerase II Factor RPAP1 Is Critical for Mediator- Driven Transcription and Cell Identity. Cell Rep 22, 396-410 (2018). 289 176. Hu, Z., Killion, P.J. & Iyer, V.R. Genetic reconstruction of a functional transcriptional regulatory network. Nat Genet 39, 683-687 (2007). 177. Albert, F.W., Bloom, J.S., Siegel, J., Day, L. & Kruglyak, L. Genetics of trans- regulatory variation in gene expression. Elife 7 (2018). 178. Brynedal, B. et al. Large-Scale trans-eQTLs Affect Hundreds of Transcripts and Mediate Patterns of Transcriptional Co-regulation. Am J Hum Genet 100, 581-591 (2017). 179. Johanson, T.M. et al. Transcription-factor-mediated supervision of global genome architecture maintains B cell identity. Nat Immunol 19, 1257-1264 (2018). 180. Ebert, A. et al. The distal V(H) gene cluster of the Igh locus contains distinct regulatory elements with Pax5 transcription factor-dependent activity in pro-B cells. Immunity 34, 175-187 (2011). 181. Arvey, A. et al. An atlas of the Epstein-Barr virus transcriptome and epigenome reveals host-virus regulatory interactions. Cell Host Microbe 12, 233-245 (2012). 182. Bult, C.J. et al. Mouse Genome Database (MGD) 2019. Nucleic Acids Res 47, D801-D806 (2019). 183. Li, H., Quang, D. & Guan, Y. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Res 29, 281-292 (2019). 184. Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol 20, 9 (2019). 290