COMPUTATIONAL INTERROGATION OF CELL-TYPE SPECIFIC IMPACTS FOR NON-CODING GENETIC VARIANTS BASED ON 3D GENOME ORGANIZATION By Jiaxin Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineering – Doctor of Philosophy 2024 ABSTRACT Genetic variations play a crucial role in the development of severe human diseases, but deciphering their genetic basis remains challenging due to the complex multi-scale genome structure and intricate molecular mechanisms. Genome-wide Association Studies (GWAS) have identified genotype-phenotype associations, but face limitations such as low statistical power and lack of causality. With advancements in Next Generation Sequencing techniques, researchers have gained insights into the vast non-coding genome, discovering millions of functional DNA elements that regulate cell type-specific gene expression, including promoters and enhancers. These cis-regulatory elements form gene regulatory networks (GRNs) that provide pathways for understanding the effects of genetic variants on diseases. However, leveraging GRNs in non-coding genetic variant analysis poses challenges, such as the vast number of potential enhancer-gene links, the influence of multilevel variabilities on chromatin interactions and 3D structures, and the need for comprehensive data integration methods. This thesis aims to address these challenges by developing machine learning, deep learning, and optimization-based models to discover novel disease-associated genes, enhance eQTL fine-mapping predictions, and investigate the multi-level variabilities of multi-scale 3D chromatin organization. By leveraging regulatory networks of long-range chromatin interactions, incorporating 3D chromatin organizations, and modeling the 3D structures, this work contributes to deep understanding on cell type-specific non-coding genetic variations and advancing precision medicine and clinical care. Copyright by JIAXIN YANG 2024 This thesis is dedicated to my grandpa Thank you for the unconditional love across time iv ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my advisor and committee chair, Dr. Jianrong Wang. His expert guidance was instrumental in helping me navigate the various stages of my projects, from setting specific goals to developing computational algorithms. I am profoundly grateful for his mentorship. My appreciation also extends to my committee members, Dr. Jianliang Qian, Dr. Yuehua Cui, and Dr. Rongrong Wang, for their invaluable feedback on my research and their guidance in developing my professional skills. I am thankful for the support and camaraderie of my lab colleagues, Dr. Hao Wang, Dr. Wenjie Qi, Sikta Das Adhikari, Dr. Binbin Huang, Dr. Jiwoong Kim, and Dr. Pronoy Kanti Mondal. The days we spent working together will always be cherished. I owe a debt of gratitude to my friends, Dr. Wei Jin, Lei Yang, Yiming Zhou, Yiqian Wang, and Dr. Zhongjie Ji, for their understanding, support, and companionship throughout my graduate studies. Their presence made the challenging moments more bearable. Most importantly, I wish to thank my family for their unwavering support and encouragement. Their belief in my abilities has been a constant source of strength. I would like to give a special mention to my grandfather, an incredible source of love, guidance, and inspiration. I am forever grateful for the foundation you laid in my life, and I love you more than words can express. As a tiny planet in this vast universe. you are the unseen dark matter embracing me. How I long to behold you again, yet I know our paths cannot cross anew. Still, your gravity lingers, a silent force. I cherish the moment our light cones intertwined, forever altering my celestial course. Though we may never meet again, you are the reason my galaxy remains whole, an eternal strand in the cosmic web to which I am bound. v TABLE OF CONTENTS INTRODUCTION .......................................................................................... 1 DISCOVER NOVEL DISEASE-ASSOCIATED GENES BASED ON REGULATORY NETWORKS OF LONG-RANGE CHROMATIN INTERACTIONS ......... 6 3DVariantVision: MULTIMODAL FRAMEWORK TO DECODE GENETIC VARIATION BASED ON 3D CHROMATIN ..................................................................... 27 SYSTEMATIC DELINEATION OF MULTI-LEVEL STRUCTURAL VARIABILITIES OF SPATIAL GENOME CONFORMATIONS ....................................... 68 DISCUSSION AND FUTURE DIRECTIONS ............................................. 98 BIBLIOGRAPHY ............................................................................................................. 100 APPENDIX A SUPPLEMENTARY FIGURES FOR CHAPTER 2................................ 115 APPENDIX B SUPPLEMENTARY FIGURES FOR CHAPTER 3................................ 122 APPENDIX C SUPPLEMENTARY FIGURES FOR CHAPTER 4 ............................... 132 vi INTRODUCTION Genetic variations play a crucial role in the development of various severe human diseases, including leukemia, Alzheimer's disease, and cancers1. Understanding the relationship between genetic variants and human diseases is essential for advancing precision medicine and clinical care. Genome-wide Association Studies (GWAS) have revolutionized complex disease genetics by identifying genotype-phenotype associations through the analysis of millions of single-nucleotide polymorphisms (SNPs)2,3. However, GWAS faces limitations such as low statistical power4 and lack of causality due to Linkage Disequilibrium (LD)5. Deciphering the genetic basis of human diseases remains challenging, considering the multi-scale genome structure and intricate underlying molecular mechanisms6,7. Traditionally, research focused on SNPs located in the coding regions of genes, which only account for 1.2% of the human genome2. With the advancements in Next Generation Sequencing (NGS) techniques8, such as RNA-seq, DNase-seq, and ChIP- seq9, scientists have gained insights into the functions of the vast non-coding genome10. Recent studies have discovered millions of functional DNA elements that regulate cell type-specific gene expression in cis, with promoters and enhancers being the two major cis-regulatory elements (CREs)11. Enhancers, bound by various transcription factors (TFs), activate target gene expression by forming chromatin loops with promoters. Gene regulatory networks (GRNs), formed by enhancers, genes, and TFs, provide clear pathways for understanding the effects of genetic variants on diseases6. However, leveraging GRNs in the analysis of non-coding genetic variants poses several challenges, including the vast number of potential enhancer-gene links, the need for comprehensive data integration methods, the unclear relationship between non-coding genetic variant- gene associations and enhancer-promoter associations, and the influence of multilevel variabilities on chromatin interactions and 3D structures. To address these challenges, this thesis aims to develop machine learning, deep learning, and optimization-based models to (1) discover novel disease-associated genes by leveraging regulatory networks of long-range chromatin interactions, (2) enhance eQTL fine-mapping predictions by incorporating 3D chromatin knowledge, and (3) 1 investigate the multi-level variabilities of 3D chromatin organization and model the 3D structures. Accurate prediction of cell type-specific enhancer-promoter interactions (EPIs) is crucial for constructing GRNs12. Computational algorithms for predicting chromatin interactions can be classified into unsupervised and supervised models12. Unsupervised models, such as the Activity-by-contact (ABC) model13 and correlation-based models, have demonstrated superior performance compared to distance-based approaches. However, their performance is sensitive to feature selection and correlation calculation algorithms. With the development of chromosome conformation capture (3C)-based techniques, such as Hi-C14, HiChIP15, ChIA-PET16, and Capture Hi-C17, supervised models have been designed to identify potential EPIs by incorporating various 1D genomic/epigenomic features18,19. Deep learning-based models, such as SPEID20 and DeepTACT21, have further expanded the predictive capabilities by extracting information from reference DNA sequences using representation learning techniques. Understanding the spatial organization and folding of the human genome within the cell nucleus is crucial for deciphering the complex systems of spatially coordinated transcriptional and epigenetic activities7. Hi-C14 has been a driving force in studying 3D genome structures, revealing structural components such as chromatin loops, topologically associated domains (TADs), and chromatin compartments. However, accurately reconstructing high-resolution spatial conformations for all chromosomes is computationally challenging due to the large missing rate and high noise level in Hi-C data. Two categories of computational models have been developed to address this problem: (1) Hi-C simulation and prediction22–24 and (2) 3D genome structure reconstruction25–28. Simulation-based models rely on physical mechanism assumptions, such as polymer modeling and phase separation, while data-driven deep learning-based models, like DeepC23 and Akita24, predict Hi-C contact maps using DNA sequences. In 3D structure reconstruction, observed Hi-C contact frequencies are converted into spatial distances29, and consensus or ensemble structures are inferred by maximizing the similarity between predicted and observed Hi-C distances. The emerging single-cell Hi-C (scHi-C)30–40 technologies have enabled the mapping of 3D chromatin structures in individual cells, revealing the fundamental genome structure and function connections at 2 single-cell resolution. However, the extremely low sequencing depth of single-cell chromatin contact maps poses challenges for studying high-resolution 3D chromatin structures. Methods like Higashi41 have been developed to impute contact maps based on latent correlations among single cells, but there is still a lack of comprehensive methods to reconstruct single-cell 3D chromatin structures. Identifying functionally relevant variants is a significant challenge in human genetics, particularly for non-coding variants10,42. Large-scale efforts, such as the ENCODE consortium9 and the US National Institutes of Health Roadmap Epigenomics project11, have provided data from various assays across the genome to help interpret non-coding variants. Tools like GWAVA43 and FunSeq244 have been developed to annotate potential regulatory variants and predict their effects genome-wide. However, simply considering the overlap of a variant with annotations is insufficient due to the low resolution of publicly available data and the high false-negative rate of rare variants. Traditional machine learning-based model like Kmer-SVM45 and gkm-SVM46 emerged as pioneering models for predicting regulatory elements directly from DNA sequences, bypassing the need for existing annotated motifs. These models employ support vector machines (SVMs) trained on k-mer features to assess the likelihood of a sequence being a functional genomic regulatory element or a tissue-specific enhancer. Delta-SVM47 incorporates gkm-SVM predictions to assess the disruptive impacts of genetic variants. However, the complexity and non-linearity of the underlying regulatory grammar in DNA sequences require further improvements in model performance. In recent years, advanced machine learning, deep learning, and optimization- based models have achieved extraordinary performance across various scientific fields48– 68, including the prediction of genetic variant effects from DNA sequences. DeepSEA69, Basset70, and DanQ71, have demonstrated the potential of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for predicting genetic variant effects based on DNA sequences. These models are trained on large-scale multi-omics datasets across different cell types from the reference genome and can predict the effects of genetic variants on transcription factor binding, chromatin accessibility, and gene expression. Further advancements, such as Basenji72 and ExPecto73, have expanded the scope of predictions to include a wider range of genomic signals and target gene expressions. 3 The Transformer architecture74, originally developed for natural language processing, has shown remarkable success in capturing long-range dependencies and has been adapted for genomic applications. Models like Enformer75, which combines CNNs and Transformers, excel in predicting functional genome profiles and offer improved interpretability through attention weights. Enformer's ability to capture distal regulatory elements up to 100kb away enables more accurate predictions of genetic variant effects. Foundation Models, such as DNABERT76–78 and the Nucleotide Transformer79, leverage self-supervised pre-training on unlabeled genomic sequences to capture the fundamental grammatical structures of DNA. These models have demonstrated remarkable efficacy across various downstream applications, including the detection of functional genetic variants. The increased model size and ability to process longer sequences have further enhanced their performance. Despite the advancements in deep learning models for predicting genetic variant effects, challenges remain. The reliance on labeled data at the cell type level limits their capability to discern functional effects at the single-cell level. Quantifying the impact of genetic variation at the single-cell level has also been studied. The development of single- cell RNA-sequencing (scRNA-seq)80 has enabled the simultaneous and unbiased estimation of cellular composition and cell type-specific gene expression, creating opportunities for mapping eQTLs across different cell types and in dynamic processes81. The single-cell eQTLGen consortium (sc-eQTLGen)81 has been established as a large- scale international collaborative effort to identify the upstream interactors and downstream consequences of disease-related genetic variants in individual immune cell types. The integration of single-cell sequencing data, such as scRNA-seq80,82, scATAC- seq82,83, and scHi-C30–40, presents opportunities for fine-tuning models with minimal data. Additionally, the training of current models based on the reference genome neglects the diversity and frequency of genetic variations across different genotypes. CRISPR84–87 technology, which elucidates the causal and real effects of genetic variants, offers valuable insights beyond the reference genomic context and can help bridge the gap between model predictions and biological reality. In conclusion, this thesis aims to address these challenges by developing machine learning, deep learning, and optimization-based models to discover novel disease- 4 associated genes, enhance eQTL fine-mapping predictions, and investigate the multi- level variabilities of multi-scale 3D chromatin organization. By leveraging regulatory networks of long-range chromatin interactions, incorporating 3D chromatin organizations, and modeling the 3D structures, this work contributes to deep understanding on cell type- specific non-coding genetic variations and advancing precision medicine and clinical care. 5 DISCOVER NOVEL DISEASE-ASSOCIATED GENES BASED ON REGULATORY NETWORKS OF LONG-RANGE CHROMATIN INTERACTIONS 2.1 INTRODUCTION Genome-wide association studies (GWAS) has been one of the major approaches to identify genetic variants, e.g. SNPs, that are associated with specific phenotypes, such as diabetes, neurodegenerative diseases, autoimmune diseases and cancer3,88–92. The statistically significant SNPs from GWAS can indicate genomic loci containing genes whose expression levels are functionally related with the disease status93. Although it has been successful in many studies, there are two major limitations of traditional GWAS analysis: 1) limited statistical power due to sample sizes and minor allele frequencies, and 2) the lack of mechanistic understandings of statistical associations. These limitations are especially challenging for some complex diseases, where multiple functionally coordinated genes and non-coding SNPs contribute to the observed phenotypic changes with mild effect sizes94. A key to address these two challenges is to systematically delineate the regulatory effects of non-coding SNPs95 on gene expression. Since the vast majority of the human genome are non-coding, most of the significant SNPs from GWAS are located in non- coding regions instead of in genes92. As has been shown, non-coding GWAS SNP hits are enriched in regulatory elements, such as enhancers96. Algorithms are therefore needed to leverage the regulatory information from multi-omics dataset to discover more disease-associated genes that are regulated by the non-coding SNPs97. It also provides benefits to identify cis- and trans-factors of co-regulation to further understand the functional relationships among multiple disease-associated genes and SNPs97,98, as well as mechanistic insights of genetic associations mediated by dysregulation of genes99–101. Based on the recognition of transcriptional dysregulation as one of the major mechanisms underlying disease-associated genetics102, gene regulatory networks have been proposed to improve and interpret GWAS results94,103–105. But most network-based GWAS analysis algorithms are based on basic gene co-expression networks, where nodes are genes and edges represent correlated expressions94,103. Co-expression 6 networks are limited in improving GWAS analysis because only genes are modeled as nodes and non-coding SNPs are largely missing in the networks. Furthermore, the co- expression based edges treat the underlying regulatory mechanisms as black boxes without considering the specific cis- or trans- regulatory elements, such as enhancers and transcription factors (TFs), that mediate the observed correlated activities103,104. Due to the recent biotechnology developments, such as Hi-C14,106, ChIA-PET107 and Capture-C108, high-throughput chromatin contact maps are being generated14,109,110. These chromatin contact maps provide information of three-dimensional (3D) chromatin structure14,106–110 and reveal specific long-range chromatin interactions linking distal non- coding enhancers to target genes14,106–110. Unlike cis-regulatory links identified by eQTL calling111, the chromatin interactions are based on evidence of physical interactions between enhancers and promoters and can capture longer cis-regulatory interactions (e.g. ~1Mb)14,106–110. Therefore, the incorporation of 3D chromatin contact maps into regulatory network construction is expected to extend the capability of analyzing distal non-coding GWAS SNPs and their regulatory impacts on specific target genes, which may participate in critical biological pathways associated with the disease. There have been several successful case studies of using long-range chromatin interactions to decode the underlying disease mechanisms, such as SNPs associated with prostate cancer, erythrocyte and triglyceride were found to be enriched in regulatory DNA regions and may disrupt TF binding96,112,113. Statistical methods have been built to jointly model SNPs and gene expression with respect to the impacts on disease risks114,115. Although the methods show promising results, further incorporation of chromatin interactions to link SNP with specific genes will substantially enlarge the SNP-sets associated with genes and improve the statistical power. A couple of algorithms104,116 have been developed to utilize chromatin interactions to combine cis-regulatory elements with gene-gene networks for disease-association analysis, which can prioritize disease-associated non-coding SNPs. But these methods cannot aggregate the SNP-level signals to discover new genes that might be functionally associated with diseases. Another layer of complexity of gene regulation comes from combinations of TF bindings to not only gene promoters but also linked distal enhancers117,118. Knowing the key TFs as trans-regulatory factors for specific genes can shed light on common 7 regulatory mechanisms shared by multiple genes and can also indicate how multiple enhancers coordinate together to regulate target genes119,120. In this regard, long-range enhancer-promoter interactions provide important information to aggregate TF binding patterns across promoters and enhancers. Combined with GWAS data, candidate master TF regulators of disease-associated genes can be inferred based on enrichment analysis from both promoters and enhancers121–123. Given the benefits of integrating 3D chromatin interactions into the study of gene regulation, we have developed a computational infrastructure to construct regulatory networks and make network-based predictions of novel disease-associated genes. The software, named as APRIL, combines cell-type specific epigenomics and transcriptomics datasets with public available chromatin contact maps to build expanded regulatory networks, which include long-range cis-regulation between non-coding enhancers and genes and trans-regulatory links of TFs. APRIL also provides functions to analyze co- occurring signatures of GWAS SNPs in the constructed networks, in order to obtain insights on functional coordination of disease-associated genetic variants. Furthermore, APRIL contains both unsupervised and supervised machine learning algorithms to predict new disease-associated genes, by leveraging the information from non-coding regulatory SNPs and the regulatory network structures. The application of APRIL on GWAS studies of leukemia demonstrates that not only new disease-associated genes can be reliably predicted but also regulatory mechanisms underlying diseases can be indicated. 2.2 MATERIALS AND METHODS 2.2.1 Dataset and annotations/definitions For chromatin interaction datasets used in this study, the ChIA-PET dataset of two replicates are collected from GSE39495124 for K562 cell-line. The two replicates are merged together, leading to ~100k interactions in total. Additional public-available high quality chromatin interaction datasets, including ChIA-PET107, Hi-C106, Capture-C108 and computationally predicted enhancer-gene interactions, i.e. JEME125 and IM-PET18, can also be used as inputs. GENCODE version 19126 is used as the gene model annotations. Only protein coding genes are considered. The promoters for each gene are defined as the +/- 1kb 8 region centered at the transcription start sites (TSS). In total it includes ~20k protein- coding genes. The consensus enhancer annotation is collected from the Roadmap Epigenome and ENCODE consortia96,124, using the version filtered by DNase-seq signals (-log10(p) >= 10). The enhancers are further annotated as cell-type specific enhancers based on chromatin segmentations learnt from chromHMM in all 127 cell lines/tissues based on histone modifications96. For epigenomics dataset, the imputed DNase-seq and RNA-seq data of 127 cell lines are collected from the Roadmap Epigenome and ENCODE consortia to quantify cell- type specific enhancer activities and gene expression96,124. For TF binding site annotations, TF motif hits predicted by Kheradpour et al127 are collected. The version filtered by conservation scores across multiple species (>0.3) is used. This resulted in ~13.6M motif hits for ~500 TFs. Known disease-associated genes are collected from DisGeNET128. This curated dataset includes gene-disease associations from multiple resources, including UNIPROT, CTD, ORPHANET, CLINGEN, GENOMICS ENGLAND, CGI and PSYGENET129–132. Immune-associated genes are identified based on keywords matching summarized in Supplementary Table (Figure A. 1). GWAS SNPs associated with immune are collected from two sources: Biobank133 and EMBL EBI134. Significant eQTLs in whole blood are collected from GTEx V7111. The nominal p-values of eQTLs are provided by GTEx and are used in this paper. 2.2.2 Regulatory network construction based on long-range interaction Construct 3D chromatin modules of long-range cis-regulation Groups of genomic fragments that are inter-connected with each other by chromatin interactions, i.e. 3D chromatin modules, are first identified in the APRIL algorithm (Figure 2.1 A). Within each 3D chromatin module, nodes represent genomic fragments and edges represent long-range chromatin interactions. For each pair of nodes in a module, there exists at least one path connecting them. Every chromatin interaction has two linked genomic loci, i.e. interaction anchors, and different chromatin interactions may involve overlapping interaction anchors. To create a catalogue of unique indexes for interacting 9 genomic locations, consecutively overlapping anchors along the genome are merged into single fragments and are represented as single nodes in the network. Genomic fragments (i.e. nodes) in 3D chromatin modules are further annotated based on their potential functions in gene regulation. The nodes are classified into three types: gene nodes, enhancer nodes and other-element nodes. A genomic fragment is classified as a gene node if it overlaps with a gene’s promoter (i.e. +/-1kb from TSS) based on the provided gene model annotations and if the gene is expressed in the specific tissue or cell-type (log2(RPKM)>0) based on the provided transcriptome data. A genomic fragment is classified as an enhancer node if it contains enhancer-specific epigenetic signatures provided by the user. Enhancer-specific epigenetic signatures can be: a) enhancer chromatin state called by chromHMM or Segway135,136, b) high chromatin accessibility such as DNase-seq peaks or ATAC-seq peaks, c) enhancer-specific histone modification (e.g. H3K4me1 or H3K27ac) ChIP-seq peaks, d) enhancer RNA (eRNA) signal peaks. Users have the flexibility to select the specific enhancer epigenetic signatures to run APRIL. Genomic fragments that are not classified as genes or enhancers will be considered as other-elements. Based on these annotations, nodes will be color coded in the final network visualizations (Figure 2.1 A). Overall, the annotated 3D chromatin modules represent connected regulatory units of multiple genes and multiple enhancers. The chromatin interactions within each module represent long-range cis-regulation among pairs of enhancers, other-elements and gene promoters, which are usually co- located in chromatin domains137. Expand regulatory sub-networks with trans-regulation The regulatory networks will be further expanded from 3D chromatin modules, by incorporating connectivities of trans-regulation. Although long-range cis-regulation within 3D chromatin modules are mainly generated by intra-domain interactions14,137, trans- regulation by transcription factors (TFs) can coordinate expression of genes located in different chromatin domains. The APRIL algorithm builds expanded regulatory sub- networks by merging 3D chromatin modules that share common enriched TFs (Figure 2.1 A, Figure A. 2). Based on the genome-wide TF motif annotations provided by the 10 user, APRIL will first scan genomic fragments in every 3D chromatin module by counting the motif occurrences for each specific TF. The motif occurrence counts across all genomic fragments are then organized into a node-level TF matrix, where each row corresponds to a genomic fragment (i.e. nodes in the network) and each column corresponds to a specific TF. To reduce false positives of TF motif analyses, only TFs that are expressed (log2(RPKM) > 0) in the specific tissue or cell-type under study are included. Based on the node-level TF matrix, a module-level TF matrix is then constructed, where each row corresponds to a 3D chromatin module and each column corresponds to the averaged TF occurrence counts across nodes within the specific module. Hierarchical clustering is then applied on the module-level TF matrices to identify clusters of 3D chromatin modules whose TF occurrence profiles are significantly similar (Figure A. 3). Considering the large variances across different TFs, ‘complete’ mode of hierarchical clustering is used. Pearson correlation is used as the similarity metric. The default number of clusters is determined based on analyses of Within-cluster Sum of Squares and Averaged diameters of the resulting sub-networks (Figure A. 3). 3D chromatin modules belonging to the same clusters are linked together by adding 1) nodes to represent the common TFs shared by the chromatin modules, and 2) edges between TF nodes and genomic fragments in the 3D modules which contain motifs of the specific TFs. The resulting graph of multiple 3D chromatin modules linked by common TF nodes are used as the expanded regulatory sub-networks, which serve as the foundation for the subsequent disease-associated gene predictions. In the meantime, based on the hierarchical clustering, clusters of TFs that are enriched within specific regulatory sub-networks are identified as candidate master TF regulators, and a clustered heatmap is generated to visualize the associations of candidate master TF regulators and specific regulatory sub-networks. 11 Figure 2.1. The workflow of APRIL and its application on K562 ChIA-PET dataset. (A) APRIL takes 3D chromatin contact maps and cell-type specific multi-omics dataset as inputs to construct 3D chromatin modules including gene nodes (red), enhancers nodes (orange) and other-elements nodes (blue). Enriched TFs based on motif analysis are added as TF nodes (purple) and are used to merge 3D chromatin modules into sub- networks. Based on the GWAS SNPs and the constructed regulatory sub-networks, novel disease-associated genes are predicted using different machine learning techniques. (B) One example of the constructed regulatory sub-network based on K562 ChIA-PET data. (C)-(D) Statistical distributions of network properties. Global graph properties (C) include Node count (upper panel) and Diameter (lower panel). Node centrality metrics (D) include Betweenness centrality (upper left panel), Closeness centrality (lower left panel), Degree centrality (upper right panel) and Page-rank centrality (lower right panel). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 2.2.3 Network-based enrichment analysis for genetic associations Before predicting novel disease-associated genes, APRIL provides a series of functions to statistically test whether GWAS SNP hits are over-represented within specific regulatory sub-networks. Based on the provided GWAS summary statistics (e.g. association p-values of SNPs) from the user, APRIL first identifies genomic fragments (i.e. 12 nodes in regulatory sub-networks) that harbor those SNPs. Geodesic distances, which are defined as the lengths of shortest paths between two nodes harboring GWAS hits, are calculated to quantitatively demonstrate how close GWAS hits are located across the topology of the regulatory sub-networks. 1,000 random controls are generated by shuffling the labels of nodes (i.e. GWAS-hit labels and non-GWAS-hit labels) across all regulatory sub-networks. The observed distribution of geodesic distances based on real GWAS hits are then compared to the distributions from the 1,000 random control sets using Kolmogorov-Smirnov tests to statistically evaluate whether significant GWAS hits prefer to be located closer in the network. For each regulatory sub-network, APRIL further statistically tests whether the specific sub-network has significantly higher frequency of observing GWAS hits. 1,000 random control sets are generated by shuffling the labels of nodes across all sub- networks (i.e. GWAS-hit labels and non-GWAS-hit labels), with the total number of nodes and edges in each sub-network controlled. For each sub-network, using the counts of GWAS-hit nodes in the 1,000 random sets as the null distribution, we calculated the empirical p-value using the formula: 1+No. of (GWAS-hit in random networks > observed GWAS-hit in APRIL sub-network)) / 1001. The regulatory sub-networks are then sorted based on their GWAS enrichment significance. 2.2.4 Prediction of novel disease-associated genes APRIL employs both unsupervised and supervised machine learning algorithms to predict novel disease-associated genes, based on the constructed networks of long-range gene regulation. Users have the flexibility to choose the prediction methods and do comparative analysis on the results. The unsupervised prediction is based on label propagation138 and the supervised prediction is based on random forest models. Unsupervised label propagation to predict disease-associated genes APRIL utilizes HotNet138 algorithm to predict novel disease-associated genes. Different from traditional applications, APRIL incorporates three different types of nodes in the network for label propagation, i.e. gene nodes, enhancer nodes and TF nodes. As in the HotNet138 diffusion framework, every regulatory sub-network is considered as a dynamic fluid system, where each node (except TF nodes) serves as a source of fluid with a 13 constant rate and fluid can diffuse from node to node within the sub-network. APRIL uses the GWAS association effect sizes of each node as the corresponding node-specific source fluid rates. Therefore, nodes harboring significant GWAS SNP hits are assigned with higher source fluid rates. If a node contains multiple SNPs, the maximum effect size is used for the node. Sub-networks with less than five gene nodes are removed from this analysis. After HotNet label propagation converges to the steady state, a predicted disease-association score is assigned for every node, especially for gene nodes. Top- ranked genes based on the association scores are highlighted as candidate disease- associated genes. Supervised prediction of disease-associated genes using random forest models To construct predictive models of disease-associated genes and identify key network features, APRIL provides a supervised mode of predictions based on random forests. For every gene node in the sub-networks, the binary label to predict is disease-associated or not. The features used in random forest to predict the disease-association labels for a specific gene node include five sets: 1) regulatory sub-network related features, such as node degree, betweenness, closeness, page rank centrality; 2) GWAS related features, such as the maximum and summation of effect sizes of neighboring nodes; 3) TF related features, such as TF motif occurrences in the gene node and neighboring nodes. For each specific TF, the motif occurrence counts from neighboring nodes are aggregated into a weighted summation score, where the weights are GWAS effect sizes assigned to each neighboring node; 4) eQTL features in neighboring nodes, including the count of eQTLs and the p-value of the most significant neighboring eQTL; and 5) cell-type specific activity features of genes and enhancers, such as the z-scores of gene expression and enhancer chromatin accessibility across multiple cell-types. Since different gene nodes may be connected to different numbers of enhancer nodes, APRIL uses quantile statistics of enhancer activities to uniformly describe activity distributions of neighboring enhancer nodes for every gene node. To train the random forest model, balanced negative training sets are generated from the pool of genes in regulatory sub-networks that are not identified as disease-associated based on traditional GWAS and do not share common neighboring nodes in the sub-networks with genes in the positive training sets. Random forest classifier is trained using the R package ‘randomForest’ with 50 trees. Ten-fold 14 cross-validation is used to evaluate the performance. Feature importance is calculated and provided to users. 2.2.5 Cell-type specific activity analysis Based on user provided cell-type specific activity datasets for genes and enhancers, APRIL also provides functions to understand how cell-type specific activities are coordinated in regulatory sub-networks. To quantitatively evaluate whether genes, enhancers and TFs are significantly correlated, APRIL provides different versions of random backgrounds to carry out statistical tests. The first version of background is created by randomly pairing nodes in the same networks, with all other factors controlled. The second version is based on all potential pairs. For each version, 1,000 random samples are generated. Pearson correlations of observed node pairs are calculated based on quantile normalized cell-type specific activities. Similar correlation calculations are carried out for node pairs from random samples. The observed correlations are then compared with random background and are statistically tested (one-sided Student’s t- test). Three types of node pairs are considered: 1) gene-gene pairs, 2) enhancer-gene pairs, and 3) TF-gene pairs. Since the regulatory sub-networks are constructed based on information of 3D chromatin interactions and TF motif occurrences, the cell-type specific activity correlations and the significance tests provide orthogonal information on functional coordination of gene regulation at systems-level. Furthermore, APRIL employs public available protein-protein interaction (PPI) databases139 to evaluate whether TF-TF pairs in regulatory sub-networks are statistically enriched with PPIs based on empirical permutation tests. For each sub-network, APRIL generates 1,000 random sets of TF-TF pairs by shuffling TF names across all sub-networks. This procedure controls all other factors and can efficiently eliminate bias. The fractions of PPI-supported TF-TF pairs in the observed sub-networks are then compared with random samples to quantitatively test whether TFs co-regulating genes are more likely to physically interact with each other. 2.2.6 Network and prediction visualizations The constructed regulatory sub-networks are visualized using the R package ‘igraph’. In the network visualization, gene nodes are marked as red, enhancer nodes are marked as orange, other-element nodes are marked as blue, and TF nodes are marked as purple. 15 Edges of gene-gene pairs, enhancer-gene pairs and enhancer-enhancer pairs represent long-range chromatin interactions. Edges of TF-gene pairs and TF-enhancer pairs represent predicted trans-regulation. A plotting function with default setting above is provided to users. Along with the network visualization, additional plots can be generated, including 1) activities of genes, enhancers and TFs, 2) correlations of gene-gene pairs, enhancer-gene pairs, and TF-gene pairs, 3) PPI enrichment, and 4) clustered heatmap of enriched TFs across 3D modules. The plots are generated based on the same method as stated in the methods 2.2.4. For predictions of novel disease-associated genes, results of the two machine learning methods are provided as a series of plots and tables. The predictions based on label propagation are provided as a table with three columns: gene names, predicted scores and sub-network indexes. Since the raw predicted scores from different regulatory sub-networks are not directly comparable, the mini-max normalized scores are provided for each gene to make it comparable across sub-networks. Along with the table, a boxplot of predicted score ranks between disease-associated genes and control gene sets is also provided to the user. The result can also be visualized in the regulatory sub-networks, where nodes are colored by the predicted scores of label propagation and candidate disease-associated genes are represented as stars. The predictions based on the random forest are provided as a table with two columns: gene names and the predicted disease-association probabilities. The ROC plot of ten-fold cross-validation is provided to the users. The predicted disease-associated genes can be visualized in the sub-networks and the newly discovered candidate genes are highlighted as red stars. 2.3 RESULTS 2.3.1 Expanded regulatory networks constructed based on 3D chromatin interactions. To demonstrate the performance, APRIL is applied on a ChIA-PET chromatin interaction dataset of K562 cell-line124. The distances of chromatin interactions linking regulatory elements and genes can be longer than 1Mb, with a median distance of 25kb (Figure A. 2). The regulatory elements mainly include enhancers as characterized by enhancer- specific epigenetic signatures, including enhancer chromatin states and DNase-seq 16 signal peaks96,124, along with other potential non-coding regulatory elements. They are termed as enhancer nodes hereafter in this paper. The interacting genomic locations that overlap with gene promoters126 are annotated as gene nodes. In addition, APRIL identified ~402k DNA motif occurrences for 221 TFs located in interacting genomic locations from this ChIA-PET dataset. By integrating the long-range cis-regulation of chromatin interactions and trans-regulation mediated via TFs, APRIL constructed an expanded K562-specific regulatory network, including edges linking different types of nodes (i.e. gene, enhancer and TF nodes). Overall, there are 10,458 gene nodes, 73,494 enhancer nodes (including enhancer elements and other regulatory elements) and 14,667 TF nodes. There are 741 sub- networks containing at least one gene node. In each sub-network, multiple 3D chromatin modules (i.e. groups of long-range cis-regulation) are connected by common enriched TFs (i.e. combinatorial trans-regulation). Figure 2.1 B shows one example of the constructed sub-network, where different types of color-coded nodes connected by cis- or trans-regulatory links. More examples are shown in Figure A. 2. The sub-networks have different sizes, as demonstrated by the number of nodes and the network diameters (Figure 2.1 C). In addition, indicated by a variety of centrality metric analyses (Figure 2.1 D), substantial fractions of sub-networks are organized in a hierarchical structure with node ‘hubs’ linking many neighbors in the graphs, emphasizing the importance of understanding gene regulation from a systems-level. 2.3.2 Cell-type specific regulatory signatures encoded in networks The constructed regulatory sub-networks provide a platform to analyze K562 specific gene regulation, as the networks contain 83,952 genomic locations in total, such as gene promoters, enhancers and other regulatory elements, which participate in long-range gene regulation (Figure 2.2 A). In addition, the TF nodes in sub-networks represent abundant trans-regulation mediated by TF combinations. Compared to genome-wide background, genes included in the expanded regulatory networks have significantly higher expression levels in K562 cells (Figure 2.2 B). Similarly, enhancers and TFs included in the networks also show higher activities in K562 cells compared to other enhancers or TFs not in the networks (Figure 2.2 C, D). These results suggest the 17 functional importance of nodes contained in the constructed regulatory networks for cell- type specific activities. To characterize the coordinated activities of genes, enhancers and TFs in specific sub-networks, correlations across different cell-types are calculated for gene-gene, enhancer-gene and TF-gene node pairs, based on DNase-seq and RNA-seq data from 127 cell-types/tissues in ENCODE and Roadmap Epigenomics consortia96,124. All three types of node pairs are found to have significantly higher activity correlations than random samples, where basic network topology properties are strictly controlled (Figure 2.2 E- G). These observed high correlations indicate that the constructed regulatory networks, which are built from long-range chromatin interactions and TF motif enrichments, can accurately capture the cell-type specific regulatory modules. To obtain mechanistic understandings of potential multi-TF collaborations involved in gene regulation, APRIL further incorporates protein-protein interaction (PPI) data139. Compared to randomly shuffled TF node annotations across the regulatory networks, TF- TF node pairs within the same sub-networks are found to be highly enriched with PPIs (Figure 2.2 H), suggesting combinations of TFs involved in co-regulation of genes may have the potential to physically interact with each other. As the constructed 3D chromatin modules represent cis-regulatory units with multiple enhancers and genes, they provide a better way to characterize combinatorial TF groups in gene regulation than analyzing single enhancers or promoters. For every 3D chromatin module, the TF motif counts from enhancer or gene nodes are aggregated together to create module-level TF enrichment profiles. Interestingly, clusters of TF combinations are found to be shared across specific subsets of 3D chromatin modules (Figure 2.3). For example, there are 14 3D chromatin modules enriched with TFs of the FOXO family, 8 3D chromatin modules enriched with TFs of the MAF family and 10 3D chromatin modules enriched with TFs of the ETS family. This observation suggests that the clusters of TFs are potential trans-regulators working together to co-regulate multiple genes and enhancers across different 3D chromatin interacting hubs. 18 Figure 2.2. Genomic fragments in regulatory sub-networks tend to share similar cell-type specific regulatory activities. (A) Pie-chart summary of node annotations. (B-D) Regulatory sub-networks constructed by APRIL capture highly active (B) genes (based on RNA-seq), (C) enhancers (based on DNase-seq) and (D) TFs (based on RNA-seq, P = 0.007302) in K562 cells. The controls are genes, enhancers and TFs that are not covered by the regulatory sub-networks. P-values are calculated by the one-sided Student’s t-test. (E)-(G) The constructed sub-networks capture highly co-active (E) gene- gene pairs, (F) enhancer-gene pairs and (G) TF-gene pairs across cell-types (red) based on quantile normalized activities, compared with controls of shuffled networks (orange) and random pairs (purple). P-values are based on the one-sided Student’s t-test. (H) The TF-TF node pairs within the same sub-networks are enriched with PPIs. Controls are generated by randomly shuffled TF names on the same networks. Error bars are the standard errors based on 1,000 random controls. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 19 Figure 2.3. 3D chromatin modules form clusters with common TF combinations. Rows represent different TFs and columns represent 3D chromatin modules. TF enrichment within a specific 3D chromatin module is calculated based on motif occurrences in promoters and enhancers. All five highlighted TF combinations enriched in different clusters of 3D chromatin modules are known leukemia or immune related TF families. 2.3.3 Long-range regulatory networks can characterize the relationship among disease-associated genes. To show how disease-associations are encoded in the constructed K562-specific regulatory sub-networks, the co-occurring patterns of GWAS SNP hits are statistically tested with respect to both topological closeness and enrichment of over-representation. Considering that K562 is a blood cancer cell-line, immune-associated GWAS datasets from the UK BioBank and EMBI databases133,134 are used for the analysis (see Methods). After overlaying the immune-associated GWAS SNPs to the regulatory networks based on genomic location overlaps, the pairwise geodesic distances for each pair of gene nodes containing significant GWAS SNPs, termed as disease-gene nodes hereafter, are calculated (Figure 2.4 A). Compared to randomly shuffled networks, the geodesic distances between disease-gene nodes are significantly shorter (p-value< 2.2x10-16, Kolmogorov-Smirnov test), suggesting that disease-associated genes are closer to each other based on the topology of the constructed networks. This observation also indicates the functional relationship of GWAS SNP hits in the process of gene regulation, which 20 may mediate the association between the genetic variants and traits through regulatory paths in the networks. Consistent with this hypothesis, the immune-associated GWAS SNPs contained by nodes of the regulatory network demonstrate more stringent GWAS p-values than SNPs that are not contained by the network (Figure 2.4 B, C). In addition, disease-associated genes in regulatory networks tend to have relatively higher fractions of neighboring nodes containing eQTLs (Figure A. 4). Furthermore, a subset of regulatory sub-networks are found to be statistically enriched with GWAS genes (Figure 2.4 D), compared to randomly shuffled controls using empirical permutation tests with the network topologies maintained. As one of the examples, Figure 2.4 E shows a K562 regulatory sub-network containing five genes (red nodes) linked together by a number of long-range cis-regulation and six TFs. Three out of the five genes are found to be significantly associated with immunity based on the GWAS analysis (Figure 2.4 F), which is significantly more than randomly shuffled controls (p-value=0.008, empirical permutation tests). These sub-networks enriched with GWAS genes suggest that disease-associated genes and the corresponding regulatory elements are functionally inter-connected with each other. 21 Figure 2.4. Connectivity and enrichment of disease-associated genes and GWAS SNPs in regulatory sub-networks. (A) Geodesic distances between disease-gene nodes in sub- networks are significantly shorter compared to random networks (P < 2.2e-16, Kolmogorov-Smirnov test). (B-C) GWAS SNPs covered by the regulatory sub-networks are more significantly associated with disease based on data from (B) the UK Biobank (P = 0.00096) and (C) EMBL database (P = 0.05). The P-value is calculated using one-sided Student’s t-test). (D) A subset of sub-networks are enriched with disease-associated genes. P-values are based on the permutation tests against random graphs. (E) One example of disease-associated genes enriched in a regulatory sub-network in K562 cells. Disease-associated genes are represented as stars. (F) Distribution of disease- associated gene counts based on random samples of shuffled GWAS labels (disease- gene and non-disease-gene nodes) across the networks. 22 2.3.4 Unsupervised disease-associated gene discovery using long-range regulatory networks As supported by the observed co-occurring GWAS SNP hits in regulatory sub-networks, predictive machine learning algorithms are applied to predict novel disease-associated genes based on network features. As every node in the networks is assigned with a score of association with immunity, based on the GWAS SNPs from UK BioBank133,134 that overlap with the nodes, HotNet label propagation algorithm138 is employed to aggregate the scores of connected nodes (Figure 2.5 A). Using the GWAS scores from gene nodes alone to do the label propagation, the disease-associated genes from GWAS are significantly top-ranked compared to control gene sets (Figure 2.5 B), which is expected. Strikingly, without using any GWAS scores from gene nodes, the disease-associated genes are also significantly top-ranked, based on GWAS scores from enhancer nodes that are linked to the regulatory networks by long-range chromatin interactions with genes (Figure 2.5 B). This finding suggests that distal genetic variants in linked regulatory elements are predictive for disease-association of target genes. By combining GWAS scores from both genes and distal enhancers, APRIL can achieve a better separation of disease-associated genes and control genes (Figure 2.5 B). As an interesting example, the gene CTPS1 is predicted to be immune-associated based on label propagation on K562 regulatory networks (Figure 2.5 C). The association of CTPS1 with immunity is found to be supported by DisGeNet128. As shown in the regulatory sub-network where nodes are color-coded by predicted association scores, CTPS1 is predicted mainly due to a neighboring enhancer node (Figure 2.5 C), which has long-range chromatin interaction with CTPS1’s promoter based on K562 ChIA-PET dataset. Therefore, the high association score of this particular enhancer with immunity accurately predicts CTPS1’s functional relevance in immune systems. Furthermore, this specific enhancer contains multiple SNPs that are also eQTLs of CTPS1 in whole blood tissues (Figure 2.5 C), providing cellular level evidence of the genetic association. In addition, CTPS1 has a neighboring TF node: NRF1, which is supported by the ChIP-seq signal peak of NRF1 in the promoter region of CTPS1 (Figure 2.5 C). Considering the known functional roles of NRF1 in cancer140, this observation further indicates potential mechanistic insights of CTPS1’s association with immunity. 23 Figure 2.5. Predict disease-associated genes using regulatory sub-networks based on label propagation. (A) Schematic figure of label propagation. Disease association scores are propagated along the sub-networks so that every gene node can borrow disease- association information from neighboring nodes. (B) Disease-associated genes have higher percentile ranks of predicted scores than control genes (one-sided Student’s t- test), suggesting that effect sizes of genetic associations from other nodes in the network can help to improve disease-associated gene prediction. (C) Example of newly discovered disease-associated gene CTPS1, which is validated by DisGeNET. The color of nodes corresponds to the predicted disease-association score. The neighboring TF node (NRF1) is inferred based on NRF1 motif in the promoter region of CTPS1, which is supported by the ChIP-seq data of NRF1 binding. Also, the neighboring enhancer node in the network is linked to CTPS1 by long-range K562 ChIA-PET data. The enhancer element contains three significant GTEx eQTLs in whole blood tissues. 2.3.5 Supervised prediction based on aggregated SNP information from neighboring nodes The random forest based supervised prediction of disease-associated genes is also applied on the constructed K562 regulatory networks, which provides complementary advantages compared to unsupervised predictions. As explained in the methods section, multiple sets of features in gene promoters, linked enhancers and network topology are included (Figure 2.6 A). Using the database of DisGeNet128, genes are labeled as 24 immune-associated or not associated to train the random forest classifiers. Based on ten- fold cross validation, the random forest models achieve high accuracy (average AUC=0.87) in predicting immune-associated genes (Figure 2.6 A). By analyzing the calculated feature importance, the top-ranked predictive features include network- topology related features of genes (e.g. centrality metrics), gene expression levels, specific TFs, and most interestingly, the GWAS effect sizes of linked enhancers in the regulatory networks (Figure 2.6 B, Figure A. 5, Figure A. 6). Since the distal enhancers are linked to genes by long-range chromatin interactions, this finding strongly highlights the importance of non-coding genetic variants that may induce gene expression dysregulation in complex diseases. As an example (Figure 2.6 C), the gene ABCA7 is predicted to be associated with immunity, consistent with its high expression in whole blood tissues. This gene is located in a large regulatory sub-network with multiple genes, enhancers and TFs densely connected, and it has been found to be associated with autoimmune diseases141,142. One of the contributing features for this prediction is a non- coding regulatory enhancer element linked to ABCA7 promoter through chromatin interactions. In support of dysregulation of ABCA7’s expression from SNPs in this enhancer, there is an eQTL discovered in whole blood tissue which is correlated with the expression changes of ABCA7 (p<7.8x10-28). Other contributing features for this prediction include neighboring TF nodes of KLF16 and SP1, both have been demonstrated to be related with immunity, especially leukemia143–146. A similar example of predicted immune-associated genes can be found in Figure A. 7, which also leads to insights on the predictive features of cis- and trans-factors underlying disease- associations. 25 Figure 2.6. Supervised prediction of disease-associated genes based on regulatory sub- networks. (A) Overview of features used in the random forest model, including network features, effect sizes, TF features and activity features. The supervised prediction of disease-associated genes achieves an averaged AUC = 0.87 based on ten-fold cross validation. (B) Boxplots of top ranked features in predicted disease-associated genes compared to control genes (one-sided Student’s t-test). (C) Example of a newly discovered disease-associated gene ABCA7. ABCA7 has high expression in whole blood tissues based on GTEx dataset. ABCA7 has two neighboring TF nodes: SP1 and KLF16, based on their motif occurrences. They are validated by ChIP-seq dataset of SP1 and KLF16 bindings. Both TFs are highly ranked with respect to the feature importance calculated by the random forest model. Also, a neighboring enhancer node linked to ABCA7 contains a highly significant eQTL in whole blood tissue based on GTEx dataset (p-value = 7.8e-28). 26 3DVariantVision: MULTIMODAL FRAMEWORK TO DECODE GENETIC VARIATION BASED ON 3D CHROMATIN 3.1 INTRODUCTION Decoding the functional impacts of genetic variants plays pivotal roles in revealing the underlying mechanisms of complex human diseases, such as Alzheimer’s disease1,147,148, autoimmune diseases121,149 and cancer150–152. Although conventional genome-wide association studies (GWAS) have enabled the identifications of specific single-nucleotide polymorphisms (SNPs) associated with different diseases, its analytical framework imposes significant limitations, such as the low statistical power, the ambiguity among neighboring SNPs in LD, the restricted capability of pinpointing distal non-coding SNPs, and the lack of mechanistic interpretations4,5. These challenges are collectively caused by the limited sample sizes in disease genetics, moderate SNP effect sizes, the burden of genome-wide multiple hypothesis testing, the LD blocks, and, particularly, the simplified black-box treatment of molecular-level SNP effects, such as the functional dysregulation of cell-type specific transcription4,5. To address these challenges, one promising strategy is to integrate the multi-omics information under specific cellular contexts into the analysis and leverage the molecular- level phenotypes as mediators to decipher the disruptive impacts of SNPs. This strategy, depending on the specific modeling architectures, has led to a series of important discoveries, including molecular-level genetic associations (e.g. eQTLs, histone QTLs, and meQTLs)153–156, transcriptome-wide association studies (TWAS)157, and de novo machine learning predictions of cell-type specific SNP effects (e.g. delta-SVM, GWAWA, FunSeq)43,44,47. More recently, given the fast advancements in deep learning, a variety of natural language processing (NLP) models have been successfully adapted to predict non-coding SNP effects at single-nucleotide resolution. Based on Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BLSTM), DeepSEA69, Basenji72, DanQ71 and BPNet158 can predict cell-type specific transcription factor (TF) binding sites, chromatin accessibility, histone marks and nearby gene expression, which are further used to predict the disruptive effects induced by specific SNPs. Also based on 27 CNN architecture, SPEID20 and 21 can predict enhancer-promoter links, while Akita24 and DeepC23 characterize Hi-C chromatin interactions. As the most recent development beyond the CNN-centric framework, Enformer75 employs the attention mechanisms74 and extends the genomic window of input features to be 100kb centered around gene promoters, in order to learn the long-range regulatory effects of non-coding SNPs, which yields promising results. Overall, these models substantially expand the functional annotation from coding regions to non-coding genetic variants and provide mechanistic hypotheses on potentially functional SNPs. However, existing approaches face challenges: 1) Integrating long-range and local effects remains unaddressed, and 2) the gap between chromatin and gene expression effects persists. 3) lacks cell-type specificity in chromatin interactions and fails to illustrate the relationship between chromatin and gene expression effects. As a result, a holistic comprehension of genetic variant impacts on local profiles, long-range chromatin, downstream target genes, and associated diseases remains incomplete. Here we introduce 3DVariantVision, a cutting-edge deep learning-based multimodal framework, designed to offer a comprehensive view of genetic variant effects, spanning from genotype to phenotype. This innovative approach integrates diverse genome-wide datasets, including ChIP-seq9, Hi-C14, and eQTL summary statistics159–161. Initially pre-trained on predicting cell-type-specific 1D genomic and epigenomic profiles, 3DVariantVision crafts informative embeddings solely from DNA sequences, employing cross-attention in communicative learning to discern intricate enhancer-promoter relationships. Fine-tuning eQTL prediction tasks, it bridges the knowledge gap between 3D chromatin structure and gene expression. Remarkably, relying solely on DNA sequences, 3DVariantVision excels in predicting functional genetic variants, deciphering their disruptive effects on TF bindings, histone modifications, enhancer-promoter interactions, and gene expression. This enhanced capability leads to improved eQTL discoveries. By capturing the intricate, nonlinear regulatory grammar across multiple feature levels and genomic distances, 3DVariantVision accurately identifies the distal target genes of non-coding variants, unveils novel TF combinations, and provides fresh insights into trait-associated variants, transcending conventional GWAS and TWAS studies. 28 3.2 MATERIALS AND METHODS 3.2.1 Model framework of 3DVariantVision We introduce 3DVariantVision, an innovative multimodal framework tailored to deciphering the effects of genetic variations using a 3D chromatin context. The framework operates across three distinct stages to unravel the effects of these variants: 1) Disruption of upstream TF binding effects; 2) Influence of chromatin interaction effects; 3) Manifestation of downstream gene expression effects, encompassing eQTLs. Through these three intricately connected stages, 3DVariantVision achieves a comprehensive and nuanced understanding of the complex interplay between genetic variations and their functional consequences within the context of 3D chromatin architecture. Epigenomics prediction based on representation learning The 3DVariantVision approach employs a configuration consisting of three CNN blocks, succeeded by two fully connected blocks. This architecture is applied to predict 139 distinct ChIP-seq peaks within the human genome, encompassing TF binding sites, histone modifications, and DNase peaks. As input data, the one-hot-encoded DNA sequence is utilized, where nucleotides A, C, G, and T are represented as [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], respectively, while N represents an unresolved base as [0,0,0,0]. The input sequences are of length 2,000 base pairs. The employment of CNNs in this context derives from their capability to effectively capture local sequence patterns. The CNN-based architecture serves to generate a representative embedding of the DNA sequence. Each individual CNN block comprises a convolutional layer, which is subsequented by Batch Normalization, Rectified Linear Unit (ReLU) activation, and Max Pooling operations. This orchestrated sequence of layers within each CNN block contributes to the network's capacity to discern and encode essential features from the DNA sequence data. Batch Normalization is a fundamental technique in machine learning that enhances training stability by normalizing activations within each mini batch, mitigating issues related to internal covariate shifts and accelerating convergence. ReLU is a widely used non-linear activation function in neural networks, facilitating the incorporation of essential non-linearity by outputting the input directly if it's positive, while transforming negative inputs to zero, enhancing the network's 29 ability to capture complex relationships within data. Max pooling is a pooling operation, where the input data is partitioned into non-overlapping regions and the maximum value from each region is selected to form a downsampled representation, aiding in feature extraction and spatial hierarchy preservation. The second and third CNN blocks utilized dilated CNN that introduces gaps or "dilation" between filter elements, effectively expanding the receptive field while preserving computational efficiency. This allows dilated CNNs to capture both local and global features from input data, making them particularly useful for tasks involving hierarchical or multiscale information. The embedding resulting from the CNN blocks is transformed into a 1D vector through a flattening process. Subsequently, this vector is fed into a sequence of two fully connected blocks, which together facilitate the prediction of binary ChIP-seq peaks. The initial block encompasses a Linear layer, accompanied by Batch Normalization, a ReLU activation function, and Dropout operations. This composite arrangement serves to project the embedded features into a 200-dimensional space, enabling enhanced discriminative capabilities. The subsequent block, which follows, comprises a Linear layer designed to produce output vectors of dimensionality 139. These vectors undergo a final transformation using the Sigmoid activation function. This transformation generates probabilities indicative of peak bindings for the 139 distinct categories. The utilization of the Sigmoid function allows for the effective modeling of these binding probabilities, encapsulating the predictive capacity of the network regarding the presence of ChIP-seq peaks across the target genomic regions. In training the model, we employed the Binary Cross-Entropy as the objective function to gauge the loss. The derivatives of this objective function with respect to the model parameters were computed through a standard backpropagation algorithm. To optimize the objective function and facilitate efficient convergence, we harnessed the power of the Adam optimizer. For the input preparation in the representation learning stage of 3DVariantVision, we divided the entire human genome into non-overlapping 2kb regions and computed labels for all 139 epigenomic peaks. Training labels were derived from uniformly processed data releases from ENCODE124, encompassing 127 TF ChIP-seq, 11 Histone ChIP-seq, and 1 DNase-seq datasets specifically from the GM12878 cell type. Each 30 epigenomic peak was labeled as 1 if the corresponding 2kb bin overlapped with the peak region and 0 otherwise. Our focus was on the set of 2kb bins that contained at least one peak, resulting in a total of 382,742 sequences. To facilitate model training, we partitioned the data into training, validation, and testing sets, following an 8:1:1 ratio. In summary, each sample within the dataset comprises a 2kb sequence extracted from the human hg19 reference genome, paired with a corresponding label vector encompassing information for all 139 epigenomic peaks. This meticulously curated dataset forms the foundation for training and evaluating the performance of 3DVariantVision in decoding genetic variations within the context of 3D chromatin architecture. Chromatin interaction prediction based on communicative learning Through its training for epigenomic peak prediction, 3DVariantVision acquires the capability to generate embeddings that effectively encapsulate the sequence information of any given DNA sequence. These embeddings are derived from the CNN blocks and can be directly leveraged for the prediction of chromatin interactions between two DNA fragments. It's noteworthy that existing methods for chromatin interaction prediction commonly adopt a simplistic approach of concatenating the features or embeddings of two DNA fragments, thereby neglecting to model the joint features that emerge from their interaction. To address this limitation, 3DVariantVision introduces a novel approach known as communicative learning. This approach is designed to foster a deeper understanding of the relationship between enhancer-promoter pairs, leveraging the power of cross-attention mechanisms. In doing so, 3DVariantVision goes beyond the conventional concatenation method by actively learning and capturing the intricate interplay and dependencies between enhancer-promoter pairs, thereby enhancing the accuracy and interpretability of chromatin interaction predictions. Cross-attention mechanisms, also known as inter-attention, are a crucial component in transformer-based models74, differentiating themselves from self-attention mechanisms. While self-attention focuses on capturing relationships within a single sequence, cross-attention extends its scope to establish connections between elements in different sequences or entities. This means cross-attention allows the model to weigh and attend to information not just within one sequence but across multiple sequences, 31 enabling a more comprehensive understanding of relationships and dependencies in broader contexts, making it particularly valuable in tasks involving multiple sources of information or different modalities, such as translation, summarization, or multi-modal understanding. Derived from the CNN, the one-hot encoding representations of enhancers 𝑆𝑒 and promoters 𝑆𝑝 undergo a transformation into richer embeddings, denoted as 𝐸𝑒 and 𝐸𝑝, respectively. These embeddings are structured as 𝐸 ∈ ℝ𝐾×𝐿 , where 𝐾 signifies the embedding dimension, which corresponds to the number of CNN channels, while 𝐿 stands for the sequence length of the enhancer or promoter after pooling processes. In this arrangement, each column within the matrix symbolizes the embedding of a specific minuscule region. Positional encoding is strategically employed to incorporate positional information into the embeddings of both enhancers and promoters. In the context of implementing cross-attention, we have three fundamental components at play: query (𝑋𝑞), key (𝑋𝑘), and value (𝑋𝑣). The query represents the information we are actively seeking, the key provides the context or reference, and the value encapsulates the content we are searching through. The query and the key are multiplied together to produce the attention scores, which are then used to compute the weighted sum of the values. In our communicative learning framework, the promoter embedding takes on the role of the query 𝑋𝑄 = 𝐸𝑝, while the enhancer embedding assumes the roles of both key and value 𝑋𝐾 = 𝑋𝑉 = 𝐸𝑒. Each of these components possesses its unique set of weights for linear transformations within distinct representation subspaces. Queries 𝑄 = 𝑊𝑄𝑋𝑄, in essence, represent the promoter-specific information at each minuscule region, whereas keys 𝐾 = 𝑊𝐾𝑋𝐾 encapsulate enhancer-specific details, enabling each minuscule region to focus on relevant information during the attention process. The scaled dot-product of these queries and keys gives rise to the attention matrix 𝐴, where 𝑎𝑖𝑗 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑇 𝑞𝑖𝑘𝑗 √𝐾 ). Here 𝑞𝑖 and 𝑘𝑗 represent the 𝑖th and 𝑗th minuscule region of query and key, respectively. Values 𝑉 = 𝑊𝑉𝑋𝑉 , on the other hand, encapsulate the enhancer-derived information. Each individual attention head computes its output as a weighted summation across all input positions by 𝐸𝑝 𝑟𝑒𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = 𝐴𝑉. This enables each promoter minuscule 32 region to glean insights from the entirety of its corresponding enhancer minuscule region. Crucially, the multiple attention heads operate with independent parameters. The outputs from these multiple heads are concatenated to form the final layer output, which is subsequently subjected to a linear layer for aggregation and combination. This comprehensive approach effectively integrates positional and contextual information, enhancing the model's ability to capture intricate relationships between enhancers and promoters in chromatin interactions. In our cross-attention mechanism, we employed four heads, each with a value size of 200 and a key/query size of 200. To effectively capture the intricate, non-linear relationships between enhancers and promoters, we integrated three layers of cross-attention modules. In this architecture, the input to the current module consists of the embeddings of enhancers 𝐸𝑒 and the reweighted promoters 𝑟𝑒𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 obtained from the preceding module. 𝐸𝑝 Following the establishment of the intricate relationship between enhancers and promoters, the cross-attention mechanism facilitates the creation of a reweighted embedding of promoters, influenced by their paired enhancers. The attention matrix 𝐴 generated in the final layer of the cross-attention mechanism is referred to as the "communication map." This communication map serves as a representation of the detailed relationships between enhancer-promoter pairs and is then transformed into a 1D vector through a flattening process. Subsequently, this vector is channeled through two fully connected blocks. These blocks serve to process the information and generate binary predictions pertaining to chromatin interactions. In essence, this series of steps consolidates the knowledge acquired from enhancer-promoter associations and leverages it to make informed predictions about the presence or absence of chromatin interactions. To construct the enhancer-promoter dataset, we sourced chromatin interactions from Capture-C data within the GM12878 cell type108. These interactions were generated by overlapping fragment pairs with active enhancer and promoter regions associated with genes, thus forming the positive enhancer-promoter interaction dataset. To maintain a balanced dataset, we created a negative set by randomly pairing enhancers and promoters while preserving the same distance distribution observed in the positive set. For partitioning the data into training, validation, and testing sets, a careful approach was 33 undertaken. We divided the data by chromosomes, ensuring strict non-overlapping partitions to mitigate overfitting concerns. Chromosome 9 was reserved for validation, whereas chromosomes 10 and 11 were allocated for testing purposes. The remaining chromosomes constituted the training dataset. Similar to the representation learning stage in the first phase, we employed the Adam optimizer to optimize the target function, which in this case was Binary Cross- Entropy. eQTL prediction based on transfer learning The final stage of our model focuses on predicting eQTLs by leveraging information gleaned from both 1D epigenomic signals and 3D chromatin interactions. These insights are captured through the preceding stages of representation learning and communicative learning. To accomplish this, each SNP-gene pair is examined, and the genomic region spanning 2kb centered around the SNP is designated as the "enhancer." Consequently, we establish the corresponding enhancer-promoter pairs derived from these SNP-gene associations. Both reference enhancers and alternative enhancers are constructed based on the underlying genetic variants present in these pairs. These enhancer-promoter pairs, representing both reference and alternative scenarios, are individually fed into 3DVariantVision. After undergoing the cross-attention process, two reweighted promoter embeddings are obtained—one for reference enhancer-promoter pairs and the other for alternative enhancer-promoter pairs. These embeddings collectively encode the features characterizing the SNP-gene pairs. In the final stage of 3DVariantVision, we employ a Random Forest Classifier. This model is used to predict whether the SNP-gene pairs exhibit an eQTL effect or not, drawing on the information encapsulated in the embeddings. This approach allows us to make informed predictions about the regulatory impact of genetic variants on gene expression, contributing valuable insights to the study of eQTLs. 3.2.2 eQTL dataset preparation To compile the SNP-gene pair dataset, we aggregated eQTL data from an extensive ensemble of 31,684 blood samples sourced from 37 eQTLGen Consortium cohorts. Given our primary focus on chromatin-mediated eQTLs, we first filtered out SNP-gene 34 pairs with distances smaller than 5kb and retained only those SNPs situated within enhancer regions. To enhance the statistical robustness of our analysis, we concentrated on SNP-gene pairs with a sample size exceeding 20,000. To distinguish significant SNP-gene pairs from non-significant ones, we employed a Bonferroni correction threshold of P-value < 0.05 for significance, and P-value > 0.1 for non-significance. It's important to note that one gene may be associated with multiple SNPs based on this criteria. Creating the positive dataset of eQTLs from the significant SNP-gene pairs involved two key steps: 1) Addressing linkage disequilibrium (LD) concerns, we retained only the most significant SNP within each SNP island. SNP islands were defined by grouping significant SNPs with adjacent distances of less than 2kb. 2) To balance the genetic influence of each gene, we retained a maximum of five of the most significant SNPs for each gene. This ensures a balanced representation of SNP-gene associations. To establish a balanced negative set for our dataset, we followed the same distance distribution as the positive set. This meticulous approach to dataset construction ensures that our analysis is rooted in rigorous statistical criteria and represents a comprehensive understanding of chromatin-mediated eQTLs. 3.2.3 Performance comparison In the first stage of 3DVariantVision, we conducted a comparative analysis of epigenomic predictions with DeepSEA69, a cutting-edge CNN-based model renowned for its performance in this domain. To ensure an equitable evaluation, both 3DVariantVision and DeepSEA were trained and tested on identical datasets. We evaluated the predictive performance by calculating the Area Under the Receiver Operating Characteristic Curve (AUROC) for each epigenomic track in the testing set, separately for each model. This rigorous approach allowed us to objectively assess and compare the effectiveness of both models in predicting epigenomic features. In the second stage of 3DVariantVision, we conducted a comprehensive comparison of enhancer-promoter interaction predictions with previous state-of-the-art models. These prior models included tree-based models utilizing engineered features, as well as deep learning-based models that based on DNA sequences. However, it's essential to note that many of these models merely concatenated enhancer and promoter 35 information or features without explicitly modeling their joint relationships. One notable model in this context is ProTECT162, a tree-based model that excels at capturing joint features between enhancers and promoters, specifically through Protein-Protein Interaction (PPI) bindings. Both ProTECT and 3DVariantVision possess the ability to model these joint features, prompting a direct comparison. To ensure fairness in our evaluation, ProTECT was trained and tested on the same datasets as 3DVariantVision, focusing on subsets with PPI occurrences. It's worth mentioning that ProTECT utilizes only PPIs as features, aligning with the approach adopted by 3DVariantVision, which exclusively uses joint features for predictions. Unlike the original ProTECT, 3DVariantVision does not leverage additional information such as genomic distance, activity levels, or epigenomic correlations across different cell types. To gauge prediction performance, we employed ROC curves and quantified performance metrics on the testing set. Additionally, we assessed the model's capacity to learn joint features by utilizing Capture-C supported enhancer-promoter interactions. To create background data, we randomly shuffled the enhancer-promoter pairs while preserving the same enhancer and promoter information but altering the joint features. We then calculated the odds ratio for both ProTECT and 3DVariantVision based on different percentages of enhancer- promoter pairs predicted as interactions by each model. A higher odds ratio signifies more accurate predictions, highlighting the effectiveness in learning joint features. In the third and final stage of eQTL prediction, we conducted a comprehensive evaluation of our model's performance by comparing it with the state-of-the-art model known as Enformer75. Enformer, trained on DNA sequences, is adept at predicting multi- chromosome gene expression by integrating long-range interactions, thereby demonstrating competence in eQTL prediction. To facilitate this comparison, we retrieved the predicted features generated by Enformer and subsequently trained a Random Forest model on the same datasets used for 3DVariantVision. In assessing the performance, we leveraged ROC curves and compared the performance on the same testing set. Furthermore, to further substantiate our findings, we amassed three orthogonal fine-mapped eQTL datasets as gold-standards161,163,164. We then meticulously constructed balanced negative sets while controlling for genomic distances separately. 36 Once again, we employed ROC curves to evaluate model performance and conducted a comparative analysis to ascertain the effectiveness of our approach in comparison to existing state-of-the-art methods. We compared 3DVariantVision predicted eQTLs with the traditional statistical eQTL calling method (GTEx)160. We assembled a gold-standard eQTL dataset by aggregating eQTLs from Muther et al163, Battle et al161, and Geuvadis et al164, resulting in a dataset comprising 11,210 entries. To facilitate a rigorous comparison, we generated a background dataset ten times larger by randomly selecting SNP-gene pairs while controlling for distance. Both 3DVariantVision and GTEx were tasked with predicting the same number (11,210) of positive eQTLs. Our analysis focuses on three distinct subsets: 1) 'Recalled_by_3DVariantVision': This subset comprises eQTLs that were successfully discovered by 3DVariantVision but missed by GTEx, representing the intersection between GTEx's False Negatives and 3DVariantVision's True Positives. 2) 'TP_by_GTEx': In this subset, GTEx effectively identified eQTLs, representing True Positives according to GTEx's predictions. 3) 'All_eQTL': This subset encompasses the entirety of gold- standard eQTLs used as the benchmark. We conducted a comparative analysis involving genomic distance and Minor Allele Frequency (MAF) among these three groups. 3.2.4 Comprehensive genetic variant insights 3DVariantVision offers a multifaceted and comprehensive perspective on genetic variants, encompassing a wide array of insights. These include understanding the impacts of genetic variants on TF binding, elucidating chromatin effects, predicting eQTLs, uncovering novel TF combinations, identifying target genes, and explaining associations with various diseases. Genetic impacts on TF binding In the case of a SNP, we gather the 2kb DNA fragment centered around the SNP. From this fragment, we construct both a reference sequence and an alternative sequence, reflecting the genetic variants present. Within 3DVariantVision, predictions for epigenomic peaks are made separately for these two sequences. The disparities between these predictions are then quantified as "TF disrupting values," signifying the influence of 37 the genetic variant on different TF binding patterns. These values serve as indicators of how the variant affects the binding of various TFs. Chromatin effects In the context of SNP-gene pairs, 3DVariantVision enables the quantification of chromatin interaction effects by assessing the distinctions between the reference genome and the alternative genome. To elaborate, consider a specific SNP-gene pair: the genomic region spanning 2kb centered around the SNP is designated as the "enhancer." From this, we establish enhancer-promoter pairs that correspond to these SNP-gene associations. For both reference and alternative scenarios, we construct enhancer-promoter pairs based on the underlying genetic variants inherent in these associations. Subsequently, we individually input these enhancer-promoter pairs into the second stage of 3DVariantVision to obtain separate predictions for chromatin interaction scores. To quantify the chromatin effects, we calculate chromatin changing ratios, which reflect the extent of change and are computed as the ratio between the difference in alternative chromatin interaction and reference chromatin interaction and the reference chromatin interaction itself. Furthermore, the sign of the chromatin changing ratio indicates whether the SNP enhances or diminishes the chromatin interaction. These ratios provide a meaningful representation of the impact of genetic variants on chromatin interactions. Functional TF combinations for chromatin 3DVariantVision employs cross-attention mechanisms to effectively model the intricate relationships among minuscule regions situated between enhancers and promoters. This enables the identification and prioritization of vital pairs of these minuscule regions, thereby unveiling crucial functional TF combinations that play a role in chromatin interactions. For each enhancer-promoter pair, we select the top-priority minuscule region pair based on the cross-attention weights. Subsequently, we cross-reference these selected pairs with TF ChIP-seq datasets to determine how many TF combinations are bound to these prioritized minuscule region pairs. To establish a rigorous baseline, we generate background data by shuffling the top-priority minuscule region pairs 1,000 times. We 38 calculate Z-scores for each TF combination relative to this background. TF combinations are then prioritized based on a p-value threshold of 0.05. To ascertain the biological significance of these prioritized TF combinations, we evaluate Protein-Protein Interaction (PPI)139 enrichments. Specifically, we examine the number of TF combinations supported by PPI data, comparing the counts between the prioritized TF combinations identified by 3DVariantVision and a set of randomly shuffled pairs. This analysis serves to validate the relevance of the identified TF combinations in the context of chromatin interactions. Downstream gene discovery 3DVariantVision exhibits remarkable proficiency in eQTL prediction, a capability that can be harnessed to unveil prospective downstream target genes linked to genetic variants. The eQTL prediction scores, generated by 3DVariantVision, can be computed for any candidate SNP-gene pairs. By identifying the gene within each pair with the highest prediction score, we can effectively prioritize target genes. This targeted gene prioritization approach goes beyond conventional Genome- Wide Association Studies (GWAS)3 and Transcriptome-Wide Association Studies (TWAS)157, offering a pathway to uncover novel mechanistic insights into variants associated with specific traits. It provides a powerful means of understanding the functional implications of genetic variants and their influence on gene regulation, shedding new light on the genetic basis of complex traits and diseases. 3.2.5 Genetic Variant and target gene prioritization Prioritize the SNP within CRISPRi-perturbed enhancer We devised a strategy to prioritize the causal SNP among CRISPR-QTLs13,165, specifically those originating from CRISPRi-perturbed enhancers. For each CRISPR-QTL, we collect all SNPs with a MAF greater than 0.05 located within perturbed enhancers. These SNPs are then paired with their corresponding target genes. Next, we calculate eQTL prediction scores based on models like 3DVariantVision. The SNP-gene pair with the highest score is prioritized, and its associated score is deemed representative of the CRISPR-QTL. 39 We utilize CRISPR-QTL data to corroborate the eQTL predictions made by 3DVariantVision. To establish a baseline, we generate a background dataset for CRISPR- QTLs by assembling random enhancer-gene pairs with matching enhancer lengths and genomic distance distributions. Employing the aforementioned strategy, we calculate predicted scores for each pair independently using both 3DVariantVision and Enformer models. To visually compare the score disparities between CRISPR-QTLs and the background, we construct boxplots for each model, thus facilitating a thorough analysis. Additionally, a similar analysis is performed based on the ABC model for comprehensive insights. Prioritize target gene from neighbors based on fine-mapped eQTL We compiled a dataset of four fine-mapped eQTLs specific to blood cell types, utilizing four distinct fine-mapping methodologies: SuSIE166, DAP-G167, CAVIAR168, and CaVEMaN169. For each eQTL, we meticulously generated corresponding control datasets by matching the same SNP to the nearest gene, ensuring that these genes had a False Discovery Rate (FDR) greater than 0.1. Moreover, we imposed an additional criterion for these control genes, requiring that the distance between them and the actual target genes exceeded 1kb. This step was taken to eliminate any potential overlaps in sequence features. Subsequently, we computed prediction scores using 3DVariantVision, and to assess the performance of our model, we employed ROC curves. These ROC curves provided a robust means of quantifying model performance for each fine-mapping dataset individually, thereby allowing for a comprehensive evaluation of 3DVariantVision's predictive capabilities in the context of fine-mapped eQTLs. 3.2.6 Local effect predictions of genetic variant Variant effect predictions on Saturation Mutagenesis data In 3DVariantVision, we employ TF disturbing scores as a metric to quantitatively assess the genetic impact on epigenomic signals. This metric serves as a valuable resource for identifying high-effect genetic variants, a task facilitated by integrating insights from MPRA (Massively Parallel Reporter Assay) experiments170. The MPRA experiments offer detailed insights into genetic variation effects within ten enhancer and ten promoter regions, providing data at a single-base pair resolution. Building on this data, we crafted 40 a classification task aimed at identifying high-effect variants amid a broader background. To define positive variants, we employed the following criteria: 1) a p-value threshold of less than 1e-5, and 2) selection of the top N-ranked variants based on their effect sizes. Conversely, for negative variants, we applied these criteria: 1) a p-value greater than 0.05, and 2) a log2(effect size) threshold below 0.05. To ensure the creation of a balanced dataset, we sampled the data accordingly. Next, we employed a RandomForest model and executed a rigorous 5-fold cross- validation procedure using the TF disturbing scores derived from 3DVariantVision for each variant. Our model underwent training and evaluation separately for the top 200, 500, and 1000 effects on both promoter and enhancer regions. Evaluation of model performance was carried out using ROC curves, providing a robust assessment of the model's ability to discriminate high-effect genetic variants from the background. Decipher TFs’ role as activators and repressors in chromatin 3DVariantVision offers insights into chromatin effects and their directions by computing both the ratio and sign of chromatin changes. These metrics provide a valuable foundation for investigating the intricate relationship between chromatin effects and expression effects. To facilitate this exploration, we partitioned the consolidated eQTL dataset into groups defined by distinct TF ChIP-seq peak bounds on the corresponding enhancers. Within each group, we computed the Spearman correlation between the absolute values of the chromatin changing ratio and the absolute values of eQTL expression effects. For a deeper understanding of a TF's role as an activator or repressor in chromatin, we devised a sorting mechanism based on the absolute values of expression effects. We introduced a metric to gauge the consistency of effect direction between chromatin and expression, measured as the fraction of concordant signs among the top N eQTLs with the most pronounced expression effects: 𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦(𝑛) = 1 𝑛 𝑛 ∑ 𝐼(𝑐𝑖 = 𝑒𝑖) 𝑖=1 where 𝑐𝑖, 𝑒𝑖 represent the sign of the chromatin changing ratio and expression effects, respectively. Subsequently, we constructed a plot illustrating the evolving 41 consistency as N varies. A pivotal point of reference is the consistency at N=100, which is instrumental in determining whether the TF functions as an activator or repressor. This determination hinges on whether the consistency surpasses 0.5, signifying activation, or falls below 0.5, indicating repression. 3.2.7 Disease association evaluation 3DVariantVision delivers a comprehensive understanding of genetic variants and their implications for disease associations. In our analysis of eQTL predictions using 3DVariantVision, we bifurcated the dataset based on whether the SNPs were substantiated by GWAS. To elucidate the distinctions between these groups, we compared the eQTL prediction scores through the use of boxplots. Furthermore, we curated TWAS data171 related to lipid traits, identifying 28 statistically significant SNP-gene associations in whole blood cell types with p-values less than 0.05. For each SNP-gene pair, we gathered all coding genes within a 500kb radius from the TSS of the TWAS-targeted gene, constituting the background. Subsequently, we leveraged 3DVariantVision to prioritize genes from this background. To assess performance, we computed the rank of the TWAS target gene within the background based on the 3DVariantVision prediction score. We created a control scenario by randomly selecting one gene from the background and determining the corresponding ranks, thus allowing for robust quantification of 3DVariantVision's prioritization capabilities in the context of TWAS data. 3.3 RESULTS 3.3.1 Comprehensive decoding of genetic variation through 3D chromatin with 3DVariantVision Recent advances in genome-wide chromatin conformation capture methods have unveiled the crucial role of long-range chromatin interactions in bringing distant cis- regulatory elements and promoters into close proximity, thereby exerting influence over gene expression160. When a genetic variant resides within an enhancer and disrupts TF bindings within that enhancer, it often emerges as the causal eQTL for the target gene associated with that enhancer (Figure 3.1 A). This assumption is substantiated by the 42 significant enrichment of chromatin interactions observed in fine-mapped eQTL datasets. Specifically, the three eQTL datasets, CaVEMaN169, DAP-G167, and CAVIAR168, exhibit markedly higher proportions of chromatin interaction support in comparison to the genomic distance-controlled background, showing fold changes of 2.75, 2.28, and 1.73, respectively (Figure 3.1 B). As an illustrative case on chromosome 13, rs9517725 emerges as an eQTL for the gene ENSG00000125304 in whole blood tissues. Remarkably, this SNP is situated within an enhancer that interacts with the promoter of the same target gene, as supported by Capture-C data108. Both the SNP region and the promoter region are bound by critical TFs, which may play a mediating role in gene expression (Figure 3.1 C). With the aim of harnessing 3D chromatin information for comprehensive long-range prediction of genetic variant effects, we introduce 3DVariantVision, offering an in-depth understanding of genetic variants' impact. Figure 3.1. Genetic variants impact gene expression through chromatin regulation. (A) Chromatin interaction is a key mechanism for genetic variants to regulate gene expression. (B) Fine-mapped eQTLs (CaVEMaN, DAP-G, CAVIAR) are enriched in Capture-C supported chromatin interactions comparing the genomic distance controlled background. (C) An example of chromatin interaction supported eQTL in the blood cell line. 3DVariantVision is a multimodal framework designed for the comprehensive assessment of genetic variant effects on both local and distant regions within the cis- regulatory domain of target genes (Figure 3.2 A). This encompassing understanding of 43 genetic variants necessitates the integration of three distinct types of multi-omics datasets (Figure 3.2 A Left): 1) Genome-wide cell-type-specific genomic and epigenomic peaks, comprising 127 TF ChIP-seq, 1 DNase-seq, and 11 Histone ChIP-seq datasets in GM128789. 2) Chromatin interactions, derived from enhancer-promoter interactions sourced from Capture-C data in GM12878108. 3) An expansive eQTL dataset in whole blood tissue, meticulously curated through meta-analysis involving 31,684 blood samples across 37 eQTLGen Consortium cohorts159, thus maximizing statistical power and enabling robust model training. The 3DVariantVision framework serves to quantify variant impacts through multiple lenses by (Figure 3.2 A Right): 1) Predicting genetic influences on TF binding sites within the local vicinity. 2) Revelating chromatin effects on long-range chromatin interactions in cis, along with insights into their connection with expression effects. 3) Forecasting whether a given SNP-gene pair qualifies as an eQTL or not. 4) Unraveling the roles played by TF combinations within the context of 3D chromatin. 5) Identifying potential distal target genes affected by genetic variants. 6) Shedding light on associations between genetic variants and diseases. The architecture of 3DVariantVision unfolds through three distinct stages, each contributing to a comprehensive understanding of genetic variants (Figure 3.2 A Middle, Figure B. 1): 1) Genomic Track Peak Prediction based on Representation Learning (Figure B. 1 Left): In the initial stage, we embark on the task of pre-training an embedding model. This model's purpose is to craft informative representations for any given DNA sequence. Here, 1D Convolutional Neural Networks (CNNs) come into play, extracting local features from DNA sequences and predicting genomic and epigenomic profiles within the GM12878 cell line. The input consists of one-hot encoded 2kb DNA sequences, while the output manifests as a binary vector signifying peak occurrences, including ChIP-seq, DNase-seq, and Histone ChIP- seq. Through rigorous training for peak predictions in this first step, the CNN module is subsequently frozen, empowering 3DVariantVision to adeptly generate embeddings that effectively encapsulate sequence information from any given DNA sequence. 44 2) Chromatin Interaction Prediction based on Communicative Learning (Figure B. 1 Middle): The second stage delves into modeling 3D chromatin information by predicting enhancer-promoter interactions. When presented with enhancer- promoter pairs, 3DVariantVision leverages the embeddings of enhancers and promoters, which are generated based on the now-frozen CNNs trained in the initial step. Unlike many existing approaches that simply concatenate features of enhancers and promoters, 3DVariantVision introduces communicative learning. This innovative approach capitalizes on cross-attention mechanisms, inspired by self-attention in Transformer-based models. Cross-attention establishes intricate connections between elements across different DNA fragments, enabling the model to weigh and prioritize information spanning enhancers and promoters. This facilitates a more profound comprehension of relationships and dependencies among enhancer-promoter pairs. After traversing through three layers of cross- attention, a communication map emerges, depicting the intricate relationships between enhancers and promoters. This map is subsequently deployed to predict the existence or absence of interactions. 3) eQTL Prediction based on Fine-Tuning (Figure B. 1 Right): In the final stage, the focus shifts to eQTL prediction based on large-scale eQTL datasets. For every SNP-gene pair under consideration, two corresponding enhancer-promoter pairs are constructed based on the reference genome and alternative genome separately, contingent on whether the genetic variant occurs or not. Utilizing the cross-attention block pre-trained in the second stage, two communication maps are generated. These maps serve as the foundation for predicting whether the SNP-gene pair qualifies as an eQTL. In summation, the 3DVariantVision architecture unfolds across three stages, amalgamating both local features and 3D chromatin information. This holistic approach enables accurate eQTL predictions and a profound comprehension of genetic variants. Taking the representative long-range eQTL instance of rs12413588, 3DVariantVision predicts it as an eQTL for the gene ENSG00000170525, validated by the fine-mapped eQTL dataset from Battle et al161 (p-value=9.54e-14, distance=258kb) (Figure 3.2 B). Furthermore, this predicted eQTL is substantiated by experimentally 45 verified chromatin interactions identified through Capture-C. Intriguingly, 3DVariantVision reveals that this SNP significantly disrupts the binding sites of crucial TFs, including SMC3, RAD21, CTCF, RUNX3, and PAX5. This disruption leads to a noteworthy 0.02% reduction in chromatin interaction with the target gene, offering insightful explanations for the underlying eQTL mechanisms. In addition, rs79259450 emerges as an eQTL prediction for the gene ENSG00000185112 (Figure B. 2 A). This prediction aligns with validation from the fine-mapped eQTL dataset provided by Geuvadis et al164 (p- value=3.70e-11, distance=460kb) and corroborated by Capture-C chromatin interactions. Notably, this genetic variant is anticipated to disrupt the binding sites of TBP, NBN, SPI1, among others, resulting in a 0.06% decrease in chromatin interactions. Similarly, rs813000 surfaces as an eQTL prediction for the gene ENSG00000257923, with validation stemming from Battle et al.'s161 fine-mapped eQTL dataset (p-value=5.03e-37, distance=458kb) and Capture-C chromatin interactions (Figure B. 2 B). This SNP is predicted to disturb the binding sites of EBF1, RUNX3, and others, resulting in a substantial 0.17% reduction in chromatin interactions. Collectively, these examples underscore 3DVariantVision's remarkable capacity to predict eQTLs and elucidate the intricate effects of genetic variants spanning both local and distal genomic regions. 46 Figure 3.2. Overview of the 3DVariantVision framework. (A) 3DVariantVision mainly utilized 3 multi-omic datasets in training, including epigenomic peaks (TF binding sites, open chromatin, histone modifications), chromatin interaction pairs, and ensembled eQTLs (left). Representation Learning was used to generate embedding for DNA sequences by predicting epigenomic peaks (middle). (B) One example of predicted eQTLs based on 3DVariantVision in the whole blood cell line. rs12413588 is validated by Battle et al. eQTL dataset with p-value = 9.54e-14 and also supported by Capture-C chromatin interactions. The right panel shows the predicted effects of chromatin interaction effects and represents TF bindings of the corresponding SNPs. 47 3.3.2 Superior performance on 1D genome profile and long-range chromatin prediction 3DVariantVision significantly outperforms the previous state-of-the-art model, DeepSEA69, in the prediction of 1D genome profiles, underscoring its proficiency in generating DNA sequence embeddings during the initial phase. To ensure a rigorous comparison, both 3DVariantVision and DeepSEA were meticulously trained and evaluated using the same balanced dataset. 3DVariantVision consistently exhibits higher AUROC and AUPR scores in comparison to DeepSEA across most predictions for TF ChIP-seq peaks and Histone ChIP-seq peaks (Figure 3.3 A-B, Figure B. 3). This heightened prediction accuracy is further affirmed through qualitative assessments when visualizing the predicted peaks alongside observed ones within the genome. For instance, consider a representative case on 1Mb region in chromosome 17 (Figure 3.3 C). The peaks predicted by 3DVariantVision not only align with peaks identified by traditional peak calling methods at significant TF binding sites and histone modifications but also unearth previously undiscovered peaks. These novel peaks, although overlooked by traditional peak calling methods, find support in raw experimental signals, underscoring the enhanced predictive capabilities of 3DVariantVision. Beyond its proficiency in predicting local genome profiles, 3DVariantVision excels in accurately forecasting enhancer-promoter interactions, shedding light on the intricate distal 3D chromatin mechanisms in stage 2. We gathered enhancer-promoter interactions from Capture-C data specific to the GM12878 cell type and created a balanced negative dataset while controlling for genomic distance. To mitigate the risk of overfitting, 3DVariantVision underwent training and testing on distinct chromosomes. Anchored by the cross-attention module, a backbone of communicative learning within 3DVariantVision, this module meticulously models the nuanced relationships between minuscule regions within enhancers and promoters. This approach equips the model with the capability to glean joint features shared between these two genomic fragments. To evaluate this aspect, 3DVariantVision's performance was benchmarked against ProTECT162, another model harnessing protein-protein interaction features to encode joint features within genomic fragments. Both ProTECT and 3DVariantVision exhibit the ability to model these joint features, warranting a head-to-head comparison. In the interest 48 of fairness and accuracy, ProTECT was subjected to the same dataset for training and testing as 3DVariantVision, with a specific focus on subsets featuring protein-protein interactions (PPIs). The superior AUROC scores achieved by 3DVariantVision in chromatin interaction prediction, compared to ProTECT, underscore its enhanced performance (Figure 3.3 D). To further assess 3DVariantVision's prowess in learning joint features within the communicative learning framework, we turned to enhancer-promoter interactions supported by Capture-C. To ensure consistency in enhancer and promoter information while modifying joint features, we generated background pairs by shuffling the corresponding enhancer-promoter pairs. We computed predictions using both ProTECT and 3DVariantVision, followed by the calculation of odds ratios at various thresholds. 3DVariantVision consistently exhibited higher odds ratios, affirming its superior ability to capture joint information between enhancers and promoters (Figure 3.3 E, Figure B. 4). This robust performance ensures the model's capacity to decipher intricate chromatin mechanisms within minuscule regions, such as those proximal to SNPs, and their influence on target genes. In a compelling example from the testing set, 3DVariantVision adeptly predicted a long-range enhancer-promoter interaction within chromosome 4, a prediction that was subsequently validated by Capture-C experiments. This interaction was found to be associated with crucial TF binding sites, including HCFC1, SIX5, EED, YY1, and RAD21 (Figure 3.3 F). Impressively, 3DVariantVision went beyond validation and unearthed novel interactions that had eluded detection by Capture-C. For instance, in chromosome 8, our model effectively uncovered an enhancer-promoter interaction corroborated by Hi- C, an independent experimental dataset within the same cell type. This interaction featured the binding of BATF, SPI1, RUNX3, YY1, TAF1, and ELF1(Figure 3.3 G). Additionally, 3DVariantVision successfully unveiled enhancer-promoter interactions, supported by Hi-C data, in chromosome 5 and chromosome 21, all of which had been previously missed by Capture-C. These newly discovered interactions were also found to be associated with crucial TF binding sites (Figure B. 5). These remarkable findings underscore 3DVariantVision's capacity to not only validate existing chromatin interactions 49 but also to uncover previously unrecognized interactions, thereby enhancing our understanding of the complex 3D chromatin architecture. 50 Figure 3.3. 3DVariantVision achieved superior performance in TF binding sites (A) and histone modification (B) predictions compared with the previous state-of-the-art model DeepSEA. Each point represents the AUROC of 3DVariantVision and DeepSEA in one 51 Figure 3.3 (cont’d) epigenomic peak prediction. (C) Examples of epigenomic peak predictions in chr17:45,314,322-46,234,870. 3DVariantVision successfully predicted the annotated peaks and discovered the novel peaks supported by the signals. (D) ROC comparison of chromatin interaction prediction between 3DVariantVision and ProTECT. (E) Odds ratio comparison at top 10% predictions when shuffling the pairs of enhancer-promoter interactions. (F-G) Examples of predicted chromatin interaction by 3DVariantVision. (F) is validated by Capture-C, while (G) is a newly discovered interaction supported by Hi-C but missed by Capture-C. 3.3.3 3DVariantVision prioritizes casual SNP on eQTL prediction In the final phase of our model, we shift our focus towards eQTL prediction, harnessing insights accumulated from the earlier stages of representation learning and communicative learning. These critical insights are the result of a meticulous process that incorporates 1D epigenomic signals and 3D chromatin interactions (Figure 3.4 A). For each SNP-gene pair, two corresponding enhancer-promoter pairs based on the reference genome and alternative genome are generated. Concurrently, communicative maps, hinging on these two enhancer-promoter pairs, are meticulously calculated. These maps, crafted through communicative learning trained during stage 2, encapsulate the distinctive features of each SNP-gene pair, serving as a foundation for our eQTL prediction framework. The construction of balanced training, validation, and testing sets is achieved through the large-scale eQTL dataset sourced from blood tissues. To quantify our eQTL prediction performance, we rely on the AUROC metric in the testing dataset. In a rigorous comparative analysis, we pitted the performance of 3DVariantVision against Enformer75, a cutting-edge model adept at gene expression prediction, leveraging long- range interactions through self-attention mechanisms. Notably, 3DVariantVision achieved a commendable AUROC compared to Enformer, thereby underscoring its superiority in eQTL prediction (Figure 3.4 A). Furthermore, we conducted a comprehensive performance evaluation by comparing 3DVariantVision and Enformer using four independent fine-mapped eQTL datasets as gold standards. This orthogonal assessment aimed to provide additional insights into the predictive capabilities of both models across various scenarios. In each of these datasets – SuSIE166, DAP-G167, CAVIAR168, and CaVEMaN169 - 3DVariantVision consistently outperformed Enformer in terms of AUROC 52 (Figure 3.4 B). These consistently higher AUROC values for 3DVariantVision across all datasets underscore its robust and superior performance in eQTL prediction, regardless of the dataset characteristics and complexities. These results further solidify the efficacy of 3DVariantVision as a state-of-the-art tool for eQTL prediction. While chromatin effects and expression effects share a related connection, it's imperative to distinguish between these two distinct molecular phenomena. This distinction underscores the necessity of the fine-tuning procedure in Stage 3 of 3DVariantVision, where we transition from capturing chromatin interaction information to making accurate eQTL predictions. Notably, some existing predictive models are trained solely on reference genomes to forecast chromatin interactions and subsequently derive eQTL effects by contrasting alternative genomes with reference genomes. However, this direct approach may struggle to bridge the gap between chromatin-level information and expression-level outcomes. To substantiate this notion, we embarked on an experiment to assess the feasibility of employing chromatin interaction changes from Stage 2 of 3DVariantVision as prediction scores for eQTLs. Utilizing SuSIE fine-mapped eQTLs as gold standards, we observed an AUROC of 0.5606, which falls short of the AUROC achieved in Stage 3 of 3DVariantVision (0.7605) (Figure B. 6). This observation underscores that understanding the intricate dynamics of genetic variants' effects on gene expression necessitates a dedicated model fine-tuned for eQTL prediction. Such a model can meticulously consider the complex interplay between genetic variations, chromatin structures, and gene expression patterns. This approach allows us to navigate the intricate relationships between these factors, ultimately culminating in heightened accuracy for eQTL predictions. Traditional statistical eQTL calling methods, such as GTEx160, face inherent limitations stemming from their restricted sample sizes and incomplete understanding of underlying mechanisms. These constraints render the identification of long-distance and rare eQTLs challenging. To underscore the advantages of 3DVariantVision in eQTL prediction, we conducted a comparative analysis between 3DVariantVision and GTEx. Our comparision focuses on three distinct subsets: 1) 'Recalled_by_3DVariantVision' representing eQTLs successfully discovered by 3DVariantVision but missed by GTEx. 2) 'TP_by_GTEx' representing true positives according to GTEx's predictions. 3) 'All_eQTL 53 representing the entirety of gold-standard eQTLs used as the benchmark. Our findings reveal several compelling insights. Firstly, 3DVariantVision excels at recalling long-range eQTLs that eluded GTEx (Figure 3.4 C). Secondly, 3DVariantVision demonstrates a knack for pinpointing rare eQTLs missed by GTEx (Figure 3.4 D). These observations underscore the superior capacity of 3DVariantVision to unveil elusive eQTLs, particularly those located at extended genomic distances and characterized by low allele frequencies, thereby enhancing our understanding of the regulatory landscape underlying gene expression. Illustratively, consider rs17824742, predicted as an eQTL affecting gene ENSG00000100577, an instance successfully recalled by 3DVariantVision but overlooked by GTEx (Figure 3.4 E). Notably, this serves as a paradigmatic long-range eQTL, characterized by a Minor Allele Frequency (MAF) of 0.09 and an extensive genomic separation spanning 233kb. What lends further credence to this prediction is its substantiation through chromatin interactions gleaned from both Hi-C and Capture-C data, shedding light on the potential regulatory mechanisms underpinning this eQTL. Figure 3.4. 3DVariantVision prioritizes casual SNP on eQTL prediction. (A) 3DVariantVision achieved better performance in eQTL prediction at the testing set 54 Figure 3.4 (cont’d) compared with Enformer. (B) Consistently better performance in fine-mapped eQTL predictions based on three different datasets. (C-D) 3DVaraintVision also recalled eQTLs which were missed by GTEx, the statistical method in a smaller population. Recalled eQTLs have longer genomic distances (C) and smaller MAF (D). (E) One example of recalled long-range eQTL supported by chromatin interactions with MAF < 0.1. 3.3.4 3DVariantVision identifies TF effects and TF grammar in chromatin The multifaceted 3DVariantVision framework adeptly encompasses local genomic and epigenomic profiles, long-range chromatin interactions, and eQTL predictions, providing a comprehensive outlook on genetic variants that spans a wide spectrum of insights. To gauge the local impacts, we utilize the TF Disturbing Score, which quantitatively measures the perturbation of genetic signals by computing the difference in TF profile prediction scores between the reference genome and the alternative genome. To validate the efficacy of the TF Disturbing Score, we devise an unsupervised algorithm aimed at genetic variant effect prediction, leveraging data from Massively Parallel Reporter Assays (MPRA)170. MPRA experiments grant us insights into the effects of genetic variations at a single-base pair resolution across ten enhancer and ten promoter regions. Employing the Principal Component 1 (PC1) of TF Disturbing Score predictions derived directly from 3DVariantVision, without further training, yielded commendable results in the prediction of genetic variant effects. To illustrate this effectiveness, consider an enhancer region in IRF4. 3DVariantVision exhibited a notably high Pearson correlation (r=0.295) with MPRA experimental effects (Figure 3.5 A). Impressively, 3DVariantVision adeptly pinpointed two loci with significant scores, which were subsequently validated as disrupting the binding motifs for two crucial TFs, namely, RUNX1 and NR2F1. These observations underscore the capacity of 3DVariantVision to faithfully capture the effects of genetic variants without the need for additional training or refinement, further demonstrating its utility in genetic variant effect prediction. Furthermore, we devise a classification task aimed at identifying high-impact genetic variants. We select the topN-effect variants as the positive set based on MPRA- derived p-values and effect scores. To create a balanced background set for each positive set, we employ random sampling. We experiment with various values of N, specifically 55 200, 500, and 1000, constructing multiple datasets for enhancer and promoter variants separately. We employ a RandomForest classifier, utilizing TF Disturbing Scores generated by 3DVariantVision for each variant. Remarkably, our observations reveal high AUROC scores during testing when identifying top-effect variants in both promoters and enhancers. Specifically, we attain AUROC scores of 0.7199, 0.6743, and 0.6484 when considering the top 200, 500, and 1000 affected promoter variants (Figure 3.5 B), and AUROC scores of 0.8393, 0.7974, and 0.7499 when evaluating the top 200, 500, and 1000 affected enhancer variants (Figure 3.5 C). These results underscore 3DVariantVision's proficiency in discerning high-impact genetic variants with precision and confidence. 56 Figure 3.5. 3DVariantVision identified the high-effect variants based on MPRA. (A) One example showed the accurate prediction of variant effects based on 3DVariantVision using unsupervised learning. The top panel represents the variant effects in IRF4 loci based on MPRA, while the bottom panel represents the predicted effects based on the PC1 of TF disrupting values of 3DVariantVision. The Spearman correlation between the predicted effects and observed effects is 0.295. High variant effects were observed on the TF binding sites (highlighted). (B-C) A Random Forest model was trained based on TF disrupting values from 3DVariantVision to predict high-effect variants from the balanced background. The ROC curves showed that the 3DVariantVision’s ability to identify the top effect variants against the background in enhancer regions (B) and promoter regions (C), respectively. 57 3.3.5 3DVariantVision prioritized the SNP within CRISPRi-perturbed enhancers and the target gene of fine-mapped eQTLs 3DVariantVision's outstanding capability in de novo eQTL prediction opens up new avenues for prioritizing causal SNPs among the vast pool of nearby SNP candidates, a challenge previously hindered by LD in GWAS and traditional QTL calling methods. When multiple SNP candidates in enhancer are linked to the same target gene, 3DVariantVision calculates eQTL prediction scores, ultimately prioritizing the SNP-gene pair with the highest score (Figure 3.6 A). The validity of this approach is demonstrated using CRISPR-QTL datasets165, which are not constrained by LD and offer insights into potential causal genetic variants. CRISPR-QTL mapping is akin to conventional human eQTL studies, with individual humans replaced by individual cells, genetic variants substituted by unique combinations of 'unlinked' guide RNA (gRNA)-programmed perturbations per cell, and tissue-level RNA-seq of many individuals replaced by scRNA- seq of many cells. We applied both Enformer and 3DVariantVision to prioritize causal SNPs among CRISPRi-perturbed enhancers. To establish a baseline, we generated a background dataset for CRISPR-QTLs by assembling random enhancer-gene pairs with matching enhancer lengths and genomic distance distributions. eQTL prediction scores were calculated for the prioritized SNP-gene pairs in both CRISPR-QTLs and the background, using 3DVariantVision and Enformer separately. 3DVariantVision predicted a higher scores of CRISPR-QTL compared to scores of the background (Figure 3.6 B). In contrast, there is no significant difference between CRISPR-QTL and background based on Enformer predictions (Figure 3.6 B). These results highlight 3DVariantVision's significant discrimination in CRISPR-QTL compared to Enformer. Similar comparisons were carried out for enhancer-gene interactions (ABC-EPI) predicted by the Activity-by- Contact (ABC) model13, where 3DVariantVision consistently exhibited greater discrimination than Enformer. 3DVariantVision predicted a higher score of ABC dataset compared to scores of the background (Figure 3.6 C). In contrast, there is no significant difference between ABC dataset and background based on Enformer predictions (Figure 3.6 C). By adeptly modeling the joint features between enhancers and promoters, 3DVariantVision not only excels at predicting genetic variant effects in the local genomic 58 region but also possesses the capability to predict downstream target genes accurately. For any SNP of interest, 3DVariantVision generates eQTL prediction scores for all candidate genes paired with that SNP. The gene with the highest prediction score is prioritized as the target gene (Figure 3.6 D). To validate 3DVariantVision's proficiency in target gene identification, we employed four distinct fine-mapping datasets: SuSIE166, DAP-G167, CAVIAR168, and CaVEMaN169. Balanced control backgrounds were meticulously generated, ensuring that each SNP matched the nearest gene to the actual target genes of eQTLs. The resulting AUROC values for these fine-mapped eQTLs are as follows: 0.7410 for SuSIE, 0.7173 for DAP-G, 0.7200 for CAVIAR, and 0.7679 for CaVEMaN (Figure 3.6 E), underscoring the superior performance of 3DVariantVision in target gene identification. To further enhance stringency, we curated a more selective background by exclusively choosing the nearest coding genes of the actual target genes of eQTLs. Remarkably, 3DVariantVision continues to exhibit commendable accuracy across all four fine-mapped eQTL datasets, with AUROC values of 0.5529, 0.5346, 0.5310, and 0.5870, respectively (Figure B. 9). These results affirm the precision and reliability of 3DVariantVision in target gene identification. For instance, consider the eQTL rs6985508, where 3DVariantVision successfully pinpointed ENSG00000022567 as the target gene, distinguishing it from the nearby ENSG00000204882 (Figure 3.6 F). This eQTL is further supported by chromatin interactions from Capture-C data, and the target gene is bounded by pivotal TFs such as RUNX3, ELF1, HCFC1, NRF1, and MAZ. Additionally, in the case of eQTL rs1545837, 3DVariantVision precisely identified ENSG00000147439 as the target gene, effectively discriminating it from the nearby ENSG00000179388 (Figure 3.6 G). Notably, both the target gene and the background gene are associated with important TFs. However, a crucial distinction emerges - the actual target gene is bound by ELF1, while the background gene is not. Intriguingly, the occurrence of RUNX3 in the SNP region implies a potential mechanism of ELF1 repression, as RUNX3 has been proven to repress ELF1 in the CD8-TC cell line172. These two illustrative examples underscore 3DVariantVision's remarkable proficiency in identifying the target genes of genetic variants, shedding light on the intricate interplay between genetic variations, TFs, and gene expression regulation. 59 Figure 3.6. 3DVariantVision prioritized the SNP within CRISPRi-perturbed enhancers and the target gene of fine-mapped eQTLs. (A) Schematic figure of validation by CRISPR-QTL. The eQTL prediction scores of all the potential SNP-gene pairs located in the cis enhancer-gene pair are calculated. The SNP-gene pair with the highest prediction score is prioritized to represent the score of the corresponding CRISPR-QTL. (B-C) 3DVariantVision showed significantly higher prediction scores on two orthogonal datasets, CRISPR-QTL (B) and ABC (C), compared with genomic distance-controlled SNP-gene pairs. (D-E) 3DVariantVision prioritized the downstream target genes of genetic variants. (F) For the SNPs with interests, 3DVariantVision calculated the eQTL prediction scores with surrounding genes and prioritized the top effected one. (E) 3DVariantVision successfully identified target genes of 4 fine-mapped eQTL datasets from the closest gene to the real target genes with distance > 1kb. (F-G) Two eQTL examples of target gene identifications supported by chromatin interaction (F) and TF profiles (G). 60 3.3.6 3DVariantVision deciphers the relationships of effects between chromatin and expression 3DVariantVision excels in modeling long-range chromatin interactions, providing a means to quantify the chromatin effects of SNP-gene pairs by contrasting the reference genome with the alternative genome. While the preceding eQTL prediction section underscores the related yet distinct nature of chromatin interactions and gene expression associations, this section delves into quantifying their relationship by calculating correlations between chromatin effects and expression effects. GTEx supplies the slopes for all significant SNP-gene associations, serving as proxies for expression effects. Chromatin effects, in turn, are determined by assessing the chromatin changing ratio between the alternative and reference genomes, as derived from 3DVariantVision. The Spearman correlation of absolute values between expression effects and chromatin effects yields 0.1759, revealing a positive relationship wherein elevated chromatin effects correspond to increased expression effects. Notably, this correlation is even higher (0.2126) for a subset of SNP-gene associations characterized by genomic distances exceeding 500kb. To comprehensively explore this connection, we segment the significant SNP-gene associations into 20 evenly-distributed groups based on genomic distance. Strikingly, a statistically significant higher correlation emerges in the associations with longer genomic distances (p-value=0.0022) (Figure 3.7 A), suggesting that chromatin interaction may mediate long-range eQTLs. Continuing our investigation, we delve into the directionality of chromatin effects and expression effects. While the absolute values of these effects exhibit positive correlations, the directions reveal a more intricate, non-correlative relationship. We posit that distinct roles of TFs might underlie this phenomenon. Specifically, eQTLs influenced by activator TFs may exhibit consistent directions between chromatin effects and expression effects, whereas those influenced by repressor TFs may demonstrate opposite directions. To put this hypothesis to the test, we segregate significant SNP-gene associations from GTEx into groups based on different TF bindings at the SNP sites, referred to as TF-mediated groups. We assess the fraction of consistent directions between chromatin effects and expression effects for each TF-mediated group based on the top N effect values derived from expression effects. N varies from 20 to 2000. The 61 direction consistency is plotted against N for each TF-mediated group. At the top-effect eQTLs, some TF-mediated groups indeed exhibit consistent directions, while others manifest opposite directions (Figure B. 8). However, the consistency tends to approach neutrality when the expression effects of eQTLs are less pronounced. To glean insights into the roles of TFs, whether they activate or repress, we employ the consistency at the top 100 expression effects as a representation (Figure 3.7 B). Notably, all TF-mediated groups exhibit positive Spearman correlations between the absolute values of chromatin effects and expression effects. Two well-established examples, EBF1 as an activator173 and TBX21 as a repressor174, exemplify consistent and opposite directions between chromatin effects and expression effects, respectively (Figure 3.7 B). These results underscore 3DVariantVision's capacity to unveil the roles of TFs in both chromatin and expression regulation. Communicative Learning effectively modeled the intricate relationships among minuscule regions situated between enhancers and promoters, allowing for the prioritization of vital functional TF combinations that contribute to chromatin interactions. To assess the significance of these TF combinations, we focused on Capture-C supported enhancer-promoter interactions in heldout chromosomes (chr9, chr10, and chr11). We counted the occurrences of TF combinations within the top-priority minuscule region pairs using TF ChIP-seq peaks, and we created corresponding backgrounds. Subsequently, we calculated z-scores for each TF combination and visualized them as a heatmap (Figure B. 7). Notably, some of these enriched TF combinations were validated by Protein-Protein Interactions (PPIs)139, such as SIN3A-ZBTB40, YY1-JUNB, CHD1- SRF, and STAT3-ARNT, highlighting the role of PPIs in chromatin interactions (Figure 3.7 C). To comprehensively assess the significance of the TF combinations prioritized by 3DVariantVision, we conducted PPI enrichment analysis by comparing them against shuffled backgrounds. This analysis revealed a statistically significant enrichment of PPIs among the prioritized TF combinations, as confirmed by the z-test (p-value = 1.03e-2) (Figure 3.7 D). These findings collectively underscore 3DVariantVision's capacity to discover and prioritize crucial TF pairs essential for chromatin interactions. 62 Figure 3.7. 3DVariantVision deciphers the relationships of effects between chromatin and expression. (A-B) 3DVariantVision deciphered the chromatin effects of eQTLs. (A) The eQTLs were divided into 20 groups based on genomic distance and the Spearman correlation between chromatin effects and expression effect sizes was calculated separately. The eQTLs with longer genomic distances showed a higher correlation between effect sizes and chromatin effects. (B) The eQTLs were grouped by different harboring TFs located near the SNPs. The direction consistency on the top 100 expression effected eQTLs was calculated by the fraction of the consistent direction between expression effects and chromatin effects. The red dots in the scatter plot represent TFs with directions while the blue dots represent TFs with opposite directions. All of them have a positive Spearman correlation between chromatin effect sizes and expression effect sizes. The right panel highlights two TFs with high consistency and high opposite directions. (C-D) 3DVariantVision prioritized the TF combinations of chromatin interactions. For the minuscule regions of the Capture-C enhancer-promoter interactions 63 Figure 3.7 (cont’d) in chr9, chr10, chr11, the harboring TF combinations were counted based on TF ChIP- seq. The Z-score was calculated for each TF combination by comparing the top important minuscule region pairs based on the attention weights of 3DVariantVision and the random minuscule region pairs from the same enhancer-promoter interaction. The enriched TF combination block was shown in (D). (E) PPIs are enriched on the prioritized TF combinations (p-value of Z-score < 0.05), compared with the randomly shuffled TF pairs. 3.3.7 3DVariantVision helps to understand the disease associations Understanding Genome-Wide Association Studies (GWAS)3 and Transcriptome-Wide Association Studies (TWAS)157 is of paramount importance in unraveling the genetic underpinnings of complex diseases and advancing personalized medicine. To assess 3DVariantVision's capacity to elucidate the implications of genetic variants for disease associations within the context of GWAS, we categorized GTEx eQTLs into two distinct groups based on whether the associated SNPs were corroborated by GWAS findings. Notably, for the eQTL group supported by GWAS, 3DVariantVision exhibited markedly higher eQTL prediction scores in comparison to the group without GWAS support (Figure 3.8 A). Furthermore, TWAS aims to identify genes whose expression levels are influenced by genetic variants. To comprehensively assess 3DVariantVision's potential in prioritizing target genes associated with TWAS, we focused on a set of 28 statistically significant SNP-gene associations linked to lipid traits in whole blood cell type171. For each of these associations, we generated a corresponding background set by aggregating all coding genes located within a 1Mb region centered on the TSS of the TWAS-targeted gene. Subsequently, we calculated the rank percentile of the TWAS target gene within this background, utilizing the 3DVariantVision prediction score. Remarkably, our results revealed that the TWAS target genes consistently obtained significantly higher rank percentiles compared to all the coding genes in the 1Mb background (Figure 3.8 B). Those observations underscore the potential of 3DVariantVision to shed light on the intricate relationship between genetic variants, gene expression, and disease associations, offering valuable insights into the genetic basis of complex diseases and the development of precision medicine approaches. In this demonstration, we present notable examples of successful target gene prioritization for four distinct traits - high-density lipoprotein cholesterol (HDL-C), low- 64 density lipoprotein cholesterol (LDL-C), total cholesterol (TC), and triglycerides (TG). Consider the genetic variant rs12050262, associated with the HDL-C trait. Remarkably, 3DVariantVision effectively prioritized SETD3, the genuine target gene, out of a pool of 7 coding genes within the surrounding 1Mb background region (Figure 3.8 C). Similarly, for rs17599675, linked to the LDL-C trait, our model accurately pinpointed FBXO38 as the target gene among 11 coding genes within the 1Mb background region (Figure 3.8 D). Furthermore, rs11578696, associated with the TC trait, was successfully linked to TMEM79, the actual target gene, from a set of 38 coding genes located within the 1Mb background region (Figure B. 10 A). Lastly, for rs10889347, a genetic variant associated with the TG trait, 3DVariantVision precisely prioritized USP1 as the target gene, out of 6 coding genes situated in the 1Mb background region (Figure B. 10 B). These compelling examples underscore 3DVariantVision's efficacy in accurately identifying genes associated with specific traits, facilitating a deeper understanding of the genetic basis of complex phenotypes, and offering insights for potential therapeutic interventions. 65 Figure 3.8. 3DVariantVision interprets disease-associated traits. (A) eQTLs supported by GWAS showed significantly higher prediction scores based on 3DVariantVision compared with the background eQTLs. (B) For 28 TWAS associations with blood cell traits, all the coding genes within 2MB from the TSS of the TWAS targeted gene were collected as the background, and the corresponding eQTL prediction scores were 66 Figure 3.8 (cont’d) generated by 3DVariantVision. The y-axis represents the percentile rank based on eQTL prediction scores, showing 3DVariantVision prioritized the TWAS associations against the background. (C-D) Two examples of prioritizing target genes of TWAS SNPs based on 3DVariantVision within 1Mb. The barplot at the bottom panel shows the eQTL prediction scores. (C) and (D) are the examples of HDL and LDL trait associations, respectively. 67 SYSTEMATIC DELINEATION OF MULTI-LEVEL STRUCTURAL VARIABILITIES OF SPATIAL GENOME CONFORMATIONS 4.1 INTRODUCTION The three-dimensional (3D) organization of chromatin plays a crucial role in regulating gene expression, DNA replication, genome stability, and tissue differentiation175–177. In recent years, the rapid development of chromatin conformation capture assays, such as ChIA-PET107, Capture-C108, and Hi-C14, has enabled the quantitative analysis of 3D chromatin interactions, including enhancer-promoter interactions and genome-wide intra- chromosomal and inter-chromosomal contacts across various species and cell types. These studies have revealed structural components of chromatin at multiple scales, including chromatin loops, topologically associating domains (TADs), and chromatin compartments14. However, bulk-tissue measurements can only represent the average chromatin conformation, failing to capture the heterogeneity among millions of individual cells (Figure 4.1 A). The advent of single-cell chromatin conformation capture methods, such as single-cell Hi-C30–40, has opened new frontiers in the field, allowing researchers to investigate 3D genome structure at the single-cell level and unveil the dynamics and variability of chromatin organization across cells30–40. The analysis of 3D genome structures encompasses both intra-chromosomal and inter-chromosomal interactions. Within a chromosome, researchers investigate chromatin loops, TADs, and chromatin compartments14. When studying the inter-chromosomal structure of a diploid cell, researchers examine the relative positions of chromosomes and their spatial organization within the nucleus41,178–180. These multi-scale structures, both intra- and inter-chromosomal, are heavily influenced by variability at multiple levels (Figure 4.1 B). As cells differentiate from stem cells, cell-type variability leads to the emergence of distinct functions and shapes of different tissues181. Even within the same cell type, cell-level variability arises from differences in cell age and individual-level variations182. Moreover, within a single diploid cell, the two copies of the same chromosome, one from the paternal genome and the other from the maternal genome, can exhibit structural differences in allele-level, such as the active and inactive states of 68 chromosome X183,184. This multi-level variability poses significant challenges for both computational and experimental analyses of 3D chromatin regulation. Based on bulk-tissue unphased Hi-C data, which is more widely available compared to single-cell Hi-C, researchers have developed numerous methods for contact normalization185,186, imputation187, and interaction calling188, as well as further structure reconstruction26–28,189–192. These methods often employ a 'beads on a string' polymer model, where each chromosome is represented as a chain of 'beads' consisting of DNA fragments or loci, and the pairwise spatial distances between genomic loci are inferred from Hi-C contacts29. Structure reconstruction methods can be broadly categorized into two types: consensus methods and ensemble methods (Figure 4.1 C). Consensus methods, such as FLAMINGO28 and GAM-FISH192, reconstruct a single consensus structure for each chromosome by treating the intra-chromosomal Hi-C contact map as an average representation of pairwise distances between genomic loci without further justification29. However, these methods overlook cell-level and allele-level variability30,182. In contrast, ensemble methods25,193 decompose the bulk-tissue Hi-C contact map matrix into a large tensor, with an additional dimension representing individual cells, thereby generating thousands of different structures. While ensemble methods attempt to capture cell-to-cell variability, the resulting structures mix multi-level variability and are challenging to interpret due to the lack of a clear correspondence between the generated structures and the individual cells in the Hi-C experiment25,193. Consequently, the strategies and underlying assumptions of these methods require further systematic evaluation to provide guidance for future research. In this study, we provide a comprehensive analysis of the multi-level variability in 3D chromatin organization by leveraging a wide range of bulk-tissue and single-cell Hi-C datasets14,34,35 across different cell types, developmental stages, and alleles. Our findings reveal several key biological insights. First, we demonstrate that intra-chromosomal chromatin organization is more stable within the same cell type compared to the dynamics observed across cell types, with the large-scale skeleton being stable at low resolution and the short-scale detailed organization captured at higher resolutions. Second, we show that inter-chromosomal chromatin exhibits a higher degree of complexity and variability that cannot be captured by a single consensus structure, emphasizing the 69 importance of considering cell-to-cell variability in inter-chromosomal interactions. Third, we uncover an ordered hierarchy of structural variability, with tissue-level variability being more pronounced than age-level variability, followed by individual cell-level and allele- level variability for autosomes, while chromosome X exhibits a distinct order with allele- level variability being more prominent than individual cell-level variability. These findings highlight the importance of considering cell type-specific factors, developmental dynamics, and allele-specific differences when studying 3D chromatin organization and its relationship to gene regulation and cellular function. From a computational perspective, our analysis provides guidance for selecting appropriate resolutions and read depths when studying chromatin organization across different genomic scales, considering the trade-off between resolution and genomic distance. We also emphasize the need for advanced computational methods that can capture the complexity and variability of inter- chromosomal interactions beyond single consensus structures. Furthermore, we propose a framework for modeling and analyzing chromatin structures at different levels of inquiry based on the ordered hierarchy of variability sources. Our study highlights the limitations of existing methods, such as consensus and ensemble approaches, in addressing the multi-level variability in 3D chromatin organization and underscores the need for integrating multiple experimental techniques to remove biases and obtain a more accurate representation of chromatin interactions. In conclusion, our comprehensive analysis of multi-level variability in 3D chromatin organization provides valuable insights into the complex interplay between cell type-specific factors, developmental dynamics, and allele-specific differences, guiding future experimental designs and computational method development in this rapidly evolving field. 70 Figure 4.1. Overview of 3D chromatin structure modeling and analysis based on Hi-C. (A) Bulk-tissue Hi-C provides an average representation of both intra-chromosomal and inter-chromosomal chromatin organizations from millions of cells. The heatmap shows an example of a bulk-tissue Hi-C contact matrix, with warmer colors indicating higher interaction frequencies between genomic regions. (B) Multi-level variability in analyzing 3D genome structures. Intra-chromosomal contacts represent the 3D structure of a single chromosome, while inter-chromosomal contacts represent the interactions between 71 Figure 4.1 (cont’d) multiple chromosomes in diploid cells, which contain 23 maternal and 23 paternal alleles. The analysis of 3D genome structure is influenced by variability at different levels, including cell-type level, individual cell level, and allele level. (C) Summary of existing methods for 3D genome structure reconstruction. Consensus methods based on bulk- tissue unphased Hi-C data ignore cell-level and allele-level variability to reconstruct the 3D structure of a single chromosome using the intra-chromosomal distance matrix derived from Hi-C. Ensemble methods decompose the distance tensors to generate thousands of multi-chromosomal 3D structures, but the resulting models can be difficult to interpret biologically. 4.2 MATERIALS AND METHODS 4.2.1 Datasets and pre-processing Hi-C datasets from seven bulk tissues (GM12878, K562, KBM7, HUVEC, IMR90, HMEC, and NHEK) obtained from Rao et al14 were utilized to compare intra-chromosomal 3D chromatin organization across different cell types. For the analysis of inter-chromosomal interactions, both SPRITE194 and Hi-C data from the GM12878 cell type were employed. Single-cell Hi-C datasets from Dip-C experiments in human (GM12878)34 and mouse (MOE and retina) cells35 were used to investigate intra-chromosomal interactions at the single-cell level. All Hi-C datasets underwent mapping and normalization procedures as described in their respective original publications14,34. 4.2.2 Bulk-level Intra-chromosomal Hi-C cross cell-type analysis To assess cell-type level variability, Spearman correlation was calculated across different cell types. The Hi-C data was partitioned based on genomic distance bands to control for confounding factors such as missing rate and genomic distance. Let 𝑔 represent the genomic distance between two genomic fragments. The Hi-C data was divided into the following bands: (0,1Mb], (1Mb,2Mb], (2Mb,5Mb], and (5Mb,10Mb]. For Hi-C data from two cell types, the union set of available data was considered, and Spearman correlation was calculated within each chromosome separately. Various Hi-C resolutions were examined, including 1Mb, 500kb, 250kb, 50kb, 25kb, and 10kb. However, the correlation for 𝑔 ∈ (0,1Mb] at 1Mb resolution was not available. Additionally, the correlation for 𝑔 < 2Mb at 10kb resolution was not computed due to low data quality resulting from a high 72 missing rate. The primary replicate of GM12878 was used as the query cell type, and Spearman correlation was calculated between the query cell type and the target cell types, which included K562, KBM7, HUVEC, IMR90, HMEC, NHEK, and another replicate of GM12878. Spearman correlations between the primary replicate of GM12878 and another cell type were calculated for specific chromosomes, resolutions, and genomic distance bands. These correlations from all chromosomes were then grouped based on the corresponding resolution, genomic distance band, and target cell type, representing the overall correlation between GM12878 and the other cell type under specific resolution and genomic distance band settings. The cell type with the highest median correlation was used as the cell-type-specific boundary for a given resolution and genomic distance band, as all median correlations between GM12878 and other cell types fall below this threshold. This cell type is considered the most similar to GM12878 in terms of chromatin organization under specific resolution and genomic distance band settings. To quantify the influence of read depth on the data, a single replicate of GM12878 was used, and the original mapped reads were downsampled. For a given chromosome and resolution, the corresponding Hi-C matrix was obtained. Original reads were randomly downsampled at the following rates: 80%, 50%, 30%, 20%, 10%, 8%, 5%, 3%, 2%, and 1%. Based on the sampled reads, downsampled Hi-C matrices were constructed. The downsampling process was performed once for each specific chromosome and resolution combination. Spearman correlation between the original Hi-C matrix and the downsampled Hi-C matrices was then calculated separately for different genomic distance bands. 4.2.3 Bulk-level inter-chromosomal Hi-C data normalization To remove the bias in Hi-C interaction frequencies, two transformation functions were applied between Hi-C and SPRITE data. For the GM12878 cell type, both Hi-C and SPRITE data were collected. All interaction frequencies at 10Mb resolution were divided into intra-chromosomal and inter-chromosomal categories for Hi-C and SPRITE separately. First, for intra-chromosomal interaction frequencies, a linear transformation function 𝑔(𝑥) = 𝛽1𝑥 + 𝛽0 was used to convert SPRITE interaction frequencies to Hi-C 73 interaction frequencies using the equation 𝐼𝐹𝐻𝑖𝐶𝑖𝑛𝑡𝑟𝑎 = 𝑔(𝐼𝐹𝑆𝑃𝑅𝐼𝑇𝐸𝑖𝑛𝑡𝑟𝑎) , where 𝐼𝐹𝐻𝑖𝐶𝑖𝑛𝑡𝑟𝑎and 𝐼𝐹𝑆𝑃𝑅𝐼𝑇𝐸𝑖𝑛𝑡𝑟𝑎 represent the intra-chromosomal interaction frequencies in Hi- C and SPRITE, respectively, and 𝛽1, 𝛽0 are the learned parameters. Second, for inter- chromosomal interaction frequencies, a log transformation function 𝑓(𝑥) = 𝑎1(𝛼1 𝑙𝑜𝑔 (𝑥 + 1) + 𝛼0) + 𝑎0 was employed to convert Hi-C interaction frequencies to SPRITE interaction frequencies using the equation 𝐼𝐹𝑆𝑃𝑅𝐼𝑇𝐸𝑖𝑛𝑡𝑒𝑟 = 𝑓(𝐼𝐹𝐻𝑖𝐶𝑖𝑛𝑡𝑒𝑟), where 𝑎1, 𝑎0, 𝛼1 and 𝛼0 are the learned parameters. The log function was chosen over the linear function due to its better performance. The learned transformation functions were applied to normalize inter-chromosomal Hi-C interaction frequencies by removing the bias using the equation: 𝐼𝐹𝐻𝑖𝐶𝑖𝑛𝑡𝑟𝑎 = 𝑔(𝑓(𝐼𝐹𝐻𝑖𝐶𝑖𝑛𝑡𝑒𝑟)) The normalized Hi-C interaction frequencies were transformed into 3D Euclidean 𝜂, where 𝑑𝑖𝑗 represents the pairwise distances using the exponential function: 𝑑𝑖𝑗 = 𝐼𝐹𝑖𝑗 3D distance between DNA fragments 𝑖 and 𝑗, 𝐼𝐹𝑖𝑗 represents the interaction frequency between fragments 𝑖 and 𝑗, and 𝜂 is set to 0.25, as suggested by previous literature. A standard distance metric satisfies the triangle inequality: 𝑑𝑖𝑗 < 𝑑𝑖𝑘 + 𝑑𝑗𝑘, which can be used to quantify the quality of spatial distances transformed from inter-chromosomal interaction frequencies. For all genomic fragments in the whole genome at 10Mb resolution, a triplet of fragments 𝑖, 𝑗, and 𝑘 was randomly sampled to check whether the triangle inequality holds. This sampling process was repeated 1000 times to generate 1000 triplets, and the fraction of cases in which the triangle inequality held was calculated. The entire procedure was performed 100 times to obtain the distribution of these fractions. A higher fraction indicates a better quality of the distance from Hi-C. 4.2.4 Low-rank property evaluation The pairwise distance matrix is biologically generated by the underlying low-rank coordinate matrix of DNA fragments with rank ≤ 3. Thus, Due to the property of ranks for matrix addition, the observed Euclidean distance matrix has a rank ≤ 5. Singular value decomposition (SVD) was performed on the observed distance matrix transformed from the Hi-C interaction frequency. The SVD is given by: 74 𝑛 𝐷 = 𝑈𝛴𝑉𝑇 = ∑ 𝜎𝑖𝑢𝑖𝑣𝑖 𝑇 𝑖 |𝜎1| ≥ |𝜎2| ≥ . . . ≥ |𝜎𝑛|, Here, 𝐷 represents the observed distance matrix, 𝜎𝑖 are the singular values (SVs), 𝑢𝑖 and 𝑣𝑖 are the left and right singular vectors, respectively, and 𝑛 represents the dimension of 𝐷. According to the low-rank property of the distance matrix, the top 5 SVs and corresponding vectors are used to reconstruct the approximation of the distance matrix: 5 𝐷̂ = ∑ 𝜎𝑖𝑢𝑖𝑣𝑖 𝑖 𝑇 To assess the low-rank property in the observed distance matrix, we first calculated the explanation ratio of the top 𝑘 SVs: 𝑟𝑘 = 𝑘 ∑ |𝜎𝑖| 𝑖=1 𝑛 ∑ |𝜎𝑖| 𝑖=1 ∈ [0,1] The ideal explanation ratio of the top 5 SVs 𝑟5 should be close to 1, indicating that the low-rank property holds well when the ratio is closer to 1. Additionally, we checked the correlation between the observed distance matrix (𝐷) and the approximation matrix (𝐷̂) based on the top 5 SVs. Both Spearman and Pearson correlations were calculated. A higher correlation indicates a better low-rank property, as the approximation matrix closely represents the original observed distance matrix. 4.2.5 Allele-allele correlation in single-cell Dip-C, a high-throughput single-cell Hi-C technique, provided data for 17 cells in the human GM12878 cell type and approximately 400 cells in mouse, including retina and main olfactory epithelium (MOE) cell types. By leveraging unique single nucleotide polymorphisms (SNPs) in paternal and maternal alleles, the reads from Dip-C were mapped to specific alleles, enabling the construction of allele-specific interaction frequency matrices. Let 𝐴♂ and 𝐵♀ represent the paternal allele of chromosome A and the maternal allele of chromosome B, respectively. 𝐼𝐹𝐴♂𝐵♀ represents the interaction frequency matrix between 𝐴♂ and 𝐵♀. For example: 75 1) 𝐼𝐹𝐴♂𝐴♂ is the intra-chromosomal interaction frequency matrix within the paternal allele of chromosome A. 2) 𝐼𝐹𝐴♂𝐴♀ is the inter-chromosomal interaction frequency matrix between the paternal and maternal alleles of chromosome A. 3) 𝐼𝐹𝐴♂𝐵♀ is the inter-chromosomal interaction frequency matrix between the paternal allele of chromosome A and the maternal allele of chromosome B. Due to the high missing rate in single cells, inter-chromosomal interaction frequencies are almost unavailable. Therefore, we only used the single-cell intra-chromosomal interaction frequency matrices, 𝐼𝐹𝐴♂𝐴♂ and 𝐼𝐹𝐴♀𝐴♀, from Dip-C at 1Mb resolution. To control the confounding factors, including genomic distance, we normalized the read depths band-wisely using BandNorm195. For simplicity, the normalized 𝐼𝐹𝐴♂𝐴♂ and 𝐼𝐹𝐴♀𝐴♀ will be referred as 𝐼𝐹𝐴♂ and 𝐼𝐹𝐴♀, respectively. To distinguish the interaction frequency ♂ represent the paternal allele of chromosome A in cell i, matrices in different cells, let 𝐴𝑖 and 𝐼𝐹𝐴𝑖 ♂ represent the intra-chromosomal interaction frequency matrix of 𝐴𝑖 ♂. We calculated different correlations for a quantitive comparison: 1) Allele level correlation in the same cell (allele-level correlation): For chromosome A in cell i (𝐴𝑖), we computed the Spearman correlation between 𝐼𝐹𝐴𝑖 ♀, which represents the correlation between paternal and maternal ♂ and 𝐼𝐹𝐴𝑖 alleles of the same chromosome within a single cell. 2) Cell level correlation in the same allele (cell-level correlation): a. Paternal allele: For the paternal allele of chromosome A (𝐴♂), we calculated the Spearman correlation between 𝐼𝐹𝐴𝑖 ♂ and 𝐼𝐹𝐴𝑗 ♂, which represents the correlation of the paternal allele of the same chromosome between different cells i and j. b. Maternal allele: For the maternal allele of chromosome A, 𝐴♀, we calculated the Spearman correlation between 𝐼𝐹𝐴𝑖 ♀ and 𝐼𝐹𝐴𝑗 ♀, which represents the correlation of the maternal allele of the same chromosome between different cells i and j. 3) Overall correlation in different alleles and cells (overall correlation): 76 For chromosome A, we computed the Spearman correlation between 𝐼𝐹𝐴𝑖 𝐼𝐹𝐴𝑗 ♀, which represents the correlation of the same chromosome between different ♂ and alleles across different cells i and j. Allele-level correlations were calculated for different cells, while cell-level correlations and overall correlations were calculated for different cell pairs. Boxplots were used to visualize the correlation distributions across different chromosomes j. 4.2.6 Embedding of single-cell alleles and distance calculation We applied BandNorm to generate embeddings for each allele of a specific chromosome in single cells from mouse Dip-C data without imputation. For each chromosome, we collected the normalized interaction frequency matrices at 1Mb resolution of each allele across individual cells and performed principal component analysis (PCA). The loadings on the top 50 principal components (PCs) were used as the embedding of the corresponding allele in a single cell. To visualize the embeddings in a two-dimensional (2D) space, we employed two strategies: (1) using the loadings on the top 2 PCs directly, and (2) using Uniform Manifold Approximation and Projection (UMAP) to project the 50- dimensional embedding vectors onto a 2D space. 4.2.7 Multi-level variability quantification and comparison We compared different levels of chromatin variability based on single-cell allele embeddings using sampling strategies, including tissue, age, cell, and allele levels. To compare the tissue-level variability and age-level variability, we used Dip-C mouse cells from MOE and retina tissues. All cells were from male mice, including retina cells at postnatal day 7 (P7) and P28, and MOE cells at P28. For each chromosome, we randomly 7), one allele from a retina cell at P28 sampled one allele from a retina cell at P7 (𝐿𝑟𝑒 28), and one allele from an MOE cell at P28 (𝐿𝑚𝑜𝑒 (𝐿𝑟𝑒 arising from different alleles, all three sampled alleles were either paternal or maternal, 28). To remove confounding factors i.e., 𝐿𝑟𝑒 7, 𝐿𝑟𝑒 28, 𝐿𝑚𝑜𝑒 28 were either all paternal or all maternal. Euclidean distances were calculated based on the embeddings 28), and the tissue-level distance, 𝑑𝑖𝑠𝑡(𝐿𝑚𝑜𝑒 to determine 𝑑𝑖𝑠𝑡(𝐿𝑟𝑒 7, 𝐿𝑟𝑒 the age-level distance, 28, 𝐿𝑟𝑒 28). The distances were normalized by setting the age-level distance to 1 ( 𝑑𝑖𝑠𝑡𝑎𝑔𝑒 = 1 ) and calculating the 77 normalized tissue-level distance as𝑑𝑖𝑠𝑡𝑡𝑖𝑠𝑠𝑢𝑒 = 𝑑𝑖𝑠𝑡(𝐿𝑟𝑒 7, 𝐿𝑟𝑒 28) / 𝑑𝑖𝑠𝑡(𝐿𝑚𝑜𝑒 28, 𝐿𝑟𝑒 28). The sampling process was repeated 1,000 times, generating 1,000 𝑑𝑖𝑠𝑡𝑡𝑖𝑠𝑠𝑢𝑒 values for each chromosome. Boxplots were used to visualize the distribution of 𝑑𝑖𝑠𝑡𝑡𝑖𝑠𝑠𝑢𝑒 across different chromosomes. Similarly, to compare the age-level variability and cell-level variability, we used Dip- C mouse cells from retina tissues. All cells were from male mice, including retina cells at cells at P7 (𝐿7 P7, P28, and P56. For each chromosome, we randomly sampled two alleles from different 2), one allele at P28 (𝐿28), and one allele at P56 (𝐿56). All sampled alleles were either paternal or maternal. We calculated the Euclidean distances based on 1, 𝐿7 the embeddings: cell-level distance 𝑑𝑖𝑠𝑡(𝐿7 1, 𝐿7 2) , 28-day age-level distance 𝑑𝑖𝑠𝑡(𝐿28, 𝐿7 1) , and 49-day age-level distance 𝑑𝑖𝑠𝑡(𝐿56, 𝐿7 1) . The distances were normalized by setting the cell-level distance to 1 ( 𝑑𝑖𝑠𝑡𝑐𝑒𝑙𝑙 = 1 ) and calculating the normalized age-level distance as 𝑑𝑖𝑠𝑡𝑎𝑔𝑒 28 = 𝑑𝑖𝑠𝑡(𝐿28, 𝐿7 1) / 𝑑𝑖𝑠𝑡(𝐿7 1, 𝐿7 2) and 𝑑𝑖𝑠𝑡𝑎𝑔𝑒 49 = 𝑑𝑖𝑠𝑡(𝐿56, 𝐿7 1) / 𝑑𝑖𝑠𝑡(𝐿7 1, 𝐿7 2) . The sampling process was repeated 1,000 times, generating 1,000 𝑑𝑖𝑠𝑡𝑎𝑔𝑒 28 and 𝑑𝑖𝑠𝑡𝑎𝑔𝑒 49 values for each chromosome. In addition, to compare the cell-level variability and allele-level variability, we only used Dip-C female mouse cells at P5 from MOE tissues. For each chromosome, we randomly sampled two cells, with 𝑃1 and 𝑀1 representing the paternal and maternal alleles from the first cell, and 𝑃2 and 𝑀2 represent the paternal and maternal alleles from the second cell. Euclidean distances were calculated based on the embeddings to determine the allele-level distance, 𝑑𝑖𝑠𝑡(𝑀1, 𝑃1), the cell-level distance, 𝑑𝑖𝑠𝑡(𝑀1, 𝑀2), and the overall distance, 𝑑𝑖𝑠𝑡(𝑀1, 𝑃2). The distances were normalized by setting the allele-level distance to 1 (𝑑𝑖𝑠𝑡𝑎𝑙𝑙𝑒𝑙𝑒 = 1), calculating the normalized cell-level distance as 𝑑𝑖𝑠𝑡𝑐𝑒𝑙𝑙 = 𝑑𝑖𝑠𝑡(𝑀1, 𝑀2) / 𝑑𝑖𝑠𝑡(𝑀1, 𝑃1) , and normalized overall distance 𝑑𝑖𝑠𝑡𝑐𝑒𝑙𝑙+𝑎𝑙𝑙𝑒𝑙𝑒 = 𝑑𝑖𝑠𝑡(𝑀1, 𝑃2) / 𝑑𝑖𝑠𝑡(𝑀1, 𝑃1). The sampling process was repeated 1,000 times, generating 1,000 𝑑𝑖𝑠𝑡𝑐𝑒𝑙𝑙 and 𝑑𝑖𝑠𝑡𝑐𝑒𝑙𝑙+𝑎𝑙𝑙𝑒𝑙𝑒 values for each chromosome. 78 4.3 RESULTS 4.3.1 Cell-type specificity of intra-chromosomal chromatin across distinct scales and resolutions Cell-type-specific intra-chromosomal chromatin organization has been widely studied using Hi-C datasets, leading to the development of various techniques for data preprocessing, normalization, imputation, and structure reconstruction26–28,185,186,188–192. However, a comprehensive analysis of intra-chromosomal chromatin across different cell types, scales, and resolutions while controlling for confounding factors is still lacking. In this study, we aimed to address this gap by systematically investigating the intra- chromosomal chromatin organization in multiple cell types. As a sequence-based experiment, Hi-C relies on read depth to capture chromosome conformation, which limits data quality and resolution14. Limited read depth results in higher rates of missing data at higher resolutions, with the rates of missing data being strongly related to genomic distance. For example, analysis of bulk Hi-C data from the GM12878 cell type at 10kb resolution revealed that the available rate (1 - missing rate) substantially decreases with increasing genomic distance (Figure 4.2 A). The available rate is approximately 90% within 1Mb, 60% within 2Mb, 40% within 3Mb, 30% within 5Mb, and 10% within 10Mb. To control for the effect of genomic distance, we split the Hi-C matrix into different genomic distance bands: (0,1Mb], (1Mb,2Mb], (2Mb,5Mb], (5Mb,10Mb], (10Mb,20Mb], and (10Mb,50Mb]. The contact frequency distribution for available entries in each band was examined (Figure C. 1). We observed that when the genomic distance exceeds 10Mb, the entries become almost binary. Therefore, we focused our analysis on four genomic distance bands: (0,1Mb], (1Mb,2Mb], (2Mb,5Mb], and (5Mb,10Mb]. To investigate intra-chromosomal chromatin variability across cell types, we utilized eight bulk Hi-C datasets, including two replicates of GM12878 and one replicate each of K562, KBM7, HUVEC, IMR90, HMEC, and NHEK. All Hi-C data were obtained from Rao et al, with similar read depths to minimize experimental biases. We calculated the Spearman correlation between the primary replicate of GM12878 and the other six cell types to quantify the variability between cell types. Additionally, we calculated the 79 Spearman correlation between the two replicates of GM12878 to determine the variability within the same cell type. These calculations were performed at various resolutions ranging from 1Mb to 10kb to assess the multi-scale variability in four different genomic distance bands. Significantly higher correlations within the same cell type compared to across cell types indicate that chromatin organization is stable within the same cell type (Figure 4.2 B). At the short genomic distance band (0,1Mb], the correlation within GM12878 is higher than 0.9 under various resolutions from 500kb to 25kb, and higher than 0.85 at 10kb resolution, demonstrating stable chromatin organization at short genomic distances across various resolutions. However, as the genomic distance increases, the correlation decreases, suggesting that stable chromatin organization only holds at lower resolutions for longer genomic distances (Figure 4.2 B). These findings highlight the importance of considering both genomic distance and resolution when studying intra-chromosomal chromatin variability across cell types. The stability of chromatin organization within the same cell type and the decreasing stability with increasing genomic distance provide valuable insights into the multi-scale nature of chromatin organization and its relationship to cell type identity. To further check the influence of read depth on data quality and determine reliable resolutions, we established cell-type-specific correlation boundaries based on across- cell-type correlations at different genomic distance bands and resolutions. The correlation within the same cell type should not be lower than these boundaries under the same genomic distance and resolution settings, as a lower correlation would indicate that the chromatin organization is more dynamic than across cell types, suggesting unreliability. For the primary replicate of GM12878, we downsampled the mapped reads from 80% to 1% to generate Hi-C matrices with fewer reads and assessed the decrease in data quality based on the correlation between the downsampled and original matrices. At low resolutions, such as 1Mb and 500kb, even with downsampling rates as low as 1%, the correlation remained higher than the cell-type-specific correlation boundaries, demonstrating extremely stable chromatin organization across different genomic distance bands (Figure 4.2 C, Figure C. 2). Higher resolutions required higher read depths. For instance, at the genomic distance band (0,1Mb], the minimal downsampling rates for 80 resolutions of 100kb, 50kb, 25kb, and 10kb were 2%, 5%, 10%, and 20%, respectively (Figure 4.2 C). Furthermore, to maintain consistent correlations at the same resolution across different genomic distance bands, the read depth varied. For example, to achieve consistent correlations at 50kb or 100kb resolution, the downsampling rates could be 1% for (0,1Mb], 10% for (1Mb,2Mb], 30% for (2Mb, 5Mb], and 50% for (5Mb, 10Mb], indicating that longer genomic distance bands require more read depth to maintain consistent correlations at the same resolution (Figure C. 2). Similarly, to maintain a consistent correlation at the same read depth, the reliable resolutions varied across genomic distance bands. For instance, when the downsampling rate was 10%, 20%, or 30%, the reliable resolutions were 10kb for (0,1Mb], 50kb for (1Mb,2Mb], and 100kb for (2Mb, 10Mb]. When the downsampling rate decreased to 1%, the reliable resolutions were 10kb for (0,1Mb], 100kb for (1Mb,2Mb], 100kb for (2Mb, 5Mb], and 250kb for (5Mb, 10Mb] (Figure C. 2). 81 Figure 4.2. Cell-type specificity of intra-chromosomal chromatin across distinct scales and resolutions. (A) Missing rate based on different genomic distances based on Hi-C at 10kb resolution from GM12878 cell type. Hi-C map was split based on several genomic 82 Figure 4.2 (cont’d) distance bands including (0,1Mb], (1Mb,2Mb], (2Mb,5Mb], (5Mb,10Mb], (10Mb,20Mb], (20Mb,50Mb], (50Mb,inf]. (B) Bulk Hi-C correlation between different cell types based on different genomic distance bands and resolution. The boxplots show the Spearman correlation in single chromosome between GM12878 with another cell type including K562, KBM7, HUVEC, IMR90, HMEC, NHEK, and another replicate of GM12878. (C) Correlation decreased with fewer read depths based on different genomic distance bands and resolutions. The GM12878 Hi-C was downsampled based on different rates from 80% to 1%, and the Spearman correlation was calculated between the downsampled one and original one. Each subfigure shows the results based on one genomic band. The line chart shows the correlations in single chromosome changes based on different downsampling rates. Different lines with various colors show the correlation changing trend at specific resolution. The right panel of each subfigure is the boxplot showing the cell-type specific boundary as a reference to determine the reliable correlation threshold of the downsampled Hi-C. The observations from our analysis provide valuable biological insights into the variability of intra-chromosomal chromatin organization. We found that intra- chromosomal chromatin is more stable within the same cell type compared to the dynamics observed across cell types. This stability is particularly evident in the multi-scale nature of chromatin organization, where the large-scale skeleton is stable at low resolution, while the stable short-scale detailed organization can be captured at higher resolutions. Our findings also highlight the complex interplay between read depth, resolution, and genomic distance in determining the reliability of chromatin organization data. Read depth has a more significant influence on the reliability of longer-range chromatin interactions at higher resolutions. This analysis provides guidance for selecting appropriate resolutions and read depths when studying chromatin organization across different genomic scales. The results suggest that for any Hi-C experiment, the reliable resolution for longer genomic distances is lower compared to shorter genomic distances, due to limitations in read depth. This trade-off between resolution and genomic distance is an important consideration when designing and interpreting Hi-C experiments. To overcome this limitation, methods like FLAMINGO28 employ a hierarchical approach to reconstruct the skeleton of an entire chromosome at low resolution and capture within- domain structures at high resolution. In conclusion, our study reveals the stability of intra-chromosomal chromatin organization within cell types and the importance of considering the complex relationships 83 between read depth, resolution, and genomic distance when studying chromatin organization. These findings have implications for the design and interpretation of Hi-C experiments and the development of computational methods for multi-scale chromatin organization analysis. 4.3.2 Heterogeneity of inter-chromosomal chromatin contacts and spatial distances Moving on to cell-type-specific inter-chromosomal chromatin, we investigated the chromatin organization between single chromosomes. Inter-chromosomal spatial distances are much longer than intra-chromosomal spatial distances14,196,197, which results in 3C-like techniques capturing limited reads and causing extremely high missing rates for detailed inter-chromosomal organization information196. Moreover, due to the digestion and ligation procedures, Hi-C is biased towards intra-chromosomal interactions compared to inter-chromosomal interactions, as it lacks the ability to handle long spatial distances. SPRITE, which does not rely on digestion and ligation procedures, exhibits no bias between intra-chromosomal and inter-chromosomal interactions194. This property makes SPRITE a valuable tool for removing the bias in Hi-C data and enabling a fair comparison between intra-chromosomal and inter-chromosomal interaction frequencies (Figure 4.3 A). We plotted scatter plots showing interaction frequencies based on Hi-C and SPRITE for intra-chromosomal and inter-chromosomal interactions separately. At 1Mb resolution, inter-chromosomal interaction frequencies showed no significant correlation between Hi-C and SPRITE due to the high missing rate caused by limited read depth (Figure C. 3). However, at a lower resolution of 10Mb and after removing entries close to the centromere, Hi-C and SPRITE exhibited a clear regression relationship with a correlation of 0.956 and 0.776 for intra-chromosomal and inter-chromosomal interactions, respectively (Figure 4.3 A, Figure C. 3). We fitted separate regression models for intra-chromosomal and inter-chromosomal interactions to remove the bias in Hi-C data based on SPRITE at 10Mb resolution. To validate the transformation applied to Hi-C data using SPRITE, we leveraged the exponential relationship between bulk-level intra-chromosomal interaction frequencies of Hi-C and real spatial distances, as established in a previous study29. This relationship is widely used to convert interaction frequencies to Euclidean distances in 84 various computational tools27,28. After removing the bias in inter-chromosomal interaction frequencies of Hi-C, we applied the same exponential transformation suggested by the previous study to convert the corrected inter-chromosomal interaction frequencies to spatial distances. To assess whether the converted spatial distances constitute a valid distance metric, we employed the triangle inequality. By sampling triplet entries in the inter-chromosomal distance matrix, we observed significant improvements in the fraction of cases where the triangle inequality holds, compared to the matrix based on the original Hi-C interaction frequencies (Figure 4.3 B). This finding suggests that our transformation of interaction frequencies based on SPRITE effectively removed the bias in Hi-C data and rendered the exponential transformation, originally developed for intra-chromosomal distances, applicable to inter-chromosomal distances as well. Using the distance matrices converted from the corrected Hi-C interaction frequencies, we investigated whether a consensus 3D structure can represent them. Since the Euclidean distance matrix is calculated from 3D coordinates, it is expected to be low-rank, with a rank not exceeding 528. We compared the inter-chromosomal distance matrix at 10Mb resolution with the intra-chromosomal distance matrix of chr1 at 1Mb resolution, as they have similar dimensions. Singular Value Decomposition (SVD) was applied to both matrices separately, and the singular values were ordered by their absolute values. The top 5 singular values could only explain 55.4% of the total variance in the inter-chromosomal distance matrix, compared to 70.2% in the intra-chromosomal distance matrix (Figure 4.3 C). The explanation ratio of the inter-chromosomal distance matrix was significantly and consistently lower than that of the intra-chromosomal distance matrix across different numbers of top singular values (Figure C. 3). Additionally, we constructed a low-rank approximation of each distance matrix using the top 5 singular values and their corresponding singular vectors and compared the approximated distance matrices with the original ones. The approximated intra- chromosomal distance matrix was highly similar to the original matrix, with a Spearman correlation of 0.985 and a Pearson correlation of 0.977 (Figure 4.3 C), indicating that the bulk-level intra-chromosomal Hi-C distance matrix can be well represented by a consensus 3D structure of a single chromosome. In contrast, the approximated inter- chromosomal distance matrix differed from the original matrix, with a Spearman 85 correlation of 0.835 and a Pearson correlation of 0.661 (Figure 4.3 C). We further examined the correlation between the original matrices and their approximations based on different numbers of top singular values. The results clearly showed that the intra- chromosomal distance matrix has a rank no larger than 5, while the inter-chromosomal distance matrix has a rank much higher than 5 (Figure C. 4). This finding suggests that although the inter-chromosomal distance matrix represents an average across cells at the bulk level, it cannot be adequately represented by a single 3D structure. Figure 4.3. Heterogeneity of inter-chromosomal chromatin contacts in bulk tissue. (A) Removal of experimental biases in Hi-C data using SPRITE. The top panel shows schematic figures illustrating the correction of inter-chromosomal Hi-C contacts using 86 Figure 4.3 (cont’d) SPRITE by fitting two regression models. The bottom panel displays scatter plots comparing Hi-C and SPRITE contact frequencies for intra-chromosomal and inter- chromosomal contacts separately, with the fitted models overlaid. (B) Comparison of the fraction of cases where the triangle inequality holds in distances transformed from original Hi-C data and corrected Hi-C data. The boxplots show the distribution of the fraction of cases across multiple samples, with higher fractions indicating better agreement with the triangle inequality and more reliable distance transformations. (C) Low-rank properties of intra-chromosomal and inter-chromosomal distance matrices based on singular value decomposition (SVD) approximation. Two scenarios are presented: the distance matrix derived from intra-chromosomal contacts of chromosome 1 at 1Mb resolution (top panel) and the distance matrix derived from inter-chromosomal contacts of the whole genome at 10Mb resolution (bottom panel). The line plots show the cumulative explanation ratio as a function of the number of top singular values used in the approximation. The heatmaps display the original distance matrices and their corresponding approximations based on the top 5 singular values and singular vectors, with higher correlations between the original and approximated matrices indicating better low-rank approximations. Our results highlight the fundamental differences in the variability of intra- chromosomal and inter-chromosomal chromatin organization across cells within the same cell type. While intra-chromosomal chromatin can be well represented by a consensus 3D structure, inter-chromosomal chromatin exhibits a higher degree of complexity and variability that cannot be captured by a single structure due to the highly dynamic nature of interactions across cells. This finding suggests that the cell-to-cell variability in inter-chromosomal interactions is a crucial factor to consider when studying chromatin organization at this scale. Furthermore, our analysis indicates that inter- chromosomal interaction frequencies derived from Hi-C data are biased and noisy at high resolution. Unbiased experiments, such as SPRITE, can serve as a reference to remove these biases and provide a more accurate representation of inter-chromosomal interactions. This observation highlights the importance of integrating multiple experimental techniques to overcome the limitations of individual methods and obtain a more comprehensive understanding of chromatin organization. The findings of this study have significant implications for the development of computational methods aimed at modeling 3D chromatin structures based on bulk-tissue Hi-C data. While it is feasible to build models that reconstruct the consensus structure of single chromosomes within the same cell type, our results suggest that using a single consensus 3D structure to 87 represent inter-chromosomal chromatin organization based on bulk Hi-C data may be inadequate. The higher-rank nature of the inter-chromosomal distance matrix indicates the presence of substantial cell-to-cell variability, emphasizing the need for more advanced methods that can capture the complexity of inter-chromosomal interactions instead of relying on single consensus structures. In conclusion, our study reveals the inherent differences in the variability of intra- chromosomal and inter-chromosomal chromatin organization across cells within the same cell type. These findings underscore the importance of considering cell-to-cell variability when investigating inter-chromosomal interactions and highlight the need for integrating multiple experimental techniques to remove biases and obtain a more accurate representation of chromatin organization. Moreover, our results have significant implications for the development of computational methods, emphasizing the need for advanced approaches that can capture the complexity and variability of inter- chromosomal interactions beyond single consensus structures. These insights contribute to a deeper understanding of chromatin organization and provide valuable guidance for future research in this field. 4.3.3 Diploid-specific variability of chromatin coupled with diverse sources of biological factors Bulk Hi-C provides a cell-type-specific average representation of chromatin interactions from millions of diploid cells, each containing 46 chromosomes (23 maternal and 23 paternal alleles)198. Even within the same cell type or tissue, differences in cell age and allelic variation can result in different chromatin organizations. The rapid development of single-cell Hi-C (scHi-C) techniques has enabled the analysis of chromatin interactions at the single-cell level, providing the potential to investigate cell-to-cell intra-chromosomal chromatin variability30–40. However, compared to bulk Hi-C, scHi-C suffers from more missing data due to the extremely limited read depths, leading to lower resolution and increased challenges for analysis tools. Dip-C, an advanced scHi-C technique, generates a higher number of contacts with minimal false positives34. By detecting unique single- nucleotide polymorphisms (SNPs) based on paternal and maternal genomes, Dip-C distinguishes reads between the two haplotypes of each chromosome, allowing for the construction of diploid scHi-C maps. Additionally, Dip-C employs an algorithm to impute 88 the two chromosome haplotypes using reads without unique SNPs, assuming that the two homologs typically contact different partners34. This makes Dip-C a valuable resource for analyzing both cell-to-cell and allele-to-allele variability in chromatin organization. To quantify variability using the imputed diploid scHi-C maps from 17 GM12878 cells generated by Dip-C at 1Mb resolution34, we calculated Spearman correlations, with lower correlations indicating higher variability. We computed paternal-paternal and maternal- maternal correlations across cells to represent cell-specific variability, and paternal- maternal correlations within the same cell to represent allele-specific variability. Moreover, paternal-maternal correlations from different cells were calculated to represent overall variability, which is a combination of cell-specific and allele-specific variability. Our analysis revealed that paternal-maternal correlations within the same cell were consistently lower than paternal-paternal and maternal-maternal correlations across cells (Figure 4.4 A). This finding indicates that allele-specific variability in chromatin organization is higher than cell-specific variability, suggesting that differences between homologous chromosomes within a cell are more pronounced than differences between the same chromosome across cells of the same type based on the imputed diploid Hi-C maps. However, the conclusion drawn from the imputed diploid Hi-C data may be influenced by the biased assumption of the imputation algorithm. The imputed diploid Hi- C is based on the assumption that the two homologs typically contact different partners, which assigns reads with unknown haplotypes to different alleles by maximizing the difference between paternal and maternal alleles within a single cell. To address this potential bias, we re-examined the mapped reads within the homologous chromosomes without imputation (Figure 4.4 B). Our analysis revealed that the majority of reads (87.61%) do not harbor any unique SNPs, allowing us to determine the chromosome of origin but not the specific allele. Only 11.79% of reads are one-phased, meaning that one fragment of the read harbors unique SNPs that can determine the allele of origin for that fragment, while the allele of the other fragment remains unknown. The remaining 0.6% of reads are two-phased, indicating that both fragments of the read harbor unique SNPs, enabling us to determine the allele of origin for both fragments. Interestingly, among the two-phased reads, only 1.17% are inter-chromosomal contacts (Figure 4.4 B), which we 89 refer to as two-homolog reads. This finding suggests that the majority of contacts are intra-chromosomal. Therefore, although the specific allele of the other fragment in one- phased reads is unknown, we can reasonably impute them as originating from the same allele, considering them as intra-chromosomal contacts. Based on this rationale, we constructed single-cell diploid intra-chromosomal Hi-C maps using only one-phased reads without any biased imputation. To account for the important confounding factor of genomic distance, we applied BandNorm195 to normalize the single-cell diploid intra-chromosomal Hi-C maps and repeated the correlation analysis. After normalization, the previously observed significantly and consistently lower correlations for paternal-maternal comparisons within the same cell disappeared (Figure 4.4 C). Interestingly, the correlations for paternal- maternal comparisons within the same cell were slightly higher than the paternal-paternal and maternal-maternal correlations across cells, but the differences were not significant. These results demonstrate that the relationship between allele-specific variability and cell- specific variability is complex and requires further investigation. To gain deeper insights, we utilized additional Dip-C data from mouse cells. We generated embeddings for each allele of the chromosomes and visualized them across cells using UMAP for each chromosome separately. In the resulting visualizations, each dot represents either the paternal or maternal allele of a specific chromosome in a single cell (Figure 4.4 D). The colors of the dots in different subfigures were based on different tissues, sexes, alleles, and cell ages, respectively. Our analysis revealed clearly distinct patterns between different tissues, sexes, and cell ages. Interestingly, the differences between alleles were not significant in chromosome 1 to chromosome 22. However, chromosome X showed significant differences, which were mainly attributed to the differences in sex. These observations suggest that the overall differences in chromatin organization across cells arise from multiple levels of variability, including tissue-level, cell-age-level, cell-level, and allele-level variability. 90 Figure 4.4. Diploid single-cell-specific variability of chromatin based on Dip-C. (A) Spearman correlation between two alleles under four scenarios in different chromosomes 91 Figure 4.4 (cont’d) based on imputed Dip-C data. The boxplots show the distribution of correlations for each scenario: (1) paternal and maternal alleles within the same cell (red boxes), (2) paternal alleles between different cells (blue boxes), (3) maternal alleles between different cells (purple boxes), and (4) paternal allele and maternal allele in two different cells (green boxes). Higher correlations indicate greater similarity in chromatin organization between the compared alleles. (B) Counts of mapped reads from Dip-C experiments. The bar plot shows the number of reads in different categories, including all reads (red), one-phased reads (green), and two-phased reads (blue), and two-homologous reads (purple). (C) Spearman correlation between two alleles under four scenarios in different chromosomes based on one-phased reads from Dip-C data without imputation. The boxplots follow the same color scheme as in panel (A). (D) UMAP embeddings of mouse cells based on one- phased reads from normalized Dip-C data without imputation. The UMAPs are shown for different chromosomes separately, with each dot representing one allele of a specific chromosome. The colors of the dots represent different tissues, sexes, alleles, and cell ages, allowing for the visualization of cell-to-cell variability in chromatin organization across various biological factors. These findings underscore the importance of considering various sources of variability when analyzing cell-to-cell chromatin organization. Making biased assumptions, such as assuming that the two homologs typically contact different partners, can lead to misinterpretations of the data. By utilizing normalized single-cell diploid intra- chromosomal Hi-C maps and employing unbiased analysis methods, we can gain a more accurate understanding of the complex interplay between different levels of variability in chromatin organization. 4.3.4 Ordered hierarchy of structural variability guiding 3D chromatin analyses under different contexts To further investigate the relative contributions of tissue-level, cell-age-level, cell-level, and allele-level variability, we developed sampling methods based on the generated embeddings. By controlling for confounding factors, the sampling-based approach allows for a more rigorous and unbiased comparison of different levels of variability in chromatin organization. First, to compare tissue-level and age-level chromatin variability, we focused on male cells from the Dip-C dataset, including main olfactory epithelium (MOE) cells at postnatal day 28 (P28), and retina cells at P28 and P7. For each chromosome, we randomly sampled one allele from a retina cell at P28, one allele from a retina cell at P7, 92 and one allele from an MOE cell at P28. To control for confounding factors arising from different alleles, all three sampled alleles were either paternal or maternal. We then calculated Euclidean distances based on the embeddings to determine the age-level distance (i.e., the distance between retina cells at P28 and P7) and the tissue-level distance (i.e., the distance between retina and MOE cells at P28) (Figure 4.5 A). The distances were normalized based on age-level distance to remove sampling bias. We repeated the sampling process 1,000 times for each chromosome and found that the normalized tissue-level distance was significantly larger than the age-level distance across all chromosomes (Figure 4.5 B). This result indicates that tissue-level variability in chromatin organization is higher than age-level variability across different cells. Next, we compared age-level and cell-level chromatin variability using male retina cells from different cell ages, including P7, P28, and P56. Similarly, we randomly sampled one allele from cells at P28 and P56, and two alleles from cells at P7, while controlling for the allele origin (either paternal or maternal) to ensure a fair comparison. Euclidean distances were then calculated based on the embeddings to determine the age-level distances (i.e., the distance between P7 and P28, and the distance between P7 and P56) and the cell-level distance (i.e., the distance between two cells at P7) (Figure 4.5 C). To remove sampling bias, the distances were normalized based on the cell-level distance. We repeated the sampling process 1,000 times for each chromosome and found that the normalized age-level distance was significantly larger than the cell-level distance across all chromosomes (Figure 4.5 D). Interestingly, the magnitude of the difference in age- level distance increased with the age difference between the compared developmental stages (Figure 4.5 D). This result suggests that age-level variability in chromatin organization is more pronounced than cell-level variability across different cells, and that the extent of this difference is related to the magnitude of the age difference. Finally, we investigated cell-level variability and allele-level variability using female MOE cells at P7 to control for tissue-level and age-level variabilities. For each chromosome, we randomly sampled both paternal and maternal alleles from two cells separately. Euclidean distances were then calculated based on the embeddings to determine the cell-level distance (i.e., the distance between maternal alleles from two cells), the allele-level distance (i.e., the distance between maternal and paternal alleles 93 from the same cell), and the overall distance (i.e., the distance between the maternal allele of one cell and the paternal allele from another cell) (Figure 4.5 E). To remove sampling bias, the distances were normalized based on the allele-level distance. We repeated the sampling process 1,000 times for each chromosome and found that the normalized cell-level distance was significantly larger than the allele-level distance for chromosomes 1 through 22 (Figure 4.5 E). This result suggests that cell-to-cell differences contribute more to the overall variability in chromatin organization than allele- specific differences within individual cells for these chromosomes. Interestingly, chromosome X exhibited a contrasting pattern, with the allele-level distance being significantly larger than the cell-level distance (Figure 4.5 F). This observation indicates that, for chromosome X, allele-specific differences in chromatin organization are more pronounced than cell-to-cell differences. The distinct behavior of chromosome X compared to the autosomes highlights the unique regulatory mechanisms and evolutionary pressures associated with this sex chromosome. The larger allele-level distance in chromosome X suggests that the maternal and paternal copies of this chromosome may be subject to different regulatory environments or epigenetic modifications, leading to more pronounced differences in their chromatin organization. 94 Figure 4.5. Ordered hierarchy of structural variability guiding 3D chromatin analyses under different contexts. (A-B) Comparison of tissue-level and cell-age-level variability. 95 Figure 4.5 (cont’d) (A) Schematic figure illustrating the sampling strategy from embedding spaces. The tissue-level distance and age-level distance were calculated based on embeddings and normalized. (B) Boxplots showing the distribution of normalized tissue-level distances across different chromosomes. Higher values indicate greater tissue-level variability compared to age-level variability. (C-D) Comparison of cell-age-level and cell-level variability. (C) Schematic figure illustrating the sampling strategy from embedding spaces. The age-level distance and cell-level distance were calculated based on embeddings and normalized. (D) Boxplots showing the distribution of normalized age-level distances across different chromosomes, including age-level distances over 28 days (red) and 49 days (blue). Higher values indicate greater age-level variability compared to cell-level variability. (E-F) Comparison of cell-level and allele-level variability. (E) Schematic figure illustrating the sampling strategy from embedding spaces. The cell-level distance and allele-level distance were calculated based on embeddings and normalized. (F) Boxplots showing the distribution of normalized cell-level distances across different chromosomes. Higher values indicate greater cell-level variability compared to allele-level variability. Our analysis of single-cell Hi-C data reveals the relative contributions of different sources of variability in chromatin organization. We found that tissue-level variability is more pronounced than age-level variability, indicating that cell type identity plays a crucial role in shaping chromatin structure. Additionally, age-level variability is more significant than cell-level variability, suggesting that developmental changes have a greater impact on chromatin organization than cell-to-cell differences within a specific age group. Furthermore, cell-level variability is generally more pronounced than allele-level variability for autosomes, while chromosome X exhibits a contrasting pattern with larger allele-level variability. These findings highlight the importance of considering the ordered hierarchy of cell type-specific factors, developmental dynamics, and allele-specific differences when studying chromatin structure and its relationship to gene regulation and cellular function. For autosomes, the variability order is as follows: tissue-level > age-level > individual cell-level > allele-level. In contrast, chromosome X exhibits a different order: tissue-level > age-level > allele-level > individual cell-level. These variability orders have important implications for modeling and analyzing chromatin structures at different levels of inquiry. When investigating chromatin organization at a specific level, variability sources lower in the hierarchy can be considered less influential and potentially ignored, while variability sources higher in the hierarchy should be carefully accounted for in the analysis. 96 For example, when studying chromatin structure at the tissue level, age-related changes, individual cell-to-cell differences, and allele-specific variations may have a smaller impact on the overall organization and could be treated as less significant factors. However, when investigating chromatin organization at the allelic level, it is crucial to consider the effects of tissue-specific factors, developmental dynamics, and cell-to-cell differences, as these sources of variability rank higher in the hierarchy and can substantially influence the observed chromatin structure. In conclusion, the insights of variability orders provide a valuable framework for future studies aimed at unraveling the complex interplay between chromatin organization, gene expression, and cellular identity. 97 DISCUSSION AND FUTURE DIRECTIONS Deciphering the effects of non-coding genetic variants is critical for understanding human diseases. With the knowledge gained from cell type-specific 3D chromatin organization, researchers have developed numerous machine learning methods to boost the prediction of causal genetic variants associated with corresponding phenotypes, including cancers, compared to traditional statistical methods. In this thesis, we developed APRIL to leverage long-range chromatin regulatory networks for discovering disease-associated genes, demonstrating the vast potential and interpretability of understanding underlying mechanisms through chromatin regulation networks. We also developed 3DVariantVision to provide a comprehensive understanding of genetic variants based on 3D chromatin information, including eQTL prediction, local and 3D effect discovery, and even insights into disease associations. Additionally, we provided a comprehensive analysis of the multi-level variability in multi-scale 3D chromatin organization, offering valuable insights into the complex interplay between cell type-specific factors, developmental dynamics, and allele-specific differences, which can guide future experimental designs and computational method development in the field of 3D chromatin organization. Building upon these contributions, we propose the following future research directions: 1) Novel 3D chromatin structure reconstruction methods: Leveraging our comprehensive analysis of multi-level variability in multi-scale 3D chromatin organization, we aim to develop novel methods for reconstructing 3D chromatin structures that provide a more comprehensive chromatin network for long-range interactions. 2) Graph Neural Networks for disease-associated gene discovery: APRIL has demonstrated the importance of chromatin regulatory networks in discovering disease-associated genes. With the comprehensive chromatin network for long-range interactions reconstructed using our proposed methods, advanced techniques such as Graph Neural Networks can be applied to further investigate signals within the network, including disease-associated genes and GWAS hits. 3) Comprehensive modeling of genome-wide chromatin regulation networks: 3DVariantVision is a pioneering framework for comprehensive understanding of genetic variants based on 3D chromatin. However, it currently focuses on one-to-one 98 interactions (enhancer-promoter interactions) without considering the overall view of chromatin regulation. In the future, we plan to model the entire genome-wide chromatin regulation network to capture more complex and comprehensive underlying mechanisms. 4) Integration of Foundation Models for DNA sequences: With the rapid development of AI techniques, including Foundation Models for DNA sequences76–79, we can further utilize the embeddings directly from these models to enhance the performance and interpretability of our entire framework. By pursuing these future research directions, we aim to deepen our understanding of the complex relationships between genetic variants, 3D chromatin organization, and human diseases. The integration of advanced computational methods, such as Graph Neural Networks and Foundation Models, with the knowledge gained from our comprehensive analyses will pave the way for more accurate predictions and mechanistic insights, ultimately contributing to the development of precision medicine and improved clinical care. 99 BIBLIOGRAPHY 1. 2. 3. 4. 5. 6. Klein, R. J. et al. Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science (1979) 308, 385–389 (2005). Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five Years of GWAS Discovery. The American Journal of Human Genetics 90, 7–24 (2012). Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6, 95–108 (2005). Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat Rev Genet 20, 467–484 (2019). Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47, 291–295 (2015). Emmert-Streib, F., Dehmer, M. & Haibe-Kains, B. Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Front Cell Dev Biol 2, (2014). 7. Misteli, T. Beyond the Sequence: Cellular Organization of Genome Function. Cell 128, 787–800 (2007). 8. 9. Buermans, H. P. J. & den Dunnen, J. T. Next generation sequencing technology: Advances and applications. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 1842, 1932–1941 (2014). Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res 48, D882–D889 (2020). 10. Zhang, F. & Lupski, J. R. Non-coding genetic variants in human disease. Hum Mol Genet 24, R102–R110 (2015). 11. Bernstein, B. E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol 28, 1045–1048 (2010). 12. Xu, H., Zhang, S., Yi, X., Plewczynski, D. & Li, M. J. Exploring 3D chromatin contacts in gene regulation: The evolution of approaches for the identification of functional enhancer-promoter interaction. Comput Struct Biotechnol J 18, 558–570 (2020). 13. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat Genet 51, 1664–1669 (2019). 100 14. Rao, S. S. P. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159, 1665–1680 (2014). 15. Mumbach, M. R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 13, 919–922 (2016). 16. Wei, C.-L. et al. A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome. Cell 124, 207–219 (2006). 17. Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high- resolution capture Hi-C. Nat Genet 47, 598–606 (2015). 18. He, B., Chen, C., Teng, L. & Tan, K. Global view of enhancer–promoter interactome in human cells. Proceedings of the National Academy of Sciences 111, (2014). 19. Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet 48, 488– 496 (2016). 20. Singh, S., Yang, Y., Póczos, B. & Ma, J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quantitative Biology 7, 122– 137 (2019). 21. Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res 47, e60–e60 (2019). 22. Qi, Y. & Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS Comput Biol 15, e1007024 (2019). 23. Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase- scale transfer learning. Nat Methods 17, 1118–1124 (2020). 24. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 17, 1111–1117 (2020). 25. Tjong, H. et al. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization. Proceedings of the National Academy of Sciences 113, (2016). 26. Lesne, A., Riposo, J., Roger, P., Cournac, A. & Mozziconacci, J. 3D genome reconstruction from chromosomal contacts. Nat Methods 11, 1141–1143 (2014). 27. Trieu, T., Oluwadare, O. & Cheng, J. Hierarchical Reconstruction of High- Resolution 3D Models of Large Chromosomes. Sci Rep 9, 4971 (2019). 101 28. Wang, H., Yang, J., Zhang, Y., Qian, J. & Wang, J. Reconstruct high-resolution 3D genome structures for diverse cell-types using FLAMINGO. Nat Commun 13, 2645 (2022). 29. Wang, S. et al. Spatial organization of chromatin domains and compartments in single chromosomes. Science (1979) 353, 598–602 (2016). 30. Nagano, T. et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59–64 (2013). 31. Stevens, T. J. et al. 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature 544, 59–64 (2017). 32. Nagano, T. et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61–67 (2017). 33. Flyamer, I. M. et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature 544, 110–114 (2017). 34. Tan, L., Xing, D., Chang, C.-H., Li, H. & Xie, X. S. Three-dimensional genome structures of single diploid human cells. Science (1979) 361, 924–928 (2018). 35. Tan, L., Xing, D., Daley, N. & Xie, X. S. Three-dimensional genome structures of single sensory neurons in mouse visual and olfactory systems. Nat Struct Mol Biol 26, 297–307 (2019). 36. Ramani, V. et al. Massively multiplex single-cell Hi-C. Nat Methods 14, 263–266 (2017). 37. Kim, H.-J. et al. Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data. PLoS Comput Biol 16, e1008173 (2020). 38. Lee, D.-S. et al. Simultaneous profiling of 3D genome structure and DNA methylation in single human cells. Nat Methods 16, 999–1006 (2019). 39. Li, G. et al. Joint profiling of DNA methylation and chromatin architecture in single cells. Nat Methods 16, 991–993 (2019). 40. Nguyen, H. Q. et al. 3D mapping and accelerated super-resolution imaging of the human genome using in situ sequencing. Nat Methods 17, 822–832 (2020). 41. Zhang, R., Zhou, T. & Ma, J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat Biotechnol 40, 254–261 (2022). 102 42. Zeng, Z. & Bromberg, Y. Predicting Functional Effects of Synonymous Variants: A Systematic Review and Perspectives. Front Genet 10, (2019). 43. Ritchie, G. R. S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat Methods 11, 294–296 (2014). 44. Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 15, 480 (2014). 45. Lee, D., Karchin, R. & Beer, M. A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 21, 2167–2180 (2011). 46. Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10, e1003711 (2014). 47. Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 47, 955–961 (2015). 48. Ogata, J. D. et al. excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies. Bioinformatics 39, (2023). 49. Dozmorov, M. G. et al. CTCF: an R/bioconductor data package of human and mouse CTCF binding sites. Bioinformatics Advances 2, (2022). 50. Carlson, A. L. et al. Infant gut microbiome composition is associated with non-social fear behavior in a pilot study. Nat Commun 12, 3294 (2021). 51. 52. Innocenti, F. et al. Next-generation sequencing (NGS) in metastatic colorectal cancer (mCRC): Novel mutated genes and their effect on response to therapy (Alliance). Annals of Oncology 30, v198–v199 (2019). Innocenti, F. et al. DNA Mutational Profiling in Patients With Colorectal Cancer Treated With Standard of Care Reveals Differences in Outcome and Racial Distribution of Mutations. Journal of Clinical Oncology 42, 399–409 (2024). 53. Chen, J., Mu, W., Li, Y. & Li, D. On the Identifiability and Interpretability of Gaussian Process Models. in Advances in Neural Information Processing Systems (eds. Oh, A. et al.) vol. 36 70267–70278 (Curran Associates, Inc., 2023). 54. Mu, W. et al. Airpart: interpretable statistical models for analyzing allelic imbalance in single-cell datasets. Bioinformatics 38, 2773–2780 (2022). 55. Lê Cao, K.-A. et al. Community-wide hackathons to identify central themes in single-cell multi-omics. Genome Biol 22, 220 (2021). 103 56. Mu, W. et al. bootRanges: flexible generation of null sets of genomic ranges for hypothesis testing. Bioinformatics 39, (2023). 57. Davis, E. S. et al. matchRanges: generating null hypothesis genomic ranges via covariate-matched sampling. Bioinformatics 39, (2023). 58. Ma, H., Zeng, D. & Liu, Y. Learning Optimal Group-structured Individualized Treatment Rules with Many Treatments. Journal of Machine Learning Research 24, 1–48 (2023). 59. Ma, H. et al. Disentangling sex-dependent effects of APOE on diverse trajectories of cognitive decline in Alzheimer’s disease. Neuroimage 292, 120609 (2024). 60. Ma, H., Liu, Y. & Wu, G. Elucidating Multi‐Stage Progression of Neuro‐degeneration Process in Alzheimer’s Disease. Alzheimer’s & Dementia 18, (2022). 61. Ma, H., Zeng, D. & Liu, Y. Learning Individualized Treatment Rules with Many Treatments: A Supervised Clustering Approach Using Adaptive Fusion. in Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) vol. 35 15956–15969 (Curran Associates, Inc., 2022). 62. Liu, R., Yuan, H., Johnson, K. A. & Krishnan, A. CONE: COntext-specific Network (2023) via Contextualized Graph Attention. bioRxiv Embedding doi:10.1101/2023.10.21.563390. 63. Liu, R. & Krishnan, A. Open Biomedical Network Benchmark: A Python Toolkit for Benchmarking Datasets with Biomedical Networks. in Proceedings of the 18th Machine Learning in Computational Biology meeting (eds. Knowles, D. A. & Mostafavi, S.) vol. 240 23–59 (PMLR, 2024). 64. Ding, J. et al. DANCE: a deep learning library and benchmark platform for single- cell analysis. Genome Biol 25, 72 (2024). 65. Liu, R. et al. Taxonomy of Benchmarks in Graph Representation Learning. in Proceedings of the First Learning on Graphs Conference (eds. Rieck, B. & Pascanu, R.) vol. 198 6:1–6:25 (PMLR, 2022). 66. Liu, R., Hirn, M. & Krishnan, A. Accurately modeling biased random walks on weighted networks using node2vec+. Bioinformatics 39, (2023). 67. Liu, R. & Krishnan, A. PecanPy: a fast, efficient and parallelized Python implementation of node2vec. Bioinformatics 37, 3377–3379 (2021). 68. Liu, R., Mancuso, C. A., Yannakopoulos, A., Johnson, K. A. & Krishnan, A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics 36, 3457–3465 (2020). 104 69. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12, 931–934 (2015). 70. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26, 990–999 (2016). 71. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44, e107–e107 (2016). 72. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 28, 739–750 (2018). 73. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet 50, 1171–1179 (2018). 74. Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems (eds. Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017). 75. Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18, 1196–1203 (2021). 76. Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021). 77. Zhou, Z. et al. DNABERT-2: Efficient Foundation Model and Benchmark For Multi- Species Genome. (2023). 78. Zhou, Z. et al. DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models. (2024). 79. Dalla-Torre, H. et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv 2023.01.11.523679 (2023) doi:10.1101/2023.01.11.523679. 80. Hwang, B., Lee, J. H. & Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 50, 1–14 (2018). 81. van der Wijst, M. G. P. et al. Single-cell RNA sequencing identifies celltype-specific cis-eQTLs and co-expression QTLs. Nat Genet 50, 493–497 (2018). 82. Granja, J. M. et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat Biotechnol 37, 1458–1465 (2019). 105 83. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337 (2021). 84. Korkmaz, G. et al. Functional genetic screens for enhancer elements in the human genome using CRISPR-Cas9. Nat Biotechnol 34, 192–198 (2016). 85. Cong, L. et al. Multiplex Genome Engineering Using CRISPR/Cas Systems. Science (1979) 339, 819–823 (2013). 86. Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science (1979) 354, 769–773 (2016). 87. Pulido-Quetglas, C. et al. Scalable Design of Paired CRISPR Guide RNAs for Genomic Deletion. PLoS Comput Biol 13, e1005341 (2017). 88. Chanock, S. High marks for GWAS. Nat Genet 41, 765–766 (2009). 89. Billings, L. K. & Florez, J. C. The genetics of type 2 diabetes: what have we learned from GWAS? Ann N Y Acad Sci 1212, 59–77 (2010). 90. Gandhi, S. & Wood, N. W. Genome-wide association studies: the key to unlocking neurodegeneration? Nat Neurosci 13, 789–794 (2010). 91. Simmonds, M. J. GWAS in autoimmune thyroid disease: redefining our understanding of pathogenesis. Nat Rev Endocrinol 9, 277–287 (2013). 92. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). 93. Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol 12, e1004714 (2016). 94. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169, 1177–1186 (2017). 95. Hudson, T. J. Wanted: regulatory SNPs. Nat Genet 33, 439–440 (2003). 96. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). 97. Civelek, M. & Lusis, A. J. Systems genetics approaches to understand complex traits. Nat Rev Genet 15, 34–48 (2014). 98. Nadeau, J. H. & Dudley, A. M. Systems Genetics. Science (1979) 331, 1015–1016 (2011). 106 99. Wiseman, F. K. et al. A genetic cause of Alzheimer disease: mechanistic insights from Down syndrome. Nat Rev Neurosci 16, 564–574 (2015). 100. Gotoda, T. From Association to Function in the Post-GWAS Era. J Atheroscler Thromb 22, 442–444 (2015). 101. Gallagher, M. D. & Chen-Plotkin, A. S. The Post-GWAS Era: From Association to Function. The American Journal of Human Genetics 102, 717–730 (2018). 102. Lee, T. I. & Young, R. A. Transcriptional Regulation and Its Misregulation in Disease. Cell 152, 1237–1251 (2013). 103. Calabrese, G. M. et al. Integrating GWAS and Co-expression Network Data Identifies Bone Mineral Density Genes SPTBN1 and MARK3 and an Osteoblast Functional Module. Cell Syst 4, 46-59.e4 (2017). 104. Gao, L. et al. Identifying noncoding risk variants using disease-relevant gene regulatory networks. Nat Commun 9, 702 (2018). 105. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 21, 1109–1121 (2011). 106. Lieberman-Aiden, E. et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science (1979) 326, 289–293 (2009). 107. Fullwood, M. J. & Ruan, Y. ChIP‐based methods for the identification of long‐range chromatin interactions. J Cell Biochem 107, 30–39 (2009). 108. Hughes, J. R. et al. Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment. Nat Genet 46, 205–212 (2014). 109. Grubert, F. et al. Genetic Control of Chromatin States in Humans Involves Local and Distal Chromosomal Interactions. Cell 162, 1051–1065 (2015). 110. Jung, I. et al. A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat Genet 51, 1442–1449 (2019). 111. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). 112. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011). 107 113. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science (1979) 337, 1190–1195 (2012). 114. Huang, Y.-T., VanderWeele, T. J. & Lin, X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann Appl Stat 8, (2014). 115. Huang, Y., Liang, L., Moffatt, M. F., Cookson, W. O. C. M. & Lin, X. iGWAS: Integrative Genome‐Wide Association Studies of Genetic and Genomic Data for Disease Susceptibility Using Mediation Analysis. Genet Epidemiol 39, 347–356 (2015). 116. Tian, D., Zhang, R., Zhang, Y., Zhu, X. & Ma, J. MOCHI enables discovery of heterogeneous interactome modules in 3D nucleome. Genome Res 30, 227–238 (2020). 117. Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13, 613–626 (2012). 118. Beagrie, R. A. et al. Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519–524 (2017). 119. Meyer, M. B., Benkusky, N. A. & Pike, J. W. Selective Distal Enhancer Control of the Mmp13 Gene Identified through Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) Genomic Deletions. Journal of Biological Chemistry 290, 11093–11107 (2015). 120. Huang, J. et al. Dissecting super-enhancer hierarchy based on chromatin interactions. Nat Commun 9, 943 (2018). 121. Farh, K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015). 122. Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Nature 518, 365–369 (2015). 123. Mells, G. F. & Hirschfield, G. M. Making the most of new genetic risk factors – genetic and epigenetic fine mapping of causal autoimmune disease variants. Clin Res Hepatol Gastroenterol 39, 408–411 (2015). 124. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). 125. Cao, Q. et al. Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet 49, 1428–1436 (2017). 108 126. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47, D766–D773 (2019). 127. Kheradpour, P. & Kellis, M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res 42, 2976–2987 (2014). 128. Pinero, J. et al. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015, bav028–bav028 (2015). 129. Mattingly, C. J., Colby, G. T., Forrest, J. N. & Boyer, J. L. The Comparative Toxicogenomics Database (CTD). Environ Health Perspect 111, 793–795 (2003). 130. Gutiérrez-Sacristán, A. et al. PsyGeNET: a knowledge platform on psychiatric disorders and their genes. Bioinformatics 31, 3075–3077 (2015). 131. Rehm, H. L. et al. ClinGen — The Clinical Genome Resource. New England Journal of Medicine 372, 2235–2242 (2015). 132. Tamborero, D. et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med 10, 25 (2018). 133. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 134. Kanz, C. The EMBL Nucleotide Sequence Database. Nucleic Acids Res 33, D29– D33 (2004). 135. Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9, 473–476 (2012). 136. Ernst, J. & Kellis, M. Chromatin-state discovery and genome annotation with ChromHMM. Nat Protoc 12, 2478–2492 (2017). 137. Lupiáñez, D. G. et al. Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions. Cell 161, 1012–1025 (2015). 138. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for Detecting Significantly Mutated Pathways in Cancer. Journal of Computational Biology 18, 507–522 (2011). 139. Stark, C. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–D539 (2006). 140. Das, J., Felty, Q., Poppiti, R., Jackson, R. & Roy, D. Nuclear Respiratory Factor 1 Acting as an Oncoprotein Drives Estrogen-Induced Breast Carcinogenesis. Cells 7, 234 (2018). 109 141. Nowyhed, H. N. et al. ATP Binding Cassette Transporter ABCA7 Regulates NKT Cell Development and Function by Controlling CD1d Expression and Lipid Raft Content. Sci Rep 7, 40273 (2017). 142. Aikawa, T. et al. ABCA7 haplodeficiency disturbs microglial immune responses in the mouse brain. Proceedings of the National Academy of Sciences 116, 23790– 23796 (2019). 143. Jones, K. A., Kadonaga, J. T., Luciw, P. A. & Tjian, R. Activation of the AIDS Retrovirus Promoter by the Cellular Transcription Factor, Sp1. Science (1979) 232, 755–759 (1986). 144. Fernandez-Zapico, M. E. et al. A functional family-wide screening of SP/KLF proteins identifies a subset of suppressors of KRAS -mediated cell growth. Biochemical Journal 435, 529–537 (2011). 145. Dupuis-Maurin, V. et al. Overexpression of the Transcription Factor Sp1 Activates the OAS-RNAse L-RIG-I Pathway. PLoS One 10, e0118551 (2015). 146. Ilsley, M. D. et al. Krüppel-like factors compete for promoters and enhancers to fine- tune transcription. Nucleic Acids Res 45, 6572–6588 (2017). 147. Biffi, A. Genetic Variation and Neuroimaging Measures in Alzheimer Disease. Arch Neurol 67, 677 (2010). 148. Gaiteri, C., Mostafavi, S., Honey, C. J., De Jager, P. L. & Bennett, D. A. Genetic variants in Alzheimer disease — molecular and brain network approaches. Nat Rev Neurol 12, 413–427 (2016). 149. Ramos, P. S., Shedlock, A. M. & Langefeld, C. D. Genetics of autoimmune diseases: insights from population genetics. J Hum Genet 60, 657–664 (2015). 150. Zheng, S. L. et al. Cumulative Association of Five Genetic Variants with Prostate Cancer. New England Journal of Medicine 358, 910–919 (2008). 151. Wacholder, S. et al. Performance of Common Genetic Variants in Breast-Cancer Risk Models. New England Journal of Medicine 362, 986–993 (2010). 152. Zhang, B., Beeghly-Fadiel, A., Long, J. & Zheng, W. Genetic variants associated with breast-cancer risk: comprehensive research synopsis, meta-analysis, and epidemiological evidence. Lancet Oncol 12, 477–488 (2011). 153. Nicolae, D. L. et al. Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS. PLoS Genet 6, e1000888 (2010). 110 154. Smith, A. K. et al. Methylation quantitative trait loci (meQTLs) are consistently detected across ancestry, developmental stage, and tissue type. BMC Genomics 15, 145 (2014). 155. McVicker, G. et al. Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science (1979) 342, 747–749 (2013). 156. Nica, A. C. & Dermitzakis, E. T. Expression quantitative trait loci: present and future. Philosophical Transactions of the Royal Society B: Biological Sciences 368, 20120362 (2013). 157. Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet 51, 592–599 (2019). 158. Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53, 354–366 (2021). 159. Võsa, U. et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat Genet 53, 1300–1310 (2021). 160. Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013). 161. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res 24, 14–24 (2014). 162. Wang, H., Huang, B. & Wang, J. Predict long-range enhancer regulation based on protein–protein interactions between transcription factors. Nucleic Acids Res 49, 10347–10368 (2021). 163. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet 44, 1084–1089 (2012). 164. Wen, X., Luca, F. & Pique-Regi, R. Cross-Population Joint Analysis of eQTLs: Fine Mapping and Functional Annotation. PLoS Genet 11, e1005176 (2015). 165. Gasperini, M. et al. A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell 176, 377-390.e19 (2019). 166. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. J R Stat Soc Series B Stat Methodol 82, 1273–1300 (2020). 111 167. Wen, X., Pique-Regi, R. & Luca, F. Integrating molecular QTL data into genome- wide genetic association analysis: Probabilistic assessment of enrichment and colocalization. PLoS Genet 13, e1006646 (2017). 168. Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics 198, 497– 508 (2014). 169. Brown, A. A. et al. Predicting causal variants affecting expression by using whole- genome sequencing and RNA-seq from multiple human tissues. Nat Genet 49, 1747–1751 (2017). 170. Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat Commun 10, 3583 (2019). 171. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010). 172. Lotem, J. et al. Runx3-mediated Transcriptional Program in Cytotoxic Lymphocytes. PLoS One 8, e80467 (2013). 173. Vilagos, B. et al. Essential role of EBF1 in the generation and function of distinct mature B cell types. Journal of Experimental Medicine 209, 775–792 (2012). 174. Yang, R. et al. Human T-bet Governs Innate and Innate-like Adaptive IFN-γ Immunity against Mycobacteria. Cell 183, 1826-1847.e31 (2020). 175. Bickmore, W. A. The Spatial Organization of the Human Genome. Annu Rev Genomics Hum Genet 14, 67–84 (2013). 176. Cremer, T. & Cremer, C. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat Rev Genet 2, 292–301 (2001). 177. Sexton, T., Schober, H., Fraser, P. & Gasser, S. M. Gene regulation through nuclear organization. Nat Struct Mol Biol 14, 1049–1055 (2007). 178. Xiong, K., Zhang, R. & Ma, J. scGHOST: identifying single-cell 3D genome subcompartments. Nat Methods 21, 814–822 (2024). 179. Dekker, J. et al. Spatial and temporal organization of the genome: Current state and future aims of the 4D nucleome project. Mol Cell 83, 2624–2640 (2023). 180. Girelli, G. et al. GPSeq reveals the radial organization of chromatin in the cell nucleus. Nat Biotechnol 38, 1184–1193 (2020). 112 181. Till, J. E. & McCulloch, E. A. Hemopoietic stem cell differentiation. Biochimica et Biophysica Acta (BBA) - Reviews on Cancer 605, 431–459 (1980). 182. Snijder, B. & Pelkmans, L. Origins of regulated cell-to-cell variability. Nat Rev Mol Cell Biol 12, 119–125 (2011). 183. Sharp, A. J. et al. DNA methylation profiles of human active and inactive X chromosomes. Genome Res 21, 1592–1600 (2011). 184. Brown, C. J., Carrel, L. & Willard, H. F. Expression of Genes from the Human Active and Inactive X Chromosomes. The American Journal of Human Genetics 60, 1333– 1343 (1997). 185. Knight, P. A. & Ruiz, D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis 33, 1029–1047 (2013). 186. Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods 9, 999–1003 (2012). 187. Xiong, K. & Ma, J. Revealing Hi-C subcompartments by imputing inter- chromosomal chromatin interactions. Nat Commun 10, 5069 (2019). 188. Kaul, A., Bhattacharyya, S. & Ay, F. Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2. Nat Protoc 15, 991–1012 (2020). 189. Zhang, Y., Liu, W., Lin, Y., Ng, Y. K. & Li, S. Large-scale 3D chromatin reconstruction from chromosomal contacts. BMC Genomics 20, 186 (2019). 190. Li, F.-Z. et al. Chromatin 3D structure reconstruction with consideration of adjacency relationship among genomic loci. BMC Bioinformatics 21, 272 (2020). 191. Hirata, Y., Oda, A., Ohta, K. & Aihara, K. Three-dimensional reconstruction of single-cell chromosome structure using recurrence plots. Sci Rep 6, 34982 (2016). 192. Abbas, A. et al. Integrating Hi-C and FISH data for modeling of the 3D organization of chromosomes. Nat Commun 10, 2049 (2019). 193. Boninsegna, L. et al. Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations. Nat Methods 19, 938–949 (2022). 194. Quinodoz, S. A. et al. Higher-Order Inter-chromosomal Hubs Shape 3D Genome Organization in the Nucleus. Cell 174, 744-757.e24 (2018). 195. Zheng, Y., Shen, S. & Keleş, S. Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D. Genome Biol 23, 222 (2022). 113 196. Maass, P. G., Barutcu, A. R., Weiner, C. L. & Rinn, J. L. Inter-chromosomal Contact Properties in Live-Cell Imaging and in Hi-C. Mol Cell 69, 1039-1045.e3 (2018). 197. Maass, P. G., Barutcu, A. R. & Rinn, J. L. Interchromosomal interactions: A genomic love story of kissing chromosomes. Journal of Cell Biology 218, 27–38 (2019). 198. Segal, M. R. Can 3D diploid genome reconstruction from unphased Hi-C data be salvaged? NAR Genom Bioinform 4, (2022). 114 APPENDIX A SUPPLEMENTARY FIGURES FOR CHAPTER 2 Figure A. 1. Summary of Immune-associated phenotypes to filter GWAS SNPs and disease-associated genes. 115 Figure A. 2. Constructed regulatory sub-networks using K562 ChIA-PET dataset. (A) Distance distribution of long-range chromatin interactions in K562. (B-E) Examples of regulatory sub-networks based on K562 ChIA-PET data. Nodes are annotated as expressed promoters (red), active enhancers (orange), other elements (blue), and TF nodes (purple). 116 Figure A. 3. Determination of the number of clusters of 3D chromatin modules using (A) elbow methods and (B) averaged degree of the expanded regulatory sub-networks. The tested number of clusters of 3D chromatin module range from 0 to 450. The hierarchical clustering tree is cut with different numbers of clusters. (A) Within cluster Sum of Squares (WSS) is calculated for each cluster assignment to find an optimal number of lusters. (B) Expanded regulatory sub-networks are constructed based on different clustering assignments, and the averaged diameter are calculated across all regulatory sub- networks to indicate the selection on the number of clusters. The selected number of clusters is highlighted as red dot. 117 Figure A. 4. eQTLs in neighboring nodes of disease-associated genes in regulatory sub- networks. Disease-associated genes are annotated by known disease-gene associations from DisGeNet. The significant GTEx eQTLs are overlaid with the sub-networks. Only eQTLs contained by the neighboring nodes of genes are considered. Box-plot shows the number of significant GTEx eQTLs contained by the neighboring nodes (Y axis) of each disease-associated gene and non-disease-associated gene (X axis). P-value is calculated based on one-sided Student's’ t-test. 118 Figure A. 5. Performance of disease-associated gene prediction using APRIL regulatory sub-networks based on label propagation using different sets of features. Graph features and TF features show important role in prediction. (A) Compare the performance of using only graph features and without graph features; (B) Compare the performance of using only TF features and without TF features; (C) Compare the performance of using only GWAS effect features and without GWAS effect features; (D) Compare the performance of using only activity features and without activity features. 119 Figure A. 6. Bar-plot of feature importance for Top 50 features in the random forest model. Graph features have highest importance among all features. The GWAS effect sizes of neighboring nodes, gene activity, and specific neighboring TF nodes also have high feature importance. 120 Figure A. 7. Example of newly discovered disease-associated gene TBKBP1. TBKBP1 has high expression in whole blood among 53 different tissues from GTEx. Besides, TBKBP1 has KLF16, ETV6, NFIC, NR2F1 and NR2F6 motifs validated by ChIP-seq data. Also, the neighboring enhancer linked to TBKBP1 in the APRIL sub-networks contains 2 significant GTEx eQTLs (P=3.2e-16, P=3.9e-19). 121 APPENDIX B SUPPLEMENTARY FIGURES FOR CHAPTER 3 Figure B. 1. Overview of 3-stage 3DVariantVision. Stage 1 (left) uses representation learning to generate sequence embedding using CNN by predicting epigenomic peaks including TF bindings, histone modifications, DNase peaks. Stage 2 (middle) transfers the learned sequence embedding from stage 1 to generate communication maps (joint features) of enhancer-promoter pairs using communicative learning by predicting enhancer-promoter interactions. Stage 3 (right) transfers the learned communicative model to generated communication maps of the reference genome and alternative genome of SNPs. The random forest model is used to predict eQTLs based on the communication maps. 122 Figure B. 2. Examples of predicted eQTLs based on 3DVariantVision in whole blood tissue. (A) eQTL rs79259450 is supported by Geuvadis with p-value = 3.7x10-11; (B) eQTL rs813000 is supported by Battle with p-value = 5.0x10-37. Both predicted eQTLs are supported by Capture-C chromatin interactions. The right panel shows the predicted TF binding effects and chromatin interaction effects of the corresponding SNPs, which were calculated by the differences of scores between the alternative allele and the reference allele normalized by dividing the reference scores. 123 Figure B. 3. AUPR comparison of epigenomic peak predictions between 3DVariantVision and DeepSEA based on (A) TF binding sites, and (B) Histone modification peaks. Each point represents one epigenomic peak and x, y-axis represent the AUPR of DeepSEA and 3DVariantVision, respectively. 124 Figure B. 4. Odds ratio comparison of chromatin interaction predictions between 3DVariantVision and ProTECT using shuffled enhancer-promoter pairs based on Capture-C. The higher odds ratio of 3DVariantVision demonstrates the ability to capture the joint features between enhancer and promoter. 125 Figure B. 5. Examples of predicted long-range enhancer-promoter interactions based on 3DVariantVision in GM12878 cell line. (A-B) The newly discovered EPI missed by Capture-C is validated by Hi-C. 126 Figure B. 6. ROC of eQTL prediction based on chromatin interaction changes and 3DVariantVision on SuSIE fine-mapped eQTL dataset. 127 Figure B. 7. Enrichment of TF combinations prioritized by 3DVariantVision in enhancer- promoter interactions from Capture-C. The enrichment of each TF combination was quantified by the z-scores of the occurrences in top-weighted tiny region pairs of enhancer-promoter interactions against random tiny region pairs. 128 Figure B. 8. Effect direction consistency of eQTLs with different TF bindings. The effect direction consistency is calculated by the fraction of consistency between the effect size direction of top effect eQTLs and the direction of the corresponding chromatin changing from 3DVariantVision. (A) and (B) show several represented examples of positive and negative consistent TF binding eQTL groups, respectively. 129 Figure B. 9. Performance of identifying target genes of eQTLs from background protein- coding genes. Four fine-mapped eQTL datasets were used to plot the ROC curves. 130 Figure B. 10. Examples of prioritizing target genes of TWAS SNPs based on 3DVariantVision. (A) and (B) is the examples of TC and TG trait associations, respectively. 131 APPENDIX C SUPPLEMENTARY FIGURES FOR CHAPTER 4 Figure C. 1. Hi-C contact frequencies at different genomic distance bands after removing missing data. The contacts tend to be binary values at long genomic distance bands (>5Mb). 132 Figure C. 2. The effect of read depth on the reproducibility of Hi-C contact maps across different genomic distance bands and resolutions. Hi-C data from the GM12878 cell line was downsampled to different read depths, ranging from 80% to 1% of the original read depth. Spearman correlation coefficients were calculated between the downsampled and original Hi-C contact maps for each resolution and genomic distance band. Each subfigure shows the correlation curves for a specific resolution, with different colors representing different genomic distance bands. The blue dashed line indicates a correlation threshold of 0.85, and the blue shaded regions in each subfigure highlight the downsampling rate range between 30% and 10% where the correlation is above the threshold. The red dashed lines in the subfigures for 100kb and 50kb resolution demonstrate the different downsampling rates required to maintain the same correlation across different genomic distance bands. 133 Figure C. 3. Scatter plot showing the relationship between inter-chromosomal interaction frequencies measured by Hi-C and SPRITE at 1Mb resolution (A) and 10Mb resolution (B). Each point represents a pair of genomic bins from different chromosomes. The highlighted regions in (B) correspond to genomic bins located close to the centromeres. 134 Figure C. 4. Approximation of intra-chromosomal and inter-chromosomal observed distance matrices using singular value decomposition (SVD). (A) Explanation ratios of the observed distance matrices for intra-chromosomal (red) and inter-chromosomal (blue) contacts using different numbers of top singular values. (B) Pearson correlation (left panel) and Spearman correlation (right panel) between the observed distance matrices and the approximated distance matrices using different numbers of top singular values and corresponding singular vectors for intra-chromosomal (red) and inter-chromosomal (blue) contacts. 135