A FRAMEWORK FOR BIOLOGICAL DATA INTEGRATION AND FEATURE SELECTION IN LARGE DATA SETS By Agustin Gonzalez-Reymundez A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Genetics—Doctor of Philosophy 2021 ABSTRACT A FRAMEWORK FOR BIOLOGICAL DATA INTEGRATION AND FEATURE SELECTION IN LARGE DATA SETS By Agustin Gonzalez-Reymundez The increasing volume of high-dimensional biological data (omics) has intensified the discovery of thousands of biomarkers across the different fundamental components of the cell (e.g., genome, transcriptome, proteome, epigenome) and allowed the characterization of complex phenotypes (e.g., metabolome, imaginome, phenome). However, the ability to integrate omics into informative results is constantly challenged by a seemingly ever-increasing volume of data. Furthermore, huge data sizes impose a tradeoff between how complex an omic integration algorithm can be and how much data it can handle (e.g., how fast can the algorithm be scaled to integrate large data sizes). In this dissertation, we explore statistical frameworks to face the challenges of modern omic data, including the integration of high-dimensional data of large sample sizes. We have developed a novel framework of competitive analytical performance compared with existing methods but suitable for omic data reaching biobank scales (i.e., hundreds of thousands of samples and variables). We implemented this method as an R package and showed its application on two traits of a complex molecular basis: cancer and regulation of energy intake and expenditure. In chapter one, we review the technologies and methods used to generate and integrate omic data. Chapter two describes our novel method and software of omic integration, shows examples in synthetic data, and evaluates its computational and analytical performance. Chapter three presents an application of our method to reveal a novel pan-cancer classification of tumors beyond the tissue of origin, regulated by distinct sets of molecular signatures. In chapter four, we present an application of our method to integrate phenomics data and identify patterns of energy balance regulated by genomic variation. Finally, in chapter five, we offer general conclusions to the entire thesis. To Elise and Solaris iv ACKNOWLEDGEMENTS A vast number of people contributed to making this thesis possible. Above all, I would like to thank my research advisor, Dr. Ana Ines Vazquez, for her constant support, encouragement, and instrumental professional advice. I also would like to thank the guidance committee members, Dr. Eran Andrechek, Dr. Molly Bray, and Dr. Andrea Doseff, for all their invaluable feedback and contributions. I would also like to thank the members of the QuantGen group at Michigan State University, particularly Dr. Gustavo de los Campos, Alexa Lupi, Marco Lopez-Cruz, Fernando Aguate, and Alexander Grueneberg. They greatly contributed to the development of many of the computational and statistical elements presented here. Finally, I would like to thank my fellow students and friends from the Genetics and Genome Sciences Program and former director Dr. Cathy Ernst for their support throughout this process. This research was funded by my adviser Dr. Vazquez and the Genetics and Genome Science Program, Research Alliance grants from Zoetis, and the NIH grant R01 DK 119836-01A1. Computational resources were provided by Michigan State University’s High-Performance Computing Cluster. v TABLE OF CONTENTS LIST OF TABLES ......................................................................................................... viii LIST OF FIGURES..........................................................................................................ix CHAPTER 1 INTRODUCTION ..................................................................................... 1 1.1 WHERE DO OMICS COME FROM? .................................................................. 1 1.2. HOW IS OMIC DATA INTEGRATED AND ANALYZED? .................................... 6 1.2.1 Matrix decomposition-based methods ......................................................... 6 1.2.2 Network-based methods .............................................................................. 8 1.2.3 Correlation-based methods ......................................................................... 9 1.2.4 Regression-based methods ....................................................................... 10 CHAPTER 2 MULTI-OMIC INTEGRATION WITH SPARSE SINGULAR VALUE DECOMPOSITION ........................................................................................................ 11 2.1. ABSTRACT ...................................................................................................... 11 2.2. INTRODUCTION.............................................................................................. 12 2.3. MATERIAL AND METHODS ............................................................................ 14 2.3.1. Statistical background and algorithms ....................................................... 14 2.3.2. Syntaxis of the main function moss ........................................................... 17 2.3.3. Performance evaluation ............................................................................. 21 2.4. RESULTS......................................................................................................... 23 2.4.1. Example of unsupervised omic integration with MOSS ............................. 23 2.4.2. Example of supervised omic integration with MOSS ................................. 28 2.4.3. Evaluation of MOSS analytical performance ............................................. 30 2.4.4. Benchmarking ............................................................................................ 31 2.5. DISCUSSION ................................................................................................... 34 CHAPTER 3 MULTI-OMIC SIGNATURES IDENTIFY PAN-CANCER CLASSES OF TUMORS BEYOND TISSUE OF ORIGIN ..................................................................... 37 3.1. ABSTRACT ...................................................................................................... 37 3.2. INTRODUCTION.............................................................................................. 39 3.3. MATERIAL AND METHODS ............................................................................ 41 3.3.1. Pan-cancer data ........................................................................................ 41 3.3.2. Omic integration, clustering, and features selection. ................................. 42 3.4. RESULTS......................................................................................................... 46 3.4.1. Data description ......................................................................................... 48 3.4.2. Classification of pan-cancer tumors after removing tissue-specific signals 53 3.4.3. Clinical and demographical characterization of tumor clusters .................. 54 3.4.4. Gene signatures characterizing tumor clusters.......................................... 58 vi 3.5. DISCUSSION ................................................................................................... 64 CHAPTER 4 PHENOMIC DATA INTEGRATION IN THE UK BIOBANK REVEALS GENETIC VARIANTS INVOLVED IN ENERGY BALANCE ......................................... 71 4.1. ABSTRACT ...................................................................................................... 71 4.2. INTRODUCTION.............................................................................................. 72 4.3. MATERIAL AND METHODS ............................................................................ 74 4.3.1. Cohort ........................................................................................................ 74 4.3.2. Statistical analysis ..................................................................................... 77 4.4. RESULTS......................................................................................................... 82 4.4.1. PEB variables were associated with specific groups of phenotypes. ......... 82 4.4.2. PEB-induced aggrupation of phenotypically distinct participants. .............. 87 4.4.3. Genomic variants associated with PEB. .................................................... 89 4.5. DISCUSSION ................................................................................................... 96 CHAPTER 5 CONCLUSIONS .................................................................................. 100 APPENDICES ............................................................................................................. 105 APPENDIX A SUPPLEMENTARY MATERIAL FOR CHAPTER 2 .......................... 106 APPENDIX B SUPPLEMENTARY MATERIAL FOR CHAPTER 3 .......................... 109 APPENDIX C SUPPLEMENTARY MATERIAL FOR CHAPTER 4 .......................... 135 BIBLIOGRAPHY ......................................................................................................... 140 vii LIST OF TABLES Table 3.1: Data description by cancer type after quality controls .................................. 50 Table 3.2: Clusters characterization after removing tissue effects ................................ 55 Table 4.1: Descriptive statistics of phenotypical variables............................................. 84 Table 4.2: Labels for PEB variables based on correlation with original phenotypes. .... 87 Table A.1: Code templates to replicate examples. ...................................................... 106 Table B.1: List of genes significantly deregulated in at least one pan-cancer cluster .. 109 viii LIST OF FIGURES Figure 2.1: Scree plot, PEV trajectory, and partial derivatives ...................................... 24 Figure 2.2: Omic contribution to selected features. ....................................................... 25 Figure 2.3: Cluster analysis ........................................................................................... 26 Figure 2.4: Association between clusters and SVD dimensions .................................... 27 Figure 2.5: Signature of features for two clusters. ......................................................... 28 Figure 2.6: PEV trajectory plot and partial derivatives for a PLS analysis. .................... 29 Figure 2.7: Output of the function moss_heatmap. ....................................................... 30 Figure 2.8: Performance of feature detection with MOSS. ............................................ 31 Figure 2.9:Benchmarking of omic integration methods ................................................. 33 Figure 3.1: Omic integration and features selection method……………………………...48 Figure 3:2: Pan-cancer clustering of tumor samples after adjusting for tissue effects… 54 Figure 3.3: Gene signatures for Clusters 1 and 4…………………………………………. 61 Figure 3.4: Gene signatures for Clusters 6, 7, and 8……………………………………… 63 Figure 4.1: Inclusion criteria and sample size ............................................................... 82 Figure 4.2: Relationship between PEB variables and original phenotypes.................... 86 Figure 4.3: PEB-based cluster analysis......................................................................... 89 Figure 4.4: Summary of GWAS loci within genic regions .............................................. 94 Figure A.1: Analytical performance of several omic integration methods……………... 108 Figure B.1: Clustering of tumor samples (no constraints)………………………………. 130 Figure B.2: Re-classification of tumors and previously reported molecular subtypes.. 131 Figure B.3: Survival curves by pan-cancer tumor clusters……………………………… 132 ix Figure B.4: Re-classification of tumors reveals differences in sample type……………133 Figure B.5: Expression and copy numbers for transcription factors and targets……... 134 Figure C.1: Plots of singular values trajectories .......................................................... 135 Figure C.2: Comparison between dense and sparse latent factors ............................. 136 Figure C.3: Annotated Manhattan plot for PEB 1 ........................................................ 136 Figure C.4: Annotated Manhattan plot for PEB 2 ........................................................ 137 Figure C.5: Annotated Manhattan plot for PEB 3 ........................................................ 137 Figure C.6: Annotated Manhattan plot for PEB 4 ........................................................ 138 Figure C.7: Annotated Manhattan plot for PEB 5 ........................................................ 138 Figure C.8: Summary of previously reported information for genes in Figure 4.4….....139 x CHAPTER 1 INTRODUCTION The word “omic” is nowadays used as a generic term to represent collections of biological data obtained with high-throughput technologies (e.g., tandem mass spectrometry, array and sequencing technologies, agriculture imaging, metagenomics). These collections have emerged as a way of describing complex biological systems as a whole [1]. In section “1.1 Where do omics come from?” we review the types of systems that omics characterize and the technologies used to generate omic data. Once omics are measured, their information can be integrated to discover novel biomarkers or understand the interaction between multiple phenotypes [2]. A review of omic integration methods is presented in section “1.2 How is omic data integrated and analyzed?”. 1.1 WHERE DO OMICS COME FROM? Perhaps the first application of the suffix –ome (a mass or totality of something) in genetics was in the word genome. The term is attributed to the German botanist H. Winkler, who chose it to represent an organism's haploid set of chromosomes and the genes it contains [3]. It has been suggested that the suffix –ome could have been chosen by Winkler as an analogy to broadly used botanical terms, such as microbiome, biome, and rhizome, each one representing an entire collection of biological entities of a certain kind [4]. The word genome gained broader popularity during the context of the apparent C-value paradox (the discrepancy between the amounts of DNA of an organism and the amount 1 needed to encode proteins). After discovering non-coding DNA in the 1940s and the solution of the paradox, the word genome started to be used as a synonym for the entire coding and non-coding DNA [5]. With the advent of Sanger’s sequencing method in 1977 (based on the production of all possible DNA fragments from a template, relying on chain termination by modified nucleotides and separation by electrophoresis) [6], it soon became possible to obtain complete genome sequences of microbes. This achievement inspired the creation of genomics as a new scientific discipline. The first generation of sequencing methods, together with the development of the polymerase chain reaction (PCR) [7] and recombinant DNA technology [8] during the 1980s and early 1990s [9], became instrumental tools for carrying out the Human Genome Project. During the early 2000s, the first draft of the reference human genome catalyzed the creation of the HapMap [10] and the Encyclopedia of DNA Elements (ENCODE) [11]. The formed aimed to study the common genetic variation between individuals while the latter aimed to annotate all functional elements within the human genome, such as genes and regulatory sequences. Both projects were fundamental for developing genotyping microarray technologies and the first genome-wide studies (GWAS – i.e., the inference of associations between thousands of genomic polymorphisms in a diverse population and a phenotype using linkage disequilibrium). Then, the word genome became a synonym for all the possible DNA sequences of a cell and their variants across populations. In the late 2000s, massively parallel sequencing technologies (the so-called second or next-generation – e.g., 454 Life Sciences, Illumina Genome Analyzer, ABI Solid) were developed. These novel technologies allowed lower reactions costs and longer 2 sequencing reads, enabling the sequencing of whole genomes in large cohorts of participants [12]. Some of these cohorts, like the 1K Genomes in 2008 and the UK Biobank (sequencing 50 thousand individuals) in 2010, have contributed to detecting rare and very rare variants, usually associated with significant health and behavioral traits [13,14]. Thus, the genome was conceived as the entire collection of DNA sequences plus their common and rare variation across populations. The close relationship between the evolving concept of the term genome, together with the development of high-throughput technologies, greatly influenced the creation of a plethora of analogous fields in biology. Each of this fields aimed to study the genome complement of a particular set of functional elements. The term transcriptome, for example, was first proposed by Victor Velculescu, who defined it in 1995 as the entire collection of RNA molecules of an organism [15]. The same year, Velculescu and collaborators introduced Serial Analysis of Gene Expression (SAGE), a revolutionary method to compare samples by taking a snapshot of the population of messenger RNA [16]. Previously, parallel gene expression analysis primarily relied on clonal DNA microarrays (i.e., cDNA samples hybridized against oligonucleotides matching known genes) [17]. By contrast, SAGE addressed transcript presence of known and unknown genes in a more accurate way than microarrays (which accuracy can suffer due to artifacts from background noise). With the advent of RNA-seq in 2008, the discovery of novel genes and the assessment of expression levels continued to improve by achieving higher coverage and depths than SAGE [18]. On the other hand, the term proteome is attributed to Marc Wilkins, who first used it in the early 1990s to represent the genomic complement of proteins [19]. The study of the 3 proteome was enhanced by more precise separation techniques (like capillary electrophoresis, liquid and gas electrophoresis), advances in mass spectrometry (for example, the creation of “soft” ionization approaches, such as electrospray ionization – ESI- and matrix-assisted laser desorption ionization –MALDI-) and the introduction of microarray technologies for protein analysis (e.g., immunoassays, functional microarrays, and reverse-phase array) [20]. These methods have vastly improved the ability to identify and quantify novel proteins and their interactions. Furthermore, due to its crucial role in phenotype expression, the proteome is composed of a much broader set of entities than the genome or transcriptome, including all possible peptides, context-dependent functions, and post-translational modifications. Another example is the term epigenome, popularized in the 1990s as a merge between epigenetics (a term attributed to Conrad Waddington, who proposed it in the 1940s to describe inheritable traits in response to environmental stimuli) and the suffix ome. As in the previous omics, epigenome refers to the genomic complement of epigenetic marks. Therefore, the epigenome represents a broader category, composed of elements from all the previous molecular types discussed: DNA (e.g., patterns of DNA methylation), RNA (non-coding regulatory RNAs), and proteins (e.g., histone modification and chromatin remodelers). This diversity of elements imposed the need for a broader set of methods to characterize the epigenome [21]. To study the genome-wide patterns of DNA methylation, for example, methods like DNA restriction endonuclease assay (to compare the relative size of restriction fragments between individuals, depending on the sensibility of the restriction enzymes to methylated residues), and chromatin immunoprecipitation (Chip, using methylation-specific antibodies to isolate methylation fragments) followed by 4 microarray genotyping or sequencing, were proposed. Another method is the treatment of DNA with bisulfite reaction (i.e., applying sodium bisulfite to turn methylated cytosines into uracil) coupled with array hybridization or next-generation sequencing. Similar Chip methods have also been applied to study RNA-protein interaction (RIP, which removes DNA from samples and captures RNA with specific antibodies). Mass spectrometry has also been widely used to detect histone modification and isoforms. The explosion of omic data in the last few years has also inspired the utilization of the - ome suffix to describe a much broader set of biological entities, derived from high- throughput methods, of high-dimensional nature, or alluding to the totality of items on a system. The term phenome, for example, was first proposed by Michael Soulé in 1967 as the collection of all possible phenotypes of an organism. However, with the advent of high-throughput phenotyping, the term is now applied to the set of all high-dimensional phenotypes acquired at an organism-wide scale [22]. Due to advances in phenotyping techniques, many projects have been able to produce extensive phenomic records for different organisms, including humans (e.g., UK Biobank [14]), mice (Euro Phenome [23]), and plants (International Plant Phenotyping for plants [24]). Technological advances that made extensive phenotyping possible include neuroimaging via structural MRI to study neuronal and cognitive functions [25], automated data loggers to record behavioral data [26], and spectroscopic imaging of crop plants to measure thousands of agroeconomic traits [27]. Soon it became clear that single-layer analyses could not truly capture the synergies between molecular factors across omics (e.g., how the expression of a gene is non- linearly affected by mutations and epigenetic alterations). Therefore, an integrative omic 5 approach has emerged [28]. The purpose of this approach is to represent multiple omics in a rational and unified way, highlighting the variability across subjects while minimizing the redundant signal from groups of related features (e.g., genes in the same ontology or pathway, linkage blocks) [29,30]. Several computational and statistical methods have been proposed to conduct this task. Next, we review some of the most popular algorithms and models available for omic integration. 1.2. HOW IS OMIC DATA INTEGRATED AND ANALYZED? Omic integration refers to a vast group of techniques, all conceived to explore the combined effects and synergies across different high-throughput types of biological data. Here, we review some of the most popular methods currently available for omic integration, organizing them as 1) matrix decomposition based, 2) graph-based, 3) correlation-based, and 4) regression-based. The first group will include methods that distill an extended matrix of omics 𝑋 (binding omics blocks by column) into a factor representing variability across subjects and a factor representing the contribution of each omic feature to that structure. The second group will include dimension reduction techniques, considered under the paradigm of graph embedding [31]. The third group comprises methods that explore associations among features within and across omics by explicitly exploiting the correlations between variables. The last group will include methods based on regression, where the response is an entire omic block. 1.2.1 Matrix decomposition-based methods This group includes techniques that work on an extended omic matrix 𝑋 = [𝑋1 … 𝑋𝐿 ] (with 𝑋𝑙=1,…,𝐿 being a matrix representing the 𝑙-th omic block attached by columns) and attempt a decomposition of it into two factors. The first factor collapses the 6 redundancies within and between omics -by creating orthogonal columns representing the independent signals across omic features (that we will call 𝑍)-. The second factor represents the contribution of each omic feature to this combined effect (we will call this factor 𝑊). Many methods assume a linear relationship between the two factors of the form 𝑋 ≈ 𝑍𝑊. Standard Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and sparse versions fall under this category. Methods such as the Integrated Non-negative Matrix Factorization (iNMF [32]) assume a common 𝑍 among omics and minimizes ‖𝑋𝑙 − 𝑍𝑊𝑙 ‖2𝐹 (Where 𝑊𝑙 are the weights of the 𝑙-th omic). Other methods such as Integrative Clustering (iCluster [33]) and Joint and Integrated Variation Explained (JIVE [34]) assume a latent variable model instead. iCluster defines a model 𝑋 = 𝑍𝑊 + 𝜀 that imposes a LASSO penalty on the elements of 𝑍 and 𝑊. JIVE on the other hand, assumes a model 𝑋 = 𝑍𝑊 + 𝐼 + 𝜀, where 𝐼 is an extra term that represents the individual variation within omics. In all these cases, 𝑍 and 𝑊 (or 𝑊𝑙 ) are estimated by iterative procedures, such as expectation maximization algorithm (EM). Other methods use a Bayesian approach to find clusters of subjects while identifying regulatory modules across omics. These methods assume that 𝑋 comes from a finite mixture. The components of the mixtures can be taken as clusters, inferred via Markov Chain Monte Carlo (MCMC) methods. The Joint Bayes Factor (JBF [35]) method, for example, assumes a model of the form 𝑋𝑙 = 𝑍𝑙 (𝑊 + 𝑊𝑙 ) + 𝜀𝑙 , and imposes sparsity on the elements of 𝑊𝑙 and 𝑊 via a Bernoulli process. The method uses a Gibbs sampler to iteratively find 𝑊𝑙 , 𝑊, and 𝑍𝑙 . Methods like Multiple Dataset Integration (MDI [36]) and Bayesian Consensus Clustering (BCC [37]), on the other hand, assume a Dirichlet process, where the clusters are represented by an indicator matrix, sampled from a multinomial 7 distribution. MDI assumes 𝑊𝑙 is an indicator matrix of cluster membership for each omic. The contribution of each feature to each cluster is estimated by means of an “assignment probability”, assumed to have a Beta prior distribution. BCC extends this idea, but by explicitly modeling adherence to a global cluster. 1.2.2 Network-based methods Methods in this category can be conceived under the paradigm of graph embedding [31]. Under this framework, data points in a high dimensional space are represented by a graph. The graph’s connections represent similarities between points. The goal is then to map points onto a space of smaller dimensions while preserving similarity. This mapping can be represented by 𝑍 = 𝑓(𝑋, 𝐴), where 𝑓 is a function (which can be linear, non-linear, explicit, or implicit), 𝑋 is the extended omics matrix, as before, and 𝐴 is a matrix representing similarities between subjects. This framework includes methods such as Laplacian Eigenmap [38], ISOMAP [39], and Local Linear Embedding [40], together with linear counterparts, such as SVD. For instance, when 𝑓 is linear, 𝑍 ′ 𝑍 = 𝐼, 𝐴 = 𝑋𝑋 ′ , and the features are standardized, 𝑍 becomes the principal components of 𝑋. The graph embedding framework can also be extended to include non-linear mappings employing the “kernel trick”. The intuition behind this is to map 𝑋 onto a higher dimensional Hilbert space (we can think of 𝑓(𝑋) as a new matrix with row vectors in the new feature space), and then performing the linear algorithm in this new feature space (e.g., SVD on 𝑓(𝑋)). Then, the problem is solved by taking 𝑍 as the principal components of 𝑓(𝑋). A large body of literature frame this problem under reproducible kernel Hilbert spaces [41]. For example, a kernel matrix 𝐾 (e.g., 𝐾 = (𝑋𝑋 ′ + 𝑐)𝑡 ) can be used to define 𝑓 implicitly, taking 𝐴 to be equal to 𝐾 = 𝑓(𝑋)𝑓(𝑋)′ . Then, 𝑍 is obtained by computing the eigenvectors of 𝐾 8 (the so called “Kernel PCA”) [42]. Depending on the dimensions of the problem, one can define a separated kernel by layers of information (𝐾𝑙 ) and average them during the estimation of 𝑍. Methods like Similarity Network Fusion (SNF [43]) define 𝐴 as a scaled exponential kernel of the distance between samples. Others, such as regularized Multiple Kernel Learning for Dimension Reduction (rMKL-DR [44]), do so by adding constraints in the contribution of each kernel to 𝑍. The theory also allows to reformulate some supervised methods, such as linear discriminant analysis and support vector machines. For these, 𝐴 can be rendered to represent the similitudes within and between classes of subjects. Examples of these in omic integration are smooth t-statistics Support Vector Machines (stSVM [45]) and Features Selection Multiple Kernel Learning (FSMKL [46]). 1.2.3 Correlation-based methods The methods we describe in this section directly exploit the correlations between features and can be considered variations of the traditional canonical correlation analysis (CCV) and partial least squares (PLS). In all cases, each omic block can be modeled as before: 𝑋𝑙 = 𝑍𝑙 𝑊𝑙 + 𝜀𝑙 . The problem now is formulated by turning the omic blocks into vectors and multiplying them by the row vectors 𝑎𝑙 and 𝑎𝑙′ (𝑙 ≠ 𝑙′), so that 𝑓(𝑎𝑙 𝑍𝑙 𝑊𝑙 , 𝑎𝑙′ 𝑍𝑙′ 𝑊𝑙′ ) = 𝑓(𝑏𝑙 𝑊𝑙 , 𝑏𝑙′ 𝑊𝑙′ ) is maximized. Here, 𝑓 is the correlation function for CCV and a covariance function for PLS. In both cases, 𝑏. and 𝑊. are estimated. Variations of the problem include sparsity constraints during the estimation of 𝑊. (sCCV [47]), and embedded structure of groups of features (ssCCV [48], sgCCV [49]). In PLS, the problem can also be generalized to multiple blocks (MBPLS [50]), with 𝑓(𝑏𝑙 𝑊𝑙 , 𝑔(∑𝑙′ ≠𝑙 𝑏𝑙′ 𝑊𝑙′ , 𝜃)), where 𝑔 is a function of the linear combination of multiple sets of information, and 𝜃 is a vector of extra 9 parameters. Sparse version of this (sMBPLS [51]) add an iteration procedure and LASSO penalties on 𝑏. , 𝑊. , and 𝜃. 1.2.4 Regression-based methods Methods in this section treat the problem of omic integration as one of regression: 𝑌 = 𝑓(𝑋, 𝛿) + 𝜀, where the response is an omic block 𝑌, 𝑋 is one or more omic blocks, and 𝛿 is a matrix of coefficients. To exploit the true multivariate nature of the problem, methods such as the Reduced Rank Regression (RRR [52]) assume a linear 𝑓 and impose restrictions in the rank of 𝛿. This rank condition implies the existence of linear constraints due to dependencies within each omic block, and between them. Therefore, by estimating 𝛿, we can address which features are associated within and between omic blocks. Variations of the method provide sparse solution with biologically more interpretable results (sRRR [53]). In the Bayesian context (BsRRR [54]), a different prior distribution can be assumed separately for each omic block and their effects: 𝑌 = 𝑓(∑𝑙 𝑋𝑙 𝛿𝑙 ); 𝑋𝑙 ~𝑝(𝑋𝑙 |𝜃); 𝛿𝑙 ~𝑝(𝛿𝑙 |𝜔) (where 𝜃 and 𝜔 are hyper-parameters estimated from the data or assumed to have a distribution themselves). This approach has the effect of handling different scales by omic (e.g., discrete for SNP, counts for RNA-seq data) and considers different penalties for dealing with high dimensionality. The methods reviewed offer a general analytical framework to integrate different layers of data effectively. However, the computational performance of many of these methods suffers when data size becomes substantial [55,56]. The following chapters describe our method and R package for omic integration and present applications in two complex molecular basis traits: cancer and energy balance regulation. 10 CHAPTER 2 MULTI-OMIC INTEGRATION WITH SPARSE SINGULAR VALUE DECOMPOSITION This chapter was prepared alongside Alexander Grueneberg, Guanqi Lu, Felipe Couto Alves, Gonzalo Rincon, and Ana I. Vazquez. 2.1. ABSTRACT The availability of multi-layer omics data has drastically increased in the past years. Several methods have been developed to integrate these types of data effectively. However, our ability to integrate increasing volumes of omic data remains limited. This article presents multi-omic integration with Sparse Singular Value Decomposition (MOSS), a free and open-source R package to integrate multiple and large omics datasets. This package is computationally efficient and offers biological insight through cluster analysis and identification of biologically relevant omic features. Source code is freely available at CRAN. 11 2.2. INTRODUCTION Omic data is characterized by many parameters per sample (usually a much larger number of parameters than the sample size, the so-called p>>n). Thus, traditional methods (e.g., ordinal least squares) are insufficient to obtain significant insights from this multi-layer, high-dimensional data. To effectively integrate high-dimensional sets of data, novel methods have been developed [33,34,57–61]. These methods typically combine some form of projection onto a lower-dimensional space (e.g., to reveal structure among samples and features) with some form of feature engineering (e.g., to determine what genes or other molecular entities are most informative at explaining the differences and similarities between omics). These methods have profoundly contributed to our understanding of variation in complex traits across diverse levels of regulation (e.g., mutations in coding genes and epigenetic regulation) [62,63]. Thanks to ongoing biobank efforts, omic data also increases the number of available samples, providing higher prediction ability and statistical power [64]. However, more extensive data sizes make computations progressively lengthier or impossible to perform [65]. Moreover, extensive data sizes also compromise parallelizing complex algorithms (e.g., convolutional neural networks) [66]. To handle these limitations, we developed Multi-omic integration with Sparse Singular Value Decomposition (MOSS). MOSS is a free and open-source R package that performs data integration and feature selection on large data sets. It combines the flexibility of sparse singular value decomposition (sSVD) with parallel and in-disk computations to accommodate data sizes reaching biobank dimensions. In this article, we describe the package’s main capabilities and its mathematical and computational foundations. We 12 evaluate MOSS analytical performance using a realistic simulation of multi-omic data and benchmark it against state-of-the-art methods of omic integration. Instructions on how to download and install MOSS can be found at CRAN, as well as the package’s manual, vignette, and additional examples. 13 2.3. MATERIAL AND METHODS 2.3.1. Statistical background and algorithms Omic integration models: MOSS fits a partial least squares (PLS) [67] model, 𝐐 = 𝐖𝚺 + 𝛆, to find elements maximizing the associations between orthogonal projections of an omic working as response (represented by the matrix 𝐘 = (𝑦𝑖𝑙 )𝑙=1,…,𝑚 𝑖=1,…,𝑛 ) and omics working 𝑗=1,…,𝑝 as predictors (represented by the matrix 𝐙 = (𝑧𝑖𝑗 )𝑖=1,…,𝑛 ) omics. These projections are 𝑘=1,…,𝑞 𝑘=1,…,𝑞 represented by matrices 𝐐 = 𝐘𝐔 and 𝐖 = 𝐙𝐕, where 𝐔 = (𝑢𝑙𝑘 )𝑙=1,…,𝑚 and 𝐕 = (𝑣𝑗𝑘 )𝑗=1,…,𝑝 are orthonormal columns of loadings. The matrix 𝛆 = (𝜀𝑖𝑙 )𝑙=1,…,𝑚 𝑖=1,…,𝑛 represents uncorrelated residuals, with 𝜀𝑖𝑙 ~(0, 𝜎𝜀2 ), not following any particular distribution. The PLS is iteratively solved by least squares to find 𝐐 ,𝐖, and 𝚺. The rows of 𝐘 and 𝐙 are assumed to represent the same individuals or samples, while their columns are assumed to have zero means and unit variances. Data integration enters the model through 𝐙, as a set of normalized omic blocks appended column-wise, such as 1 1 𝐙 = [||𝐙 2 𝐙1 … 𝐙𝑡 ] 1 ||2 ||𝐙𝑡 ||2 2 where 𝑡 is an arbitrary integer representing the number of omic blocks, and ||. ||22 is the square of the Frobenius norm of a matrix. Models with covariates: To remove the effects of covariates, we use the model 𝐐 = 𝐗𝛅 + 𝑔=1,…,𝑠 𝐖𝚺 + 𝛆, where the columns of matrix 𝐗 = (𝑥𝑖𝑔 )𝑖=1,…,𝑛 represent a set of 𝑠 covariates and 𝑘=1,…,𝑞 𝛅 = (𝛿𝑔𝑘 )𝑔=1,…,𝑠 represent the effects of covariates on 𝐐. MOSS removes the effects of these covariates by pre-multiplying each term of the above equation by 𝐀 = 𝐈𝑛 − 𝐗(𝐗 𝑻 𝐗)− 𝐗 𝑻 , where 𝐈𝑛 = 𝐝𝐢𝐚𝐠( 1, ⏟… ,1 ). 𝑛−𝑡𝑖𝑚𝑒𝑠 14 Estimation of parameters: To estimate 𝐔, 𝐕, and 𝚺, MOSS minimizes the following loss function: 𝐿 = ‖𝐘𝐔 − 𝐙𝐕𝚺‖22 + 𝜆𝑈 ( 𝛼𝑈 ‖𝐔‖1 + (1 − 𝛼𝑈 )‖𝐔‖22 ) + 𝜆𝑉 ( 𝛼𝑉 ‖𝐕‖1 + (1 − 𝛼𝑉 )‖𝐕‖22 ) [Eq.2-1] where 𝐘 and 𝐙 are either the original or covariates-adjusted omics. The first term in the above sum is the L2 norm of approximation of 𝐐 by 𝐖𝚺, subject to 𝚺 = 𝐝𝐢𝐚𝐠(𝜎1 , … , 𝜎𝑞 ), 𝜎1 > ⋯ > 𝜎𝑞 > 0, and 𝐔𝑇 𝐔 = 𝐕 𝑇 𝐕 = 𝐈𝑞 . The second and third terms are Elastic Net (EN) [68] penalties on the elements of 𝐔 and 𝐕. The EN penalty balances well-established techniques of variable selection (zeroing out the noise and redundant signal between omic features) and shrinkage (to account for the high number of omic features that often exceed the number of samples). The expression ||. ||1 corresponds to the L1 norm. Here 𝜆. = 𝑞(𝜈. ), is considered as a monotonically decreasing function of 𝜈. (the number of desired elements different from zero) onto positive real numbers 𝜆. , and 𝛼. is any number between zero and one. The value of 𝛼. balances shrinking and variable selection. When sparsity is not imposed (i.e., 𝜆. = 0), solutions for [Eq.2-1] are obtained by taking partial derivatives on 𝐔, 𝐕, and 𝚺, and setting them to zero. Considering 𝐁 = 𝐙𝑇 𝐘, solutions for 𝐔, 𝐕, and 𝚺 can be obtained from the partial singular value decomposition of 𝐁 of rank 𝑞 ({𝐔 ̃, 𝐕 ̃, 𝚺 ̃} = 𝑆𝑉𝐷(𝐁, 𝑞)). When 𝜆. > 0, solutions are obtained iteratively from the following set of equations: 𝐔̃ −1𝜆𝑈 𝛼𝑈 𝐬𝐢𝐠𝐧(𝐔 ∗ )) ∗ 2 𝐔 = 1+𝜆𝑈 (1−𝛼𝑈 ) { 1 [Eq.2-2] ̃− 𝜆𝑉 𝛼𝑉 𝐬𝐢𝐠𝐧(𝐕 ∗ )) 𝐕 𝐕∗ = 2 1+𝜆𝑉 (1−𝛼𝑉 ) 15 Equations in [Eq.2-2] are solved using the algorithm in [69] extended to include the EN parameter 𝛼. , where 𝐔∗ = 𝐔𝚺 and 𝐕 ∗ = 𝚺𝐕. To ensure that 𝐔 and 𝐕 are orthonormal in each iteration, the QR decomposition of 𝐔∗ and 𝐕 ∗ (𝐔∗ = 𝐋(𝑈) 𝐑(𝑈) and 𝐕 ∗ = 𝐋(𝑉) 𝐑(𝑉) ) is used. In each iteration, 𝐔 is recovered as 𝐔 = 𝐋(𝑈) 𝐃(𝑈) , where 𝐃(𝑈) = 𝐝𝐢𝐚𝐠(𝑑1 , 𝑑2 , . . . , 𝑑𝑞 ), (𝑈) and 𝑑𝑘 = 1/||𝐋𝑘 ||2 (the inverse of the norm of each column of 𝐋(𝑈) ). The same steps are used to recover 𝐕. After a fixed number of iterations, or at convergence, the final values of 𝐔 and 𝐕 are used to recover 𝚺 (𝚺 = 𝐔𝑇 𝐁𝐕 𝑇 ). Tuning hyperparameters: The value 𝜆. is tuned following [69], modified to tune 𝜆𝑈 , 𝜆𝑉 , or both. Briefly, 𝜆. is chosen as the 𝜈. -th order statistic of 𝐔, or 𝐕, where 𝜈𝑈 and 𝜈𝑉 are fixed numbers representing the desired number of samples and features loadings different from zero (i.e., the degrees of sparsity), respectively. Then, the proportion of variance explained (PEV) by each one of a grid of values of 𝜈𝑈 and 𝜈𝑉 is calculated. The trajectory of PEV across values of 𝜈𝑈 and 𝜈𝑉 is then used to select an “optimal” 𝜆. value to solve [Eq.2-2]. This selection is made automatically via two alternative methods. The first 𝜕𝑃𝐸𝑉 method uses the first empirical partial derivative of PEV ( ) to choose the value of 𝜈. 𝜕𝜈. at which the change in PEV is maximum (“liberal” method). The second method choses the value of 𝜈. at which the change in PEV stabilizes (“conservative” method). Similarly, MOSS displays a classic plot of 𝜎1 , … , 𝜎𝑞 (Scree plot), to visualize the change in variance explained by each latent dimension. The number of latent dimensions 𝑞 is not tuned by MOSS internally. However, automatic suggestions are provided based on the above tuning methods, where the trajectory of 𝜎1 , … , 𝜎𝑞 is used instead of the PEV one. Cluster analysis: MOSS can use the columns 𝐐 to detect clusters of samples via Density- Based Spatial Clustering of Applications with Noise (DBSCAN) [70]. DBSCAN is one of 16 the most potent clustering techniques to delimit clusters of irregular shapes. Essentially, DBSCAN identifies groups of densely packed points without specifying the number of clusters a priori. In MOSS, neighborhoods of nearby points can then be tuned by evaluating different cluster partitions over a grid of values of e, a hyperparameter controlling the neighborhood. MOSS chooses the number of “optimal” clusters that maximizes the Silhouette score [71] over a grid of possible e values, as in [72]. Visualization of clusters: Additionally, MOSS can use t - Stochastic Neighbor Embedding (tSNE) to project a group of columns of 𝐐 onto a two-dimensional display [72]. Essentially, tSNE projects multiple dimensions onto a lower-dimensional display while conserving local neighborhoods (eventually representing data clusters) [73]. tSNE is an effective technique to reveal clusters [74]. The algorithm has two fundamental parameters: perplexity (which accounts for the adequate number of local neighbors) and cost (related to the difference between the neighborhood’s distribution in the higher and lower dimensional spaces). Since low costs are more likely to reveal clusters, MOSS tunes the tSNE projection by choosing the map of minimum cost among multiple random starts of the algorithm. 2.3.2. Syntaxis of the main function moss The package's primary function is called moss. This function works along with other auxiliary functions to integrate data sets, pre-process, integrate them, and generate plots. The details of each function can be obtained from the package help pages. Following, we describe the inputs and outputs of moss. Input: The input data must be passed through data.blocks as a list of omic blocks. Each row must represent a subject or sample, and each column an omic feature or variable. 17 MOSS allows each block to be of the class matrix, array, or filed-backed big matrix (FBM) [75]. Objects of class FBM can be passed to moss whenever data sets are too big to be handled in RAM. For this, the only requirement is for package bigstatsr [75] to be installed. Alternatively, omic blocks passed to moss as ‘matrix’ or ‘array’ can be internally turned into FBM by setting the argument use.fbm = TRUE. In our experience, this can speed up computations when matrices are small enough to fit in memory but still too large to be handled in a reasonable time. If covariates adjust omic blocks, these can be passed as a matrix, vector, or data frame, through argument covs. Standardization, normalization, and imputation: Arguments scale.arg and norm.arg control, respectively, the standardization of each column within an omic block (i.e., centering to zero mean and scaling to unit variance), and normalization. Omics within data.blocks are expected to have named rows. A warning message is displayed if at least one omic is missing row names, or the row names are inconsistent across blocks. In the presence of missing data, a simple imputation by the mean of each column is provided. Nevertheless, before calling moss, the user is recommended to run standard quality controls (such as calculating the proportion of missing data across rows and columns, zero variance features, and minor allele frequency). Methods: Without additional constraints, MOSS treats [Eq.2-1] as a partial least squares model (PLS, method=”pls”). To specify which omic will be used as responses 𝐘, a number from 1 to t (the number of omic blocks) must be passed to resp.block. By imposing additional constraints, more multivariate techniques can be performed. For example, when 𝐘 is assumed to be the identity matrix, MOSS treats [Eq.1-1] as a principal components analysis (PCA) (method=”pca”). Alternatively, if 𝐘 is a column matrix with 18 values representing different categories, [Eq.2-1] is treated as a linear discriminant analysis (LDA; [76]) (method=”lda”). Cluster analysis and visualization: If argument cluster=TRUE, package dbscan [77] is used to find clusters on Q. By default, this is done on all Q columns. However, a different set of columns can be passed as a vector of indexes through argument axes.pos. By default, the number of clusters is tuned on a grid of 100 values of e (eps_res), evenly spaced between 0 and 4 (eps_range). With this setting, clusters with less than two samples are discarded. Same options are obtained by setting cluster=list(eps_res=100, eps_range=c(0,4), min_clus_size=2). This option allows for alternative values of eps_res, eps_range, and min_clus_size. To obtain the two-dimensional embedding of Q (or a subset of columns indicated by axes.pos), users can do tSNE=TRUE, or tSNE=list(perp=50, n.iter=1e3, n.samples=1), where perp, n.iter, and n.samples are the perplexity parameter of tSNE, number of iterations, and number of random initial conditions, respectively. Sparsity constraints: Within MOSS, 𝜈. is specified by arguments nu.u (vector of integers between one and the total number of samples) and nu.v (vector of integers between one and the total number of features). If the values of nu.u and nu.v is not specified, only a standard (i.e., dense) SVD is computed. The values of 𝛼. are specified through arguments alpha.u and alpha.v. Argument exact.dg tells moss to chose 𝑓: 𝜈. → 𝜆. , such as the number of elements different from zero in each column of U and V is exactly 𝜈. . Argument lib.thresh=TRUE (default) tells moss to select the value of 𝜈. at which the change in PEV is maximum. If lib.thresh=FALSE, the value of 𝜈. at which PEV reaches a plateau is chosen. 19 Parallel computing: By default, the process of tuning the degree of sparsity is done in series. However, this can be changed by setting argument nu.parallel=TRUE. This option uses package future.apply, to allow for simple parallel distribution of tasks on a local machine or computer cluster [78]. Plots: If argument plot=TRUE, several high-level plots will be produced (see Results). This argument requires package ggplot2 [79] to be installed. Outputs: Function moss returns a named list with the results of the data integration plus additional analyses. The list includes matrices B and Q, along with two lists containing the results of the dense and sparse SVDs. The output list also has the plots with embedding, cluster analysis, selected items by omic, and signatures of features by a cluster of samples. Data: Analytical performance was evaluated on data generated with the R package MOSim [80]. MOSim uses existing omic data to sample pairs of genes and regulators from which differential expression is simulated for a given experimental design. We used MOSim’s accompanying mouse omic data from the STATegra project [81] to seed all omics. In all scenarios, three omics representing gene expression (RNA-seq count data), micro-RNA seq (miRNA-seq), and ATAC-seq data of DNAase I activity (DNase-seq) were simulated. Signal effects were imposed by assigning different proportions of miRNA-seq and DNase-seq features (5% and 20% of total features) to regulate the expression of 15% of total genes across three clusters of samples. A first simulation scenario with a small number of samples (100) and features (1,000) illustrated MOSS main capabilities. Then, simulations for an increasing number of samples and features were used to evaluate MOSS’ performance at recovering signals (i.e., groups of differentially 20 expressed genes and corresponding regulatory features) in more realistic scenarios. Due to a restriction of MOSim to simulate massive data sizes (hundreds of thousands of rows and millions of features), we benchmarked MOSS on representing rank-one bi-clusters, embedded in three synthetic omic blocks with Gaussian noise, and for increasing numbers of samples and features (see the help page for function simulate_data()). 2.3.3. Performance evaluation We evaluated the performance of MOSS in terms of its ability to detect differentially expressed genes and their regulatory features. To do that, we used MOSim with ten different random starts. We simulated the three omics described above for all combinations of 100, 1,000, and 10,000 samples across three clusters for 1,000 and 10,000 features in each random start. We used 5% and 20% of miRNA-seq and DNAse- seq as regulatory elements for all data sizes. In all scenarios, 15% of total genes were set as differentially expressed (DE). In each scenario, we calculated true positives (TP) as the number of signal features (DE genes and regulatory features) detected by moss, true negatives (TN), as noisy features (not DE genes or regulatory elements) not detected by moss, false positives (TP) as noisy features detected by moss, and false negatives, as signal features not detected by moss. These quantities were used to calculate the 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 𝑇𝑁 accuracy ( ), sensitivity ( ), specificity ( ), and precision 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃 𝑇𝑃 ( ). 𝑇𝑃 + 𝐹𝑃 To evaluate MOSS computational time in the context of other methods of omic integration, we used iCluster [33], NMF [82], SNFtool [43], mixOmics [58], and OmicsPLS [83] R packages to run in the same scenarios. The scenarios consisted of simulations for different combinations of 100, 1,000, 10,000 and 100,000 samples, and 1,000, 10,000, 21 100,000, and 1,000,000 features. We allowed all methods to have enough available memory (100 Gb) to produce results in hours rather than days. 22 2.4. RESULTS 2.4.1. Example of unsupervised omic integration with MOSS The following example shows how to perform omic integration with MOSS on a simulated data set with three omics and three clusters of samples. The method was a sparse PCA assuming ten latent dimensions. EN was used to tune the degree of sparsity of features. tSNE was used to embed the first three columns of Q onto two dimensions. DBSCAN was used to delimit clusters. Figure 2.1-A shows the PCA’s scree plot (out_moss$scree_plot). A clear jump in the singular values occurs between the first and the second dimension. Nevertheless, the scree plot trajectory reaches a plateau after the fourth dimension. Figure 2.2-A shows the trajectory for PEV and derivatives for varying degrees of sparsity for features. (out_moss$tun_dgSpar_plot). A degree of sparsity of 300 is suggested according to the conservative tuning method. 23 Figure 2.1: Scree plot, PEV trajectory, and partial derivatives. A) The left-most panel shows the singular values corresponding to the first 10 SVD dimensions (scree plot). The following two panels correspond to the first and second empirical partial derivatives of the scree plot. B) Similarly, the left-most panel shows the PEV trajectory on a grid of degrees of sparsity, with the center and right-most panels representing its first and second empirical partial derivatives, respectively. Figure 2.2 shows the results of the feature selection across omics. The first two latent dimensions had the largest number of features selected. Most of these features were DNA-seq, followed by gene expression. Conversely, the features selected in the third dimension were mainly miRNA-seq, followed by DNA-seq features. The highest absolute loading values were obtained for the first and third latent dimensions (out_moss$selected_items). 24 Figure 2.2: Omic contribution to selected features. The numbers of features selected by latent factors are shown in a log2 scale. The pie charts slides represent the relative contribution of each omic (RNA-seq, miRNA-seq, and DNase-seq) to the features selected by dimension (PC index). The pie charts ratio represents the quotient between the squared loadings of the selected features by dimension and their standard deviation. The tSNE map and cluster analysis are presented in Figure 2.3 (out_moss$clus_plot). The three simulated clusters are detected. From Figure 2.4, we can see that the first and third dimensions are the more relevant for cluster formation. 25 Figure 2.3: Cluster analysis. The three first SVD latent factors of the integrated omics were embedded onto two dimensions via tSNE. Clusters (labeled 1, 2, and 3) were delimited via DBSCAN. 26 Figure 2.4: Association between clusters and SVD dimensions. The plot shows the results of Kruskal-Wallis association tests between clusters and the first ten SVD dimensions (PC index). A different point shape and color represent each cluster. Lastly, to determine what selected features dominated each cluster, the user can look at signatures within out_moss$feat_signatures. If the number of features is too large to visualize correctly, the function moss_signatures can be called (Table A.1). Figure 2.5 shows the top 5% of candidate features. The percentage is based on the squared means of features values within the respective omic. Candidates are defined as features with standard error intervals excluding zero. 27 Figure 2.5: Signature of features for two clusters. The plot shows the top features (y- axis) selected by latent factor and omic block. Points correspond to the average feature values plus and minus one standard error. For clarity, only the top 1% of selected features and the first two clusters are shown. 2.4.2. Example of supervised omic integration with MOSS Setting method = “pls” and resp.block = 1 tells moss to run a PLS using gene expression as a multivariate response and the remaining omic blocks as multivariate predictors. The following example shows how to run a sparse PLS, where the degree of sparsity is separately tuned for both responses and predictors (code provided in Table A.1). The trajectory of PEV and derivatives for both responses and predictors are displayed in Figure 2.6 (out_moss$tun_dgSpar_plot). In this example, we have set lib.thresh = TRUE to use the liberal method of features selection. Figure 2.6 suggests that the liberal method would select less than ten genes and approximately 50 regulatory features. As before, 28 the contribution of each omic to the selected items, clusters, and features signatures can be obtained from out_moss. In addition, the accompanying function moss_heatmap can retrieve a heatmap of the covariance matrix between selected responses, and predictors can be retrieved using the accompanying function moss_heatmap. This function uses the ComplexHeatmaps R package [84] (Table A.1). Figure 2.6: PEV trajectory plot and partial derivatives for a PLS analysis. The top three panels represent the PEV trajectory on a grid of degrees of sparsity for predictor features and its first and second empirical partial derivatives, respectively. The bottom three panels show this information for responses. Here, responses and predictors were gene expression and regulatory omics, respectively. 29 Figure 2.7: Output of the function moss_heatmap. Rows and columns represent genes and regulatory features selected by a given combination of latent factors. In the example, the first and third dimensions of the sparse SVD of the covariance between gene expression and remaining omics were used. Features names are omitted for clarity. 2.4.3. Evaluation of MOSS analytical performance Results of performance evaluation are shown in Figure 2.8. The accuracy of MOSS to detect signal features was in the order of ~0.9 for all scenarios, except for the largest number of features and lowest signal intensity. Sensitivity was high for all scenarios of high signal intensity, dropping to ~0.6 for more than 100 features. When signal intensity was low, sensitivity dropped in all scenarios, especially for the largest number of features. However, increasing sample sizes corresponding with higher sensitivities. Specificity was highest for scenarios of low signal intensity and moderated to a high number of features. Higher signal intensities corresponded to lower specificities, but this difference was almost negligible for the number of features. Precision was high in all scenarios. However, it dropped for the lowest signal intensity values and the number of features. 30 Figure 2.8: Performance of feature detection with MOSS. Performance at detecting features was evaluated on simulated data with different combinations of numbers of samples (n) and features (p) for two alternative signal intensities (0.05 and 0.2), defined as the proportion of features with regulatory effect. The performance metrics were accuracy (ACC), sensitivity (SEN), specificity (SPE), and precision (PRE). Rectangles and bars represent metric values averages and average plus-minus standard errors across the simulations’ random starts, respectively. Adjacent columns by n tick mark represent alternative signal intensities. 2.4.4. Benchmarking Figure 2.9 shows the computational time taken by MOSS and other omic integration methods for different combinations of samples (n) and the number of features (p). In all cases, three omic blocks were simulated. We allowed all methods to have enough available memory (100 Gb) to produce results in hours rather than days. Nevertheless, for scenarios approaching bio-bank dimensions (i.e., hundreds of thousands of samples and omic features), all methods either crashed or could not finish before 24 hours. The exception to this pattern was the use of moss MOSS with use_fbm =TRUE. For a 31 relatively low number of samples (e.g., n <1x103), all methods, except iCluster, NMF, and mixOmics, produced results in less than one hour. In iCluster, p =1x105 causes the program to run for more than a day, and with p=1x106, to crash. NMF and mixOmics took approximately 3 and 10 hours, respectively, when n=1,000 and p=1e6. Finally, the method with the most substantial influence of sample size was SNFtool, with better performances for smaller sample sizes, with the opposite being true for larger sample sizes. 32 Figure 2.9:Benchmarking of omic integration methods. The plot shows the computation time taken by MOSS and five other omic integration methods (iCluster, NMF, SNFtool, mixOmics, and OmicsPLS) to integrate three omic blocks and perform feature selection in different simulated scenarios. Scenarios corresponded to a different combination of numbers of samples and features in simulated data. Column panels (n) represent the number of samples, and row panels (p) represent the number of features. Each rectangle corresponds to a different omic integration method. The rectangle's height represents computing time in hours. The symbols “*” and “†” represent a method running for more than a day or crashing, respectively. MOSS was used with regular matrices or filed-backed big matrices (FBM). 33 2.5. DISCUSSION Omic integration emerged as a group of techniques for analyzing multiple omic data layers collectively and retrieving helpful information of shared processes within the cell [63]. However, the computational and statistical tools used to carry out these tasks are constantly challenged by the vast amount of generated data [29,85]. As an essential step in understanding the biology of complex traits, omic integration methods should retrieve informative results in a reasonable amount of time. For that purpose, we have developed MOSS, a free and user-friendly tool that rapidly retrieves information about the principal axes of variation across omic data, identifies features of possible biological roles, detects clusters of individuals, and represents them in terms of features of potentially functional role. We evaluated MOSS in terms of several metrics representing the performance of feature selection. In terms of both accuracy (i.e., ability to collectively detect signal and miss noise) and precision (ability to detect true from the false signal), MOSS best performance occurred in scenarios where the proportion of the number of features (p) to the number of samples (n) was low to moderate. The lower accuracy and precision in the scenarios with large p/n and low proportion of regulatory features could have been a consequence of the heuristic used to select features. Like related least squares-based algorithms, this performance could improve if the value of n is increased [86]. However, lower performance in high p/n and low signal-high noise scenarios is an unsolved challenge among omic integration methods [56]. Knowing if MOSS’s decrease in performance for this setting is smaller than for competitive methods would require further simulation studies. 34 The evaluation of the analytical performance of competitive methods was out of the scope of this paper and can be found elsewhere (e.g., [56,65]). Nevertheless, to support the standalone performance of MOSS, we have included the results of a small simulation in Figure A.1. Since the proportion of differentially expressed genes remained the same across scenarios, increasing p/n could also increase the number of false positives. Hence, lower signals could imply that some regulatory features are missed, reducing true positives. Similarly, increasing false positives can explain the drop in sensitivity (i.e., the ability to detect signal correctly) in the high p/n scenario. Likewise, specificity (i.e., ability to correctly miss noisy features) increased with fewer regulatory features, increasing true negatives, regardless of the higher false positives expected in the high p/n scenario. A thorough evaluation of performance across alternative tuning methods for feature selection is outside the scope of this paper. Nevertheless, we acknowledge that performance on the feature selection approach, and an alternative tuning method could improve MOSS performance. However, heuristics based on training sets can improve computational times compared to traditional cross-validation [87]. We have shown the ability of MOSS to retrieve biologically meaningful results in different simulated scenarios, ranging from a few numbers of samples and features to data volumes approaching biobank scales. Although trying to create synthetic but realistic data via the MOSim package, as in any other simulation approach, we acknowledge that only a finite combination of scenarios was explored. Nevertheless, in an earlier publication, we have shown that MOSS can also retrieve biologically meaningful results on real data. For example, in [88], we have used MOSS to effectively integrate information from ~60,000 35 features from gene expression, DNA methylation, and copy numbers across ~5,000 tumors from different cancer diagnoses. In that work, we showed MOSS's ability to detect clusters of tumors beyond original diagnoses, which shared molecular features of potential therapeutic use. One of MOSS's essential capabilities is the handling of data sizes reaching biobank dimensions. However, even when regular R matrices are used, MOSS can perform in a short amount of time compared to other methods of omic integration and feature selection. In addition, package bigstatsr allows MOSS to perform a dense-partial SVD in data sets as big as the UK Biobank [75]. In addition, MOSS includes a convenient parallel computing scheme, as provided by the future.apply R package. Although this implies that the user loses some control on how parallel jobs are administered, since future.apply works in multiple platforms, this option reduces the guesswork and the dependency of the parallel computing strategy on operating systems used. MOSS is a flexible, fast, and robust tool to perform data integration. It shares capabilities with popular methods, including estimation of latent data dimensions, feature selection, and convenient graphical displays. Nevertheless, unlike these methods, MOSS integrates datasets too large to handle in RAM, requiring shorter amounts of time. 36 CHAPTER 3 MULTI-OMIC SIGNATURES IDENTIFY PAN-CANCER CLASSES OF TUMORS BEYOND TISSUE OF ORIGIN This chapter has been adapted from the article published in the open-access journal, Scientific Reports (DOI: 10.1038/s41598-020-65119-5). 3.1. ABSTRACT Despite recent advances in treatment, cancer continues to be one of the most lethal human maladies. One of the challenges of cancer treatment is the diversity among similar tumors that exhibit different clinical outcomes. Most of this variability comes from widespread molecular alterations that can be summarized by omic integration. We have identified eight novel tumor groups (C1-8) via omic integration, characterized by unique cancer signatures and clinical characteristics. C3 had the best clinical outcomes, while C2 and C5 had the poorest outcomes. C1, C7, and C8 were upregulated for cellular and mitochondrial translation and low proliferation. C6 and C4 were also downregulated for cellular and mitochondrial translation and had high proliferation rates. C4 was represented by copy losses on chromosome 6 and had the highest number of metastatic samples. Copy losses on chromosome 11 characterized C8, also having the lowest lymphocytic infiltration rate. C6 had the lowest natural killer infiltration rate and was represented by copy gains of genes in chromosome 11. C7 was represented by copy gains on chromosome 6 and had the highest upregulation in mitochondrial translation. 37 We believe that, since molecularly alike tumors could respond similarly to treatment, our results could inform therapeutic action. 38 3.2. INTRODUCTION Despite recent advances that have improved cancer treatment, it reigns as one of the most lethal human diseases. Cancer can be considered a highly heterogeneous set of diseases: while some tumors may have a good prognosis and are treatable, others are quite aggressive, lethal, or may not have a standard of care [89–91]. Cancer can also defy standard classification: a well-classified tumor may not respond to standard therapy, as expected, and may behave as a different cancer type [92–94]. Fortunately, with the advances of sequencing technologies, data has become available for research as never before. The Cancer Genome Atlas (TCGA), for instance, offers clinical and omic (e.g., genomic, transcriptomic, and epigenomic data) information from thousands of tumors across 33 different cancer types [95]. Much of this omic data can enable us to classify tumors and explain the striking variation observed in clinical phenotypes [96–99]. Omic integration has been successfully applied in previous classification efforts [72,100– 102]. These classifications have highlighted how molecular groups of tumors highly agree with human cell types. Alternatively, we hypothesize internal subtypes hidden by cell type and tissue characteristics influencing cell behavior. These subtypes could be distinguished by molecular alterations unlocking cancerous cell-transformation events. To test this hypothesis, we have developed a statistical framework that summarizes omic patterns in main axes of variation, describing the molecular variability among tumors. Key features characterizing each axis (i.e., features contributing the most to inter-tumor variability) are retained, while irrelevant ones are filtered. Retained features are then used to cluster tumors by molecular similarities and find specific molecular features representing each group. 39 Here we show that, after removing all tissue-specific effects, the cancer signal immediately emerges. The new molecular aggrupation, emphasizing shared tumor biology, can supply new insights into cancer phenotypes. We expect this novel classification to aid in developing therapeutic alternatives for tumors without a current standard of care. 40 3.3. MATERIAL AND METHODS 3.3.1. Pan-cancer data The TCGA offers a demographically diverse sample with comprehensive and modern multi-omic data. We retrieved data from 5,408 from 33 cancer types made available by the Genome Data Commons (GDC) repository [103] via the TCG-Assembler R package [104]. Omic data consisted of curated level-three data of genome-wide gene expression (GE), DNA methylation (METH), and copy number variants (CNV) profiles by tumor sample. GE profiles by sample corresponded with the logarithm of RNA-Seq counts by gene (Illumina HiSeq RNA V2 platform). METH profiles corresponded with CpG sites B- values from the Illumina HM450 platform, summarized at the CpG island level, using the maximum connectivity approach from the WGCNA R package [105], and further transformed into M-values (M=β/(1-β);[106]). CNV profiles corresponded to gene-level copy number intensity derived from Affymetrix SNP Array 6.0 platform, using human genome V19 as reference. The quality-control filtering process included excluding features with all zeros or coefficient of variation less than 1%. Samples or features disproportionally missing data (>20%) and single-sample batches were also excluded. Within the remaining samples, missing values were imputed by k-near neighbors, with k = 3. Finally, each omic block was adjusted by batch effects using ComBat [107]. The final sample size after retaining subjects with information for all three omics was n=5,408. Demographic information included gender, self-reported race and ethnicity, and patient's age at diagnosis (Table 3.1). Clinical information consisted of overall survival time and vital status at the final follow-up, sample type (from the primary tumor, metastases, or normal tissue), tumor-free fraction. We also used previous information from “The Immune 41 Landscape of Cancer” [108] and calculated significant differences between clusters using the Kruskal-Wallis tests [109]. These covariates included: intra-tumor heterogeneity fraction (as sub-clonal genome fraction), and rates of non-silent mutations, aneuploidy, homologous recombination defects (all three derived as deviations from the normal genome), proliferation (normalized difference between the number of dividing and non- dividing cells), and information from immune infiltrations (including scores for CD4+ cells, macrophages, lymphocytes, and natural killers) (See supplementary material in [108] for a detailed description of the scores’ calculation). Briefly, immune infiltration fractions were derived by CIBERSORT [110], assigned to different cell classes, and multiplied by the leukocyte fraction derived from methylation data [108]. 3.3.2. Omic integration, clustering, and features selection. The following four steps can conceptually describe our method. Step 1) Identification of major axes of variation and features selection. Integrative methods should capture combined effects across omic sites that could either span across omic layers (e.g., epigenetics, gene expression) or extend genome-wide (e.g., considering concomitantly contiguous CpG sites or even separated away sites). Let, 𝐗 = [𝐗 𝟏 , …, 𝐗𝐋] where 𝐗 𝑙 𝑙:{1,…, L} is a matrix representing the 𝑙-th omic, which row ith contains information representing a sample on one subject, and column jth represents an omic feature (e.g., a feature could be the expression of a specific gene or the methylation level for a given CpG site). Each group of features coming from a different omic block is centered, standardized, and divided by √𝑝𝑙 , where 𝑝𝑙 is the number of features from the 𝑙-th omic block. Normalization is done so larger groups of features do not dominate the data 42 integration step. Next, we conduct a sparse Singular Value Decomposition (sSVD) of 𝐗 to generate one factor that collapses the redundancies in the omics (by creating independent columns representing the independent signals across features) and one factor that collapses redundancies across samples, grouping subjects with similar signaling. This linear factorization can be represented as 𝐗 = 𝐙𝐖, where 𝐙 represents (linearly) independent axes of variability across subjects (i.e., a lower rank approximation), while 𝐖 represents loadings representing the contribution of each omic feature to this variability. This representation is familiar to many unsupervised omic integration methods but is independent of distributional assumptions on each element. In this formulation, 𝐙 and 𝐖 can be obtained by minimizing: ‖𝐗 − 𝐙𝐖‖22 + 𝑃𝜆,𝛼 (𝐖) [Eq.3-1] To the left of the plus sign is the Frobenius norm (a matrix analogous of Euclidean distance) of the difference between 𝐗 and the product of 𝐙 and 𝐖. To the right of the plus sign is a penalty on the elements of W to impose sparsity. The purpose of this penalty is to zero-out those features with minor contributions to the columns of Z. To remove the effect of tissues, or other covariates that can influence the selection of features, we pre- multiplied X by I – Q(QTQ)-1QT, where I is a diagonal matrix of ones, and Q is an indicator matrix to represent the membership to a given organ or tissue. Step 2) Identify omic features (expression of genes, methylation intensities, copy gains/losses) influencing the axes. The linear decomposition achieved by SVD is an intuitive and straightforward way of integrating omics. However, the variability across omics can be governed by just a few features (i.e., highly sparse data) or by groups of interdependent features (i.e., very redundant data). To handle these limitations, we chose 43 𝑃𝜆,𝛼 (𝐖) to be the Elastic Net penalty [68], 𝜆(𝛼‖𝐖‖1 + (1 − 𝛼)‖𝐖‖22 ), where 𝛼 balances the regularization between LASSO and ridge regression types of regularization, and 𝜆 is associated with the degree of sparsity (i.e., how many features enter in the model?). Unlike LASSO, EN can select groups of correlated features, while zeroing out the irrelevant ones [111]. Equation Eq.3-1 is solved by obtaining z1w1 (where z1 is the first column of Z and w1 is the first row of W) with coordinate descent for given values of 𝜆 and 𝛼, following the algorithm of [69], as implemented in [112], but with the following thresholding operator: sign(w1)| |w1| - 𝜆𝛼 |+ / 𝜆 (1 − 𝛼) (where |x|+ represents the positive part x). Consecutive layers are then obtained by subtracting the previous ones from X and repeating the same procedure, as many times as the number of desired axes of variation. The optimal value for 𝜆 was empirically determined, as suggested by [69]. We start by 1) calculating W over a dense grid of values for 𝜆 (lower 𝜆 yields less sparsity), 2) calculating the proportion of variance of X explained by ZW (PVX) for each 𝜆, and 3) choosing the 𝜆 at which PVX has its minimum second derivative. Since PVX decreases monotonically with 𝜆, this point represents a drastic drop on PVX, suggesting that the most relevant features accounting for the data variability are already incorporated [69]. The value 𝛼 was fixed to 0.5 to have an equal contribution of LASSO and Ridge penalties. Once a subset of features was selected, we mapped them onto genes using annotation data of genomic position downloaded from the USCE web browser tool (GRCh38 [113]). The enrichment of functional classes (ontologies, pathways, complexes) among these genes was tested using the Enrichr package[114]. Step 3) Mapping major axes of variation via tSNE and cluster definition by DBSCAN. Additionally, SVD can be coupled with non-linear embedding methods to deal with highly 44 heterogeneous data. Here, we applied t - Stochastic Neighbor Embedding (tSNE) on 𝐙 [72]. tSNE is a technique that efficiently takes on local neighborhoods present in high dimensions (eventually representing clusters of data) and conserves them while projecting onto a lower-dimensional display [73]. Hence, tSNE becomes a powerful technique to reveal clusters, even in very heterogeneous and convoluted data settings [74]. The algorithm has two fundamental parameters: perplexity (which accounts for the effective number of local neighbors) and cost (related to the difference between the neighborhood’s distribution in the higher and lower dimensional spaces). Since low cost indicates displays more likely to reveal clusters, we selected the maps corresponding with the lowest costs among perplexities of 50 and 100, using 100 thousand iterations to ensure convergence. We applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN [77]) to identify clusters. DBSCAN is one of the most powerful clustering techniques to delimit clusters of irregular shape, such as the ones tSNE produces [74]. Essentially, DBSCAN identifies densely packed groups without specifying the number of clusters prior [77]. Neighborhoods of nearby points can then be tuned by evaluating different cluster partitions over a grid of possible neighborhood sizes. We tuned this parameter by maximizing the Silhouette score, as in [72]. Step 4) Molecular and clinical characterization of clusters. The association between clusters and scores representing genes and the selected functional classes was studied to define each cluster's signatures. Scores were calculated by tacking the columns of X mapping onto a gene, or functional class, and post-multiplying it by the corresponding elements of W’. Using the scores of each gene and functional class as a response and the clusters as explanatory variables, we conducted a series of ANOVA tests to determine 45 what genes or functional classes were significant in at least one cluster. All pairwise comparisons between significant genes and functional classes were studied via Tukey tests. Gene signatures were defined based on those genes significantly deregulated in a single cluster. For both types of tests, we used a Bonferroni multiple-test correction with P(type I –error) = 0.05 / {#selected genes and functional classes}. We used the STRING database of protein-protein interactions to discuss the possibility of physical or functional relationships between the genes in each signature [115]. We considered an interaction biologically meaningful when backed up by empirical data, such as immune precipitation, microarrays, and curated databases. Interactions suggested by text-mining (two genes reported in the same scientific publication) were not considered, except in the cases when a publication’s results gave evidence of interaction (e.g., genes co-expressing, co-locating). The association between clusters and phenotypes (e.g., clinical, demographic, and immunologic covariates) was evaluated via the Kruskal-Wallis test [109] (non-parametric analogous of ANOVA). The Dunn test further evaluated all significant results [116] for pairwise differences (non-parametric analogous of Tukey tests). All steps of our method were implemented in the R programming language [117], using irlba [112], dbscan [77], and Rtsne [118] packages. 3.4. RESULTS Signals coming from tissue and cell type strongly influence a naïve initial classification of tumors across cancer types. We performed omic integration based on sparse singular value decomposition, removed tissue effects, and sought to re-classify tumors based on subtler omic patterns. Our method can be illustrated in four steps (Figure 3.1, Materials 46 and Methods). Step 1 applies sparse Singular Value Decomposition (sSVD) to an extended omic matrix X, obtained from concatenating a series of scaled and normalized omic blocks for the same subjects. Briefly, the principal axes of variation across tumors (i.e., left principal components or scores) and X's matching features ‘activities’ (i.e., the right principal components or loadings) are found. Sparsity is then imposed on the activity values, so features with minor influence over the tumors' variability are removed. Step 2 consists of identifying what features (expression of genes, methylation intensities, copy gains/losses) influence these axes the most (i.e., features not removed by sSVD) and mapping them onto genes and functional classes (e.g., pathways, ontologies, targets of micro-RNA). Step 3 involves the identification of local clusters of tumors, following [72]. Step 4 involves characterizing clusters in molecular (e.g., genes, pathways, complexes) and clinical (e.g., survival probability, immune infiltration) information, distinguishing each cluster from the rest. 47 Figure 3.1: Omic integration and features selection method. Step 1) Singular value decomposition of a concatenated list of omic blocks and identification of major axes of variation. Step 2) Identification of omic features (expression of genes, methylation intensities, copy gains/losses) influencing the axes and mapping them onto genes and functional classes (e.g., pathways, ontologies, targets of micro-RNA). Step 3) Mapping major axes of variation via tSNE and cluster definition by DBSCAN. Step 4) Phenotypic characterization of each cluster of subjects. Using samples from 33 different cancer types provided by The Cancer Genome Atlas (TCGA) and accompanying information from whole-genome profiles of gene expression (GE), DNA methylation (METH), and copy number variant alterations (CNV), we re- classified tumors based on molecular similarities between the three omics. 3.4.1. Data description The data, including information of sample size and type of sample (i.e., from normal, metastatic, or primary tissue), demographics (age, sex, and ethnicity) and survival 48 information (overall survival status and times), are summarized in Table 3.1. In addition, omic data included information for gene expression (GE, as the standardized log of RNAseq data for 20,319 genes), methylation (METH, as standardized M-values summarized at the level of 28,241 CpG islands), and copy number variants (CNV, as the standardized log of copy/gain intensity, summarized at the level of 11,552 genes). 49 Table 3.1: Data description by cancer type after quality controls. Samples are described by cancer type (TCGA code and cancer name), in terms of relative sample size (n), percent of females (F%), ethnicities (percent of non-Hispanic Whites, Afro- descendants, and Asians), Age (at the moment of diagnosis, in years), type of sample (TS%, as a percent of normal –N- and metastatic –M- samples), and survival (Surv, as expected time to 50% survival, in years). Age and Surv are represented by median values, with first and third quartiles as measurements of dispersion. Code Type n F% AD W A Age N M Surv ACC Adrenocor 23 61 0 100 0 48 (35- 0 0 6.6 (2.5- tical 57) 6.6) BLCA Bladder 271 99 13 80 7 58 (49- 1 0 3.0 (1.2- urothelial 66) 3.0) BRCA carcinoma Breast 639 69 18 75 7 58 (46- 7 0 10.2 (6.5- invasive 71) 10.2) CESC Cervical 234 25 8 78 14 60 (53- 1 1 11.2 (3.1- squamous 69) 11.2) cell CHOL Cholangio 12 36 0 100 0 55 (46- 75 0 1.7 (0.7- carcinoma 67) 5.3) COAD Colon 264 36 12 79 9 58 (41- 7 0 8.3 (3.6- adenocarc 66) 8.3) inoma DLBC Lymphom 26 54 19 81 0 60 (54- 0 0 17.6 a 63) (17.6- ESCA Esophage 134 60 12 88 0 68 (59- 2 0 2.3 17.6) (1.1- al 73) 4.4) GBM Glioblasto 49 23 12 78 10 66 (60- 0 0 0.9 (0.4- ma 73) 1.2) multiforme HNSC Head and 89 48 8 91 1 61 (59- 1 0 5.9 (1.2- Neck 71) 5.9) squamous KICH Kidney 2 0 0 100 0 52 (50- 0 0 --** chromoph 54) obe KIRC Kidney 43 51 2 91 7 67 (62- 0 0 7.5 (7.5- renal clear 75) 7.5) cell 50 Table 3.1 (cont’d) KIRP Kidney 37 62 20 80 0 65 (59- 0 0 -- renal 72) papillary cell LAML Acute 28 0 0 94 6 60 (57- 0 0 -- myeloid 67) leukemia LGG Brain 93 42 11 88 1 70 (62- 0 0 9.5(3.1- lower 75) 12.2) grade glioma LIHC Liver 62 25 8 92 0 69 (61- 13 0 4.6 (1.6- hepatocell 74) 8.6) ular LUAD Lung 381 29 6 90 5 66 (59- 4 0 4.2 (2.1- adenocarc 72) 9.2) inoma LUSC Lung 289 28 9 89 2 57 (46- 0 0 4.7 (1.8- squamous 64) 10.5) cell MESO Mesotheli- 68 0 7 93 0 60 (53- 0 0 1.6 (0.9- oma 66) 2.4) OV Ovarian 5 0 0 100 0 60 (55- 0 0 2.9 (2.9- serous 61) 2.9) PAAD Pancreatic 151 24 4 76 20 67 (60- 3 0 1.6 (1.0- adenocarc 74) 4.1) inoma PCPG Pheochro- 144 0 0 100 0 61 (56- 0 1 -- mocytoma 65) and paragangli oma PRAD Prostate 490 36 5 94 1 62 (54- 6 0 9.6 (9.6- adenocar- 70) 9.6) cinoma READ Rectum 83 42 0 85 15 63 (54- 2 0 3.9 (3.9- adenocar- 73) 3.9) cinoma SARC Sarcoma 181 41 0 100 0 58 (46- 0 1 6.7 (3.1- 69) 6.7) 51 Table 3.1 (cont’d) SKCM Skin 378 85 15 83 2 61 (50- 0 75 7.4 (2.6- melanoma 70) 20.1) STAD Stomach 263 37 4 70 25 67 (58- 0 0 4.6 (1.3- adenocar- 73) 4.6) cinoma TGCT Testicular 134 0 4 92 4 31 (26- 0 0 -- germ 37) THCA Thyroid 501 73 6 80 13 46 (35- 8 1 -- 58) THYM Thymoma 106 45 6 85 9 58 (48- 1 0 9.6 (9.6- 68) 9.6) UCEC Uterine 146 100 43 57 0 65 (57- 14 0 9.2 (3.6- corpus 72) 9.2) UCS Uterine 4 100 0 75 25 63 (54- 0 0 1.4 (0.3- carcinosa- 74) 2.2) rcoma UVM Uveal 78 45 0 100 0 62 (51- 0 0 3.8 (2.4- melanoma 74) 3.8) *: Only the three most abundant ethnicities in the data set were considered to calculate the Survival **: percent. quantiles for cancer types with less than five death events were not calculated. The first 50 main axes of variations of the extended omics matrix were selected using an apparent bend in the scree plot of Eigen-values (Material and Methods). The projection of the 50 axes onto two dimensions is shown in Figure B.1. As expected, cell-of-origin effects dominate the clustering of tumors at a pan-cancer level, with clusters enriched by previously reported pan-cancer clusters (e.g., collection of gastric cancer, gliomas, kidney, and squamous tumors), types, and subtypes (e.g., Luminal and Basal breast tumors), and single cancer types (e.g., Thyroid carcinoma, Prostate adenocarcinoma). 52 3.4.2. Classification of pan-cancer tumors after removing tissue-specific signals Once the tissue signal was identified, it was removed from the extended omic matrix. Next, sparsity constraints were imposed on the omic features to zero out those with irrelevant contributions to axes of variation and cluster formation. The selected features (i.e., with non-zero effects) across the three omics corresponded with the 18th, 25th, 33rd, and 38th axes (sorted from more to minor variance explained) and mapped onto a total of 1200 genes. The cluster identification and projection onto two dimensions revealed eight classes (Figure 3.2). Because of removing the effects of tissue localization, all clusters were formed by samples from multiple cancer types. Some clusters differed statistically from their cancer types composition (Table 3.2). However, all cancer types overlapped with more than one cluster (Figure 3.2; Table 3.2, bottom). Furthermore, this overlap was not influenced by previously reported subtypes (Figure B.2). 53 Figure 3:2: Pan-cancer clustering of tumor samples after adjusting for tissue effects. Tumor clusters were obtained by sequential application of tSNE and DBSCAN algorithm for 5,408 samples across 33 cancer types. The contours reflect cluster membership, and the points’ colors and shapes represent similar anatomical sites and cancer types, respectively. The two-dimensional tSNE projection was obtained from the four deep principal axes of the extended omic matrix projected outside the tissue-specific effects after sSVD and removing the first two axes. After re-classifying tumors, the few samples from Kidney chromophobe tumors (KICH) did not map in any of the eight clusters obtained. 3.4.3. Clinical and demographical characterization of tumor clusters Clusters differed statistically in terms of patient age (with Cluster 3 and 8 containing samples from slightly younger patients) and sex (with Clusters 2 and 7 having significantly more females than Cluster 8, due to their slightly higher composition of gynecological cancers) (Table 3.2). However, none of the clusters were significantly associated with ethnicity (Table 3.2). 54 Table 3.2: Clusters characterization after removing tissue effects. The clusters produced by the integration of whole-genome profiles of gene expression (GE), copy number variants (CNV), and DNA methylation (METH) were characterized in terms of clinical, demographic, immune, and molecular information. The table shows those variables with significant differences in at least one cluster. For each variable, different letters represent significant differences between clusters. Clusters 1 2 3 4 5 6 7 8 Cancer type# bc c d ab ab ab bc a Metastasis (%) 5c 4de 3e 17ab 5de 7cd 12bc 21a Survival time 2a 2a 3b 2a 2ab 2ab 2ab 2a Clinical (years)* Stage [119] IVab IVbc IIIc IVab IIIabc IIIab IIIabc IVab Tumor-free 60a 70a 80b 60a 60a 60a 60a 60a information fraction (%) Intratumor heterogenity 13ab 14ab 4d 10c 15a 12abc 14ab 9bc (%) Proliferation 0.4a 0.3a -0.4b 0.3a 0.3a 0.4a 0.4a 0.5a Demographic Age (years) 61a 62a 57b 60ab 60ab 61ab 62a 57b information Sex (% of 52ab 54a 50ab 50ab 53ab 46b 58a 41b females) Non-silent 2bc 2bc 1d 3a 2abc 2c 3ab 2bc Genome mutation Aneuploidy 12a 12a 3b 10a 14a 11a 12a 10a Homologous instability recomb 22ab 16c 8d 23ab 22abc 25a 27a 19bc defects Th1 CD4+ -6b -6b -3a -7b -8b -7b -6b -6b cells (x102) Th2 CD4+ 3c 2c 2c 4ab 5abc 5ab 5ab 6a cells (x102) Th17 CD4+ -8b -8b 6a -15c -5b -5b -9b -9b Immune cells (x102) Activated natural killer 2bc 2bc 3a 3ab 2bc 1c 2bc 2bc infiltration cells (x10-1) Lymphocytes 5bc 6b 4a 4bc 5bc 3bc 5bc 3c (x10-2) Tumor- infiltrating 1.7b 1.7b 1.9a 1.7b 1.8ab 1.6b 1.8b 1.6b lymphocytes 55 Table 3.2 (cont’d) DNA -6d 6a -1bc 6a 4ab 7a -3c -2bc replication&¶,(1) (x10-1) Functional Mitochondrial 0.4d -0.3b 0.0c -0.9a 0.3cd -1.1a 1.9e 0.5d &¶,(2) translation classes ** mir-has-615b -1.1c 0.7a -0.1b 0.7a -0.2b 0.8a -1.1c -0.1b ꝉ ,(3) targets S phase and -1.5f 1.0b -0.1d 0.5c 0.3c 1.3a -0.4e -0.4e DNA synthesis¶,(4) # Cancer types by cluster (%). C1 COAD (14.2), LUAD (11.7), BRCA (10.7), SKCM (8.1), SARC (7.1), READ (6.4), PRAD (4.8), ESCA (4.6), CESC (4.1), LUSC (4.1), STAD (4.1), BLCA (3.8), PAAD (3.6), TGCT (2.5), ACC (2.3), MESO (2), LIHC (1.5), UCEC (1.5), PCPG (1), HNSC (0.8), KIRC (0.3), LGG (0.3), OV (0.3), and UVM (0.3). C2 BRCA (11.1), COAD (11.1), STAD (9.6), LUSC (7.4), LUAD (7.1), SKCM (6.1), CESC (5.6), BLCA (5.4), SARC (5.4), READ (4), ESCA (3.1), KIRP (2.5), PAAD (2.5), PRAD (2.5), PCPG (2.2), HNSC (1.7), LIHC (1.5), UVM (1.5), MESO (1.4), UCEC (1.4), ACC (1.3), KIRC (1.1), GBM (1), THYM (1), LGG (0.8), THCA (0.7), TGCT (0.6), DLBC (0.1), and LAML (0.1). C3 THCA (16.1), PRAD (13.2), BRCA (9.3), LUAD (6.3), SKCM (4.4), BLCA (4.3), LUSC (3.9), STAD (3.8), COAD (3.4), TGCT (3.4), UCEC (3.4), PAAD (3.3), CESC (3.2), THYM (3.2), PCPG (3.1), LGG (2.5), SARC (1.7), UVM (1.6), HNSC (1.3), LIHC (1.2), KIRC (1.1), MESO (1.1), ESCA (1), GBM (1), LAML (0.9), DLBC (0.7), READ (0.5), KIRP (0.4), CHOL (0.4), UCS (0.1), ACC (0.1), and OV (0.1). C4 SKCM (21.7), BLCA (13), CESC (9.6), LUAD (9.6), LUSC (8.7), BRCA (7.8), ESCA (4.3), UVM (4.3), MESO (3.5), HNSC (2.6), SARC (2.6), GBM (1.7), LIHC (1.7), STAD (1.7), UCEC (1.7), COAD (0.9), KIRP (0.9), PRAD (0.9), READ (0.9), TGCT (0.9), and THYM (0.9). C5 BLCA (18.4), LUAD (15.8), CESC (10.5), SKCM (10.5), PRAD (7.9), BRCA (5.3), ESCA (5.3), STAD (5.3), COAD (2.6), GBM (2.6), HNSC (2.6), LIHC (2.6), LUSC (2.6), PAAD (2.6), PCPG (2.6), and TGCT (2.6). C6 BRCA (31.5), LUSC (9.7), ESCA (8.6), SKCM (8.6), BLCA (8.2), STAD (6.5), LUAD (5.7), PRAD (5.7), HNSC (3.9), CESC (2.5), SARC (2.2), PAAD (1.8), GBM (0.7), LGG (0.7), UCEC (0.7), UVM (0.7), CHOL (0.4), DLBC (0.4), MESO (0.4), PCPG (0.4), READ (0.4), and TGCT (0.4). C7 SKCM (14.7), BRCA (11.5), LUSC (11), ESCA (8.4), STAD (7.3), SARC (6.8), CESC (5.8), LUAD (5.8), UVM (4.7), BLCA (4.2), PAAD (3.1), HNSC (2.6), COAD (2.1), PRAD (2.1), LIHC (1.6), MESO (1.6), READ (1.6), UCEC (1.6), TGCT (1), DLBC (0.5), GBM (0.5), LGG (0.5), OV (0.5), and THCA (0.5). 56 Table 3.2 (cont’d) C8 SKCM (24.8), BRCA (23.9), CESC (12.8), PCPG (6.8), BLCA (5.1), SARC (5.1), LUSC (4.3), HNSC (3.4), UCEC (2.6), COAD (1.7), ESCA (1.7), MESO (1.7), READ (1.7), TGCT (1.7), LUAD (0.9), OV (0.9), and UVM (0.9). *Values represent median survival times by cluster. Letters represent significant differences under the log-rank test to compare the entire survival curves of each cluster. **Databases: GO Biological process (&), miRTabrBase (ꝉ), Reactome (¶). Functional classes significant at FDR adj. p-value < 0.05. Overlap between a selected group of genes and databases: (1) : GINS1, POLD3, PRIM2, POLD4, PCNA, MCM8, and MCM3. (2) : MRPS26, MRPL2, MRPL51, MRPS35, MRPL16, MRPS18A, MRPS10, MRPL14, MRPL48, MRPL21 and MRPL11. (3) : PANK2, SF3B2, PCNA, HSP90AB1, NOP2, ATN1, CHD4, HOXC13, PRICKLE4, DPP3, C12ORF57, LDHB, CCND3, CCND2, STK35, RAB23, PPP6R3, IDH3B, RPS3, SIRPA, PSMF1, DNM1L, NKX2-5, PRNP, UVRAG, PPIL1, TPI1, DST, CSNK2A1, SMOX, YIPF3, DDX11, ENTPD6, MAD2L1BP, PPP2R5D, MUT, FBXL14, MRPL21, KLHL42, WNK1, RPL7L1, NCAPD2, FKBP4 and GAPDH. (4) : GINS1, POLD3, PRIM2, POLD4, PCNA, CDKN1B, CCND1, MCM8, MCM3, PSMF1 and CDC25B. The most notorious distinctions between clusters were their differences in prognosis and severity traits (Figure B.3). Cluster 3 (the largest cluster in Figure 3.2) was distinguished by better prognosis/less severity cancers than the remaining clusters, followed by Clusters 2, 5, 6, and 7. In general, clusters 4 and 8 had the worst prognosis and more aggressive tumors (Table 3.2). Cluster 3 was also the one with the fewest metastatic samples (Figure B.4), higher survival rates, highest tumor-free fraction, lowest stage, lowest intra-tumor heterogeneity (ITH, that estimates the fraction of sub-clonal and clonal genomes in each sample[108]), and lowest proliferation (Table 3.2, Figure B.3). By comparison, Clusters 4 and 8 had significantly more metastatic samples than Cluster 3. Cluster 8 also had higher ITH rates than Cluster 3. The highest ITH rates were found in Cluster 5. 57 Cluster 3 also had the lowest rates of non-silent mutations, aneuploidies, and homologous recombination dysfunction (HRD). The remaining clusters were very similar in terms of genome instability indicators, except for Cluster 2. This cluster had significantly higher rates of HRD than Cluster 3 but significantly lower rates than every other cluster (Table 3.2). Cluster 3 was characterized by the highest rates of tumor-suppressive immune cells and tumor-infiltrating lymphocytes (Table 3.2). In addition, Cluster 6 had the lowest infiltration of activated natural killer (ANK) cells. Cluster 8 also had the lowest lymphocytic and highest Th2 CD4+ infiltrations, respectively (Table 3.2). 3.4.4. Gene signatures characterizing tumor clusters. The clusters were also characterized by distinct sets of omic features, significantly enriched for functions involved in the cell cycle (DNA replication, DNA synthesis, and targets of hsa-mir-615-b, a micro-RNA involved in cell proliferation) and mitochondrial translation (initiation, elongation, and termination) (Table 3.2). To study the pairwise differences across clusters, these gene sets were projected onto scores for each gene, as linear combinations between the features’ values mapping onto the gene (i.e., its expression, methylation, and copy number values) and their related activities (i.e., the feature's effects arising from the sparsity constraints) (see Materials and Methods section). In general, Cluster 3 was characterized by intermediate values of these scores, while the remaining clusters were characterized by higher (i.e., gene set with higher expression than Cluster 3) or lower (gene sets with lower expression than in Cluster 3) gene set scores. Clusters 2, 4, and 6 had significantly higher scores for cell proliferation and significantly lower for mitochondrial translation. On the other hand, clusters 1, 7, and 8 had significantly lower scores of proliferation and higher for mitochondrial translation. 58 Sparse factorization of the extended omic matrix resulted in the selection of features mapping onto 1200 genes. From this list, 441 genes were significantly different in at least one cluster. These results were obtained by a series of analyses of variance (ANOVAs), using the scores of each gene as response variables and clusters as explanatory variables. This list included 34 validated cancer genes, including oncogenes (ERC1, HSP90AB1, NUMA1, PPFIBP1, ZNF384, CHD4, KRAS, HIST1H3B, CCND1, CCND2, PIM1, CCND3, HMGA1, HOXC11, HOXC13, KDM5A, SRSF3, TFEB), tumor suppressors (FANCE, CDKN1B, ASXL1, ETNK1) and fusion-proteins (ERC1, HSP90AB1, NUMA1, PPFIBP1, ZNF384). Many genes also mapped onto known transcription factors (including KDM5A, RELA, SRF, CTBP2, FOXA2, NONOG, FOLSL1, TEAD4, and FOXM1) and some of their targets (Figure B.5). However, the expressions of TFs and their targets were not significantly correlated within or between clusters (Figure B.5), suggesting mechanisms of control of the gene expression other than TFs regulation. We then interrogated all pair-wise comparisons between the scores of the 441 significant genes using Tukey tests (Table B.1). We identified a subgroup of 123 significant genes that distinguished each cluster from the rest (for example, POLH had significantly higher scores in Cluster 4 than in every other cluster). The genes characterizing each cluster were then used to define signatures. With this criterion, only Clusters 1, 4, 6, 7, and 8 were characterized by distinct signatures of 57, 4, 23, 24, and 15 genes each, respectively. Since the gene scores are combinations of omic features, we looked at the gene expression in each signature and the potential role of copy numbers and methylation in regulating it (Figures 3.3-4). 59 Cluster 1's signature was composed of genes mapped on chromosome 20. A group of 56 of the 57 genes exhibited significant copy losses in Cluster 1. Of this group, 50 genes (ATRN, AP5S1, TMEM230, MGME1, NDUFAF5, CENPB, CRLS1, CRNKL1, CSNK2A1, DDRGK1, DSTN, DTD1, ESF1, FAM110A, FASTKD5, FKBP1A, IDH3B, ITPA, SMIM26, MAVS, MCM8, MKKS, MRPS26, NAA20, NOP56, NRSN2, NSFL1C, PANK2, PCNA, POLR3F, PSMF1, PTPRA, RBBP9, RBCK1, RRBP1, SIRPA, SMOX, SNPH, SNRPB2, SNRPB, SNX5, SOX12, STK35, TBC1D20, TRMT6, UBOX5, VPS16, ZCCHC3, ZNF133 and ZNF343) were also downregulated. The genes with significant copy-losses and basal expression values (TGM6, SOX13, PROKR2, PRND, OXT, LRRN4, and FERMT1), LRRN4, and FERMT1 were also significantly hyper- and hypo-methylated, respectively (Figure 3.3). Cluster 4's signature was composed of four genes mapping onto chromosome 6: TDRD6, POLH, PAQR8, and GUCA1A. All these genes exhibited significant copy losses in Cluster 4, and all of them except GUCA1A were also downregulated. Additionally, POLH was hypo-methylated, while PAQR8 was hyper-methylated (Figure 3.3). 60 Figure 3.3: Gene signatures for Clusters 1 and 4. The genes significantly de-regulated exclusive of Clusters 1 and 4 were used to define signatures (y-axis). The features values (x-axis) of each gene are separated in gene expression (GE, first column of panels), copy number variants (CNV, second column of panels), and DNA methylation (METH, third column of panels), and summarized by Bonferroni confidence intervals (adjusting for all the 441 significant genes in at least one cluster). Dots represent the average of features values across samples. Cluster 6's signature was composed of 23 genes mapping onto chromosome 11: ALDH3B1, ANKRD13D, ANO1, AQP11, ARRB1, EMSY, CCND1, CTTN, KRTAP5-10, LRP5, LRRC32, TESMIN, MYO7A, NUMA1, PAK1, PPFIA1, RBM4, RPS6KB2, RSF1, 61 SHANK2, TMEM134, TPCN2, and USP35. These genes exhibited significant copy gains, and all of them were also significantly upregulated, except for three genes with basal expression in Cluster 6: MYO7A, LRRC32, and ALDH3B1. In addition, genes USP35, SHANK2, MYO7A, LRRC32, CTTN, CCND1, ARRB1, and ALDH3B1 were additionally hypo-methylated, while genes RSF1 and PPFIA1 were hyper-methylated (Figure 3.4). Cluster 7's signature was composed of 24 genes mapping onto chromosome 6. All of these genes (BTBD9, RRP36, CCND3, CNPY3, CUL7, FRS3, GUCA1A, BICRAL, KLC4, KLHDC3, LRFN2, MEA1, MED20, MRPL2, MRPS10, PEX6, PPP2R5D, RPL7L1, SRF, TAF8, TBCC, TOMM6, TRERF1, and UBR2) exhibited significant copy gains. In addition, all of them were significantly up-regulated, except by LRFN2, GUCA1A, BTBD9, which had basal levels in Cluster 7. Genes TRERF1, LRFN2, and FRS3 were additionally hypo- methylated, while GUCA1A was hyper-methylated (Figure 3.4). Cluster 8's signature was composed of 15 genes mapping onto chromosome 11. These genes (ALDH3B1, ANO1, CCND1, CPT1A, CTTN, LRP5, MRPL21, NADSYN1, PPFIA1, RNF121, RSF1, SHANK2, TPCN2, UNC93B1, and USP35) exhibited significant copy losses. All of them except ANO1 (with basal levels in cluster 7) were significantly downregulated. In addition, genes USP35 and NADSYN1 were significantly hypermethylated, while UNC93B1, RSF1, MRPL21, and ANO1 were hypo-methylated (Figure 3.4). 62 Figure 3.4: Gene signatures for Clusters 6, 7, and 8. The genes significantly de- regulated exclusively in Clusters 6, 7, and 8 were used to define signatures (y-axis). The features values (x-axis) of each gene are separated in gene expression (GE, first column of panels), copy number variants (CNV, second column of panels), and DNA methylation (METH, third column of panels), and summarized by Bonferroni confidence intervals (adjusting for all the 441 significant genes in at least one cluster). Dots represent the average of features values across samples. 63 3.5. DISCUSSION Most pan-cancer classifications rely on molecular alterations that discriminate between the tissue of origin [72,100–102,120]. However, as soon as tissue effects were removed, we have found that the cancer signal immediately emerged. Distinct cancer classes were formed, containing tumors from different cancer types. Particular functional groups of omic features also characterized these classes. An SVD of the extended omics matrix can result in a multitude of axes of variation. Such axes have the potential to explain different patterns of variability across subjects. In this study, we preceded our cluster analysis by selecting axes of variation (i.e., basis vectors spanning the features space of the concatenated omics) having features loadings different from zero (each axis of variation has an accompanying vector of loadings representing features activities). We have obtained the cluster display in Figure 3.2 as a result of this selection criterion. Furthermore, most of the variability between clusters of tumors associates with the canonical relationship between gene expression and copy number. According to this, the primary source of co-variability among features seemed to be dominated by positive covariation of expression and copy number (i.e., copy losses match with lower expression levels, and vice versa, Figure B.5). The expression of regulatory elements within the group of selected features (including transcription factors and the micro-RNA hsa-mir- 615b) was, on the other hand, not associated with the expression of their predicted targets. These observations support the role of copy numbers as a significant force affecting tumor progression [121–123]. Experimental evidence has shown large effects of methylation at characterizing both normal and tumor tissues [124–127]. Contrarily, epigenetics has an essential role during tissue differentiation, as well as in cancer. 64 However, our analysis might suggest a minor role in leading the cancer cluster differences. We believe that this minor role could be the result of an intense correction for tissue-specific effects. Other possible explanations include artifacts of data processing, such as summarizing methylation at the CpG island level. Although the map at the CGI level covered both genic and non-genic regions and facilitated computations, this summary could have come at the cost of washing out CpG site-specific effects on cancer. A third possibility is that the abnormal methylation patterns are essential but shared by two or more cancer clusters. Our features highlighted are the ones that differentiate between clusters. Regardless, we observed abnormal methylation patterns that might suggest a role in the expression of some genes characterizing tumor classes (e.g., expression of LRN4 and GUCA1A negatively correlated with average methylation of promoters’ CpG islands). The tumor clusters C1, C4, C6, C7, and C8, had exclusive signatures (i.e., different from every other cluster). Interestingly, the clusters without distinct individual signatures had more favorable outcomes (C3, C2, and C5). One possible explanation for this is the frequent correspondence between more dramatic molecular alterations and worse clinical outcomes [128,129]. To gain insights about possible biological interactions within each signature, we used the accompanying bibliographic results provided by the STRING database [115] (see Material and Methods section). The literature suggests a broad overlap between signatures in terms of gene functions (cell growth, division, small RNA metabolism, protein synthesis, maturation and transport, and mitochondrial dysfunction). In the case of signature C1 (most genes down-regulated), the literature suggested NOP56 (a core component of the small nucleolar ribonucleic protein) as a central element in the 65 signature; interacting with MKKS, NAA20, and PTPRA (genes with roles on mitotic division); ESF1, SNRPB, SNRPB2, POLR3F and CRNKL1 (involved in small RNA processing), PCNA and ITPA (involved in DNA replication and repair), UBOX5, RRBP1, RBCK1 and NRSN2 (protein synthesis, maturation, and antigen presentation), RBBPP9 (resistance to growth inhibition of TGF); SIRPA and DSTN (cell adhesion)[130–133]. In the signature C1, NOP56 could be a candidate for future therapeutic intervention. Tumor suppressors NRSN2 and RBCK1 could also be considered. The three downregulated genes from signature C4 were involved in small RNA maturation (TDRD6, micro-RNA expression, and maturation), cell proliferation (PAQR8, plasma membrane progesterone receptor), and DNA repair (POLH, DNA polymerase involved in DNA repair). From these genes, PAQR8 and TDRD6 could represent potential targets of therapy. Although neither of them has been directly related to cancer, other members of the PAQR family of progesterone receptors are known tumor suppressors, while TDRD6 has been reported as frequently down-regulated in breast cancer, suggesting its potential use as a biomarker [134]. In the case of signature C6 (most genes upregulated), the literature suggests CTTN as interacting with two groups of genes within the signature, either by co-expression or co-localization in amplicons. One group consisted of invasion and anti-apoptotic related genes (e.g., SHANK, PAK1, PPFIA1) and ion transport (ANO1 and TPCN2) [135,136]. The other group consisted of CCND1 (cell cycle checkpoints), LRPS (protein synthesis), RSF1 (chromatin remodeling), and USP35 (protein turnover; through amplicon-mediated overexpression in breast and gynecological cancers) [137,138]. Thus, patients with signature C6 could perhaps benefit from ANO1 inhibitory therapy [136]. 66 Signature C7 was characterized by multiple genes co-expressing with KLHDC3 (involved in homologous recombination): MEA1 (spermatogenesis), CNPY3 (protein folding, antigen presentation), PPP2R5D (direct catalytic activity), RRP36 (small RNA synthesis), CCND3 (cyclin, cell cycle checks points), and MED20 (transcription). KLHDC3 also belongs to the protein turnover and antigen presentation pathway, together with CUL7 and UBR2. The literature also suggests another group of co-expressing genes within signature C7, consisting of RPL7L (ribosome), MRPL2, and MRPS10 (mitochondrial ribosome). These genes have also been found to interact in cell culture [140,141] physically. Signature C8 genes remarkably overlapped with signature C6 genes but exhibited opposite regulation (i.e., down- instead of up-regulated). Additionally, the literature suggests the interaction between CCND1, NADSYN1, and MRPL20 in signature C8 [139,140]. NADSYN1 has been proposed as a target of inhibitory therapy in cancer [141], while MRPL20 has been suggested as a biomarker for gastric cancers [142,143]. The molecular classification of tumors generated clusters with clear differences in prognosis and severity, with C3 exhibiting better outcomes than the remaining clusters. C3 also resembled a previously reported “inflammatory” type in terms of immune infiltration and cancer type composition (enriched for prostate adenocarcinoma, thyroid, and pancreatic carcinomas and having elevated values of markers for CD4+ Th17 and Th1 cells and low genomic instability) [108]. Although the remaining clusters were clearly distinguished in terms of altered molecular processes, they were highly similar clinical and demographic characteristics. C3 also differed from the remaining clusters by lacking large CNV. In C3, we do not observe drastic genome alterations being systematically linked with worse cancer outcomes, either by causing loss of tumor-suppressing activities 67 (e.g., mitotic checkpoints, DNA instability sensing, pro-apoptotic activity), or gain of oncogenic function (e.g., duplication of mitotic factors). In either case, large CNV have been associated with worsened clinical outcomes, in contrast with those characterizing C3. This observation is somewhat supported by less aggressive cancers in C3 (e.g., a high frequency of prostate and thyroid cancers), co-located with low severity cases of more aggressive tumor types. Another example of less aggressive tumors in C3 are Her2+ breast cancer and proximal inflammatory lung adenocarcinomas, tumors of less severe outcomes than their luminal/basal and proximal proliferative subtypes, respectively [144][145]. Since similar signaling deregulation can arise in different cancers (e.g., dysregulated PI3K/AKT/mTOR pathway in gynecologic cancer) [146], further research on the link between shared molecular signatures within tumors in the same cluster could shed light on the development of novel therapies, or the repurpose and combination of existing ones. Given their small molecular weights, targeting oncogenes with common monoclonal antibodies and small-molecule tyrosine kinase inhibitors could aid in the treatment of tumors with overexpressed oncogenes [147]. For instance, tumors with signature C6 could benefit from combined therapy with indirubin and Ani1, inhibitors of CCND1 and ANO1 [148,149]. On the other side of the spectrum, targeting tumor suppressors on signatures of downregulated genes also presents exciting opportunities. For instance, tumors with signature C1 could benefit from target therapy for tumor suppressors NRSN2 and RBCK1. Classic approaches for targeting tumor suppressor genes include re-activation by either re-introducing a functional copy (e.g., gene therapy) or diminishing the repressive action of other players through small-molecule inhibition [150]. 68 Nevertheless, given the technical challenges of targeting loss of tumor-suppressing function, signatures exhibiting up-regulation could have more pharmacological potential. Similarly, signatures could also rapidly address differences in tumor heterogeneity (e.g., C8 and C5 were notoriously more heterogeneous than the rest). Finally, differences in immune infiltration (C6 with the lowest activated natural killers’ infiltration and C8 with the lowest lymphocytic one) could also imply the potential use of signatures to aid immunotherapeutic decisions. Our results included genes frequently duplicated in cancer (e.g., KRAS, CCND1). However, other frequently duplicated genes, like ERBB2, MYC, and FGFR1, were not present in our selected set of features. One possibility for this unexpected result could be a limitation of the EN penalty as a feature selection criterion. For example, an EN parameter of 0.5 could have been too stringent, and groups of correlated and relevant features could have been left out during the selection process. Another possibility is that the effects of these frequent duplication were washed out by the tissue correction. While removing the tissue "environment” evident in the omics, treating tissue as a systematic effect in a linear model, a different way of having defined tissue effects would have been to use tumor histology markers (e.g., mesenchymal, epithelial). Additionally, while removing dominant differences on tissues that may be unrelated to cancer, it could also eliminate differences that may allow certain cancers to progress in that specific tissue. The third possibility is that the effects of these events were not essential for the cluster partition. Given the possibility of unveiling different biological channels altered in tumors of similar clinical and molecular characteristics, we believe this novel pan-cancer classification 69 could aid in identifying therapies for cancers without a standard of care. However, extrapolation of results herein should be exerted with the following caution. Although our data included information from multiple studies, sexes, ages, and ethnicity, our results could be strongly influenced by factors such as the country of origin of each study and biased on demographic characteristics. Further application of our methods to tumors from patients from diverse populations and ages would be essential for an effective generalization of our results. 70 CHAPTER 4 PHENOMIC DATA INTEGRATION IN THE UK BIOBANK REVEALS GENETIC VARIANTS INVOLVED IN ENERGY BALANCE 4.1. ABSTRACT Excessive calorie intake and low physical activity contribute to a positive energy balance (EB), leading to obesity. Although EB is affected by several environmental and socioeconomic factors, a sizable part of its variation is still due to genetics. The study of EB as a target phenotype is challenging, as EB combines several layers of data, including whole-body size and composition, food selection and amount ingested engagement in physical activity, and metabolic profiles. Many genes (e.g., FTO, MCR4, ANKRD33, FIGNL2) have been identified as contributing to the variation observed in different components of EB. However, a complete set of genes is still missing. We integrated several phenotypes from the UK Biobank related to EB to increase our knowledge on EB's genetic and molecular basis. We have used sparse factor analysis to define patterns of energy balance (PEB) and determined genomic regions of interest with potential causal effects in EB. 71 4.2. INTRODUCTION Obesity is affected by multiple environmental and socioeconomic factors [151]. However, it is also considered a heritable condition [152,153]. As such, obesity has been linked to the variation of many genes affecting energy balance (EB, the difference between energy intake and energy expenditure), including FTO, MCR4, ANKRD33, FIGNL2 [154,155]. Nevertheless, due to the multiple phenotypes involved in EB, the complete set of genes affecting EB is still not available [156,157]. EB is determined by multiple variables affecting the thermic effects of feeding (e.g., frequency and quantity of different food items), resting metabolic rate (e.g., body size and composition), and physical activity (PA, e.g., intensity, frequency, and duration) [157]. Factor analysis methods can capture the complex covariation among all these different types of variables [158,159]. Under this framework, multiple phenotypes can be summarized by a few factors representing patterns of EB. For example, [160,161] have used sparse latent factor models to estimate dietary patterns and variations across an exercise intervention. On the other hand, [162] used principal functional components to estimate the temporal variation of PA, while [163] used reduced rank regression to combine metabolite and dietary patterns and their link with type I diabetes. Once factors related to EB are derived, they can be used as responses in genome-wide association studies (GWAS) [154,164]. Advantages of factors as GWAS responses include lower type I errors, increased power, and faster computations [165–167]. Several studies support the association between EB components and genomic variation [154,155]. For instance, selection experiments in mice have demonstrated strain variation in the predisposition to engage in physical activity of different intensity [168–170] and 72 preference of certain foods over others [171,172]. In addition, human studies on the genetics of PA have reported heritability values as high as 63% [173–176]. Furthermore, association studies have shown the putative roles of many genes involved in food preference and adherence to physical activity [164,177–179]. Moreover, specific genes (like the fat mass and obesity-associated protein, FTO) have also been shown to have pleiotropic effects on diet, physical activity, body size, and obesity risk [180–183]. In this work, we have used extensive phenomics and genomic information from the UK Biobank cohort to derive patterns of EB (PEB). These PEB captured distinct and relevant aspects of food preference, body size and composition types, physical activity tendencies, and metabolic profiles. Furthermore, the association between PEB and genomic variation has revealed novel genes affecting multiple components of EB. 73 4.3. MATERIAL AND METHODS 4.3.1. Cohort This study used data from a subset of 219,049 participants from The UK Biobank (UKB) cohort [14] to derive energy intake and expenditure patterns and study their associations with genotypic information. Edition and quality control criteria for the phenotypic variables criteria included the exclusion of participants of non-Caucasian/European ancestry, exclusion of related individuals (KING's kinship value lower than 0.03 to exclude 1st, 2nd, and 3rd degrees of relationship) [184], exclusion of participants with abnormal biomarker levels, and exclusion of participants reporting unrealistic values of food intake and energy expenditure (Figure 4.1). Phenotypes: We used 28 phenotypic variables with possible roles at affecting an individual's energy requirements and expenditure, split into four groups: 1) blood biomarkers, including glucose, creatinine, triglycerides, and cholesterol levels, 2) body composition, including body weight, standing height, waist circumference, fat mass and lean mass, 3) diet, recorded with the ACE touch screen questionnaire on frequency of consumed items in the last year, including cooked and raw vegetables, fresh fruit, oily and non-oily fish, poultry, beef, pork, lamb or lamb, and processed meats; cheese, bread, cereal, tea, coffee, water, and alcohol (see detail in [185], and 4) physical activity (as intensity, frequency, and duration). To correct the effects of drugs affecting blood biomarkers, we followed [186]. First, we used self-reported data on medication taken for cholesterol, blood pressure, diabetes, or exogenous hormones (data field 6153, ACE Touchscreen). Next, drug effects were estimated using the subset of participants that reported taking medicine after the first 74 recorded instance. For these participants, a drug effect correction factor was calculated as the mean difference between biomarker levels before and after taking the drug. The correction factor was then used to multiply the biomarker levels of those participants taking medicine at the time of recruitment. After the adjustment, any remaining values outside three absolute median deviations (MADs) were considered outliers and excluded from the study. Diet variables were either numerical (representing the number of servings by a period) or categorical (representing consumption or not of a specific item). The numerical variables were cooked and raw vegetable intake, fresh fruit, bread, cereal, tea, coffee, and water. The categorical variables corresponded to intakes of poultry, cheese, oily and non-oily fish, beef, pork, mutton/lamb, and alcohol. If frequencies for the numerical diet variables were outside 3 MADs, the participant was excluded. If any class within each categorical dietary variable was underrepresented after quality controls, we fused it with the previous ones until all classes had more than one percent of the total sample size. We fused the last three categories into oily fish, non-oily fish, beef, lamb/mutton, and pork intakes. For poultry and processed meat intakes, we only fuse the last two categories into one. An estimate of energy intake is present within the UKB for a subset of participants that answered a web-based 24-hour dietary recall in addition to the ACE touchscreen questionnaire (data field 100002). We used this subset of individuals to fit a linear model using this energy estimate as response and ACE's derived food items consumption as predictors. Finally, we estimated the total energy intake in the remaining participants missing calory intake estimations. Individuals with an estimated caloric intake larger than 3 MADs were excluded. 75 We additionally estimated the number of calories expended during PA data derived from IPAQ following [187]. We obtained these values by dividing the product between PA intensity and duration by PA frequency, dividing by 24 to get energy values within the calories/day range. Participants with PA energy expenditure higher than 3 MADs were excluded from our study. We have used a set of covariables to account for possible confounders of the associations between phenotypes and SNPs. The list included participants sex (31, reported by individual), age (21022, age of the participant on the day they attended an Initial Assessment Centre, truncated to whole year), age2 (the square of the participant age), the interaction between sex and age, the UKB assessment center (54, center at which individuals consented to participate in the UKB study), the first five principal components of the genotypic matrix, Townsend Deprivation Index (189, calculated immediately before participant joining UK Biobank, based on the primary national census output areas, and assigned depending on the output area in which the participant postal code is located), as a measurement of socioeconomic status, and type of genotyping array used. Genotypic data: Details on the genotypic data and quality controls provided by the UKB project can be found in [14]. Briefly, genotyping was done using two closely related arrays: UK BiLEVE (~50,000 participants) and the UK Biobank Axiom arrays (~450,000 participants). The UKB Axiom array had over 820,000 SNPs and indel markers, and the UK BiLEVE array was very similar with over 95% shared content. Quality control controls included: excluding low-quality SNPs (missing call rates, low DNA concentration), multi- allelic SNPs (SNP with more than two allelic variants), SNPs departing from Hardy- Weinberg expected frequencies, with sex effects., the array, and batch effects (all test 76 based on a rejection threshold on the p-values equal to 10-12). SNP genotypes were arranged in a BGData R object [188], where each cell with observed data coded as 0, 1, or 2, indicating the number of copies of the reference allele for each locus at each individual. 4.3.2. Statistical analysis Derivation of PEB latent variables: Since our set of phenotypes consisted of several continuous and categorical variables, we derived PEB variables by first using Factor Analysis of Mixed Data (FAMD; [189]) and imposed sparsity on the contribution of each phenotype to each PEB using sparse singular value decomposition (sSVD [69]). Essentially, FAMD can be seen as an application of standard singular value decomposition (SVD) on a transformed version of the data that assures the variance of categorical variables does not artificially dominate the construction of latent factors. The transformation involves standardization of each numerical variable (i.e., center to zero means and scale to unit variance) and redefinition of each categorical variable as a set of modified dummy variables. In these dummy variables, ones and zeros are replaced by 𝑛𝑐𝑎𝑡 𝑛𝑐𝑎𝑡 1– and − , respectively, were 𝑛𝑐𝑎𝑡 is the total number of samples in each category, 𝑛 𝑛 while 𝑛 is the total samples size. The transformed phenotypes were adjusted by covariables as described in [190]. We used the following model to derive PEB as latent factors of the transformed phenotypes: 𝐐 = 𝐘𝐔𝚺 + 𝐞 [Eq.4-1] To estimate the elements of 𝐐, 𝐔, and 𝚺, we minimized the following loss function: 𝐿 = ‖𝐐 − 𝐘𝐔𝚺‖22 + 𝜆𝑈 ( 𝛼𝑈 ‖𝐔‖1 + (1 − 𝛼𝑈 )‖𝐔‖22 ) [Eq.4-2] 77 Where 𝐐𝑛x𝑞 represents the PEB variables as a set of factors obtained as linear projections of the matrix 𝐘𝑛x𝑠 (with rows representing participants and columns representing the pre- adjusted and transformed phenotypes). The columns of the matrix 𝐔𝑠x𝑞 contain the loadings of each phenotype onto the columns of 𝐐 (i.e., values representing the contribution of each phenotypic variable to the construction of each factor), and 𝚺𝑞x𝑞 = 𝒅𝒊𝒂𝒈{𝜎1 ⋯ 𝜎𝑞 }, where 𝜎𝑘 is the 𝑘-th singular value, such as a 𝜎1 > ⋯ > 𝜎𝑞 > 0. The the matrix 𝐞𝑛x𝑞 contains the projection errors that depended on the selected number of latent factors (the value of 𝑞). The second term at the right-hand side of [Eq.4-2] is an Elastic Net (EN) penalty on the elements of 𝐔. The EN penalty balances well-established techniques to select variables (zeroing out the noise and redundant signal between omic features) and shrinkage (to account for the high number of omic features that often exceed the number of samples). The expressions ||. ||2 and ||. ||1 correspond to the L2 and L1 norms, respectively. The parameter 𝜆𝑈 is a real positive number controlling the amount of sparsity in elements of 𝐔, while 𝛼𝑈 is any number between zero and one. The value of 𝛼𝑈 balances shrinking and variable selection [68]. Heuristic methods typically select the value of 𝑞 for which the trajectory of 𝜎1 ⋯ 𝜎𝑞 explains a relatively large amount of variance, or at which the trajectory of 𝜎1 ⋯ 𝜎𝑞 bends drastically (e.g., elbow rule). We complemented these heuristics by fitting the model [Eq.4-1] 100 thousand times via bootstrap resampling to avoid distributional assumptions. We then used the bootstrap distribution of 𝜎1 ⋯ 𝜎𝑞 to estimate 95 % confidence intervals for 𝑞. Once the value of q was estimated, the remaining missing data after quality controls was imputed following [191], as implemented in the function imputeFAMD from R package missMDA [192]. The values of the hyperparameter 𝜆𝑈 was estimated using the heuristic 78 proposed in [190], whilst the value of 𝛼𝑈 was set to 0.5. The bootstrap distribution for [Eq.4-1] also yield 95% confidence intervals for the sparse solution of 𝐔 to determine what phenotyeps exactly contributed to the formation of each PEB. Association studies: We conducted two types of genome-wide associations studies (GWAS) in two types of variables. The first one involved the PEB variables (columns of matrix 𝐐) as responses, while the second used transformed and covariates-adjusted original phenotypes (columns of 𝐘). Since normality was not assured, we applied the Rank-Based Inverse Normal Transformation (RIN) to both for 𝐘 and 𝐐. Considering the columns of the matrix 𝐗 𝑛x𝑝 represent the genotypes of 𝑝 SNPs across the 𝑛 participants; the following models were adjusted, one marker and variable at a time: 𝑓(𝐐𝑘 ) = 𝐱𝑗 γ𝑗𝑘 + 𝛜𝑘𝑗 [Eq.4-3] and 𝑓(𝐘𝑟 ) = 𝐱𝑗 β𝑗𝑟 + 𝐞𝑟𝑗 [Eq.4-4] Where 𝑓 represent the RIN transformation, 𝐱𝑗 is the 𝑗 -th column of 𝐗 representing the 𝑗 - th SNP (𝑗 =1, …, 𝑝), γ𝑗𝑘 and β𝑗𝑟 are the effects of the 𝑗 -th SNP on the 𝑘-th PEB and the 𝑟-th phenotype, respectively; and 𝛜𝑘𝑗 and 𝐞𝑟𝑗 are vectors of models residuals. We called models [Eq.4-3] and [Eq.4-4] PEB-GWAS and ORIG-GWAS, respectively. Estimates of γ𝑗𝑘 and β𝑗𝑟 Moreover, p-values were obtained via the omnibus RNI omnibus test implemented in the R Package RNOmni [193]. In addition to the PEB-GWAS and ORIG- GWAS analyses, we studied the pleiotropic effect of each SNP on the original phenotypes contributing to each PEB, that is, those variables with loadings different from zero after imposing sparsity on the element of 𝐔. For this, we used the fast sequential test of pleiotropy proposed in [194]. We called the analysis PEB-PLEIO. 79 Summary of GWAS results: A standard threshold of 1x10-8 was used to define an SNP as significant. GWAS results were also summarized in "peaks", defined based on SNP's p-values and linkage disequilibrium (LD) among them. LD decay was calculated as the coefficient of determination 𝑅2 between significant and adjacent SNP within half megabase. GWAS peaks were defined imposing by imposing a threshold of 𝑅2 ≥ 0.01. The SNP with the lowest p-value within a peak was chosen as "lead". Peaks were annotated using functional information from the G37 genome assembly annotation provided by the UKB project and complemented by ENCODE annotation obtained via Bioconductor package biomaRt [195,196]. Additionally, R package LDlink [197] was used to retrieve information of the overlap between peaks and significant expression-QTL (eQTL) from the Genotype-Tissue Expression (GTEx; [198]). Enrichr [202] and Ingenuity Pathway Analysis (IPA; QIAGEN Inc) were used to determine the overlap between genes and pre-existing gene sets. Cluster analysis: To determine if the PEB would induce a separation of participants in biologically meaningful groups, we first embedded the columns of 𝐐 onto two dimensions using Uniform Manifold Approximation and Projection (UMAP; [199]). UMAP is a non- linear embedding technique suitable for large data sets, producing convenient two- dimensional representations of clusters existing in higher dimensions. UMAP creates a graph in multiple dimensions and projects it onto a lower number of dimensions attempting to conserver its structure. For this, the number of neighbors and the minimum distance among them defining a local neighborhood must be tuned. Tuning was done for 5, 10, and 20 neighbors, and minimum distances of 0.1%, 1%, and 10% of the average Euclidean distances between rows of 𝐐. UMAP was fitted using the R package umap 80 [199]. The map rendering the best clusters separation was considered the optimal one. Cluster delineation was done following [190]. To determine if clusters had biological meaning, we fitted the following linear models: 𝑓(𝐘𝑟 ) = 𝐙𝛕𝑟 + 𝜻𝑟 where 𝐙 is an indicator matrix representing the membership of a participant to a specific cluster, 𝛕𝑟 is the vector of effects of cluster membership on 𝑓(𝐘𝑟 ) such as ∑#𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝑐=1 𝜏𝑐𝑟 = 0, and 𝜻𝑟 is the vector of model residuals. Normality of 𝜻𝑟 was assumed by employing transformation 𝑓. Analysis of variance (ANOVA) was used to test for significant differences among clusters. Tukey test was used to test for significant pairwise 0.05 differences between clusters means, using a significant level of #𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 [200]. ( ) 2 81 4.4. RESULTS The primary purpose of this study was to identify genetic elements influencing energy balance. For this, we integrated variables potentially affecting the body's energy requirements, including physical activity, diet, body size and composition, and blood metabolites. Hence, we analyzed these data layers with a sparse latent factor. The derived factors exhibited commonalities between the layers. Finally, we studied genetic polymorphisms associated with these latent factors. Figure 4.1 shows the processing steps and resulting sample size of this study. Figure 4.1: Inclusion criteria and sample size. Sample size (n) after each of the inclusion criteria. MAD = median absolute deviation. 4.4.1. PEB variables were associated with specific groups of phenotypes. Summary statistics by phenotypic variables are presented in Table 4.1. In the following description of results, we will refer to these as "original" variables to distinguish them from 82 the PEB variables (i.e., the derived or latent variables). The variables considered were food consumption patterns from the touchscreen questionnaire, blood biomarkers (glucose, triglycerides, cholesterol, and creatinine blood levels), body size (weight, height, waist circumference), body composition (fat mass and lean mass), and physical activity measuring intensity (intensity of walking, moderate, and vigorous exercise) and periodicity of exercise (frequency and duration of exercise). Diet variables included low- caloric (water, vegetables, fresh fruit intakes), fish and meats (beef, pork, lamb, poultry, oily, and non-oily fish), low processed foods (coffee and tea intakes), moderately processed foods (bread and alcohol intakes), and highly processed foods (cereal, cheese, and processed meats intakes). 83 Table 4.1: Descriptive statistics of phenotypical variables. Numerical variables are summarized based on the median and median absolute deviation (MAD) on their original scales (units). Categorical variables are summarized by the percentage of samples within each category (%). For all categorical variables, except alcohol intake, the following coding was used to represent intake frequency: 0= Never, 1=Less that once a week, 2=Once a week, 3=Two or three times a week, 4=Five or six times a week, 5=Once or more daily. For alcohol, the following code was used: 1=Daily or almost daily, 2=Three or four times a week, 3=Once or twice a week, 4=One to three times a month, 5=Special occasions only, 6=Never. Missing data for all variables are expressed as a percentage of the total sample size. PA: physical activity. Numerical variables Median (MAD) Missing Glucose (mmol/L) 4.9 (0.49) 18.0 Cholesterol (mmol/L) 5.8 (1.1) data5.5 Creatinine (umol/L) 70 (14) 6.0 Triglycerides (mmol/L) 1.5 (0.72) 9.6 Weight (Kg) 77 (16) 0.4 Height (cm) 170 (10) 0.3 Waist circumference (cm) 91 (13) 0.2 Fat mass (Kg) 51 (13) 1.9 Lean mass (Kg) 24 (8.5) 2.1 Cooked vegetable 2 (1.5) 4.9 Salad/raw vegetable 2 (1.5) 8.5 (tablespoons/day) Fresh fruit (pieces/day) 2 (1.5) 4.6 (tablespoons/day) Bread (slices/week) 10 (5.9) 9.5 Cereal (bowls/week) 5 (3) 5.0 Tea (cups/day) 3 (3) 4.5 Coffee (cups/day) 2 (1.5) 10.0 Water (glasses/day) 2 (1.5) 12.0 Intensity of moderate PA 240 (360) 28.0 Intensity of vigorous PA 0 (0) 26.0 (METmin/week) Intensity of walking 690 (680) 25.0 (METmin/week) Intensity of all PA 1200 (1200) 26.0 (METmin/week) Frequency of PA (days) 9 (4.4) 22.0 (METmin/week) Duration of PA (min) 80 (67) 25.0 Categorical variables Category: % Missing Oily fish (serv) 0:11, 1:34, 2:38, 3:17 6.0 Non-oily fish (serv) 0:04, 1:29, 2:50, 3:17 data 4.0 Processed meat (serv) 0:08, 1:30, 2:30, 3:28, 4:04 17.0 Poultry (serv) 0:04, 1:11, 2:37, 3:46, 4:02 21.0 Beef (serv) 0:09, 1:46, 2:33, 3:12 4.2 Lamb mutton (serv) 0:17, 1:57, 2:22, 3:04 6.6 Pork (serv) 0:15, 1:59, 2:23, 3:03 6.2 Cheese (serv) 0:03, 1:16, 2:21, 3:45, 4:09, 5:06 2.4 Alcohol (g) 1:21, 2:23, 3:26, 4:11, 5:11, 6:08 1.0 84 PEB variables were derived from the first ten latent dimensions from the sparse FAMD. Since there was not a precise inflection point in the singular values to define a cutoff (see Figure C.1-A), we based the selection of latent variables on the threshold at which the rate of change in singular values, and therefore the size of phenotypic variances explained by each PEB, reached a plateau (Figure C.1 panel C and D). This value represented 20% of the total interindividual phenotypic variability (Fig C.1, panel B). Although differing at being derived from a smaller group of phenotypes, these PEB latent variables were highly correlated with their dense counterpart (i.e., latent factors from a FAMD without sparsity) and explained a similar proportion of variance (Figure C.2). The latent variables were constructed, imposing sparsity on the loadings of the FAMD. Thus, values near zero were excluded, limiting the possible number of phenotypes contributing to each PEB. The relationship between PEB and original phenotypes was studied using non-zero FAMD loadings and marginal correlations (Figure 4.2). Since only the first five PEB variables were related to phenotypes representing multiple components of EB, we retained them for further analysis. We indexed these PEB variables from one to five, with PEB 1 and 5 representing the highest and lowest amounts of variance explained, respectively. 85 Figure 4.2: Relationship between PEB variables and original phenotypes. A) Squared phenotype loadings. The contribution of each of the original phenotypes was obtained from bootstrap repetitions of the sparse factor analysis. Each panel shows the original phenotypes that contribute to each PEB variable. B) Marginal correlation between PEB and phenotypes. Similarly, the Bootstrap correlations and CI are shown for each pair of PEB and phenotypes. Higher values of PEB 1 represented active individuals of small body sizes and low blood triglycerides. The opposite was true for lower values of PEB 1. Lower PEB 1 values represented sedentary individuals of larger body sizes and higher values of blood triglyceride. Similarly, higher PEB 2 represented subjects with small body sizes and active diets, including red meats and vegetables. Higher PEB 3 represented subjects of large 86 body size, physically active, and with high levels of blood triglycerides and blood glucose. Higher values of PEB 4 represented individuals of average body size, predominantly meat-eaters. Finally, higher values of PEB 5 represented active individuals largely vegetarian. Table 4.2: Labels for PEB variables based on correlation with original phenotypes. Correlation with original phenotypes PEB Label Positive Negative Active, small Body size and body size, low Physical activity PEB 1 mass// Blood blood (PA) triglycerides triglycerides Active, small PA // meat and Waist body size, meat, PEB 2 vegetable circumference, and veggies consumption lean body mass intake Bodyweight and Active, large mass, waist body size, high circumference// PEB 3 blood -- PA//Blood triglycerides, triglycerides and and glucose glucose Average body Meat PEB 4 size, meat -- consumption intake Vegetables, fruit, water, cereal, and Active, largely Processed meat PEB 5 oily fish vegetarian consumption consumption // PA. 4.4.2. PEB-induced aggrupation of phenotypically distinct participants. The PEB variables can also induce a separation of individuals into phenotypically distinct groups. Figure 4.3 shows clusters of participants (Figure 4.3-A) and the association between clusters and original phenotypic variables (Figure 4.3-B). This representation was done by embedding the latent PEB variables to two dimensions using Uniform 87 Manifold Approximation and Projection (UMAP)[199]. There were seven distinct clusters of participants. The cluster with the highest degree of separation from the rest was Cluster 1. This cluster was dominated by participants exhibiting low body fat, blood biomarkers, meat consumption, poultry, and processed foods. Participants within this cluster also had high levels of physical activity and consumption of low caloric foods. Contrastingly, Cluster 5 was characterized by higher body fat levels, blood biomarkers, and alcohol, while Cluster 4 was characterized by higher consumption of meats and processed foods and relatively low physical activity. The remaining clusters had fewer striking differences. Cluster 2 followed a pattern like Cluster 1, but less marked, particularly having less consumption of low-calorie food. Cluster 3 followed Cluster 1 and 2, but even less markedly, characterized by more processed food consumption. Finally, clusters 6 and 7 represented average values across all contrasts, differing only at the consumption of alcohol, processed food (higher in Cluster 6), and low-calorie food (higher in Cluster 7). 88 Figure 4.3: PEB-based cluster analysis. A) UMAP projection of PEB variables. A two-dimensional UMAP projection of PEB was obtained from Factor Analysis of Mixed Data on the original set of phenotypic variables involved in energy balance and resulted in seven clusters. B) Contrasts between phenotypes and clusters. Each panel represents a statistical contrast between phenotypes and clusters. The colored rectangles and points represent the means of each contrast. The vertical bars represent plus and minus one standard deviation of each contrast. Abbreviations: Alcohol (consumption of alcohol), BodyFat (body fat, waist circumference, weight, and blood triglycerides), CoffeTea (consumption of tea or coffee), fish (consumption of oily or non-oily fish), GluChoCre (level of glucose, cholesterol, or creatinine in the blood), Meat (consumption of meat), PhysAct (levels of physical activity), poultry (consumption of poultry), ProcFood (consumption of processed food), and WaterVegFruit (consumption of water, vegetable, or fruit). 4.4.3. Genomic variants associated with PEB. We conducted three analyses to determine the relationship between PEB variables and the original phenotypes with genetic markers. Firstly, we conducted separated GWAS on the PEB variables (PEB-GWAS). Secondly, we studied the overlap between PEB-GWAS results and separated GWAS on the original variables (ORIG-GWAS). Lastly, we used 89 the information from PEB-GWAS to study pleiotropic effects on the groups of phenotypes contributing to each PEB (PEB-PLEIO). The PEB-GWAS resulted in significant associations for all PEB variables. The significant hits (using a threshold of 5x10-8) were 2113, 116, 12348, 324, and 354. These significant hits were further grouped into 50, 3, 185, 4, and 67 peaks defined by markers in linkage disequilibrium. Annotated Manhattan plots for PEB-GWAS analyses are presented in Figures C.3-7. Most of the PEB-GWAS peaks (15, 3, 60, 3, 3, and 7) were not present on ORIG-GWAS, demonstrating that combining correlated phenotypes increased the power to find these regions (Figure 4.4). Figure 4.4 shows the annotation information for PEB-GWAS, ORIG- GWAS, and PEB-PLEIO analyses, emphasizing the overlap with genic regions. Identified genes were enriched for five different gene sets: obesity, metabolic disease, cardiovascular disease, connective tissue development and function, and carbohydrates metabolisms (FDR p-value < 0.01) (Figure 4.4, see Materials and Methods). The gene associated with the highest number of PEB latent variables (PEB 1, PEB 3, and PEB 5) was the Fat Mass and Obesity-Associated Protein (FTO) (Figure 4.4). This peak for FTO was present in the ORIG-GWAS for weight, height, waist circumference, lean and fat mass. In addition, FTO had three leading SNPs (defined here as the SNP with the lowest p-value within an LD block): rs56094641, rs1421085, and rs11642015, respectively. All these SNPs were present in the PEB-PLEIO for weight and body mass. A group of genes mapped for both PEB 1 "Active, small body size, low blood triglycerides" and PEB 3 "Active, large body size, high blood triglycerides, and glucose". This group consisted of Protein lin-7 Homolog C (LIN7C), Centrosomal Protein POC5 (POC5), 90 INO80 complex subunit E (INO80E), Protein Lifeguard 2 (FAIM2), and G1/S-specific Cyclin-D2 (CCND2). Gene LIN7C had a peak present in the ORIG-GWAS of all body measurement variables. The leading PEB-GWAS SNPs at this peak were rs962369 and rs2049045, respectively. These peaks were also present ORIG-GWAS and PEB-PLEIO for all body measurement variables. The same peak was also present in the individual ORIG-GWAS for the intensity of moderate physical activity. Gene POC5 also had a peak (leading PEB-GWAS SNPs rs2307111 and rs67570751, respectively) present in the ORIG-GWAS for all body measurement variables. Gene INO80E had a peak present in all body measurements ORIG-GWAS, except for height. The leading PEB-GWAS SNPs in this peak were rs7190185 and rs35105141, respectively. This peak was present in the PEB-PLEIO of weight, waist circumference, and body mass. Gene CCND2 had a peak with rs76895963 as the leading SNP. This peak was present in the ORIG-GWAS of weight, fat mass, and lamb consumption. Also, the peak was present in the PEB-PLEIO for weight and fat mass. Several genes had PEB-GWAS peaks mapping exclusively for "Active, large body size, high blood triglycerides, and glucose". We focus the description on peaks that were present in the ORIG-GWAS and PEB-PLEIO analyses. Golgin A6 Family Like 5 - Pseudogene (GOLGA6L5P) had a PEB-GWAS peak with rs11259919 as lead SNP. This peak was also present in the PEB-PLEIO of body measurement variables, lamb, and pork consumption. The peak was also present in the ORIG-GWAS of weight, fat mass, and waist circumference but absent for height, lamb, and pork consumption. The peak at Leiomodin 1 (LMOD1, leading SNP rs2820322) followed the same behavior as GOLGA6L5P, except that lamb consumption was not part of the PEB-PLEIO results. The 91 peaks at Glutathione S-Transferase Mu 4 (GSTM4, leading SNP rs7550711) and Vacuolar Protein Sorting-Associated Protein 29 (VPS29, with a non-reference leading SNP at chromosome 12 and 110852370 bp) were both presents in the ORIG-GWAS of weight, height, and fat mass. This peak was present in the PEB-PLEIO of weight, height, fat mass, and intensity of moderate physical activity, while the lead SNP at VPS29 was present in the PEB-PLEIO of weight, height, fat mass, and physical activity from walking. The peak at Hedgehog Interacting Protein (HHIP, leading SNP rs4240326) was present in the ORIG-GWAS of weight, waist circumference, body mass, and lamb consumption. This peak was also present in the PEB-PLEIO of weight, waist circumference, and body mass. Some genes had peaks collectively present in the ORIG-GWAS and PEB-PLEIO of all body measurements and variables, including Hydroxycarboxylic Acid Receptor 1 (HCAR1, leading SNP rs7133768), BTB/POZ Domain-Containing Protein 7 (BTBD7, leading SNP rs4381522), Hydroxysteroid 17-Beta Dehydrogenase 12 (HSD17B12, leading SNP rs1061810), SCM Polycomb Group Protein Homolog 1 (SCMH1, lead SNP rs61780439), Family 5 Member 11 (SLC5A11, leading SNP rs12923476), Regulatory Associated Protein of MTOR Complex 1 (RPTOR, leading SNP rs11150745), and tRNA Splicing Endonuclease Subunit 15 (TSEN15, leading SNP rs74767794). A peak at SUMO2 Pseudogene 17 (SUMO2P17, leading SNP rs8082345) was exclusively present in the ORIG-GWAS and PEB-PLEIO of weight, waist circumference, and fat mass. Peaks at Axis Inhibition Protein 2 (AXIN2, leading SNP rs757558), Centrosomal Protein 120 (CEP120, leading SNP rs34732995), and p53 Tumor Suppressor Protein (TP53, leading SNP rs78378222) were present in the individual ORIG-GWAS and PEB-PLEIO of weight, 92 height, and fat mass. A peak at Protein Disulfide Isomerase Family A Member 2 (PDIA2, lead SNP rs12926311) was present in ORIG-GWAS of weight, height, and fat mass, and PEB-PLEIO of weight and fat mass. A peak at RAD9 Checkpoint Clamp Component A (RAD9A, leading SNP rs34560402) was present in ORIG-GWAS of waist circumference. Finally, peaks at Mitogen-Activated Protein Kinase 5 (MAP2K5, leading SNP rs12050481) and Potassium Voltage-Gated Channel Subfamily E Regulatory Subunit 4 (KCNE4, leading SNP rs1607246) were present in the ORIG-GWAS of height. Finally, a group of genes had peaks exclusively for PEB-GWAS of "Active, small body size, low blood triglycerides". Peaks at Cytochrome P450 Family 3 Subfamily A Member 5 (CYP3A5 lead SNP rs13311457) and UHRF1 Binding Protein 1 (UHRF1BP1 lead SNP rs2744977) were present for the PEB-GWAS and PEB-PLEIO of weight, waist circumference, and fat mass. Peaks at SUMO1 Activating Enzyme Subunit 1 (SAE1 lead SNP rs3810291) and Neurexin 2 (NRXN2 lead SNP rs12273892) were present in PEB-GWAS and PEB-PLEIO of weight, waist circumference, and lean mass. Three peaks were present at putative- uncharacterized proteins. A peak at RP11-866E20.3 (lead SNP rs12967135) was present for PEB-PLEIO of waist circumference, the intensity of moderate physical activity and walking, and ORIG-GWAS for the last two variables only. A peak at RP11-251G23.2 (lead SNP rs73190105) was present in ORIG-GWAS of weight, fat mass, and intensity of vigorous activity and present in PEB-PLEIO of the first two variables. A peak at RP4- 635E18.7 (lead SNP rs5019466) was present in the ORIG-GWAS of weight, waist circumference, fat mass, and PEB-PLEIO of weight and waist circumference. Another gene was Zinc Finger and BTB Domain Containing Protein 38 (ZBTB38). Gene ZBTB38 93 had a significant peak in PEB-GWAS for "Active, small body size, low blood triglycerides" (lead SNP rs6785012). Figure 4.4: Summary of GWAS loci within genic regions. The panels above summarize the overlap between linkage disequilibrium blocks containing significant SNP from GWAS between PEB latent variable and genic regions. The panels represent, in order, the overlap between significant peaks from PEB-GWAS and genic region, overlap between ORIG-GWAS and PEB-PLEIO analyses, and overlap with gene sets and regulatory regions. 94 As illustrated in Figure 4.4, pleiotropy was most abundant for body size and composition variables, followed by components of PA and dietary intake variables. Regarding enrichment results, most genes with significant peaks were involved in connective tissue development and function, followed by obesity and metabolic disease. The most significant peaks were also located in intronic regions of the gene. 95 4.5. DISCUSSION Excessive calorie intake and low physical activity generate a positive EB that results in obesity. We integrated several phenotypes from the UK Biobank related to EB. We used this information to get novel insights into the physiological reasons that lead to a positive EB. Sparse factor analysis of mixed data (sFAMD) and bootstrap inference was used to define patterns of energy balance (PEB) and to determine associations with common SNP. From these associations, a set of genes with possible links to EB patterns were identified. Previous studies have applied similar phenotypes-integration techniques to determine individuals' diet or physical activity patterns [160–163]. In contrast, our study focused on all aspects of EB, solving the challenges of including random variables of different phenotypic scales (e.g., numerical biomarkers values and body measurements versus categorical food frequencies) by combining conventional FAMD [189] with sparse SVD [201]. The application of sparse FAMD (sFAMD), followed by bootstrap, clearly determined what phenotypes contributed to each PEB. This analysis generated PEB as factors that capture data characteristics beyond the ones expected by naïve integration (e.g., averaging subjectively defined groups of phenotypes) and effectively captured different aspects of EB. Besides capturing variability due to EB's multiple components, PEBs were also associated with variability at the genomic level. For example, some PEB-GWAS peaks mapped on genes previously reported affecting obesity and metabolic diseases, such as FTO, ZBTB38, and POC5. For instance, the SNP rs56094641at FTO had effects on PEB 1 "Active, small body size, low blood triglycerides", PEB 3 "Active, large body size, high 96 blood triglycerides and glucose" and PEB 5 "Active, largely vegetarian". This SNP has also been previously reported as a cis-QTL in muscle (Fig C.8), as well as being present in GWAS of childhood obesity [202], body fat [203], and metabolic disease [204]. Additionally, the lead SNP rs6785012 at ZBTB38, mapping on PEB 1, has been associated with eczema and red blood count [205] and has also been reported as cis- eQTL in many tissues, including testis, adipose tissue, whole blood, spleen, pituitary, and thyroid glands (Figure C.8). Moreover, rs6785012 had pleiotropic effects on body measurements found here agree with a previously reported association of ZBTB38 with BMI and waist circumference [206]. Lastly, the SNP rs2307111 at POC5, mapping on PEB 1 and 3, had pleiotropic effects on body measurement variables. This finding can be supported by the presence of this SNP in previous GWAS studies on body measurements [205,207]. However, previously reported associations between this SNP and cholesterol were not confirmed in PLEIO-GWAS or OIRG-GWAS analyses. Some genes had pleiotropic effects on both body measurements and physical activity variables. For example, a non-synonym SNP at chromosome 12 and 110852370 bp mapping on VPS29 was associated with height, weight, fat mass, and duration of physical activity. Variations in VPS29 and other genes in the retrosome (part of the retrograde transport from the endosome to the Golgi apparatus) are typically linked with neurodegenerative disorders [208]. However, in mice models, variation in the retrosome components has been indirectly linked with normal growth and mammal development and muscle response to exercise and training through the Wnt and β-Catenin pathway [209,210]. The potential role of the variability of this SNP on muscle growth and response to exercise can be explained by the significant association with the expression of VPS29 97 in muscle (GTeX data presented in Figure C.8). Similarly, a peak on GSTM4 with rs7550711 as lead SNP had a pleiotropic effect on weight, height, fat mass, and walking intensity. This SNP has been previously associated with BMI in physically active adults [206] and phospholipid fatty acids in plasma [211]. A group of PEB-GWAS peaks was located in genes involved in carbohydrate metabolism and cell cycle, a general category with known effects on body composition [212] and physical activity performance [213]. One of these genes, SIX3, had a PEB-GWAS peak for PEB5 "Active, healthy diet" variable, with rs4953152 as lead SNP. This SNP had pleiotropic effects on cooked vegetables and water consumption. These associations have not been previously registered for either cooked vegetables or water consumption. However, evidence of an association between variations in rs4953152 and cognitive processes has been observed [214,215]. In addition, rs4953152 has been reported as a cis-eQTL of SIX3-As1 in the brain (Figure C.8). Lastly, there was a peak at TP53 (lead SNP rs78378222), with pleiotropic effects on weight, height, fat mass. Variations in TP53 are primarily linked to multiple cancer types [122]. Nevertheless, rs78378222 has also been reported as associated with fat mass [216,217] and cis-eQTL of TP53 adipose tissue (Figure C.8). Another SNP with pleiotropic effects on different groups of variables was rs11259919 at LMOD1. This SNP was associated with weight, fat mass, and consumption of pork. Variations in rs11259919 have been linked with the expression of LMOD1 in many tissues, including the digestive tract, nerve, muscle, brain, adrenal gland, thyroid, artery, and heart (Figure C.8) well as with appendicular fat [217]. Although a previous link with food consumption has not been reported, variations in LMOD1 expression in skeletal muscle have been linked with 98 differential susceptibility to cardiovascular disease in response to high-fat diets [218] and body composition [219]. Many of the putative QTL detected in this study have previous associations with phenotypes contributing to energy balance. For a relatively small number of phenotypes, like the ones used here, studying all possible combinations of pleiotropic effects could have been a possibility (i.e., studying all possible pleiotropic groups among variables). However, at the scale of the UK Biobank, this renders computationally prohibitive. For that reason, the alternative of sparse FMD used here offered a way to a) inform what groups of variables are most likely to collaborate to form the PEB variables and b) inform the study of pleiotropic effects in a more manageable number of phenotypes. Although most SNP had relatively minor effects on each PEB, the large sample size of this study was instrumental in detecting them. One limitation of this study is the lack of functional validation, especially for novel genes associated with EB. Although we have used data from GTeX to confirm the association between the change of allele and gene expression, an association between gene expression and phenotypes related to EB was not confirmed (e.g., VPS29 variant causing differential fat mobilization within the adipocyte). The results generated here contribute to understanding the complex biology of energy balance and the interrelation between its related traits [220,221]. 99 CHAPTER 5 CONCLUSIONS Increasing sizes and data density become a constant computational and statistical challenge for existing data integration methods [56,66]. To face these challenges, we introduced a new method of data integration, Multi-Omic Integration via Sparse Singular Value Decomposition (MOSS). MOSS exploits the benefits of large data sizes (i.e., many samples to increase power and many features to discover biologically relevant signals) while maximizing computational performance. We have written MOSS as an R package that can be freely available Comprehensive R Archive Network (CRAN). We review the capabilities and limitations of the MOSS package in chapter 2. However, three caveats of MOSS remain to be discussed, as they arise depending on the application. The first caveat involved convergency properties. MOSS relies on the NIPALS algorithm to extract SVD solutions, and therefore, convergence in supervised problems (e.g., PLS) is not always assured [222]. Although empirical results suggest that convergence is reached in most practical situations [222], future research on the analytical properties of MOSS should include a thorough analysis of its convergency properties. The second caveat involves the lack of statistical inference for the results of the features selection process (i.e., basing feature selection in, for example, confidence intervals). By considering the elements of SVD as random variables, both factors and loadings can be thought of as drawn from probabilistic distributions (examples of this are [223] [224] and 100 [225]). Future work in MOSS could benefit from considering the data integration model as a probabilistic one to evaluate the significance of its solutions. Finally, the third caveat comes from using elastic net (EN) penalty as a feature selection process. Fundamentally, EN zeroes out the noisy features while retaining correlated “signal” [68]. The intention is to find an optimal set of features representing biologically relevant groups (e.g., genes in pathways, food items in dietary patterns). Unfortunately, the performance of EN strongly depends on the tuning of the hyperparameter “α” [111]. In its current version, MOSS uses the fast heuristics in [69] to tune the degree of sparsity, but not α. Future versions of MOSS would require an alternative to tune all hyperparameters without compromising computational efficiency. Cancer research is one of the primary disciplines where the performance of omic integration tools will continue to be challenged. Thanks to progress in sequencing techniques, cancer genomics has advanced at an extraordinary pace. This pace has been evident ever since the early days of cancer genomics, where microarray experiments were rapidly complemented by new generation sequencing techniques in less than a decade [226]. These techniques have been essential at creating several large international repositories of cancer multi-omic data, like The Cancer Genome Atlas (TCGA) [95], the International Cancer Genome Consortium (ICGC) [227], and the Cancer Cell Line Encyclopedia (CCLE) [228]. Cancer research has benefited from these data in many applications, most notoriously the enhancement of risk predictions models [14–16] and improving tumor classification with molecular subtypes [72,102,108]. Chapter 3 has also benefited from these repositories. We have used data from TCGA consisting of approximately six thousand tumor samples and sixty thousand features 101 representing genome-wide gene expression, copy number variants, and DNA methylation values. In chapter 3, we used MOSS to detect shared molecular features acting across clusters of tumors. These clusters formed beyond the restrictions of the site of origin and exhibited similar clinical and immunologic characteristics, supporting the role of common molecular signatures across cancer types [88,229]. Despite these exciting findings, future applications of our method in cancer data would require a more robust estimation of tissue effects, for example, by using markers of tumor histology (e.g., mesenchymal, epithelial). In addition, newer classification efforts must emphasize validation avenues, such as knock-out and gene-drug interaction models. Another area of fast-growing pace is phenomics. Once deemed prohibitively expensive and time-consuming, gathering phenomic data is now a reality [22]. Advances in imaging techniques [25], mass spectrometry [230], and automated data loggers [26] in the last years have been instrumental in the creation of large-scale phenotyping projects. One of the most extensive ongoing efforts is the UK Biobank (UKB), with data across several phenomic layers and genotypic information for more than five hundred thousand individuals. Integration of UKB’s phenomic and genomic data has been conducted for multiple complex traits, including neurodegenerative disorders [231], and cardiovascular disease [232], and dietary habits [233]. Similarly, we focused our chapter 4 on integrating the UKB phenome to infer variation associated with different aspects of energy balance (EB) (e.g., the tendency of lean body types to engage in regular physical and healthy diets). We studied the association between this variation and genetics. We found known genes involved with EB (e.g., FTO, POC5, ZBTB38, INO80E) and novel ones (e.g., VPS29, SIX3, LMOD1) not present in the 102 GWAS of separated phenotypes. This work is the first to integrate all EB components from a phenomic point of view to the extent of our knowledge. Regardless of these compelling results, chapter 3 used only a small set of the phenomic data in the UKB. Although only information from a few blood metabolites is currently present in the UKB, ongoing efforts to produce detailed metabolic profiles for a large sample of individuals are on their way [234]. Moreover, further incorporation of metabolomic profiles in our set of phenotypes could profoundly impact our definition of EB factors and associated genes since metabolomics has been shown to efficiently complement and improve the assessment of dietary patterns [235]. Following, we propose some avenues for future research on omic integration. We have stated that MOSS does not rely on distributional assumptions and how this limits the possibility of statistical inference. One possibility for incorporating inference within MOSS is the adoption of Bayesian methods. Bayesian alternatives to SVD [223,225] could be extended to incorporate FBM to deal with large data sets. Additionally, adopting Bayesian methods would allow the incorporation of different prior distributions on the features’ loadings coming from different omic layers. The study of the choice of prior distribution on the overall performance could aid researchers in deciding what set of assumptions better describes their data. Another possible line of research is the study of alternative forms of sparsity on the performance of omic integration. Omic integration might not be as robust in biological scenarios where specific molecular events are not as drastic as in cancer. Therefore, the evaluation of sparsity (e.g., on pre-defined groups of features) and their impact on 103 performance could help decide what methods are best for problems where the signal-to- noise ratio is low. Alternatively, since cancer is dominated by extreme, sometimes widespread, molecular events, we foresee omic integration to continue being useful for cancer research. An exciting line of research can focus on evaluating methods’ performance and the proposal of new algorithmic shortcuts to handle huge data sets (e.g., recently generated data from the Pan-Cancer Analysis of Whole Genomes) [236]. 104 APPENDICES 105 APPENDIX A SUPPLEMENTARY MATERIAL FOR CHAPTER 2 Table A.1: Code templates to replicate examples. # Install and load MOSS. install.packages("MOSS") library("MOSS") # Example of unsupervised omic integration with MOSS out_sig <- moss_signatures(out_sim,out_moss$selected_items, clus_lab = out_moss$clus_plot$dbscan.res$cluster, only.candidates = TRUE, plot=TRUE, th=0.05) # Example of supervised omic integration with MOSS set.seed(345) out_moss <- moss(data.blocks = out_sim, method = "pls", resp.block = 1, scale.arg = T, norm.arg = T, K.X=10, nu.v = seq(1,500,by=2), nu.u = seq(1,100,by=2), alpha.v = 0.5, axes.pos = 1:2, alpha.u=0.5, exact.dg = TRUE, use.fbm = TRUE, nu.parallel = TRUE, tSNE=list("perp"=30, "n.iter"=1e3, "n.samples"=1), cluster = list(eps_range=c(0,1), eps_res=10, min_clus_size=2), plot=TRUE, lib.thresh = TRUE) 106 Table A.1 (cont’d) # Extract heatmap. moss_heatmap(out_moss$B, SVD = out_moss$sparse, axes.pos = c(1,3), col.lab = "miRNA-seq & ", row.lab = "RNA-seq") 107 Figure A.1: Analytical performance of several omic integration methods. Each panel corresponds to the accuracy (top-left), precision (top-right), sensitivity (bottom-left), and specificity (bottom-right) of each omic-integration method at detecting informative features (i.e., features with signal across and between omic blocks). Methods compare were FactoMineR (function MFA), iCluster (function tune.iCluster2), mixOmics (function tune.splsda) and MOSS (function moss with method=”pca” and method=”lrr”, respectively). Results were obtained across 10,000 random simulations. 108 APPENDIX B SUPPLEMENTARY MATERIAL FOR CHAPTER 3 Table B.1: List of genes significantly deregulated in at least one pan-cancer cluster. All the genes significantly different in at least one cluster are sorted by chromosome and genomic position. ANOVA’s p-values, adjusted for multiple comparisons, are displayed on the last column of the table (p-value). Average enrichment scores by cluster Gene C1 C2 C3 C4 C5 C6 C7 C8 p-value C1orf159 0.02 0.10 -0.07 0.15 0.21 0.41 0.00 -0.09 6.36E-13 b ab b ab ab a b b WASH2P -0.29 0.21 0.00 0.05 0.17 -0.16 -0.04 0.05 9.78E-14 b a b ab ab b b ab RAB6C 0.08 -0.04 -0.01 -0.09 -0.42 0.43 -0.12 -0.48 4.47E-17 ab bc b bc bc a bc c ITM2C 0.07 0.01 -0.06 0.15 0.02 0.41 0.07 -0.07 4.94E-11 ab b b ab ab a ab b HSP90A -0.19 0.02 0.05 0.28 0.12 -0.03 -0.51 -0.16 2.62E-15 B2P ab ab ab a ab ab b ab MCC -0.11 -0.03 -0.08 0.42 0.22 0.60 0.26 0.29 1.35E-33 c bc c ab abc a abc abc STK38 -0.23 -0.14 0.09 -0.26 -0.14 0.12 -0.38 -0.01 1.92E-18 ab ab a ab ab a b ab PPIL1 -0.11 -0.10 0.09 -0.23 -0.27 -0.04 -0.44 -0.12 3.22E-15 ab ab a ab ab ab b ab C6orf89 0.07 0.10 -0.03 0.05 0.19 0.05 -0.33 0.14 5.86E-06 ab a ab ab a ab ab a MTCH1 -0.09 0.08 0.00 0.12 0.06 0.11 -0.45 0.08 5.02E-09 ab a ab a ab a b ab TBC1D2 -0.03 0.06 0.03 -0.05 -0.15 -0.03 -0.50 0.07 2.49E-09 2B a a a ab ab a b a RNF8 -0.15 0.00 0.05 -0.19 -0.14 0.13 -0.46 -0.01 1.05E-11 ab a a ab ab a b ab CMTR1 -0.03 -0.02 0.05 -0.04 -0.10 -0.06 -0.46 -0.02 1.40E-08 ab a a ab ab ab b ab ZFAND3 -0.04 0.06 0.02 0.05 0.14 -0.09 -0.43 -0.03 1.92E-07 ab a ab ab a ab ab ab 109 Table B.1 (cont’d) BTBD9 -0.03 0.13 0.03 0.29 -0.10 0.05 -1.19 0.03 1.04E-61 a a a a a a b a GLO1 -0.21 0.07 0.05 0.00 0.02 -0.08 -0.50 -0.09 1.67E-14 ab a a ab ab ab b ab SAYSD1 -0.17 0.01 0.04 0.17 0.21 -0.02 -0.48 0.18 1.60E-12 ab a a a a a ab a LRFN2 -0.02 0.12 0.05 0.49 -0.14 0.05 -1.60 0.05 1.10E- b ab b a b b c b 118 UNC5CL 0.02 -0.05 0.01 0.00 0.02 0.18 -0.36 0.10 5.13E-06 ab ab ab ab ab a b ab OARD1 0.18 0.13 -0.03 0.19 0.21 0.17 -0.87 0.01 1.16E-37 a a a a a a b a NFYA -0.09 0.09 0.04 0.06 -0.14 -0.04 -0.77 0.05 4.39E-25 a a a a ab a b a FOXP4 0.02 0.07 0.01 0.27 0.27 -0.04 -0.67 0.10 1.26E-19 a a a a a a b a TFEB 0.16 0.09 -0.07 0.34 0.31 0.31 -0.31 0.15 2.71E-18 ab ab ab a ab ab b ab FRS3 -0.01 0.16 0.05 0.76 0.02 0.06 -2.04 0.13 9.81E- b b b a b b c b 207 PRICKL 0.13 0.06 -0.03 0.02 -0.07 0.13 -0.29 0.13 1.07E-05 E4 a a a a a a a a TOMM6 -0.11 0.03 0.08 0.32 -0.13 0.02 -1.24 -0.03 7.35E-70 a a a a a a b a USP49 -0.01 0.03 0.04 0.00 -0.05 -0.11 -0.65 0.06 2.95E-17 a a a a ab ab b a MED20 -0.05 0.02 0.09 0.15 -0.06 -0.08 -1.28 -0.06 2.09E-72 a a a a a a b a BYSL -0.17 -0.03 0.10 0.19 -0.28 0.00 -1.09 -0.18 3.39E-58 a a a a ab a b a CCND3 -0.01 0.14 -0.02 0.40 0.12 0.16 -0.77 0.09 1.24E-30 a a a a a a b a TAF8 0.08 0.12 0.00 0.29 0.02 0.10 -1.06 0.16 2.53E-50 a a a a a a b a GUCA1A -0.03 0.15 0.06 0.91 0.00 0.05 -2.16 0.04 5.46E- b b b a b b c b 242 GUCA1B 0.06 0.10 -0.04 0.36 -0.12 0.06 -0.29 0.13 3.81E-08 ab ab ab a ab ab b ab MRPS10 -0.11 0.03 0.08 0.14 -0.18 0.02 -1.22 -0.15 5.96E-67 a a a a a a b a TRERF1 -0.01 0.15 0.05 0.92 -0.03 0.10 -2.05 0.09 6.77E- b b b a b b c b 218 110 Table B.1 (cont’d) UBR2 0.03 0.11 0.03 0.22 0.11 -0.11 -0.94 0.06 4.27E-38 a a a a a a b a TBCC -0.08 0.06 0.04 0.40 -0.05 0.09 -1.01 0.04 2.47E-46 a a a a a a b a BICRAL 0.10 0.13 -0.02 0.28 0.32 -0.10 -0.60 0.16 9.26E-20 a a a a a a b a RPL7L1 -0.10 0.09 0.04 0.29 0.08 -0.01 -1.01 0.04 2.92E-45 a a a a a a b a C6orf226 0.07 0.04 0.00 0.29 -0.02 0.07 -0.68 0.22 2.97E-20 a a a a ab a b a CNPY3 0.06 0.09 -0.01 0.54 0.28 0.12 -0.96 0.05 3.29E-46 ab ab b a ab ab c ab GNMT 0.09 0.26 -0.10 0.48 0.18 0.02 -0.09 0.21 3.01E-22 abc ab c a abc abc bc abc PEX6 0.03 0.15 -0.02 0.34 -0.04 0.06 -0.64 0.07 1.66E-21 a a a a ab a b a PPP2R5 -0.08 0.09 0.05 0.32 -0.03 0.00 -1.18 -0.05 2.00E-61 D a a a a a a b a MEA1 -0.08 0.05 0.06 0.36 -0.08 -0.01 -1.22 0.00 3.13E-66 a a a a a a b a KLHDC3 -0.02 0.12 0.02 0.43 -0.10 0.17 -1.14 -0.04 3.64E-60 a a a a a a b a RRP36 -0.07 0.06 0.06 0.30 0.02 0.02 -1.18 -0.03 2.84E-61 a a a a a a b a CUL7 -0.01 0.05 0.04 0.25 0.05 -0.12 -0.92 0.20 6.30E-37 a a a a a a b a KLC4 -0.02 0.06 0.00 0.39 0.28 0.25 -0.86 0.15 3.94E-37 a a a a a a b a MRPL2 -0.08 -0.01 0.06 0.39 -0.29 0.17 -1.15 -0.03 2.73E-61 a a a a a a b a PTK7 0.00 0.01 0.02 0.18 -0.28 0.09 -0.56 0.05 1.39E-12 a a a a ab a b a SRF 0.11 0.09 -0.02 0.30 0.26 0.15 -0.75 0.06 6.28E-27 a a a a a a b a CUL9 0.09 0.08 0.01 0.33 -0.03 -0.01 -0.89 0.07 1.75E-34 a a a a a a b a DNPH1 0.02 -0.05 0.02 0.33 -0.12 0.21 -0.74 0.28 1.48E-27 a a a a ab a b a CRIP3 0.16 0.07 -0.08 0.30 0.25 0.29 -0.12 0.34 2.50E-16 ab ab ab a ab a ab a ZNF318 0.00 0.08 0.04 0.16 0.08 -0.11 -0.85 0.02 3.81E-30 a a a a a a b a ABCC10 -0.13 -0.02 0.07 0.25 0.00 -0.02 -0.90 0.05 4.20E-37 a a a a a a b a 111 Table B.1 (cont’d) TJAP1 -0.07 0.12 0.05 0.36 -0.26 -0.04 -1.15 -0.04 1.10E-60 a a a a a a b a LRRC73 0.09 0.02 -0.01 0.38 0.22 -0.11 -0.35 0.11 1.32E-08 ab ab ab a ab ab b ab POLR1C -0.12 0.00 0.06 0.37 -0.02 0.09 -1.15 0.09 3.73E-60 a a a a a a b a YIPF3 -0.01 0.11 0.02 0.47 0.21 0.04 -1.09 0.06 9.49E-55 b ab b a ab ab c ab XPO5 -0.17 -0.07 0.12 0.18 -0.11 -0.14 -1.15 -0.08 9.67E-68 a a a a a a b a POLH -0.01 0.16 0.06 0.94 -0.02 0.04 -2.14 0.06 4.63E- b b b a b b c b 239 GTPBP2 -0.16 -0.01 0.10 0.06 -0.35 -0.02 -1.00 -0.11 4.93E-48 b ab a ab bc ab c ab MAD2L1 -0.09 0.09 0.05 0.32 0.02 0.10 -1.11 -0.15 1.38E-55 BP a a a a a a b a MRPS18 0.02 0.01 0.04 0.46 -0.07 0.11 -1.14 -0.04 3.73E-59 A a a a a a a b a VEGFA -0.16 0.00 0.04 0.14 0.07 0.05 -0.50 -0.03 6.75E-12 ab ab ab a ab ab b ab MRPL14 -0.16 0.04 0.05 0.37 -0.25 0.16 -0.98 0.00 8.68E-47 b ab ab a bc ab c ab TMEM63 -0.04 0.06 0.07 0.20 -0.35 -0.16 -1.01 -0.21 1.97E-48 B a a a a ab a b a CAPN11 0.13 0.15 -0.07 0.42 -0.07 0.03 -0.09 0.25 3.67E-12 ab ab b a ab ab b ab SLC29A 0.00 -0.02 0.06 0.14 -0.30 -0.10 -0.74 0.08 6.62E-24 1 a a a a ab a b a HSP90A -0.19 0.03 0.10 0.30 -0.18 -0.15 -1.13 -0.19 1.20E-65 B1 a a a a a a b a SLC35B -0.09 0.01 0.05 0.33 -0.06 0.09 -1.12 0.15 7.46E-56 2 a a a a a a b a NFKBIE -0.06 0.00 0.01 0.25 -0.04 0.21 -0.64 0.22 5.80E-20 ab a a a ab a b a AARS2 -0.06 0.02 0.05 0.33 -0.25 0.07 -1.04 -0.02 8.22E-48 a a a a ab a b a CDC5L -0.07 0.08 0.04 0.25 0.03 0.04 -1.06 -0.06 1.54E-48 a a a a a a b a SUPT3H 0.12 0.04 0.00 0.22 0.20 0.06 -0.70 0.04 7.20E-21 a a a a a a b a SLC25A -0.05 0.00 -0.03 0.23 0.23 0.34 -0.11 0.11 1.55E-08 27 b b b ab ab a b ab 112 Table B.1 (cont’d) TDRD6 -0.02 0.09 0.05 1.29 -0.12 -0.06 -1.72 -0.06 1.46E- b b b a b b c b 176 CD2AP -0.05 0.04 0.03 0.23 0.02 -0.04 -0.68 0.11 3.75E-19 ab a a a ab ab b a MUT -0.03 0.08 0.01 0.39 -0.01 0.01 -0.57 -0.01 1.43E-15 ab a a a ab a b ab CENPQ -0.21 -0.05 0.10 0.10 -0.21 -0.06 -0.69 -0.26 1.37E-29 a a a a ab a b ab MCM3 -0.16 -0.05 0.10 0.21 -0.20 -0.07 -0.81 -0.28 3.84E-36 a a a a ab a b ab PAQR8 -0.08 0.16 0.02 1.52 -0.11 0.06 -1.79 0.02 1.29E- b b b a b b c b 209 EFHC1 0.05 0.04 -0.01 0.35 0.29 0.02 -0.52 0.16 2.00E-13 ab ab ab a ab ab b ab TRAM2 0.05 0.10 -0.01 0.45 0.08 -0.07 -0.56 0.08 1.72E-17 abc ab bc a abc bc c abc TMEM14 -0.08 -0.03 0.09 0.28 -0.14 -0.15 -0.79 -0.36 1.78E-35 A ab ab a a abc abc c bc GSTA4 0.11 0.08 -0.02 0.30 0.34 -0.11 -0.42 0.11 1.99E-11 ab ab ab a a ab ab ab ICK -0.02 0.10 0.02 0.44 0.23 -0.25 -0.66 0.04 4.90E-26 ab ab ab a ab bc c ab FBXO9 0.02 0.10 -0.02 0.61 0.13 -0.06 -0.55 0.13 1.42E-21 bc bc bc a abc bc c ab GCM1 -0.12 -0.06 0.03 0.35 -0.06 -0.10 -0.17 0.11 1.37E-05 a a a a a a a a ELOVL5 0.04 0.11 -0.02 0.51 0.28 -0.20 -0.28 0.03 1.81E-12 ab ab b a ab b b ab GCLC 0.02 -0.01 0.06 0.21 0.06 -0.26 -0.70 -0.12 1.37E-25 ab ab ab a ab ab b ab KLHL31 -0.03 0.01 0.02 0.32 -0.02 -0.14 -0.36 0.13 5.98E-08 ab ab ab a ab ab b ab LRRC1 -0.03 0.02 0.05 0.14 -0.13 -0.24 -0.59 0.08 2.03E-17 a a a a ab ab b a DST 0.09 0.06 0.02 0.03 -0.10 -0.13 -0.40 -0.18 4.53E-08 a a ab ab ab ab b ab KIAA158 -0.06 0.08 0.01 0.19 0.12 0.00 -0.58 0.16 2.59E-14 6 ab ab ab a ab ab b a ZNF451 -0.04 0.10 0.01 0.36 -0.12 -0.03 -0.55 0.07 6.26E-15 a a a a ab a b a BAG2 0.06 0.11 -0.06 0.17 0.08 0.24 -0.16 0.13 4.47E-08 ab ab b ab ab a b ab PRIM2 -0.18 0.05 0.05 0.21 -0.05 -0.05 -0.53 -0.16 1.93E-15 ab a a a ab ab b ab 113 Table B.1 (cont’d) GUSBP4 -0.03 0.02 0.03 0.27 -0.07 -0.03 -0.57 0.00 1.83E-13 a a a a ab a b a PHF3 0.01 0.13 -0.02 0.01 0.35 -0.06 -0.24 0.14 2.09E-05 a a a a a a a a THSD7A 0.04 0.20 -0.13 0.42 0.55 0.35 0.20 0.18 1.35E-31 ab a ab a a a ab ab INTS4P2 0.00 0.00 -0.05 0.13 -0.15 0.58 -0.01 -0.23 6.07E-21 b b b ab b a b b ARHGEF -0.17 0.06 -0.03 0.14 0.38 0.20 0.15 0.03 1.05E-06 10 a a a a a a a a HTRA4 -0.17 -0.07 0.03 -0.10 -0.09 0.23 -0.11 -0.14 2.05E-06 b ab ab ab ab a ab ab DIP2C 0.02 -0.01 -0.02 -0.12 -0.01 0.35 -0.04 -0.13 1.46E-06 ab b b b b a b b DNA2 0.02 0.00 -0.06 0.00 0.06 0.55 0.06 0.03 7.72E-18 b b b b ab a b b HKDC1 -0.06 -0.07 -0.04 0.06 0.03 0.55 0.11 0.23 4.08E-20 b b b ab ab a ab ab CTBP2 0.10 0.08 -0.10 0.13 0.08 0.51 0.17 0.09 5.97E-23 b b b ab b a ab b FAM196 -0.04 -0.10 0.02 -0.09 0.04 0.39 -0.21 -0.33 9.00E-14 A b b b b ab a b b CSNK2A -0.51 0.38 -0.08 0.23 0.18 0.22 0.20 0.21 8.41E-53 3 c a bc ab abc ab ab ab FOLH1 0.06 0.14 -0.06 0.31 0.28 0.10 -0.13 0.10 6.03E-08 a a a a a a a a MRPL16 -0.12 -0.13 0.00 -0.13 0.01 0.50 0.00 0.09 8.72E-17 b b b b ab a b ab EIF1AD 0.23 0.07 -0.09 0.27 0.02 0.28 0.18 -0.21 5.57E-17 ab ab b ab ab a ab b SF3B2 0.11 -0.03 -0.03 0.21 -0.20 0.21 0.07 -0.27 4.31E-06 a a a a a a a a PACS1 -0.11 -0.07 0.01 0.06 -0.15 0.32 0.02 -0.26 4.39E-08 b b b ab b a ab b KLC2 0.11 -0.03 -0.06 0.15 -0.06 0.58 0.05 -0.18 1.75E-22 b b b ab b a b b RAB1B 0.04 -0.08 0.00 0.06 -0.29 0.42 -0.06 -0.40 5.73E-14 b b b ab b a b b YIF1A 0.17 0.07 -0.05 0.00 -0.03 0.31 0.04 -0.42 3.28E-13 ab ab bc abc abc a abc c BRMS1 0.15 0.05 -0.07 0.16 0.03 0.38 0.06 -0.22 1.47E-13 ab ab b ab ab a ab b MRPL11 0.16 0.10 -0.07 0.19 0.00 0.28 0.06 -0.16 4.33E-11 ab ab b ab ab a ab b 114 Table B.1 (cont’d) PELI3 -0.02 -0.06 -0.03 0.06 -0.23 0.60 0.06 -0.34 1.59E-23 b b b b b a b b DPP3 0.17 0.09 -0.10 0.19 -0.19 0.54 0.06 -0.17 2.36E-27 b b b ab b a b b BBS1 -0.12 -0.09 0.04 0.06 -0.51 0.28 -0.07 -0.43 1.69E-12 b b ab ab b a ab b ZDHHC2 0.08 0.04 -0.05 -0.03 -0.23 0.63 -0.04 -0.52 1.25E-30 4 b b bc bc bc a bc c CCDC87 -0.02 -0.04 -0.01 0.00 -0.27 0.40 0.07 -0.29 1.31E-10 b b b ab b a ab b CCS 0.04 -0.08 0.01 -0.01 -0.13 0.31 -0.11 -0.50 2.67E-11 ab bc ab abc bc a bc c RBM4 -0.01 -0.09 0.00 0.07 -0.30 0.40 -0.01 -0.29 2.04E-11 b b b ab b a b b C11orf80 0.12 0.06 -0.06 0.12 0.14 0.49 -0.07 -0.31 2.06E-19 b b b ab ab a b b RCE1 0.09 0.10 -0.11 0.31 0.07 0.69 0.10 -0.22 4.34E-40 b b b ab b a b b PC 0.06 -0.12 -0.02 0.03 -0.07 0.43 0.05 -0.12 5.98E-12 ab b b ab b a ab b LRFN4 0.19 0.09 -0.09 0.01 0.21 0.47 -0.01 -0.11 8.26E-21 ab b b b ab a b b RHOD -0.04 -0.01 -0.01 0.13 0.10 0.30 -0.03 -0.28 2.78E-06 b ab b ab ab a b b KDM2A 0.03 -0.02 -0.06 0.20 -0.15 0.82 0.02 -0.40 6.02E-46 b b b b b a b b GRK2 0.09 -0.05 -0.03 -0.11 -0.23 0.56 -0.11 -0.26 1.20E-20 b b b b b a b b ANKRD1 0.11 0.01 -0.08 0.17 0.11 0.75 -0.06 -0.29 1.54E-39 3D b b b b b a b b SSH3 0.09 0.04 -0.07 0.21 0.13 0.63 -0.10 -0.34 1.02E-29 b b b ab ab a b b RAD9A 0.18 0.04 -0.09 0.19 0.10 0.76 -0.02 -0.45 1.33E-47 b bc bc b bc a bc c POLD4 -0.02 -0.06 0.00 -0.17 -0.37 0.56 -0.13 -0.49 1.67E-25 bc bc b bc bc a bc c PPP1CA 0.12 -0.01 -0.06 0.06 -0.13 0.75 -0.05 -0.41 1.86E-39 b bc bc bc bc a bc c RPS6KB 0.13 -0.01 -0.06 0.08 0.13 0.61 0.04 -0.49 7.68E-30 2 b b b b ab a b b CORO1B 0.07 -0.08 -0.02 -0.03 -0.42 0.66 0.00 -0.53 1.20E-33 b bc bc bc bc a bc c CABP4 -0.04 -0.10 -0.07 -0.04 -0.40 1.64 -0.15 -1.06 7.53E- b b b b bc a b c 214 115 Table B.1 (cont’d) TMEM13 -0.05 -0.10 -0.01 0.07 -0.35 0.70 -0.07 -0.43 2.53E-34 4 b b b b b a b b AIP 0.06 -0.04 -0.02 -0.03 -0.08 0.54 -0.09 -0.56 2.70E-24 b bc b bc bc a bc c PITPNM -0.08 -0.10 -0.05 -0.02 -0.32 1.51 -0.14 -1.03 6.17E- 1 b b b b bc a b c 181 CDK2AP 0.02 0.01 -0.03 0.00 -0.18 0.57 -0.03 -0.58 5.03E-26 2 b b bc bc bc a bc c NDUFV1 0.06 0.00 -0.03 -0.04 -0.06 0.58 -0.08 -0.54 3.38E-26 b b b bc bc a bc c NUDT8 -0.14 -0.06 0.02 0.10 0.02 0.27 -0.01 -0.33 1.31E-07 b ab ab ab ab a ab b ALDH3B 0.01 -0.01 -0.05 0.25 0.12 0.54 -0.03 -0.08 2.38E-18 2 b b b ab ab a b b UNC93B 0.01 -0.08 -0.07 -0.02 -0.40 1.66 -0.15 -1.22 6.81E- 1 b b b b b a b c 230 ALDH3B -0.01 -0.09 -0.07 -0.02 -0.37 1.72 -0.16 -1.28 6.74E- 1 b b b b b a b c 250 NDUFS8 0.05 0.00 -0.04 -0.03 -0.02 0.70 -0.06 -0.60 1.45E-37 b b bc bc bc a bc c TCIRG1 0.00 -0.10 0.01 -0.07 0.03 0.40 -0.17 -0.37 1.32E-13 b b b b ab a b b CHKA 0.07 0.00 -0.02 -0.21 -0.16 0.41 0.06 -0.66 1.82E-20 ab b b bc bc a ab c KMT5B -0.01 -0.11 -0.01 0.15 -0.41 0.74 0.00 -0.73 9.19E-48 b b b b bc a b c C11orf24 0.01 -0.02 -0.06 0.07 -0.02 0.78 0.13 -0.54 1.21E-44 bc bc bc bc bc a b c LRP5 -0.05 -0.12 -0.05 -0.08 -0.36 1.64 -0.11 -1.35 1.47E- b b b b b a b c 233 PPP6R3 0.04 -0.06 -0.05 0.16 -0.23 0.89 0.02 -0.68 2.65E-61 b b b b bc a b c TESMIN 0.07 0.01 -0.07 -0.05 -0.33 0.82 -0.01 -0.38 3.03E-45 b b b b b a b b CPT1A -0.03 -0.14 -0.09 -0.08 -0.03 1.85 -0.04 -1.04 3.87E- b b b b b a b c 270 MRPL21 0.02 -0.10 -0.11 0.01 -0.19 1.95 -0.06 -1.03 2.94E- b b b b b a b c 303 IGHMBP -0.03 -0.11 -0.09 0.01 -0.15 1.76 -0.12 -0.91 1.02E- 2 b b b b bc a b c 236 MRGPR -0.02 -0.11 -0.01 0.36 -0.10 0.35 -0.08 0.05 1.44E-10 D ab ab ab a ab a ab ab 116 Table B.1 (cont’d) TPCN2 -0.04 -0.12 -0.10 0.08 -0.18 1.97 -0.06 -1.05 0 b b b b b a b c CCND1 0.01 -0.15 -0.12 -0.11 -0.20 2.14 -0.12 -0.86 0 b b b b bc a b c ORAOV1 -0.01 -0.12 -0.12 -0.08 -0.06 2.00 -0.11 -0.76 2.27E- b bc bc bc bc a bc c 302 ANO1 -0.08 -0.14 -0.12 0.05 -0.02 2.12 -0.02 -0.82 0 b b b b b a b c FADD 0.07 -0.05 -0.11 0.18 -0.04 1.43 0.01 -0.59 1.82E- b bc bc b bc a b c 147 PPFIA1 0.01 -0.11 -0.14 0.17 -0.10 2.10 0.03 -0.86 0 b b b b b a b c CTTN 0.02 -0.13 -0.13 0.08 -0.02 2.02 0.00 -0.86 0 b b b b b a b c SHANK2 -0.10 -0.09 -0.11 0.25 0.17 1.78 0.07 -0.84 3.59E- b b b b b a b c 244 DHCR7 0.13 0.08 -0.09 0.00 -0.21 0.73 0.14 -0.41 5.83E-43 b b bc bc bc a b c NADSYN -0.09 -0.13 -0.06 0.08 -0.03 1.67 -0.09 -1.25 1.60E- 1 b b b b b a b c 233 KRTAP5 -0.01 -0.04 -0.04 0.08 -0.10 0.59 0.02 -0.13 2.17E-20 -7 b b b ab b a b b KRTAP5 -0.06 -0.03 -0.03 0.07 0.02 0.50 -0.01 -0.03 4.14E-14 -8 b b b ab ab a b b KRTAP5 -0.09 -0.04 -0.01 0.06 -0.09 0.45 -0.05 -0.22 4.90E-12 -9 b b b ab b a b b KRTAP5 -0.01 -0.04 -0.01 -0.10 -0.05 0.44 -0.06 -0.26 1.00E-11 -10 b b b b b a b b FAM86C -0.02 0.04 -0.07 0.15 -0.25 0.93 0.07 -0.75 3.71E-70 1 b b b b bc a b c RNF121 -0.05 -0.03 -0.03 0.18 -0.22 0.79 -0.02 -0.94 6.34E-60 b b b b bc a b c NUMA1 -0.11 -0.14 -0.02 -0.01 -0.45 1.55 -0.10 -1.71 1.49E- b b b b b a b c 246 LRTOMT -0.12 -0.08 0.02 0.02 -0.40 0.61 -0.11 -0.81 4.42E-40 b b b b bc a b c LAMTOR -0.10 -0.06 -0.01 0.08 -0.36 0.75 0.01 -0.94 1.23E-56 1 b b b b bc a b c ANAPC1 -0.11 -0.03 0.02 0.02 -0.19 0.46 -0.13 -0.72 3.16E-25 5 b b b b bc a b c INPPL1 0.16 -0.06 -0.05 0.01 -0.13 0.76 0.01 -0.75 5.98E-51 b b b b bc a b c CLPB 0.10 0.14 -0.10 0.14 0.01 0.74 0.09 -0.57 2.31E-50 bc b cd bc bcd a bc d 117 Table B.1 (cont’d) ARAP1 -0.08 -0.06 0.01 -0.11 -0.04 0.59 -0.15 -0.78 7.45E-36 bc bc b bc bc a bc c STARD1 0.03 -0.05 -0.01 -0.09 -0.23 0.66 -0.09 -0.67 8.25E-37 0 b b b bc bc a bc c ATG16L -0.06 -0.03 0.02 -0.01 0.11 0.34 -0.28 -0.52 1.68E-15 2 b b b b ab a b b FCHSD2 -0.14 -0.08 0.03 0.02 -0.09 0.48 -0.17 -0.72 3.54E-28 bc b b b bc a bc c ARHGEF -0.16 -0.10 0.05 -0.11 -0.14 0.31 -0.08 -0.48 4.06E-15 17 bc bc ab bc bc a abc c RELT 0.03 0.05 -0.05 0.11 0.10 0.41 -0.01 -0.25 1.60E-11 b b b ab ab a b b RAB6A 0.07 0.00 -0.06 0.15 -0.30 0.73 0.14 -0.68 5.61E-46 b b b b bc a b c MRPL48 0.06 0.06 -0.05 0.04 -0.18 0.55 0.16 -0.78 5.02E-35 b b b b bc a ab c COA4 0.07 0.03 -0.05 0.03 -0.24 0.68 0.11 -0.94 9.29E-51 b b b b bc a b c PAAF1 0.01 0.07 -0.03 -0.08 -0.16 0.54 0.00 -0.89 3.25E-36 b b b b bc a b c UCP3 -0.16 0.00 0.01 0.03 -0.15 0.30 -0.01 -0.46 8.15E-11 b ab ab ab ab a ab b C2CD3 -0.07 -0.02 -0.04 0.15 -0.32 0.74 0.14 -0.69 8.74E-47 b b b b bc a b c PPME1 0.07 0.05 -0.09 0.12 -0.25 0.88 0.19 -0.66 4.14E-64 b b b b bc a b c LIPT2 -0.02 0.03 -0.04 0.01 -0.45 0.52 0.21 -0.65 1.00E-28 b b b b bc a ab c POLD3 -0.03 0.02 -0.06 0.32 -0.12 0.56 0.22 -0.56 5.18E-31 b b b ab bc a ab c RNF169 -0.06 -0.05 -0.03 0.25 -0.26 0.63 0.11 -0.46 1.43E-30 bc bc bc ab bc a b c XRRA1 -0.01 -0.08 0.00 0.01 -0.42 0.66 0.05 -0.91 3.25E-47 b b b b bc a b c SPCS2 0.04 -0.01 -0.01 0.02 -0.24 0.55 -0.06 -1.04 4.36E-44 b b b b bc a b c NEU3 0.01 -0.02 -0.02 0.04 -0.20 0.45 0.03 -0.43 4.20E-15 b b b ab b a ab b ARRB1 -0.08 -0.06 -0.02 -0.11 -0.42 1.31 -0.02 -1.77 9.70E- b b b b b a b c 200 RPS3 -0.05 -0.03 0.02 -0.08 -0.39 0.41 -0.05 -0.91 3.08E-30 b b b b bc a b c UVRAG -0.10 0.02 -0.04 0.24 -0.48 0.69 0.11 -0.72 1.14E-44 b b b ab bc a b c 118 Table B.1 (cont’d) THAP12 0.01 -0.02 -0.02 0.04 -0.44 0.60 0.13 -0.83 4.51E-39 b b b b bc a b c EMSY -0.06 -0.07 -0.03 0.15 -0.31 0.69 0.08 -0.53 4.76E-36 bc bc b b bc a b c LRRC32 -0.06 -0.05 -0.04 -0.05 -0.45 1.34 -0.03 -1.51 4.02E- b b b b b a b c 180 TSKU -0.11 -0.12 0.02 0.02 -0.06 0.46 0.09 -0.53 6.54E-21 bc bc bc abc bc a ab c ACER3 -0.06 0.00 -0.04 0.14 -0.15 0.66 0.11 -0.55 8.76E-34 bc b b b bc a b c MYO7A -0.05 -0.05 -0.05 -0.03 -0.39 1.44 -0.03 -1.40 3.28E- b b b b b a b c 189 PAK1 0.00 -0.02 -0.04 0.13 -0.55 0.70 0.06 -0.55 2.36E-38 b b b b c a b c CLNS1A 0.08 0.04 -0.08 0.14 -0.42 0.82 0.14 -0.76 7.70E-61 b b b b bc a b c AQP11 -0.01 -0.03 0.00 -0.16 -0.36 0.40 -0.02 -0.45 7.05E-14 b b b b b a b b RSF1 -0.05 -0.06 -0.05 -0.02 -0.43 1.40 0.02 -1.26 1.19E- b b b b b a b c 171 AAMDC -0.06 -0.03 0.00 -0.07 -0.27 0.51 -0.07 -0.68 5.25E-26 b b b bc bc a bc c INTS4 -0.05 -0.01 -0.05 0.14 -0.36 0.99 0.06 -0.85 1.20E-79 b b b b bc a b c NDUFC2 0.03 -0.05 -0.04 -0.07 -0.33 0.87 0.01 -0.83 5.68E-64 b b b b bc a b c ALG8 0.06 0.07 -0.09 0.02 -0.24 1.05 0.05 -0.84 1.48E-91 b b b b bc a b c KCTD21 0.04 -0.08 -0.03 0.07 -0.20 0.80 -0.03 -0.74 2.07E-52 b b b b bc a b c USP35 0.03 -0.04 -0.05 -0.11 -0.31 1.34 -0.07 -1.33 1.49E- b b b b b a b c 165 NARS2 0.04 0.02 -0.03 0.09 -0.48 0.60 0.04 -0.75 4.23E-36 b b b ab bc a b c CCDC77 0.11 0.03 -0.05 0.34 -0.24 0.01 0.10 0.22 1.02E-05 a a a a a a a a WNK1 0.09 -0.01 -0.05 0.15 -0.34 0.28 0.08 0.11 1.27E-06 ab ab b ab b a ab ab ADIPOR 0.12 0.00 -0.06 0.24 -0.40 0.31 0.12 0.14 8.02E-11 2 ab ab b ab b a ab ab DCP1B -0.10 -0.03 0.05 0.00 -0.64 -0.08 -0.14 -0.08 1.78E-05 a a a a a a a a FKBP4 0.21 -0.01 -0.08 0.14 -0.38 0.38 0.12 0.37 3.77E-20 ab bc c abc c a abc ab 119 Table B.1 (cont’d) RHNO1 0.22 0.02 -0.07 0.24 -0.46 0.21 0.13 0.05 4.58E-11 a ab ab a ab ab ab ab TULP3 0.14 0.01 -0.05 0.30 -0.26 0.15 0.10 0.05 4.54E-06 a a a a a a a a TSPAN9 -0.08 -0.10 0.06 -0.07 -0.36 -0.01 -0.28 -0.06 4.44E-07 a a a a a a a a PARP11 -0.09 -0.06 0.05 -0.19 -0.68 0.02 -0.14 -0.11 1.08E-06 a a a a a a a a CCND2 -0.09 0.00 0.08 -0.31 -0.28 -0.38 -0.22 -0.20 5.54E-17 ab a a ab ab b ab ab RAD51A 0.25 0.13 -0.13 0.41 0.08 0.25 0.24 0.26 1.09E-27 P1 ab ab b a ab ab ab ab DYRK4 -0.16 -0.03 0.06 -0.16 -0.48 -0.10 -0.15 -0.19 8.96E-08 a a a a a a a a AKAP3 -0.15 -0.11 0.09 -0.20 -0.34 -0.24 -0.10 0.01 4.76E-12 ab ab a ab b b ab ab TNFRSF -0.02 -0.02 0.00 0.03 -0.86 0.22 -0.05 0.06 2.66E-07 1A a a a a b a ab a LTBR 0.14 0.07 -0.06 0.25 -0.25 0.21 0.08 -0.02 6.41E-08 a a a a a a a a NCAPD2 0.11 0.10 -0.10 0.35 -0.16 0.24 0.24 0.24 1.92E-16 ab ab ab a ab a a a GAPDH 0.16 0.11 -0.07 0.20 -0.14 0.07 0.08 0.24 1.14E-08 a a a a a a a a NOP2 0.15 0.13 -0.09 0.17 -0.19 0.24 0.19 0.12 1.03E-13 ab ab b ab b a ab ab ING4 -0.04 -0.06 0.04 0.25 -0.52 -0.18 -0.02 -0.04 2.19E-05 a a a a a a a a ZNF384 0.01 0.05 -0.05 0.22 -0.46 0.25 0.14 -0.02 3.32E-07 a a a a a a a a COPS7A 0.00 0.04 -0.04 0.21 -0.52 0.17 0.06 0.22 7.81E-06 a a a a a a a a MLF2 0.09 0.02 -0.04 0.06 -0.45 0.23 0.10 0.10 9.65E-06 a a a a a a a a CDCA3 0.27 0.17 -0.14 0.31 0.23 0.27 0.25 0.31 1.25E-33 a ab b a ab a a a USP5 0.06 0.09 -0.06 0.17 -0.36 0.15 0.13 0.27 9.67E-08 a a a a a a a a TPI1 0.17 0.13 -0.08 0.25 -0.23 0.00 0.12 0.25 1.25E-10 a a ab a ab ab ab a ATN1 0.03 -0.03 -0.02 0.08 -0.53 0.25 0.04 0.10 3.65E-05 a a a a a a a a C12orf57 -0.31 -0.22 0.11 -0.12 -0.56 0.02 -0.09 -0.07 4.00E-24 b b a ab b ab ab ab 120 Table B.1 (cont’d) SCARNA 0.14 0.12 -0.07 0.18 -0.13 0.15 0.06 0.01 1.25E-07 12 a a a a a a a a EMG1 0.18 0.11 -0.06 0.09 -0.19 0.08 0.08 -0.01 1.05E-06 a a a a a a a a LPCAT3 -0.07 0.05 -0.04 0.13 -0.37 0.31 0.12 -0.03 1.34E-07 b ab b ab b a ab ab NECAP1 0.11 -0.05 -0.01 0.00 -0.67 0.16 -0.02 0.11 2.30E-05 ab ab ab ab b a ab ab CLEC4A 0.01 0.01 -0.01 -0.03 -0.79 0.21 -0.04 0.07 4.06E-06 ab ab ab ab b a ab ab DDX12P 0.17 0.09 -0.10 0.33 0.10 0.26 0.13 0.14 4.89E-15 a ab ab a ab a ab ab GABAR -0.11 -0.08 0.05 -0.02 -0.65 -0.13 0.11 0.00 1.48E-06 APL1 a a a a a a a a MAGOH 0.19 0.16 -0.10 0.12 -0.16 0.20 0.06 0.26 1.96E-15 B a a ab ab ab a ab a LOH12C -0.13 -0.03 0.05 0.13 -0.59 -0.12 -0.03 -0.22 1.54E-06 R2 a a a a a a a a BORCS5 -0.01 0.04 -0.04 0.21 -0.61 0.13 0.16 0.20 2.00E-06 a a a a a a a a CREBL2 -0.10 -0.11 0.06 -0.06 -0.70 0.08 -0.12 -0.17 1.86E-08 a a a a a a a a GPR19 0.14 0.14 -0.11 0.13 0.21 0.23 0.21 0.37 5.50E-19 ab ab b ab ab a ab a DDX47 0.05 0.02 -0.04 0.22 -0.54 0.12 0.20 0.16 3.30E-06 a a a a a a a a FAM234 -0.04 0.01 -0.05 0.12 -0.11 0.41 0.07 0.13 2.99E-10 B b b b ab b a ab ab WBP11 0.12 0.01 -0.05 0.15 -0.60 0.23 0.10 0.19 1.13E-08 ab ab ab ab b a ab ab STRAP 0.15 0.08 -0.06 0.15 -0.51 0.10 0.17 0.17 1.35E-08 a a a a a a a a DERA 0.01 0.05 -0.03 0.37 -0.29 -0.04 0.22 -0.01 2.47E-05 a a a a a a a a RECQL 0.11 0.04 -0.05 0.30 -0.18 0.09 0.15 0.12 1.33E-05 a a a a a a a a GOLT1B 0.18 0.04 -0.07 0.32 -0.16 0.13 0.18 0.14 6.73E-10 a ab ab a ab ab ab ab CMAS 0.27 0.06 -0.07 0.23 -0.45 0.03 0.07 0.19 1.98E-11 a ab b ab b ab ab ab ETFRF1 -0.09 -0.10 0.07 0.17 -0.55 -0.12 -0.20 -0.12 6.68E-09 a a a a a a a a KRAS 0.12 0.02 -0.06 0.31 -0.26 0.15 0.21 0.15 3.64E-08 a a a a a a a a 121 Table B.1 (cont’d) INTS13 0.12 0.12 -0.07 0.31 -0.21 0.01 0.18 0.06 3.16E-08 a a a a a a a a MED21 0.04 0.02 -0.03 0.36 -0.51 0.05 0.13 0.07 2.47E-05 a a a a a a a a DDX11 0.08 0.09 -0.07 0.36 -0.15 0.17 0.11 0.17 2.73E-09 a a a a a a a a H3F3C 0.09 0.00 -0.03 0.07 -0.71 0.20 0.04 0.05 4.35E-06 ab ab ab ab b a ab ab DNM1L 0.13 0.03 -0.07 0.38 -0.34 0.19 0.16 0.11 3.68E-10 a a a a a a a a YARS2 0.16 0.09 -0.06 0.27 -0.22 0.05 0.06 0.07 2.30E-06 a a a a a a a a ALG10 0.01 0.02 -0.05 0.36 -0.25 0.26 0.16 0.05 1.38E-08 ab ab ab a ab a ab ab HOXC13 0.04 0.22 -0.14 0.43 0.38 0.38 0.17 0.39 7.99E-37 ab ab b a ab ab ab a HOXC11 -0.05 0.21 -0.10 0.33 -0.14 0.28 0.19 0.21 1.84E-21 ab a ab a ab a ab ab HOXC8 -0.03 0.22 -0.10 0.26 -0.06 0.31 0.06 0.15 3.51E-19 ab ab b ab ab a ab ab GPR132 -0.22 0.09 -0.04 0.19 0.35 0.18 0.18 0.17 4.84E-10 ab ab ab a a a ab ab WASH3P -0.25 0.22 -0.04 0.24 0.36 -0.02 0.01 0.07 4.07E-14 ab a ab a a ab ab ab MYO1C -0.15 0.12 -0.03 -0.06 -0.27 0.26 0.06 -0.04 9.07E-08 b ab ab ab b a ab ab NTN1 0.02 0.05 0.03 -0.09 -0.24 -0.18 -0.37 -0.06 1.54E-07 ab a a ab ab ab b ab TRIM16L -0.21 0.08 -0.05 0.15 0.26 0.29 0.43 -0.04 1.67E-17 b ab b ab ab a a ab ANKRD1 0.06 0.25 -0.12 0.27 0.31 0.20 0.10 0.27 3.07E-23 3B ab a ab a a a ab a SPACA3 -0.38 -0.01 0.02 0.12 -0.08 0.18 0.12 0.05 4.45E-13 b ab ab ab ab a a ab SCARNA -0.27 0.12 0.01 0.05 0.14 -0.03 0.04 -0.22 6.98E-08 17 ab a ab ab a ab ab ab ABCA7 0.02 0.19 -0.08 0.21 0.15 0.21 0.06 0.00 8.46E-12 ab a ab a ab a ab ab EPS15L1 -0.16 -0.20 0.05 0.01 -0.47 0.35 0.02 -0.28 6.44E-19 bc c b bc c a bc c CEACA -0.32 0.02 0.01 0.09 0.33 0.13 0.19 -0.02 5.35E-10 M8 ab ab ab ab a a a ab JOSD2 -0.01 0.00 -0.04 0.13 -0.09 0.41 -0.01 -0.04 2.56E-09 b b b ab b a b b 122 Table B.1 (cont’d) FKBP1A -0.28 0.19 -0.02 0.05 0.29 0.07 0.03 -0.08 3.41E-11 P1 ab a ab ab a ab ab ab DEFB12 -0.10 0.10 -0.07 0.34 0.13 0.31 0.19 0.05 7.72E-13 6 ab ab ab a ab a ab ab C20orf96 -0.58 0.36 0.00 0.02 0.08 -0.08 -0.07 -0.12 2.44E-47 b a b ab ab b b b ZCCHC3 -0.86 0.52 -0.04 0.18 0.36 -0.03 0.20 0.09 1.21E- c a b ab ab b ab ab 111 SOX12 -0.53 0.35 -0.07 0.09 0.35 0.24 0.24 0.17 5.60E-51 c a b ab ab ab ab ab NRSN2 -0.54 0.28 -0.03 0.05 0.26 0.12 0.17 0.17 1.29E-38 c a b ab ab ab ab ab TRIB3 -0.14 0.33 -0.10 -0.12 0.44 0.16 0.26 0.25 4.92E-31 ab a ab ab a ab a ab RBCK1 -0.80 0.62 -0.06 0.05 0.36 -0.01 0.13 0.09 1.42E- c a b b ab b b b 122 TBC1D2 -0.91 0.51 -0.04 0.11 0.21 0.14 0.17 0.09 6.61E- 0 c a b b ab b ab b 117 CSNK2A -0.77 0.58 -0.10 0.24 0.26 0.21 0.28 0.27 1.93E- 1 d a c abc abc bc ab abc 121 SRXN1 -0.44 0.39 -0.10 0.08 0.31 0.25 0.42 0.21 5.15E-57 b a b ab ab ab a ab SLC52A -0.24 0.22 -0.04 0.06 0.16 0.17 0.07 -0.14 1.34E-13 3 c a bc abc abc ab abc bc FAM110 -0.51 0.39 -0.07 0.13 0.49 0.22 0.10 0.23 3.39E-53 A b a a a a a a a PSMF1 -1.05 0.67 -0.07 0.11 0.44 0.20 0.27 0.11 3.66E- c a b b ab b b b 186 TMEM74 -0.12 0.29 -0.05 0.07 0.32 -0.08 0.06 -0.10 2.59E-15 B ab a ab ab a ab ab ab C20orf20 -0.18 0.21 -0.02 -0.20 0.21 -0.05 0.08 -0.05 5.42E-09 2 ab a ab ab a ab ab ab SNPH -1.77 1.12 -0.09 0.04 0.78 0.21 0.24 0.26 0 e a d cd ab bc bc bc SDCBP2 -0.33 0.17 0.00 0.15 0.14 -0.10 0.10 0.01 3.23E-13 b a ab ab ab ab ab ab FKBP1A -0.76 0.44 -0.03 0.05 0.48 0.03 0.22 0.11 1.65E-83 b a a a a a a a NSFL1C -1.02 0.64 -0.08 0.27 0.56 0.15 0.22 0.28 1.93E- c a b ab ab b b ab 175 SIRPA -0.25 0.13 0.00 0.12 0.02 -0.04 -0.01 -0.11 8.68E-07 b a ab ab ab ab ab ab 123 Table B.1 (cont’d) STK35 -0.63 0.56 -0.13 0.26 0.31 0.33 0.26 0.39 4.34E- c a b ab ab a ab a 109 TGM6 -1.82 1.13 -0.09 0.02 0.76 0.26 0.25 0.28 0 e a d cd ab bc bc bc SNRPB -0.61 0.58 -0.12 0.22 0.64 0.25 0.24 0.30 8.01E- c a b ab a ab ab ab 109 ZNF343 -0.81 0.57 -0.09 0.21 0.34 0.24 0.21 0.16 8.30E- c a b b ab b b b 123 TMC2 -0.30 0.18 0.00 0.00 0.10 0.03 -0.10 -0.02 3.89E-11 b a ab ab ab ab ab ab NOP56 -0.64 0.55 -0.10 0.08 0.49 0.15 0.22 0.29 3.45E-96 c a b b ab b ab ab IDH3B -0.91 0.62 -0.06 0.18 0.42 0.00 0.19 0.27 1.62E- c a b b ab b b ab 143 EBF4 -0.40 0.08 0.03 -0.13 -0.20 0.07 -0.07 0.17 1.39E-14 b a a ab ab a ab a PCED1A -0.64 0.46 -0.03 0.00 0.52 0.03 0.02 -0.15 6.69E-69 b a a a a a a ab VPS16 -0.92 0.61 -0.08 0.16 0.56 0.17 0.20 0.25 2.78E- c a b b ab b b ab 148 PTPRA -0.88 0.52 -0.04 0.10 0.27 0.18 0.11 0.12 6.66E- c a b b ab b b b 115 MRPS26 -0.76 0.57 -0.06 0.05 0.42 0.08 0.15 0.07 3.16E- c a b b ab b b b 106 OXT -1.81 1.15 -0.10 0.04 0.76 0.25 0.24 0.28 0 e a d cd ab bc bc bc UBOX5 -0.90 0.51 -0.03 -0.02 0.31 0.22 0.11 -0.05 1.43E- c a b b ab ab b b 115 FASTKD -0.78 0.55 -0.07 0.16 0.40 0.12 0.20 0.14 1.38E- 5 c a b ab ab b ab ab 107 DDRGK1 -0.84 0.52 -0.01 -0.13 0.14 0.02 -0.02 -0.02 3.47E- c a b b ab b b b 104 ITPA -0.76 0.51 -0.05 0.18 0.39 0.02 0.17 0.09 1.49E-94 c a b ab ab b ab ab SLC4A1 -0.37 0.14 0.02 -0.09 0.04 0.03 -0.02 -0.13 5.10E-13 1 b a ab ab ab ab ab ab C20orf19 -0.56 0.12 0.02 0.12 0.16 0.15 0.06 -0.07 2.36E-28 4 ab a a a a a a ab ATRN -0.74 0.36 -0.05 0.17 0.17 0.32 0.31 0.13 2.33E-78 c a b ab ab a a ab ADAM33 -1.83 1.13 -0.09 0.03 0.74 0.24 0.24 0.33 0 e a d cd ab bc bc bc HSPA12 -1.74 1.11 -0.10 0.08 0.73 0.22 0.26 0.37 0 B d a c bc ab b b b 124 Table B.1 (cont’d) C20orf27 -1.71 1.13 -0.11 0.07 0.81 0.23 0.25 0.32 0 e a d cd ab bc bc bc SPEF1 -0.36 0.23 0.00 0.03 0.11 -0.04 -0.03 -0.05 1.58E-16 b a b ab ab b b b CENPB -0.88 0.57 -0.04 0.07 0.37 0.06 0.06 0.16 2.18E- c a b b ab b b ab 121 CDC25B -0.27 0.45 -0.10 0.18 0.49 0.07 0.14 0.15 4.42E-46 ab a ab ab a ab ab ab AP5S1 -0.87 0.60 -0.05 -0.01 0.47 0.03 0.10 0.04 6.14E- c a b b ab b b b 128 MAVS -0.73 0.52 -0.06 0.00 0.17 0.24 0.17 0.02 1.68E-95 d a c bc abc ab abc bc PANK2 -0.94 0.49 -0.03 0.19 0.38 0.01 0.19 0.18 1.77E- c a b ab ab b ab ab 118 RNF24 -0.53 0.30 -0.05 0.15 0.24 0.21 0.22 0.06 6.54E-42 b a b ab ab ab ab ab SMOX -1.79 1.14 -0.10 -0.01 0.75 0.23 0.23 0.30 0 e a d cd ab bc bc bc PRNP -0.46 0.17 0.01 0.13 0.31 -0.04 0.04 0.01 6.22E-21 ab a ab a a ab a ab PRND -1.79 1.13 -0.10 0.04 0.77 0.27 0.21 0.33 0 e a d cd ab bc bc bc SLC23A -0.54 0.18 -0.01 0.16 0.16 0.22 0.04 0.06 2.17E-30 2 b a ab ab ab a ab ab TMEM23 -0.87 0.58 -0.05 0.19 0.18 0.13 0.07 0.04 5.73E- 0 c a b ab ab b b b 124 PCNA -0.48 0.50 -0.12 0.31 0.58 0.17 0.22 0.35 3.74E-78 b a a a a a a a CDS2 -0.82 0.51 -0.06 0.11 0.42 0.28 0.12 0.09 1.04E- c a b b ab ab b b 108 PROKR2 -1.82 1.15 -0.10 0.05 0.76 0.27 0.22 0.27 0 e a d cd ab bc bc bc GPCPD1 -0.48 0.32 0.00 0.11 -0.09 -0.11 -0.17 0.05 2.83E-35 c a b ab bc bc bc ab C20orf19 -0.63 0.41 0.01 -0.08 -0.09 -0.15 -0.13 0.02 1.42E-59 6 c a b b bc bc bc ab CHGB -0.24 0.04 0.05 -0.20 -0.24 -0.12 -0.12 0.05 1.59E-07 ab ab a ab ab ab ab a TRMT6 -0.65 0.63 -0.13 0.47 0.66 0.14 0.23 0.37 2.67E- c a b a a ab a a 127 MCM8 -0.45 0.48 -0.14 0.39 0.53 0.29 0.28 0.33 3.12E-81 b a a a a a a a CRLS1 -0.74 0.56 -0.04 0.09 0.16 -0.05 0.13 -0.06 3.67E-99 c a b b ab b b b 125 Table B.1 (cont’d) LRRN4 -1.82 1.13 -0.09 0.02 0.73 0.25 0.21 0.25 0 e a d cd ab bc bc bc FERMT1 -1.71 1.08 -0.09 0.08 0.70 0.22 0.22 0.21 0 d a c bc ab b b bc BMP2 -0.31 0.14 0.03 0.10 0.07 -0.23 -0.06 -0.02 6.35E-13 b a ab ab ab b ab ab TMX4 -0.61 0.28 0.02 0.10 -0.17 -0.14 0.02 0.03 1.08E-42 c a b ab bc b ab ab PLCB1 -0.37 0.16 0.00 0.17 -0.16 -0.01 0.08 -0.03 3.30E-14 b a a a ab ab a ab PLCB4 -0.16 0.10 0.04 -0.06 -0.68 -0.22 -0.17 -0.09 6.38E-11 a a a a a a a a ANKEF1 -0.48 0.41 -0.08 0.28 0.17 0.30 0.10 0.06 3.15E-56 c a bc ab abc a abc abc SNAP25 -0.22 0.05 0.04 0.12 -0.04 -0.08 -0.21 -0.07 3.55E-06 ab a a a ab ab ab ab MKKS -0.92 0.59 -0.06 0.20 0.30 0.09 0.19 0.22 3.32E- c a b ab ab b b ab 139 SLX4IP -0.43 0.31 -0.07 0.29 0.22 0.32 0.18 0.08 4.54E-41 c a bc ab abc a abc abc JAG1 -0.42 0.22 -0.02 0.19 0.16 0.13 -0.07 0.13 1.37E-22 c a bc ab abc abc bc abc BTBD3 -0.58 0.30 -0.02 0.28 0.14 -0.03 0.16 0.12 2.76E-44 c a bc ab abc bc ab abc SPTLC3 -0.44 0.10 0.06 0.16 -0.20 -0.24 -0.16 -0.03 4.76E-23 b a a a ab ab ab ab TASP1 -0.75 0.49 -0.05 0.26 0.29 0.08 0.11 0.08 2.45E-89 c a b ab ab b b b ESF1 -0.58 0.49 -0.10 0.32 0.45 0.20 0.18 0.21 1.21E-80 c a b ab ab ab ab ab NDUFAF -0.81 0.59 -0.06 0.21 0.45 0.00 0.08 0.18 2.31E- 5 c a b ab ab b b ab 117 MACRO -0.36 0.10 0.06 0.06 -0.09 -0.22 -0.12 -0.20 4.52E-17 D2 b a ab ab ab ab ab ab FLRT3 -0.35 0.15 0.02 0.20 0.06 -0.11 -0.15 0.14 8.51E-15 b a ab a ab ab ab ab KIF16B -0.53 0.39 -0.07 0.22 0.18 0.25 0.13 0.18 1.10E-55 c a bc ab abc a abc ab SNRPB2 -0.74 0.67 -0.10 0.26 0.62 0.00 0.17 0.24 1.35E- c a b ab ab b b ab 133 BFSP1 -0.37 0.35 -0.07 0.26 0.40 -0.03 0.23 0.26 3.55E-38 ac a abc ab a abc ab ab DSTN -0.66 0.36 -0.02 0.13 0.18 0.04 0.12 0.08 1.05E-57 c a b ab ab ab ab ab 126 Table B.1 (cont’d) RRBP1 -1.72 1.09 -0.09 0.07 0.78 0.22 0.20 0.26 0 e a d cd ab bc bc bc SNX5 -0.79 0.46 -0.02 0.06 0.30 0.07 0.11 -0.13 7.95E-87 c a b b ab b ab b SNORD1 -0.42 0.25 -0.04 0.10 0.41 0.18 0.18 0.01 1.50E-28 7 b a ab a a a a ab MGME1 -0.68 0.58 -0.13 0.41 0.71 0.25 0.31 0.25 9.72E- c a b a a ab a ab 124 ZNF133 -0.81 0.55 -0.07 0.22 0.53 0.11 0.15 0.15 5.22E- c a b ab ab b b ab 113 DZANK1 -0.51 0.42 -0.04 0.14 0.07 0.01 -0.10 0.09 5.46E-50 c a bc ab abc bc bc ab POLR3F -0.75 0.60 -0.09 0.32 0.45 0.09 0.15 0.23 1.19E- c a b ab ab b b ab 117 RBBP9 -0.74 0.43 -0.02 0.25 0.10 -0.03 0.02 0.06 2.23E-76 c a b ab ab b b ab SEC23B -0.63 0.37 -0.03 0.06 0.02 0.14 0.08 0.08 1.19E-55 c a bc ab abc ab ab ab SMIM26 -0.90 0.58 -0.03 0.09 0.26 -0.02 0.05 0.01 5.27E- c a b b ab b b b 126 DTD1 -0.58 0.46 -0.04 0.12 0.26 -0.18 0.16 0.19 1.65E-66 c a b ab ab b ab ab LINC006 -0.16 0.30 -0.04 -0.04 -0.05 -0.07 -0.11 0.07 1.24E-15 52 b a b b b b b ab SLC24A -0.19 0.09 0.03 -0.04 -0.02 -0.01 -0.30 -0.15 2.92E-07 3 ab a ab ab ab ab b ab RIN2 -0.46 0.31 -0.04 0.15 0.23 0.05 0.11 0.07 1.21E-32 b a b ab ab ab ab ab NAA20 -0.61 0.60 -0.13 0.25 0.53 0.27 0.22 0.22 3.61E- c a b b ab b b b 110 CRNKL1 -0.73 0.50 -0.07 0.23 0.15 0.20 0.15 0.15 5.95E-93 c a b ab ab ab ab ab DEFB12 -0.10 0.10 -0.07 0.34 0.13 0.31 0.19 0.05 -7.80E- 7 ab ab ab a ab a ab ab 11 C20orf97 -0.58 0.36 0.00 0.02 0.08 -0.08 -0.07 -0.12 -7.88E- b a b ab ab b b b 11 ZCCHC4 -0.86 0.52 -0.04 0.18 0.36 -0.03 0.20 0.09 -7.96E- c a b ab ab b ab ab 11 SOX13 -0.53 0.35 -0.07 0.09 0.35 0.24 0.24 0.17 -8.03E- c a b ab ab ab ab ab 11 NRSN3 -0.54 0.28 -0.03 0.05 0.26 0.12 0.17 0.17 -8.11E- c a b ab ab ab ab ab 11 TRIB4 -0.14 0.33 -0.10 -0.12 0.44 0.16 0.26 0.25 -8.19E- ab a ab ab a ab a ab 11 127 Table B.1 (cont’d) RBCK2 -0.80 0.62 -0.06 0.05 0.36 -0.01 0.13 0.09 -8.27E- c a b b ab b b b 11 TBC1D2 -0.91 0.51 -0.04 0.11 0.21 0.14 0.17 0.09 -8.34E- 1 c a b b ab b ab b 11 CSNK2A -0.77 0.58 -0.10 0.24 0.26 0.21 0.28 0.27 -8.42E- 2 d a c abc abc bc ab abc 11 SRXN2 -0.44 0.39 -0.10 0.08 0.31 0.25 0.42 0.21 -8.50E- b a b ab ab ab a ab 11 SLC52A -0.24 0.22 -0.04 0.06 0.16 0.17 0.07 -0.14 -8.57E- 4 c a bc abc abc ab abc bc 11 FAM110 -0.51 0.39 -0.07 0.13 0.49 0.22 0.10 0.23 -8.65E- A b a a a a a a a 11 PSMF2 -1.05 0.67 -0.07 0.11 0.44 0.20 0.27 0.11 -8.73E- c a b b ab b b b 11 TMEM74 -0.12 0.29 -0.05 0.07 0.32 -0.08 0.06 -0.10 -8.81E- B ab a ab ab a ab ab ab 11 C20orf20 -0.18 0.21 -0.02 -0.20 0.21 -0.05 0.08 -0.05 -8.88E- 3 ab a ab ab a ab ab ab 11 SNPH -1.77 1.12 -0.09 0.04 0.78 0.21 0.24 0.26 -9E-11 e a d cd ab bc bc bc SDCBP3 -0.33 0.17 0.00 0.15 0.14 -0.10 0.10 0.01 -9.04E- b a ab ab ab ab ab ab 11 FKBP1A -0.76 0.44 -0.03 0.05 0.48 0.03 0.22 0.11 -9.12E- b a a a a a a a 11 NSFL1C -1.02 0.64 -0.08 0.27 0.56 0.15 0.22 0.28 -9.19E- c a b ab ab b b ab 11 SIRPA -0.25 0.13 0.00 0.12 0.02 -0.04 -0.01 -0.11 -9.27E- b a ab ab ab ab ab ab 11 MUC6 -0.11 0.15 -0.06 0.43 0.17 0.10 -0.04 0.32 1.99E-12 b ab b a ab ab ab ab KAT14 -0.83 0.50 -0.03 0.04 0.20 0.08 0.10 0.14 4.33E- c a b b ab b b ab 101 LOC653 -0.02 -0.03 -0.01 0.05 -0.29 0.67 0.03 -1.04 2.59E-53 566 b b b b bc a b c LOC100 -0.42 0.37 -0.04 0.13 0.28 0.03 -0.04 -0.05 2.85E-36 270804 b a b ab ab b b b LOC100 0.01 -0.10 -0.01 0.10 -0.27 0.54 -0.03 -0.49 2.12E-23 130987 b b b ab b a b b ProSAPi -0.35 0.27 0.00 0.09 -0.10 -0.16 0.06 -0.25 9.78E-23 P1 b a b ab b b ab b LOC645 -0.11 0.00 0.00 -0.08 -0.40 0.50 -0.06 -0.53 2.62E-22 332 bc b b bc bc a bc c LOC100 -0.12 0.12 -0.07 0.27 0.50 0.24 0.16 0.23 3.90E-14 289673 ab ab ab a a a ab ab 128 Table B.1 (cont’d) LOC730 -0.06 -0.08 0.06 0.28 0.23 -0.10 -0.47 -0.15 4.63E-14 101 ab ab ab a ab ab b ab ATPGD1 -0.04 0.09 -0.07 0.32 0.27 0.36 0.02 0.09 1.63E-13 b ab b ab ab a ab ab WASH5P -0.25 0.21 -0.03 0.23 0.17 -0.04 0.01 0.04 3.19E-12 ab a ab a ab ab ab ab C20orf46 -0.06 0.17 -0.07 0.29 -0.04 0.19 0.10 0.21 7.38E-12 ab a ab a ab a ab a C19orf22 -0.08 0.01 -0.04 0.30 -0.05 0.38 0.00 -0.07 2.20E-10 b b b ab b a b b SUV420 -0.13 -0.11 0.04 -0.06 -0.39 0.23 -0.01 -0.38 2.32E-10 H1 b b ab ab b a ab b LINC015 -0.02 -0.01 0.04 -0.02 0.02 -0.02 -0.53 0.04 2.49E-10 12 ab ab a ab ab ab b a LOC149 -0.22 0.15 0.02 -0.16 -0.14 0.00 -0.14 -0.16 3.95E-08 837 b a ab ab ab ab ab ab SINHCA 0.08 0.09 -0.04 0.40 -0.37 -0.11 0.17 0.05 6.33E-08 F a a a a a a a a LOC100 -0.21 0.17 -0.01 0.13 0.10 -0.10 0.01 -0.02 2.18E-07 134868 b a ab ab ab ab ab ab LOC642 0.11 0.05 -0.07 0.25 0.10 0.18 0.11 0.11 1.25E-06 846 a a a a a a a a KIAA102 0.10 -0.03 0.03 0.09 -0.03 -0.22 -0.29 0.01 2.81E-06 6 a a a a a a a a LOC374 -0.09 -0.07 0.00 0.05 -0.28 0.31 0.00 -0.02 6.53E-06 443 b b b ab b a b b 129 Figure B.1: Clustering of tumor samples (no constraints). Tumor clusters were obtained by sequential application of tSNE and DBSCAN algorithm for 5,408 samples from 33 cancer types. The contours reflect cluster membership, and the points’ colors and shapes represent similar anatomical sites and cancer types, respectively. After removing the first two, the two-dimensional tSNE projection was obtained from the first 50 principal axes of the extended omic matrix. Extended omic matrix contained appended values of gene expression, DNA methylation, and copy number variant intensity. Integers represent individual clusters. Clusters were also annotated in terms of their most enriched histological/molecular subtypes. 130 Figure B.2: Re-classification of tumors and previously reported molecular subtypes. 131 Figure B.3: Survival curves by pan-cancer tumor clusters. The figure shows Kaplan- Meier curves highlighting the survival probability by time in years for each cluster. Log- rank tests were performed to determined significant differences between curves. The legend shows the results of multiple comparisons between survival curves. Statistical differences are represented with different letters. 132 Figure B.4: Re-classification of tumors reveals differences in sample type. The relative position and number of primary, normal, and metastatic tissue samples are shown. The figure at the left shows the location of the samples by clusters. The figure at the right shows the relative proportion of sample types by cluster. 133 Figure B.5: Expression and copy numbers for transcription factors and targets. The expression and copy number variation by genes is shown by cluster (C1-8). The colors by gene name represent groups defined by different transcription factors and their targets (e.g., black represents the group of FOXM1 and its targets KRAS and SPTBN2). TFs names are shown with italic and larger font sizes. The number at the left of the dendrogram represents a grouping of genes based on k-means clustering. 134 APPENDIX C SUPPLEMENTARY MATERIAL FOR CHAPTER 4 Figure C.1: Plots of singular values trajectories. A) Scree plot by latent SVD dimension. B) Cumulative proportion of variance explained by latent dimension. C) First empirical partial derivative of singular values by latent dimension. D) Second empirical partial derivative of singular values by latent dimension. The points and bars in each panel represent average and standard deviations (SD) from 1x105 bootstrap repetitions of the sparse SVD applied to the matrix of phenotypic variables. 135 Figure C.2: Comparison between dense and sparse latent factors Figure C.3: Annotated Manhattan plot for PEB 1 136 Figure C.4: Annotated Manhattan plot for PEB 2 Figure C.5: Annotated Manhattan plot for PEB 3 137 Figure C.6: Annotated Manhattan plot for PEB 4 Figure C.7: Annotated Manhattan plot for PEB 5 138 Figure C.8: Summary of previously reported information for genes in Figure 4.4. Each row in the heatmap corresponds to gene mapping onto a significant peak from the GWAS on each PEB variable. The first sets of panels summarize the average effects of significant GWAS peaks with cis-eQTL in GTEx. The color of each cell represents the magnitude and sign of the cis-eQTL effect on the expression of that gene in each of a set of tissues. The order of the subpanels corresponds to a broader aggrupation of tissues in main organs and systems: digestive, reproductive-urinary, neuro-muscular; skin, and adipose tissues, circulatory and glandular. The following panels correspond to results from epigenomic experiments conducted in previous studies. The second panel shows where the peak was in a zone of positive DNase1 activity (as captured by ATAC-seq assays). The Third panel shows whether transcription factors were bound to the region (as captured by CHIP-seq assays. Each column represents a broad classification of transcription factors within families representing binding motifs. The last panel shows the presence of evidence for histone modifications (as captured by CHIP-seq). Each column represents a different histone mark. 139 BIBLIOGRAPHY 140 BIBLIOGRAPHY 1. Vailati-Riboni M, Palombo V, Loor JJ. What Are Omics Sciences? Periparturient Dis Dairy Cows A Syst Biol Approach. 2017; 1–7. 2. Karczewski KJ, Snyder MP. Integrative omics for health and disease. Nat Rev Genet. 2018;19: 299–310. 3. Zelenin A V., Rodionov A V., Bolsheva NL, Badaeva ED, Muravenko O V. Genome: Origins and evolution of the term. Mol Biol. 2016;50: 542–550. 4. Yadav SP. The wholeness in suffix -omics, -omes, and the word om. J Biomol Tech. 2007;18: 277. 5. Eddy SR. The C-value paradox, junk DNA and ENCODE. Curr Biol. 2012;22: R898-9. 6. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977;74: 5463–7. 7. Saiki R, Gelfand D, Stoffel S, Scharf S, Higuchi R, Horn G, et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science (80- ). 1988;239: 487–491. 8. Cohen SN, Chang AC, Boyer HW, Helling RB. Construction of biologically functional bacterial plasmids in vitro. Proc Natl Acad Sci U S A. 1973;70: 3240–4. 9. Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, et al. Fluorescence detection in automated DNA sequence analysis. Nature. 1986;321: 674– 679. 10. Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Zhang H, et al. The International HapMap Project. Nature. 2003;426: 789–796. 11. ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science (80- ). 2004;306: 636–640. 12. Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics. 2016;107: 1–8. 13. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, et al. A global reference for human genetic variation. Nature. 2015;526: 68–74. 141 14. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562: 203–209. 15. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270: 484–7. 16. Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T. Transcriptomics technologies. PLOS Comput Biol. 2017;13: e1005457. 17. Eom EM, Lee JY, Park HS, Byun YJ, Ha-Lee YM, Lee DH. Comparison between SAGE and cDNA microarray for quantitative accuracy in transcript profiling analyses. J Plant Biol. 2006;49: 498–506. 18. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10: 57–63. 19. Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez J-C, et al. From Proteins to Proteomes: Large Scale Protein Identification by Two-Dimensional Electrophoresis and Arnino Acid Analysis. Nat Biotechnol. 1996;14: 61–65. 20. Tomizaki K, Usui K, Mihara H. Protein-protein interactions and selection: array- based techniques for screening disease-associated biomarkers in predictive/early diagnosis. FEBS J. 2010;277: 1996–2005. 21. Tollefsbol TO. Advances in epigenetic technology. Methods Mol Biol. 2011;791: 1–10. 22. Houle D, Govindaraju DR, Omholt S. Phenomics: the next challenge. Nat Rev Genet 2010 1112. 2010;11: 855–866. 23. Morgan H, Beck T, Blake A, Gates H, Adams N, Debouzy G, et al. EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucleic Acids Res. 2010;38: D577–D585. 24. Pieruschka R, Schurr U. Plant phenotyping: Past, present, and future. Plant Phenomics. 2019;2019. 25. Poldrack RA, Congdon E, Triplett W, Gorgolewski KJ, Karlsgodt KH, Mumford JA, et al. A phenome-wide examination of neural and cognitive function. Sci Data 2016 31. 2016;3: 1–12. 26. Vyssotski AL, Serkov AN, Itskov PM, Dell’Omo G, Latanov A V., Wolfer DP, et al. Miniature neurologgers for flying pigeons: Multichannel EEG and action and field potentials in combination with GPS recording. J Neurophysiol. 2006;95: 1263–1273. 142 27. Montes JM, Melchinger AE, Reif JC. Novel throughput phenotyping platforms in plant genetic studies. Trends Plant Sci. 2007;12: 433–436. 28. Yugi K, Kubota H, Hatano A, Kuroda S. Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple ‘Omic’ Layers. Trends Biotechnol. 2016;34: 276– 290. 29. Gomez-Cabrero D, Abugessaisa I, Maier D, Teschendorff A, Merkenschlager M, Gisel A, et al. Data integration in the era of omics: current and future challenges. BMC Syst Biol. 2014;8 Suppl 2: I1. 30. Ebrahim A, Brunk E, Tan J, O’Brien EJ, Kim D, Szubin R, et al. Multi-omic data integration enables discovery of hidden biological regularities. Nat Commun. 2016;7: 13091. 31. Yan S, Xu D, Zhang B, Zhang H, Yang Q, Lin S. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction. IEEE Trans Pattern Anal Mach Intell. 2007;29: 40–51. 32. Zhang S, Liu C-C, Li W, Shen H, Laird PW, Zhou XJ. Discovery of multi- dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 2012;40: 9379–9391. 33. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25: 2906–12. 34. Lock EF, Hoadley KA, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013;7: 523–542. 35. Ray P, Zheng L, Lucas J, Carin L. Bayesian joint analysis of heterogeneous genomics data. Bioinformatics. 2014;30: 1370–1376. 36. Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012;28: 3290–3297. 37. Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29: 2610–2616. 38. Wang J. Laplacian Eigenmaps. Geometric Structure of High-Dimensional Data and Dimensionality Reduction. Berlin, Heidelberg: Springer Berlin Heidelberg; 2012. pp. 235–247. 39. Bowen GJ. Isoscapes: Spatial Pattern in Isotopic Biogeochemistry. Annu Rev Earth Planet Sci. 2010;38: 161–187. 143 40. Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290: 2323–6. 41. Wahba G. An introduction to reproducing kernel hilbert spaces and why they are so useful. IFAC Proc Vol. 2003;36: 525–528. 42. Schölkopf B, Smola A, Müller KR. Kernel principal component analysis. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 1997. pp. 583–588. 43. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11: 333–337. 44. Speicher NK, Pfeifer N. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics. 2015;31: i268-75. 45. Cun Y, Fröhlich H. Network and Data Integration for Biomarker Signature Discovery via Network Smoothed T-Statistics. Boccaletti S, editor. PLoS One. 2013;8: e73074. \ 46. Seoane JA, Day INM, Gaunt TR, Campbell C. A pathway-based data integration framework for prediction of disease progression. Bioinformatics. 2014;30: 838–845. 47. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009;8: Article28. 48. Chen X, Liu H. An Efficient Optimization Algorithm for Structured Sparse CCA, with Applications to eQTL Mapping. Stat Biosci. 2012;4: 3–26. 49. Lin D, Zhang J, Li J, Calhoun VD, Deng H-W, Wang Y-P. Group sparse canonical correlation analysis for genomic data integration. BMC Bioinformatics. 2013;14: 245. 50. Wangen LE, Kowalski BR. A multiblock partial least squares algorithm for investigating complex chemical systems. J Chemom. 1989;3: 3–20. 51. Li W, Zhang S, Liu C-C, Zhou XJ. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics. 2012;28: 2458–2466. 52. Izenman AJ. Reduced-rank regression for the multivariate linear model. J Multivar Anal. 1975;5: 248–264. 53. Vounou M, Nichols TE, Montana G, Alzheimer’s Disease Neuroimaging Initiative the ADN. Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. Neuroimage. 2010;53: 1147– 59. 144 54. Goh G, Dey DK, Chen K. Bayesian sparse reduced rank multivariate regression. J Multivar Anal. 2017;157: 14–28. 55. Lu LJ, Xia Y, Paccanaro A, Yu H, Gerstein M. Assessing the limits of genomic data integration for predicting protein networks. Genome Res. 2005;15: 945–53. 56. Tini G, Marchetti L, Priami C, Scott-Boyer M-P. Multi-omics integration—a comparison of unsupervised clustering methodologies. Brief Bioinform. 2017 [cited 13 Feb 2019]. 57. Chuanchao Zhang, Juan Liu, Qianqian Shi, Xiangtian Yu, Tao Zeng, Luonan Chen. Integration of multiple heterogeneous omics data. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2016. pp. 564–569. 58. Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: An R package for ‘omics feature selection and multiple data integration. Schneidman D, editor. PLOS Comput Biol. 2017;13: e1005752. 59. González-Reymúndez A, De Los Campos G, Gutiérrez L, Lunt SY, Vazquez AI. Prediction of years of life after diagnosis of breast cancer using omics and omic-by- treatment interactions. Eur J Hum Genet. 2017;25: 538–544. 60. Vazquez AI, Veturi Y, Behring M, Shrestha S, Kirst M, Resende MFR, et al. Increased proportion of variance explained and prediction accuracy of survival of breast cancer patients with use of whole-genome multiomic profiles. Genetics. 2016;203: 1425– 1438. 61. Vazquez A, Wiener H, Shrestha S, Tiwari H, de los Campos G. Integration of Multi- Layer Omic Data for Prediction of Disease Risk in Humans. Proceedings, 10th World Congress of Genetics Applied to Livestock Production. 2014. p. 6. 62. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16: 85–97. 63. Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biology. BioMed Central Ltd.; 2017. pp. 1–15. 64. Müller H, Dagher G, Loibner M, Stumptner C, Kungl P, Zatloukal K. Biobanks for life sciences and personalized medicine: importance of standardization, biosafety, biosecurity, and data management. Curr Opin Biotechnol. 2020;65: 45–51. 65. Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, et al. Systematic benchmarking of omics computational tools. Nat Commun. 2019;10: 1–11. 145 66. Chiroma H, Abdullahi UA, Abdulhamid SM, Abdulsalam Alarood A, Gabralla LA, Rana N, et al. Progress on Artificial Neural Networks for Big Data Analytics: A Survey. IEEE Access. 2019;7: 70535–70551. 67. Lorber A, Wangen LE, Kowalski BR. A theoretical foundation for the PLS algorithm. J Chemom. 1987;1: 19–31. 68. Zou H, Zou H, Hastie T. Regularization and variable selection via the Elastic Net. J R Stat Soc Ser B. 2005;67: 301–320. 69. Shen H, Huang JZ. Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal. 2008;99: 1015–1034. 70. Ester M, Kriegel H-P, Sander J, Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996. pp. 226–231. 71. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20: 53–65. 72. Taskesen E, Huisman SMH, Mahfouz A, Krijthe JH, de Ridder J, van de Stolpe A, et al. Pan-cancer subtyping in a 2D-map shows substructures that are driven by specific combinations of molecular characteristics. Sci Rep. 2016;6: 24949. 73. van der Maaten L, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9: 2579–2605. 74. Linderman GC, Steinerberger S. Clustering with t-SNE, provably. arXiv.org. 2017 [cited 28 Nov 2018]. 75. Privé F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34: 2781–2787. 76. Fisher RA. The use of multipole measurements in taxonomic problems. Ann Eugen. 1936;7: 179–188. 77. Hahsler M, Piekenbrock M. dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms. CRAN; 2017. 78. Bengtsson H. A Unifying Framework for Parallel and Distributed Processing in R using Futures. 2020 [cited 27 May 2021]. 79. Wickham H. Ggplot2 : elegant graphics for data analysis. Springer; 2009. 146 80. Tarazona S, Martinez C. Bioconductor - MOSimMulti-Omics Simulation (MOSim). 2021. 81. Gomez-Cabrero D, Tarazona S, Ferreirós-Vidal I, Ramirez RN, Company C, Schmidt A, et al. STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse. Sci Data 2019 61. 2019;6: 1–15. 82. Gaujoux R, Seoighe C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics. 2010;11: 367. 83. el Bouhaddani S, Uh HW, Jongbloed G, Hayward C, Klarić L, Kiełbasa SM, et al. Integrating omics datasets with the OmicsPLS package. BMC Bioinformatics. 2018;19: 371. 84. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32: 2847–2849. 85. Conesa A, Beck S. Making multi-omics data accessible to researchers. Sci Data. 2019;6: 1–4. 86. Zhang T. On the Consistency of Feature Selection using Greedy Least Squares Regression. J Mach Learn Res. 2009. 87. Zhang JM, Harman M, Guedj B, Barr ET, Shawe-Taylor J. Perturbation Validation: A New Heuristic to Validate Machine Learning Models. 2019 [cited 19 Jun 2021]. 88. González-Reymúndez A, Vázquez AI. Multi-omic signatures identify pan-cancer classes of tumors beyond tissue of origin. Sci Rep. 2020;10: 8341. 89. Jamal-Hanjani M, Quezada SA, Larkin J, Swanton C. Translational Implications of Tumor Heterogeneity. Clin Cancer Res. 2015;21: 1258–1266. 90. Lawrence MS, Stojanov P, Polak P, Kryukov G V., Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499: 214–218. 91. Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501: 338–45. 92. Langlands FE, Horgan K, Dodwell DD, Smith L. Breast cancer subtypes: response to radiotherapy and potential radiosensitisation. Br J Radiol. 2013;86: 20120601. 93. McGranahan N, Swanton C. Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future. Cell. 2017;168: 613–628. 147 94. Abdullah LN, Chow EK-H. Mechanisms of chemoresistance in cancer stem cells. Clin Transl Med. 2013;2: 3. 95. Chang K, Creighton CJ, Davis C, Donehower L, Drummond J, Wheeler D, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45: 1113– 1120. 96. Behring M, Shrestha S, Manne U, Cui X, Gonzalez-Reymundez A, Grueneberg A, et al. Integrated landscape of copy number variation and RNA expression associated with nodal metastasis in invasive ductal breast carcinoma. Oncotarget. 2018;9: 36836–36848. 97. Vazquez AI, Veturi Y, Behring M, Shrestha S, Kirst M, Resende MF, et al. Increased Proportion of Variance Explained and Prediction Accuracy of Survival of Breast Cancer Patients with Use of Whole-Genome Multi-omic Profiles. Genetics. 2016; genetics–115. 98. Bernal Rubio YL, González Reymúndez A, Wu K-HH, Griguer CE, Steibel JP, de Los Campos G, et al. Whole-Genome Multi-omic Study of Survival in Patients with Glioblastoma Multiforme. G3 (Bethesda). 2018; g3.200391.2018. 99. González-Reymúndez A, de los Campos G, Gutiérrez L, Lunt SY, Vazquez AI. Prediction of years of life after diagnosis of breast cancer using omics and omic-by- treatment interactions. Eur J Hum Genet. 2017 [cited 13 Mar 2017]. 100. Sánchez-Vega F, Gotea V, Margolin G, Elnitski L. Pan-cancer stratification of solid human epithelial tumors and cancer cell lines reveals commonalities and tissue-specific features of the CpG island methylator phenotype. Epigenetics Chromatin. 2015;8. 101. Hoadley KA, Yau C, Stuart JM, Benz CC, Correspondence PWL. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell. 2018;173: 291–304. 102. Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al. Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin. Cell. 2014;158: 929–944. 103. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a Shared Vision for Cancer Genomic Data. N Engl J Med. 2016;375: 1109–1112. 104. Zhu Y, Qiu P, Ji Y. TCGA-Assembler: open-source software for retrieving and processing TCGA data. Nat Methods. 2014;11: 599–600. 105. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9: 559. 148 106. Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, et al. Comparison of Beta- value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11: 587. 107. Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, et al. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform. 2013;14: 469–490. 108. Thorsson VV, Gibbs DL, Brown SD, Wolf D, Bortone DS, Ou Yang T-H, et al. The Immune Landscape of Cancer. Immunity. 2018;48: 812-830.e14. 109. Kruskal WH, Wallis WA. Use of Ranks in One-Criterion Variance Analysis. J Am Stat Assoc. 1952;47: 583–621. 110. Chen B, Khodadoust MS, Liu CL, Newman AM, Alizadeh AA. Profiling Tumor Infiltrating Immune Cells with CIBERSORT. Methods Mol Biol. 2018;1711: 243–259. 111. Waldmann P, Mészáros G, Gredler B, Fuerst C, Sölkner J. Evaluation of the la sso and the elastic net in genome-wide association studies. Front Genet. 2013;4: 270. 112. Baglama J, Reichel L, Lewis BW. irlba: Fast Truncated Singular Value Decomposition and Principal Components Analysis for Large Dense and Sparse Matrices. CRAN R project; 2018. 113. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12: 996–1006. 114. Jawaid W. enrichr: Gene enrichment using Enrichr in enrichR: Provides an R Interface to “Enrichr.” CRAN R project; 2017. 115. Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015;43: D447–D452. 116. Dunn OJ. Multiple Comparisons Using Rank Sums. Technometrics. 1964;6: 241– 252. 117. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017. 118. Krijthe JH. Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes- Hut Implementation. CRAN R project; 2015. 119. Worrall SF. TNM Classification of Malignant Tumours. Br J Oral Maxillofac Surg. 2000;38: 244. 149 120. Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief Bioinform. 2016; bbw063. 121. Mishra S, Whetstine JR. Different Facets of Copy Number Changes: Permanent, Transient, and Adaptive. Mol Cell Biol. 2016;36: 1050–63. 122. Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pan-cancer patterns of somatic copy number alteration. Nat Genet. 2013;45: 1134–1140. 123. Henrichsen CN, Chaignat E, Reymond A. Copy number variants, diseases and gene expression. Hum Mol Genet. 2009;18: R1-8. 124. Gao Y, Widschwendter M, Teschendorff AE. DNA Methylation Patterns in Normal Tissue Correlate more Strongly with Breast Cancer Status than Copy-Number Variants. EBioMedicine. 2018;31: 243–252. 125. Teschendorff AE, Relton CL. Statistical and integrative system-level analysis of DNA methylation data. Nat Rev Genet. 2018;19: 129–147. 126. Maloney R, Budiman M, Korshunova Y, Monte J, Bacher B, Lakey N, et al. Tissue- specific DNA methylation patterns are frequent targets of epigenetic change in multiple cancer types. Cancer Res. 2008;68: LB-256. 127. Witte T, Plass C, Gerhauser C. Pan-cancer patterns of DNA methylation. Genome Med. 2014;6: 66. 128. Hanahan D, Weinberg RA. Hallmarks of Cancer: The Next Generation. Cell. 2011;144: 646–674. 129. Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, et al. Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development. Cell. 2011;144: 27–40. 130. Shen AL, Moran SA, Glover EA, Drinkwater NR, Swearingen RE, Teixeira LB, et al. Association of a Chromosomal Rearrangement Event with Mouse Posterior Polymorphous Corneal Dystrophy and Alterations in Csrp2bp, Dzank1, and Ovol2 Gene Expression. Anderson MG, editor. PLoS One. 2016;11: e0157577. 131. Xu M-D, Liu S-L, Feng Y-Z, Liu Q, Shen M, Zhi Q, et al. Genomic characteristics of pancreatic squamous cell carcinoma, an investigation by using high throughput sequencing after in-solution hybrid capture. Oncotarget. 2017;8: 14620–14635. 132. Pei Y-F, Ren H-G, Liu L, Li X, Fang C, Huang Y, et al. Genomic variants at 20p11 associated with body fat mass in the European population. Obesity. 2017;25: 757–764. 150 133. Ewing RM, Chu P, Elisma F, Li H, Taylor P, Climie S, et al. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol. 2007;3: 89. 134. Shah MA, Denton EL, Arrowsmith CH, Lupien M, Schapira M. A global assessment of cancer genomic alterations in epigenetic mechanisms. Epigenetics Chromatin. 2014;7: 29. 135. Wanitchakool P, Wolf L, Koehl GE, Sirianant L, Schreiber R, Kulkarni S, et al. Role of anoctamins in cancer and apoptosis. Philos Trans R Soc B Biol Sci. 2014;369: 20130096. 136. Ayoub C, Wasylyk C, Li Y, Thomas E, Marisa L, Robé A, et al. ANO1 amplification and expression in HNSCC with a high propensity for future distant metastasis and its functions in HNSCC cell lines. Br J Cancer. 2010;103: 715–726. 137. Wang X, Sheu JJ-C, Lai M-T, Yin-Yi Chang C, Sheng X, Wei L, et al. RSF-1 overexpression determines cancer progression and drug resistance in cervical cancer. BioMedicine. 2018;8: 4. 138. Sircoulomb F, Bekhouche I, Finetti P, Adélaïde J, Hamida A Ben, Bonansea J, et al. Genome profiling of ERBB2-amplified breast cancers. BMC Cancer. 2010;10: 539. 139. Peña-Chilet M, Blanquer-Maceiras M, Ibarrola-Villava M, Martinez-Cadenas C, Martin-Gonzalez M, Gomez-Fernandez C, et al. Genetic variants in PARP1 (rs3219090) and IRF4(rs12203592) genes associated with melanoma susceptibility in a Spanish population. BMC Cancer. 2013;13: 160. 140. Hao J-J, Shi Z-Z, Zhao Z-X, Zhang Y, Gong T, Li C-X, et al. Characterization of genetic rearrangements in esophageal squamous carcinoma cell lines by a combination of M-FISH and array-CGH: further confirmation of some split genomic regions in primary tumors. BMC Cancer. 2012;12: 367. 141. Chowdhry S, Zanca C, Rajkumar U, Koga T, Diao Y, Raviram R. NAD Metabolic Dependency Determines Therapeutic Sensitivity in Cancer. Cancer Discov. 2019;9: OF14–OF14. 142. Kim H-J, Maiti P, Barrientos A. Mitochondrial ribosomes in cancer. Semin Cancer Biol. 2017;47: 67–81. 143. Sotgia F, Lisanti MP, Sotgia F, Lisanti MP. Mitochondrial biomarkers predict tumor progression and poor overall survival in gastric cancers: Companion diagnostics for personalized medicine. Oncotarget. 2017;8: 67117–67128. 144. Collisson EA, Campbell JD, Brooks AN, Berger AH, Lee W, Chmielecki J, et al. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014;511: 543–550. 151 145. Collisson EA, Campbell JD, Brooks AN, Berger AH, Lee W, Chmielecki J, et al. Comprehensive molecular profiling of lung adenocarcinoma: The cancer genome atlas research network. Nature. 2014;511: 543–550. 146. Janku F, Wheler JJ, Westin SN, Moulder SL, Naing A, Tsimberidou AM, et al. PI3K/AKT/mTOR inhibitors in patients with breast and gynecologic malignancies harboring PIK3CA mutations. J Clin Oncol. 2012;30: 777–782. 147. Hoelder S, Clarke PA, Workman P. Discovery of small molecule cancer drugs: Successes, challenges and opportunities. Molecular Oncology. John Wiley and Sons Ltd; 2012. pp. 155–176. 148. Bonelli P, Tuccillo FM, Borrelli A, Schiattarella A, Buonaguro FM. CDK/CCN and CDKI alterations for cancer prognosis and therapeutic predictivity. Biomed Res Int. 2014;2014. 149. Seo M, Seo M, Goldschmidt-clermont PJ, West M. Of mice and men: Sparse statistical modelling in cardiovascular genomics. Ann Appl Stat. [cited 12 Feb 2018]. 150. Guo X, Ngo B, Modrek A, Lee W-H. Targeting Tumor Suppressor Networks for Cancer Therapeutics. Curr Drug Targets. 2014;15: 2–16. 151. Bray GA, Kim KK, Wilding JPH. Obesity: a chronic relapsing progressive disease process. A position statement of the World Obesity Federation. Obes Rev. 2017;18: 715– 723. 152. Loos RJF. Recent progress in the genetics of common obesity. Br J Clin Pharmacol. 2009;68: 811–829. 153. MO G. Genetics of obesity: what genetic association studies have taught us about the biology of obesity and its complications. lancet Diabetes Endocrinol. 2018;6: 223– 236. 154. Jiang L, Penney KL, Giovannucci E, Kraft P, Wilson KM. A genome-wide association study of energy intake and expenditure. PLoS One. 2018;13. 155. Romieu I, Dossus L, Barquera S, Blottière HM, Franks PW, Gunter M, et al. Energy balance and obesity: what are the main drivers? Cancer Causes Control. 2017;28: 247– 258. 156. Diels S, Berghe W Vanden, Hul W Van. Insights into the multifactorial causation of obesity by integrated genetic and epigenetic analysis. Obes Rev. 2020;21: e13019. 157. Hall KD, Heymsfield SB, Kemnitz JW, Klein S, Schoeller DA, Speakman JR. Energy balance and its components: implications for body weight regulation. Am J Clin Nutr. 2012;95: 989–994. 152 158. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14: e8124. 159. Ubbens J, Cieslak M, Prusinkiewicz P, Parkin I, Ebersbach J, Stavness I. Latent space phenotyping: Automatic image-based phenotyping for treatment studies. Plant Phenomics. 2020;2020. 160. Joo J, Williamson SA, Vazquez AI, Fernandez JR, Bray MS. Advanced Dietary Patterns Analysis Using Sparse Latent Factor Models in Young Adults. J Nutr. 2018;148: 1984–1992. 161. Joo J, Williamson SA, Vazquez AI, Fernandez JR, Bray MS. The influence of 15- week exercise training on dietary patterns among young adults. Int J Obes. 2019; 1. 162. Xu SY, Nelson S, Kerr J, Godbole S, Johnson E, Patterson RE, et al. Modeling temporal variation in physical activity using functional principal components analysis. Stat Biosci. 2019;11: 403–421. 163. Johnson RK, Vanderlinden L, DeFelice BC, Kechris K, Uusitalo U, Fiehn O, et al. Metabolite-related dietary patterns and the development of islet autoimmunity. Sci Rep. 2019;9: 1–11. 164. Guénard F, Bouchard-Mercier A, Rudkowska I, Lemieux S, Couture P, Vohl M-C. Genome-Wide Association Study of Dietary Pattern Scores. Nutrients. 2017;9: 649. 165. Tang CS, Ferreira MAR. A gene-based test of association using canonical correlation analysis. Bioinformatics. 2012;28: 845–850. 166. Basu S, Zhang Y, Ray D, Miller MB, Iacono WG, McGue M. A rapid gene-based genome-wide association test with multivariate traits. Hum Hered. 2014;76: 53–63. 167. Aschard H, Vilhjálmsson BJ, Greliche N, Morange PE, Trégouët DA, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet. 2014;94: 662–676. 168. Lightfoot JT, Turner MJ, Daves M, Vordermark A, Kleeberger SR. Genetic influence on daily wheel running activity level. Physiol Genomics. 2004;19: 270–276. 169. Koteja P, Garland T, Sax JK, Swallow JG, Carter PA. Behaviour of house mice artificially selected for high levels of voluntary wheel running. Anim Behav. 1999;58: 1307–1318. 170. Lerman I, Harrison BC, Freeman K, Hewett TE, Allen DL, Robbins J, et al. Genetic variability in forced and voluntary endurance exercise performance in seven inbred mouse strains. J Appl Physiol. 2002;92: 2245–2255. 153 171. West DB, York B. Dietary fat, genetic predisposition, and obesity: lessons from animal models. Am J Clin Nutr. 1998;67: 505S-512S. 172. Ellacott KLJ, Morton GJ, Woods SC, Tso P, Schwartz MW. Assessment of feeding behavior in laboratory mice. Cell Metab. 2010;12: 10–17. 173. Barron R, Bermingham K, Brennan L, Gibney ER, Gibney MJ, Ryan MF, et al. Twin metabolomics: the key to unlocking complex phenotypes in nutrition research. Nutr Res. 2016;36: 291–304. 174. Rintala M, Lyytikäinen A, Leskinen T, Alen M, Pietiläinen KH, Kaprio J, et al. Leisure-time physical activity and nutrition: a twin study. Public Health Nutr. 2011;14: 846–852. 175. Beunen G, Thomis M. Genetic determinants of sports participation and daily physical activity. Int J Obes. 1999;23: S55–S63. 176. Maia JAR, Thomis M, Beunen G. Genetic factors in physical activity levels: a twin study. Am J Prev Med. 2002;23: 87–91. 177. Lightfoot JT. Current understanding of the genetic basis for physical activity. J Nutr. 2011;141: 526–30. 178. Herring MP, Sailors MH, Bray MS. Genetic factors in exercise adoption, adherence and obesity. Obes Rev. 2014;15: 29–39. 179. Kwon SM, Cho H, Choi JH, Jee BA, Jo Y, Woo HG. Perspectives of integrative cancer genomics in next generation sequencing era. Genomics Inform. 2012;10: 69–73. 180. Rankinen T, Rice T, Teran-Garcia M, Rao DC, Bouchard C. FTO Genotype Is Associated With Exercise Training–induced Changes in Body Composition. Obesity. 2010;18: 322–326. 181. Scott RA, Bailey MES, Moran CN, Wilson RH, Fuku N, Tanaka M, et al. FTO genotype and adiposity in children: Physical activity levels influence the effect of the risk genotype in adolescent males. Eur J Hum Genet. 2010;18: 1339–1343. 182. Park SL, Cheng I, Pendergrass SA, Kucharska-Newton AM, Lim U, Ambite JL, et al. Association of the FTO obesity risk variant rs8050136 with percentage of energy intake from fat in multiple racial/ethnic populations. Am J Epidemiol. 2013;178: 780–790. 183. Wardle J, Carnell S, Haworth CMA, Farooqi IS, O’Rahilly S, Plomin R. Obesity associated genetic variation in FTO is associated with diminished satiety. J Clin Endocrinol Metab. 2008;93: 3640–3643. 154 184. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26: 2867–2873. 185. Bradbury KE, Young HJ, Guo W, Key TJ. Dietary assessment in UK Biobank: an evaluation of the performance of the touchscreen dietary questionnaire. J Nutr Sci. 2018;7: 1–11. 186. Sinnott-Armstrong N, Tanigawa Y, Amar D, Mars NJ, Aguirre M, Venkataraman GR, et al. Genetics of 38 blood and urine biomarkers in the UK Biobank. bioRxiv. bioRxiv; 2019. p. 660506. 187. Maddison R, Ni Mhurchu C, Jiang Y, Vander Hoorn S, Rodgers A, Lawes CMM, et al. International physical activity questionnaire (IPAQ) and New Zealand physical activity questionnaire (NZPAQ): A doubly labelled water validation. Int J Behav Nutr Phys Act. 2007;4: 62. 188. Grueneberg A, de los Campos G. BGData - A suite of R packages for genomic analysis with big data. G3 Genes, Genomes, Genet. 2019;9: 1377–1383. 189. Roux B Le, Rouanet H. Geometric data analysis: From correspondence analysis to structured data analysis. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis. Springer Netherlands; 2005. 190. Gonzalez-Reymundez A, Grueneberg A, Vazquez AI. MOSS: Multi-Omic Integration via Sparse Singular Value Decomposition. CRAN R-project. 2021 [cited 19 Jan 2021]. 191. Audigier V, Husson F, Josse J. A principal components method to impute missing values for mixed data. Adv Data Anal Classif. 2013;10: 5–26. 192. Josse J, Husson F. missMDA: A package for handling missing values in multivariate data analysis. J Stat Softw. 2016;70: 1–31. 193. McCaw Z. RNOmni: Rank Normal Transformation Omnibus Test. CRAN R-project. 2020 [cited 1 Jul 2021]. 194. Aguate FM, Vazquez AI, Merriman TR, de los Campos G. Mapping pleiotropic loci using a fast-sequential testing algorithm. Eur J Hum Genet. 2021; 1–12. 195. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21: 3439–3440. 155 196. Durinck S, Spellman P, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4: 1184–1191. 197. Machiela MJ, Chanock SJ. LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics. 2015;31: 3555–3557. 198. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al. The Genotype- Tissue Expression (GTEx) project. Nature Genetics. NIH Public Access; 2013. pp. 580– 585. 199. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv.org. 2018 [cited 14 Jun 2021]. 200. Tukey JW. Comparing Individual Means in the Analysis of Variance. Biometrics. 1949;5: 99. 201. Zou H, Hastie T, Tibshirani R. Sparse Principal Component Analysis. J Comput Graph Stat. 2006;15: 265–286. 202. Bradfield JP, Vogelezang S, Felix JF, Chesi A, Helgeland Ø, Horikoshi M, et al. A trans-ancestral meta-analysis of genome-wide association studies reveals loci associated with childhood obesity. Hum Mol Genet. 2019;28: 3327–3338. 203. Liu Y, Zhang X, Lee J, Smelser D, Cade B, Chen H, et al. Genome-wide association study of neck circumference identifies sex-specific loci independent of generalized adiposity. Int J Obes. 2021;45: 1532–1541. 204. Lind L. Genome-Wide Association Study of the Metabolic Syndrome in UK Biobank. Metab Syndr Relat Disord. 2019;17: 505–511. 205. Kichaev G, Bhatia G, Loh PR, Gazal S, Burch K, Freund MK, et al. Leveraging Polygenic Functional Enrichment to Improve GWAS Power. Am J Hum Genet. 2019;104: 65–75. 206. Graff M, Scott RA, Justice AE, Young KL, Feitosa MF, Barata L, et al. Genome- wide physical activity interactions in adiposity ― A meta-analysis of 200,452 adults. PLoS Genet. 2017;13: 130. 207. Berndt SI, Gustafsson S, Mägi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome- wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet. 2013;45: 501–512. 208. Trousdale C, Kim K. Retromer: Structure, function, and roles in mammalian disease. European Journal of Cell Biology. Elsevier GmbH; 2015. pp. 513–521. 156 209. Bentzinger CF, Wang YX, Rudnicki MA. Building muscle: molecular regulation of myogenesis. Cold Spring Harbor perspectives in biology. Cold Spring Harbor Laboratory Press; 2012. 210. Newmire D, Willoughby DS. Wnt and β-Catenin Signaling and Skeletal Muscle Myogenesis in Response to Muscle Damage and Resistance Exercise and Training. Int J Kinesiol Sport Sci. 2015;3: 40–49. 211. Wu JHY, Lemaitre RN, Manichaikul A, Guan W, Tanaka T, Foy M, et al. Genome- wide association study identifies novel loci associated with concentrations of four plasma phospholipid fatty acids in the de novo lipogenesis pathway: Results from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium. Circ Cardiovasc Genet. 2013;6: 171–183. 212. Singla P. Metabolic effects of obesity: A review. World J Diabetes. 2010;1: 76. 213. Sherman WM. Metabolism of sugars and physical performance. Am J Clin Nutr. 1995;62: 228S-241S. 214. Nagel M, Watanabe K, Stringer S, Posthuma D, Van Der Sluis S. Item-level analyses reveal genetic heterogeneity in neuroticism. Nat Commun. 2018;9. 215. Davies G, Lam M, Harris SE, Trampush JW, Luciano M, Hill WD, et al. Study of 300,486 individuals identifies 148 independent genetic loci influencing general cognitive function. Nat Commun. 2018;9. 216. Hübel C, Gaspar HA, Coleman JRI, Finucane H, Purves KL, Hanscombe KB, et al. Genomics of body fat percentage may contribute to sex bias in anorexia nervosa. Am J Med Genet Part B Neuropsychiatr Genet. 2019;180: 428–438. 217. Hernandez Cordero AI, Gonzales NM, Parker CC, Sokolof G, Vandenbergh DJ, Cheng R, et al. Genome-wide Associations Reveal Human-Mouse Genetic Convergence and Modifiers of Myogenesis, CPNE1 and STC2. Am J Hum Genet. 2019;105: 1222– 1236. 218. Nanda V, Wang T, Pjanic M, Liu B, Nguyen T, Matic LP, et al. Functional regulatory mechanism of smooth muscle cell-restricted LMOD1 coronary artery disease locus. PLoS Genet. 2018;14: e1007755. 219. Ahmad RS, Imran A, Hussain MB. Nutritional Composition of Meat. Meat Science and Nutrition. InTech; 2018. 220. Goni L, Cuervo M, Milagro FI, Martínez JA. Future perspectives of personalized weight loss interventions based on nutrigenetic, epigenetic, and metagenomic data. J Nutr. 2016;146: 905S-912S. 157 221. Ramos-Lopez O, Milton-Laskibar I, Martínez JA. Precision nutrition based on phenotypical traits and the (epi)genotype: nutrigenetic and nutrigenomic approaches for obesity care. Curr Opin Clin Nutr Metab Care. 2021;24: 315–325. 222. Henseler J. On the convergence of the partial least squares path modeling algorithm. Comput Stat 2009 251. 2009;25: 107–120. 223. Tipping ME, Bishop CM. Probabilistic Principal Component Analysis. J R Stat Soc Ser B (Statistical Methodol. 1999;61: 611–622. 224. Tarantino G, Monica S, Bergenti F. A probabilistic matrix factorization algorithm for approximation of sparse matrices in natural language processing. ICT Express. 2018;4: 87–90. 225. Guan Y, Dy JG. Sparse probabilistic principal component analysis. J Mach Learn Res. 2009;5: 185–192. 226. Idris SF, Ahmad SS, Scott MA, Vassiliou GS, Hadfield J. The role of high- throughput technologies in clinical cancer genomics. http://dx.doi.org/101586/erm131. 2014;13: 167–181. 227. International Cancer Genome Consortium TICG, Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, et al. International network of cancer genome projects. Nature. 2010;464: 993–8. 228. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483: 603–7. 229. Heim D, Budczies J, Stenzinger A, Treue D, Hufnagl P, Denkert C, et al. Cancer beyond organ and tissue specificity: Next-generation-sequencing gene mutation data reveal complex genetic similarities across major cancers. Int J Cancer. 2014;135: 2362– 2369. 230. Yates III JR. A century of mass spectrometry: from atoms to proteomes. Nat Methods. 2011;8: 633–637. 231. Wu HM, Goate AM, O’Reilly PF. Heterogeneous effects of genetic risk for Alzheimer’s disease on the phenome. Transl Psychiatry 2021 111. 2021;11: 1–9. 232. Li X, Meng X, He Y, Spiliopoulou A, Timofeeva M, Wei W-Q, et al. Genetically determined serum urate levels and cardiovascular and other diseases in UK Biobank cohort: A phenome-wide mendelian randomization study. PLOS Med. 2019;16: e1002937. 158 233. Cole JB, Florez JC, Hirschhorn JN. Comprehensive genomic analysis of dietary habits in UK Biobank identifies hundreds of genetic associations. Nat Commun. 2020;11: 1467. 234. UK Biobank adds the first tranche of data from a study into circulating metabolomic biomarkers to its biomedical database. [cited 10 Aug 2021]. 235. Guasch-Ferré M, Bhupathiraju SN, Hu FB. Use of Metabolomics in Improving Assessment of Dietary Intake. Clin Chem. 2018;64: 82. 236. The era of massive cancer sequencing projects has reached a turning point. Nature. 2020;578: 7–8. 159