DATA-DRIVEN COMPUTATIONAL APPROACHES TO UNRAVEL AND INTERPRET
THE POLYGENIC ARCHITECTURE OF HUMAN COMPLEX TRAITS AND DISEASES

By

Alexander Patrick McKim

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computational Mathematics, Science, and Engineering—Doctor of Philosophy
Genetics and Genome Sciences—Dual Major

2024

ABSTRACT

The genetics and mechanisms underlying most human traits and diseases can be very

complex, where the number of true gene associations can number in the many

hundreds. Thus, when trying to better understand a trait/disease, researchers are faced

with two main challenges; missing knowledge of which genes are truly related to the

trait/disease, and understanding how those genes work together through molecular

pathways. These knowledge gaps make it hard to translate large scale genetic

information into actionable hypotheses. The overarching goal of the research presented

in this dissertation is to develop methods that address these challenges in order to gain

a better understanding of the etiology of complex traits and diseases. We worked

towards this goal by developing general-purpose computational frameworks that

leverage vast publicly-available datasets — genome-scale gene networks, gene

functional annotations, thousands of gene expression signatures, and

experimentally-derived gene-phenotype associations in humans and model organisms

— to resolve large gene lists associated with highly polygenic disease into relevant

genes, pathways, and critical interactions. Together, these findings reveal nuanced

understanding of disease mechanisms. Overall, this research helps get away from

treating each disease as a single well-defined condition, and instead find

mechanism-based disease subtypes and use these insights to find novel diagnostic and

treatment avenues.

This dissertation is dedicated to my father, who encouraged me from the beginning until
the end.

iii

ACKNOWLEDGEMENTS

I would like to acknowledge the people that most helped me get this dissertation written

and completed. Dr. Christopher Mancuso, thank you for helping me not just with the

projects here, but keeping me sane and really pushing me forward to complete this. To

Keenan Manpearl, thank you for coming through for me while I was finishing this. This

thesis would not have been completed without you two. I would also like to thank Dr.

Kayla Johnson for talking to me about thesis worries and encouraging me that I am not

in a hopeless position. I also want to thank her for helping to teach me biology when I

was starting out and super interested in the subject, but knew nothing. I thank Dr.

Stephanie Hickey for being my co-first author on the published work in chapter 2, and

for also teaching so many cool things about biology. I thank Hao Yuan for being

amazing to talk to and for providing great feedback over the years on my work. I also

want to thank my committee members Dr. Wen Huang, Dr. Shin-han Shiu, and Dr.

Jianrong Wang for their feedback on the ideas I have presented and in discussing with

me development opportunities for my post-PhD life. I would also like to thank Dr. Janani

Ravi for introducing me to topics involving the intersection of human genetics and

infectious disease, and for all the conversations about how my work can contribute. Last

but not least, I want to thank my advisor, Dr. Arjun Krishnan, for taking a chance on a

student who knew nothing about biology to start with and has taught me so much.

iv

TABLE OF CONTENTS

CHAPTER 1: BACKGROUND, RESEARCH QUESTIONS, AND
JUSTIFICATION……….………………………………………………………. ……………….1
REFERENCES………………………………………………………………………... 14

CHAPTER 2: A NETWORK-BASED APPROACH FOR ISOLATING THE CHRONIC
INFLAMMATION GENE SIGNATURES UNDERLYING COMPLEX DISEASES
TOWARDS FINDING NEW TREATMENT OPPORTUNITIES……………………………22
REFERENCES………………………………………………………………………….46
APPENDIX A2: INFLAMMATION..……………………………………………………50

CHAPTER 3: MODGENEPLEXUS: A MODULAR NETWORK-BASED APPROACH
FOR GENE CLASSIFICATION IMPROVES POST-OMICS AND POST-GWAS
HYPOTHESIS GENERATION FOR DISEASE GENES AND
MECHANISMS…………………………………………………………………………………57
REFERENCES………………………………………………………………………..132
APPENDIX A3: MODGENEPLEXUS..….……………………………………….…136

CHAPTER 4: DISCOVERING CORE AND PERIPHERAL GENES USING A
NETWORK-BASED OMNIGENIC MODEL AND TRANSLATING FINDINGS ACROSS
SPECIES.. ……………………….……………………………………………………………140
REFERENCES .……………………………………………………………………...202
APPENDIX A4: CORE GENES……….……………….…………………... ………210

CHAPTER 5: SUMMARY, LIMITATIONS, REFLECTION, AND FUTURE
DIRECTIONS…………………...…………………………………………………………….220
REFERENCES .……………………………………………………………………...227

CHAPTER 6: INTEGRATING GENE PRIORITIZATION METHODS FOR SUMMARY
GWAS STATISTICS……………………………………………………………….………....228
REFERENCES……………………………………………………………………….232

v

CHAPTER 1: BACKGROUND, RESEARCH QUESTIONS, AND JUSTIFICATION

Background

Over the past 15 years, large-scale studies, including gene expression, genome-wide

association (GWAS), and long-term biobank research have shown that complex

diseases (e.g., type II diabetes and coronary artery disease) are linked to hundreds of

genes. Further, individuals with the same disease can exhibit a range of phenotypes.

However, we still lack a thorough understanding of how disease genes lead to

deregulated molecular pathways and cellular functions and which mechanisms underlie

specific disease phenotypes. Closing this knowledge gap is crucial because we need a

mechanism-based framework to understand how, despite having the same disease,

patients can have unique genetic mutations and how these mutations lead to different

functional and phenotypic disruptions that then lead to the disease. Consequently,

addressing this gap also has an impact on moving diagnoses and treatments from the

current "one size fits all" paradigm towards designing measures and interventions that

work for individual patients that may be different from the status quo based on

population-level understanding of disease. Therefore, there is a critical need for

developing general-purpose computational approaches that can analyze large,

heterogeneous data collections to connect genes, pathways, cell functions, phenotypes,

diseases, and drug candidates.

The Post-GWAS era

The last fifteen years have seen major advancements in the knowledge of the genetic

architecture underneath complex diseases, including genetic mutations, genes,

mechanisms, and clinical phenotypes. Examples include the growing catalog of millions

of genetic variants identified in GWAS experiments for thousands of complex diseases1,2

and the growing number of gene function annotations in tissue-specific networks3. A

significant challenge in the post-GWAS era is learning the mechanistic relationships that

underlie complex diseases and trait variation across individuals. This problem is due to

the extreme polygenicity4,5 behind complex traits that GWAS has revealed and the lack

1

Figure 1.1: An overview of the biological data and concepts used and discussed in this

dissertation. The explosion of various types of diverse -omics data, the new knowledge

gained and that addition to new genesets, and the ability to cheaply sequence

individuals and populations has created the foundation for creating computational

methods to unravel the biology of human complex traits and disease.

of functional understanding of the thousands of important single nucleotide

polymorphisms (SNPs) found by GWAS without additional biological annotation5. When

GWAS is conducted, techniques like fine mapping5,6 are often employed to predict

causal SNPs for further study. However, the omnigenic model, motivated by the extreme

polygenicity behind complex traits revealed through GWAS, for example challenges

many of the assumptions behind the usefulness of fine mapping variants alone7.

Isolating causal variants if there is truth behind all loci in the genome contributing to trait

manifestation at some level8 may be at odds with underlying biology. Even with

simplifying biological reality, there has been genuine progress computationally utilizing

new genomic data consortiums to elucidate our functional knowledge of many complex

diseases. An example of this is annotating significant SNPs from a GWAS to the closest

(or a very close) gene for downstream analysis9,10. While the integration of a variety of

different data sources, such as eQTL data in conducting TWAS or chromatin capture

data7,9,11,12 are vital tools for learning genes that are associated with a phenotype, it is

also true that SNPs very often have their causative gene or relatively close to its

physical location13,14 . Biswas and colleagues further argue that genes associated with

2

relevant disease pathways also are often near significant SNPs15. Thus, predicting

genes near the significant SNPs is often reasonable. Integrating more biological

information is not because gene prioritization metrics based on only proximity performs

poorly, but because using additional biological data is complementary to simpler gene

proximity methods. Integrating other biological data finds relationships would not have

the statistical power to find consistently16. The challenge is figuring out the best

biological information to help interpret results into something actually meaningful and

usable.

Post-GWAS methods

Important methods have been developed using GWAS data to relate common variant

data to other biological data types, aiming to unravel human genetics for disease and

design treatments17. Linking implicated loci from GWAS to important biological

pathways9,10,18 provides a more interpretable explanation behind disease phenotypes.

One approach to interpret GWAS results has been to utilize networks. Numerous

methods have been implemented to map implicated GWAS loci with gene-gene

networks to prioritize disease genes and pathways19–22. Discovered GWAS variants can

be linked to expression data from both bulk RNA and single-cell RNA studies23–27 to

refine predictions about important genes and discover what each data set alone cannot.

Polygenic risk scores —calculated using variant association data from GWAS— are

able to perform reasonably well for at least some diseases because of the ability to

predict disease risk of individual patients based on what variants they specifically

have28,28,29. Polygenic risk scores have powerful applications for precision medicine

when they can begin to work for patients of all backgrounds and risk factors30,31. In

summary, GWAS results are powerful not only to interpret disease at the variant level

but for translating genome scale results to the gene and pathway level to study

mechanisms behind human disease and traits. While these diverse findings and

methods using many different models have opened the door for many avenues of study,

there are still key questions about polygenic diseases that must be investigated.

Disease Heterogeneity

A complex disease is not a single well defined condition, and the more complex it is,

the less defined it will be. This is because each disease manifests uniquely in each

3

individual, leading to extreme heterogeneity of the disease across the population.

Further, translating quantitative models for disease risk across individuals becomes

even harder when the discovery and target individuals have different genetic ancestries,

i.e. low cross-ethnic portability29. An example of this heterogeneity can be seen with

autism spectrum disorder (ASD), which has extreme patient stratification and subtypes.

Many of these differences are explained because of a different genetic makeup across

individuals32. ASD has hundreds of genes associated with it33, but individuals do not

have mutations that target each of those hundreds of associated genes. Rather, a

patient has mutations in a subset of genes that lead to a general diagnosis of autism,

but in actuality manifests as a specific subtype of autism within the spectrum and within

that patient. The genetic variation across patients in understanding disease mechanism

and treating individuals. Pharmaceutical companies recognize the heterogeneity that

can be seen in a complex disease, and are undertaking designing drug treatments for

sub-groups of patients34. This can be seen in defining subtypes of diseases as

endotypes. While what makes a good definition for an endotype is still controversial35,

the motivation to classify patients with the same disease into different subtypes is in

order to facilitate better treatment in groups of people with different types of gene and

mechanistic disruptions.

Cross-Disease relationships

It is clear that complex diseases do not operate in isolation from one another. Cross trait

work is motivated by finding similar genetic architectures between traits or diseases,

which potentially leads to similar treatments36. There has been great success in

demonstrating this shared architecture across many complex traits and diseases36–38.

However, relationships between complex diseases are often not obvious. Seemingly

distinct or unrelated complex diseases often manifest together more often than

expected by chance, such as inflammatory bowel syndrome and cardiovascular

disease39. Another type of comorbidity involves phenotypes that are either symptoms of

the disease or are common traits that are involved in cellular pathways relating to a

complex disease, known as endophenotypes40. Endophenotypes are typically complex,

having multiple relevant genes or cellular mechanisms, but are less polygenic compared

to the relevant complex disease. They often can be better annotated in terms of

4

underlying cellular mechanisms41. Endophenotypes can be shared across distinct

complex diseases. For example chronic inflammation, a commonly observed disease

phenotype, is involved with autoimmune disorders, type II diabetes, coronary artery

disease, and various cancers42. Additionally, while an endophenotype should share

genetic variance with the disease of interest41, it can be difficult to define if an

endophenotype is a symptom or a cause of a disease. Despite this, investigating

underlying phenotypes can lead to developing treatments that work commonly across

diseases.

Context-Specificity

It is crucial to find and annotate disease subtypes and underlying mechanisms to

relevant tissue and cell types. This is because proteins and processes are important in

specific contexts43,44. Understanding relevant contexts of complex diseases gives insight

to the disruptions that lead to disease manifestation, which then helps identify treatment

opportunities for a larger percentage of patients. Zhu et. al. point out that genetic

variants influence the phenotypes that underlie complex diseases in a context-specific

manner45. For an individual disease, associated variants do not necessarily influence

the same context – implying that complex diseases can have multiple relevant tissue

and cell types. This means that even if you know a priori tissues and cells that variants

work in for a disease, newly discovered variants do not necessarily influence the same

contexts. Thus, improving treatment opportunities requires researchers to understand

not only what contexts a disease operates in, but context-specific functions.

Genotypic vs phenotypic first approaches

A common goal is to gain better functional understanding of complex traits through

deconvolution of heterogeneity. However, methodologies for understanding the

functional makeup of complex diseases and the genotype-phenotype relationships can

differ46. With the advent of high-throughput sequencing, the genotype-first approach has

allowed researchers to use less functionally interpretable biological information like

variants to both define diseases and to learn about the functional makeup of a disease.

GWAS is an example of a phenotype-first approach, where the study is conducted with

patients chosen on observed disease status or some measured quantitative

phenotype5. After the case/control groups are determined, associated variants are

5

identified and annotated through various methods such as mendelian randomization47

and colocalization48 to more biologically interpretable units like pathways. Conversely, a

genetics-first approach is PHeWAS49,50, which selects SNPs of interest and then finds

associated phenotypes. Both approaches have advanced the understanding of complex

diseases, including genetic annotations and pathways that underlie them, but additional

analysis has to be done to predict effective treatments on an individual level.

Network biology

The vast complexity of biological entities and systems has led to the development of

biological networks and graphs. These networks are used to capture relationships

between entities – such as genes or proteins. Various networks, including

protein-protein interaction networks51,52, coexpression networks53,54, and gene regulatory

networks55, serve this purpose. Network biology aims to understand genetics by utilizing

biological interactions in a multi-dimensional framework. Interactions/connections

(edges) between genes or proteins, or other entities represent functional association or

relationship56,57. Networks are vital for interpreting complex disease biology because

highly connected genes or proteins likely participate in similar biological mechanisms

due to these functional relationships58. Networks are also diverse, such as being able to

represent specific biological contexts. For example, protein-protein interaction networks

differ across biological contexts because proteins interact differently in different cells.

One specific use case is a study in which networks were created representing specific

cellular and tissue contexts, and this has shed light on implicated disease-associated

processes and mechanisms59,60 that a context-naive network could not. Another useful

property is that biological networks are modular in nature61, where genes within a

module are more densely connected with one another than other genes in the network.

Complex trait/disease genes often map to network neighborhoods because

disease-gene associations are not randomly scattered62. This occurs because disease

genes are known to work together in disease-relevant processes and pathways. The

discovered modules/clusters of interacting genes can implicate important phenotypes

involved with the disease. For complex diseases involving hundreds of genes, multiple

distinct modules are likely to be found, each enriched for distinct biological pathways

and phenotypes that contribute to the disease63–65. Ultimately, the utility of networks is to

6

provide context to gene lists retrieved from experiments or some dataset, showing how

the important genes discovered interact with one another in high-dimensional space.

Definitions of Biological Network Modules

As previously cited, biological networks tend to be modular in nature56,63. Researchers

can discover how their genes of interest fall into modules within a network, and multiple

distinct but related types of modules can be uncovered. Topological modules are locally

dense neighborhoods in the network, found using only the network structure. This

network is created through biological data, but additional outside information is not

being used to influence module discovery. Functional modules consist of nodes that

work within some specific biological concept, typically defined by user-provided external

data, that are in a network neighborhood. Disease modules correspond to genes

relevant to disease phenotype manifestation - i.e all disease associated genes. These

concepts are interconnected, as functional gene modules will have relationship to

topological modules in a network, and these will be relevant in disease modules the

genes are annotated to. Given the high-dimensional nature of complex diseases and

biology, there is no gold standard of “true” modules63. Disease modules are particularly

challenging to discover due to network and data limitations. An important observation is

that within a truly complete disease module including every relevant disease gene, this

module would contain multiple submodules that relate to distinct functional processes,

where submodule genes are more interconnected in the network relative to other

disease genes. Therefore, many module detection methods discover topological

modules within a network and use user-defined genes to identify functionally related

topological modules63,66,67. An additional strategy is to define an initial disease module

as the user seed list, and expand utilizing network properties62. In short, disease

modules are those made of all disease genes - and this disease module will contain

submodules that are related to distinct functions, phenotypes, cell types, tissues, and

other biological concepts. It is clear that network modules and gene relationships are

crucial for understanding complex disease biology. Surprisingly however, they are even

relevant for mendelian and rare disease. While mendelian diseases are typically defined

as being caused by single points of mutation, it has been observed that some specific

mendelian diseases such as sickle cell disease are heterogeneous in the population

7

and associated with multiple phenotypes56,68. This suggests that even mendelian

diseases potentially have a “true disease module” of genes related to functional

processes disrupted by network effects of gene perturbations. This elucidates the

challenge of developing effective treatment for many diseases and the importance of

networks in biologically interpreting the unique mutations that lead to phenotype

heterogeneity.

Network-based gene classification

Gene classification is the task of computationally predicting the association of genes to

biological genesets, such as traits, processes, pathways, or diseases. These

computational predictions are possible because of the emergence of large databases of

publicly available data relating to gene function69,70, and the integration of this functional

information into genome wide networks43,52,71,72. This task is vital for two primary

reasons: 1) Complex diseases can be highly polygenic on the order of hundreds of loci,

variants, and genes8,73. Our known gene associations for very complex diseases are

incomplete, with most genetic heritability of complex disease still being unexplained73–75.

2) Many genes are understudied in experimental settings76–80. Additionally, even if all

genes were known, gene classification has utility in finding the genes most functionally

related to experimental results that may be missing due to lack of power, noise, or

immeasurability. Biological networks address these issues by providing functional

context to genes of interest. Using a guilt-by-association approach20,58, if a gene has

ample edge connections to known positive genes, that gene is likely associated with the

biological geneset due to these network relationships. This principle is based on the

observation that genes in biological networks highly connected to one another have

functional relationships and participate in the same or similar higher level biological

concepts such as processes and phenotypes81. Using functional networks for gene

classification assumes that discovering genes that work together in similar contexts or

have similar function are likely to also be disease associated. Multiple studies and

experiments have helped validate this assumption and the validity of this

approach3,33,82–86.

GenePlexus

GenePlexus is a supervised learning approach for network-based gene

8

classification81,82,87,88. It has been robustly demonstrated that supervised learning

outperforms label propagation for gene classification81. Given a list of genes,

GenePlexus predicts the association of all genes in a genome-wide network to that list

based on the list’s genes connectivity to other genes in the network. The model is a

L2-regularized logistic regression classifier that distinguishes between positive and

negative labels. The user-supplied genelist are used as positive labels. Negative labels

are chosen through a hypergeometric test – where in taking the positive labels, a

hypergeometric test is conducted with all gene sets in gene set collections (GSCs) such

as DisGeNet89,90 and GO70. If the positive genelist has significant overlap (p < .05) with a

geneset, the genes in that set which are not part of the initial user geneset are defined

as neutral. All other genes are given negative labels. Multiple gene-level feature vectors

can be used. One feature type being the rows of the adjacency matrix, where each row

corresponds to a gene. The rows of the node embedding matrix determined from

node2vec91,92 is another example. GenePlexus offers two major benefits: prioritizing

genes that are good candidates for experimental study by investigating the top-ranked

genes, and giving each gene in the genome a prediction of how related that gene is to

the user geneset.

Omnigenic model

Fully explaining disease mechanisms is challenging for complex diseases due to their

extremely polygenic nature. The omnigenic model8,93,94 is motivated by the particular

challenges revealed in GWAS studies, where most statistically significant loci have a

small effect size, and these small effect size SNPs explain most of the genetic

heritability. In addition, GWAS results are highly dependent on sample size. As

experiments get bigger, the number of small effect size loci discovered increases, while

large effect size loci actually have been found to decrease with larger sample sizes14.

The question raised here is what these observations imply about the genetic

architecture that underlies complex traits, and this motivates the omnigenic model –

viewing diseases as networks. Specifically, the model proposes there are a relatively

small number of “core” disease genes with direct, mechanistically interpretable genes

that are influenced by perturbations in a much larger number of “peripheral” disease

genes. From a GWAS perspective, a multitude of low effect size loci would be targeting

9

peripheral genes and the relatively high effect size loci would be enriched with core

gene relationships14. A challenge associated with this model is delivering refined,

objective definitions of what are truly core and peripheral genes. There is no gold

standard list where the genes of the disease are categorized and validated in this way.

Some distinctions that have some agreement are that core genes will be those which

are more relevant for designing treatments – and will be enriched for drug targets95.

Additionally they will be thought of as typically being more conserved96. Multiple

noticeable and novel attempts to define candidate core genes have been created95,97,98.

However, methods to either “discover” core genes or investigate biological statistics

such as conservation typically utilize small numbers of validated important genes or

pathways – and neglect the modular nature of diseases within a network – where

multiple distinct modules enriched for distinct phenotypes and pathways are discovered

when mapping disease genes. In other words, they assume important known genes are

core and evaluate them in the context of other data99. Because the definition of core

genes includes the set being “small”, it is tempting to think of only a singular pathway

being of note. Given the number of disease genes and the modular nature of both

networks and disease genes when mapped to networks, it is unlikely a singular set of

core genes would have direct effects on all of these diverse phenotypes. This suggests

the classical definition and way of thinking of core genes actually underestimates the

genetic complexity and heterogeneity of disease.

Research Questions

The explosion of -omics data for complex traits and diseases has created enormous

opportunities for fueling new computational models. In ideal circumstances,

computational models can both interpret lab experiment results and guide researchers

in planning new ones. A primary challenge of working with human genetic data is that

for some complex diseases and traits, the polygenic genetic architecture is enormously

complex on a large multi-dimensional scale of hundreds or thousands of loci.

Developing methods that manage and leverage that complexity is a vital challenge that

must be addressed. Nearly every patient with a disease has unique mutations – where

the loci do not interact independently but influence and interact with one another.

Unraveling that complexity is what will allow precision medicine to become a truly viable

10

way to treat patients, where the patient with completely unique, unseen mutations can

be stratified to a specific disease subtype100. Two outstanding, essential questions are

thus: How do we implement computational methods for seemingly impossible-to-unravel

genetic complexity? and How do we interpret diseases at the gene and pathway level?

Our proposal is that complex traits are best interpreted as smaller, meaningful

subnetworks of genes. This is true across the large number of complex traits that exist

out there, and can be leveraged to build general-purpose methods that work across

many datasets. Working with these subnetworks is what will provide viable and better

answers to a wide range of computational problems from gene classification and

prioritization, to discovering relevant pathways, and discovering which disease genes

are likely to have important and notable perturbation effects.

Dissertation Contributions and Significance

Firstly, we show a use case of the sheer complexity that human diseases can have.

Inflammation is a common pathway implicated in many diseases. While some diseases

are defined by inflammation pathways, such as autoimmune diseases, many complex

diseases have inflammation components in a way that is less obvious genetically but

has been phenotypically observed. Some common examples of this include alzheimer's

disease, coronary heart disease, and endometriosis. We investigate whether networks

and modules can unravel the inflammation components of these non-autoimmune

complex diseases, and define the specific inflammation pathways involved. Additionally,

we predict approved drugs in autoimmune diseases that could be repurposed to target

discovered inflammation components in complex disease. This project demonstrates

the utility of networks in discovering meaningful subsets of genes to unravel specific

pathways and phenotypes of note within highly complex human disease. We show that

networks are a useful tool for making novel discoveries at a refined level – enabling

researchers to investigate specific processes in disease subnetworks rather than the

disease as a whole.

Secondly, we address the challenges of interpreting biology, namely the discovery of

novel genes, from large gene sets. Real world experiments – such as differential

expression – of complex disease data not only have hundreds of relevant gene results

meeting significance thresholds, but also have noise and false positive results. To

11

address these issues, we implemented ModGenePlexus, which is an extension of the

GenePlexus method, to discover modules of a disease for gene discovery experimental

data. We leverage discovering functional modules of diseases to not only find

meaningful biological subsets of genes, but to additionally perform a form of label

propagation - a robust, well validated semi-supervised form of gene classification. This

removes genes that aren’t well connected in a network – false positives – and finds

genes that have highly robust connections – false negatives. These semi-supervised

results are then run with GenePlexus where a supervised learning model is created for

each discovered module. We demonstrate that performing gene classification in this

way is superior to running GenePlexus on the initial experimental results as a whole, an

uncovers unique biology of diseases that is missed when considering the entire gene

list as a singular unit. Our method removes poorly connected genes, and discovers

genes that were not found in the experiment, performing geneset refinement for

downstream analysis.

Thirdly, we use ModGenePlexus to predict candidate core and peripheral genes by

finding and defining a proposed disease module using GWAS data. Assuming this

module has every gene that could ever possibly be associated with the disease in

question, from an omnigenic perspective the likely ‘core’ genes are those which have a

large number of network connections to other disease genes within the module. This is

because disease core genes are influenced by a much larger number of disease

peripheral genes – whose main reason for disease association is through network

connection to the core genes. This definition allows us to categorize both core and

peripheral genes within the proposed disease module, and we demonstrate that we

predict meaningful core genes for atrial fibrillation. Similarly, our method allows us to

make predictions for important genes of other species utilizing network connections in a

multi-species network framework. Starting with a human disease gene list, we are able

to find the genes of other organisms that are highly connected to the human disease

gene list. These model organism genes and their network connections are used to shed

light on how the human core gene orthologs transfer into the model organism space,

which has major implications for designing experiments and predicting genes with

functional relevance across species.

12

Across all these projects, we have either released or plan to release code to both

reproduce our methods and expand on them. Code for chapter 2 is already publicly

available on github repository (https://github.com/krishnanlab/chronic-inflammation) and

in Zenodo record (https://zenodo.org/record/6858073). Chapter 2 has additionally been

published100. ModGenePlexus will be integrated into the publicly released software

package PyGenePlexus88 python package and into the GenePlexus webserver87,

enabling researchers to obtain top gene hits on a gene module basis - benefiting

common and existing computational pipelines.

Dissertation Structure and Research Summary

The rest of this dissertation is organized as follows: Chapter 2 demonstrates utilizing a

computational approach that integrates networks, large complex disease associations,

and drug-target information to isolate aspects of diseases that correspond to chronic

inflammation phenotypes and genes. We integrate a drug prioritization method with our

module predictions to discover inflammation related gene targets of these diseases.

Chapter 3 describes an innovation to the GenePlexus method, allowing GenePlexus to

be used with large scale experimental -omics data. We demonstrate that our new

method systematically outperforms original GenePlexus performance for real world

experiments by using semi-supervised classification to discover gene modules for

supervised learning and that unique, additional biology is uncovered when using this

new method. Chapter 4 dives into interpreting complex diseases utilizing the omnigenic

model by predicting and categorizing core genes within the discovered disease module

using GWAS experimental data. We demonstrate the functional relevance of the

predicted core genes and elucidate how they relate to other biological data. Additionally

we demonstrate the utility of discovering cross species phenotypes through

demonstrating how they relate to mechanistically relevant human genes. Chapter 5

discusses the broader impact, some noticeable limitations that we designed our

methods to work around, and proposes future directions possible based on the results

of this dissertation. Lastly, Chapter 6 gives an overview of some gene prioritization

methods that can be used with GWAS summary statistics data, and goes over a project

that motivated the biological concepts and some methods discussed in this dissertation.

13

REFERENCES

1. Caliskan, M., Brown, C. D. & Maranville, J. C. A catalog of GWAS fine-mapping
efforts in autoimmune disease. Am. J. Hum. Genet. 108, 549–563 (2021).

2. Sudlow, C. et al. UK Biobank: An Open Access Resource for Identifying the Causes
of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Med. 12,
(2015).

3. Greene, C. S. et al. Understanding multicellular function and disease with human

tissue-specific networks. Nat. Genet. 47, 569–576 (2015).

4. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and

Translation. Am. J. Hum. Genet. 101, 5–22 (2017).

5. Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primer 1,

1–21 (2021).

6. Genetic Investigation of ANthropometric Traits (GIANT) Consortium et al.

Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies
additional variants influencing complex traits. Nat. Genet. 44, 369–375 (2012).

7. Broekema, R. V., Bakker, O. B. & Jonkers, I. H. A practical view of fine-mapping and

gene prioritization in the post-genome-wide association era. Open Biol. 10, 190221
(2020).

8. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An Expanded View of Complex Traits: From

Polygenic to Omnigenic. Cell 169, 1177–1186 (2017).

9. Leeuw, C. A. de, Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: Generalized
Gene-Set Analysis of GWAS Data. PLOS Comput. Biol. 11, e1004219 (2015).

10. Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and

Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary
Statistics. PLOS Comput. Biol. 12, e1004714 (2016).

11. Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping
and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).

12. GTEx Consortium et al. A gene-based association method for mapping traits using

reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

13. Brodie, A., Azaria, J. R. & Ofran, Y. How far from the SNP may the causative genes

be? Nucleic Acids Res. 44, 6046–6054 (2016).

14. Sinnott-Armstrong, N., Naqvi, S., Rivas, M. & Pritchard, J. K. GWAS of three

molecular traits highlights core genes and pathways alongside a highly polygenic
background. eLife 10, e58615 (2021).

15. Biswas, S., Pal, S., Majumder, P. P. & Bhattacharjee, S. A framework for pathway
knowledge driven prioritization in genome‐wide association studies. Genet.

14

Epidemiol. 44, 841–853 (2020).

16. Gerring, Z. F., Mina-Vargas, A., Gamazon, E. R. & Derks, E. M. E-MAGMA: an

eQTL-informed method to identify risk genes using genome-wide association study
summary statistics. Bioinformatics 37, 2245–2249 (2021).

17. Adam, Y., Samtal, C., Brandenburg, J.-T., Falola, O. & Adebiyi, E. Performing
post-genome-wide association study analysis: overview, challenges and
recommendations. F1000Research 10, 1002 (2021).

18. Salvi, R., Gawde, U., Idicula-Thomas, S. & Biswas, B. Pathway Analysis of Genome
Wide Association Studies (GWAS) Data Associated with Male Infertility. Reprod.
Med. 3, 235–245 (2022).

19. Kim, S. S. et al. Genes with High Network Connectivity Are Enriched for Disease

Heritability. Am. J. Hum. Genet. 104, 896–913 (2019).

20. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate

disease genes by network-based boosting of genome-wide association data.
Genome Res. 21, 1109–1121 (2011).

21. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating Genes and
Protein Complexes with Disease via Network Propagation. PLoS Comput. Biol. 6,
e1000641 (2010).

22. Taşan, M. et al. Selecting causal genes from genome-wide association studies via

functionally coherent subnetworks. Nat. Methods 12, 154–159 (2015).

23. De Vries, D. H. et al. Integrating GWAS with bulk and single-cell RNA-sequencing
reveals a role for LY86 in the anti-Candida host response. PLOS Pathog. 16,
e1008408 (2020).

24. Jagadeesh, K. A. et al. Identifying disease-critical cell types and cellular processes
by integrating single-cell RNA-sequencing and human genetics. Nat. Genet. 54,
1479–1492 (2022).

25. Boltz, T. et al. Cell type deconvolution of bulk blood RNA-Seq to reveal biological

insights of neuropsychiatric disorders. BioRxiv Prepr. Serv. Biol. 2023.05.24.542156
(2023) doi:10.1101/2023.05.24.542156.

26. Mai, J., Lu, M., Gao, Q., Zeng, J. & Xiao, J. Transcriptome-wide association studies:

recent advances in methods, applications and available databases. Commun. Biol.
6, 899 (2023).

27. Zhou, Y. et al. Integrating RNA-Seq With GWAS Reveals a Novel SNP in

Immune-Related HLA-DQB1 Gene Associated With Occupational Pulmonary
Fibrosis Risk: A Multi-Stage Study. Front. Immunol. 12, 796932 (2022).

28. Collister, J. A., Liu, X. & Clifton, L. Calculating Polygenic Risk Scores (PRS) in UK
Biobank: A Practical Guide for Epidemiologists. Front. Genet. 13, 818574 (2022).

15

29. Hu, X. et al. Polygenic transcriptome risk scores for COPD and lung function

improve cross-ethnic portability of prediction in the NHLBI TOPMed program. Am. J.
Hum. Genet. S0002929722001033 (2022) doi:10.1016/j.ajhg.2022.03.007.

30. Richardson, T. G., O’Nunain, K., Relton, C. L. & Davey Smith, G. Harnessing Whole
Genome Polygenic Risk Scores to Stratify Individuals Based on Cardiometabolic
Risk Factors and Biomarkers at Age 10 in the Lifecourse—Brief Report. Arterioscler.
Thromb. Vasc. Biol. 42, 362–365 (2022).

31. Patel, A. P. et al. A multi-ancestry polygenic risk score improves risk prediction for

coronary artery disease. Nat. Med. 29, 1793–1803 (2023).

32. Sandin, S. et al. The Heritability of Autism Spectrum Disorder. JAMA 318, 1182

(2017).

33. Krishnan, A. et al. Genome-wide prediction and functional characterization of the
genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462 (2016).

34. Hartl, D. et al. Translational precision medicine: an industry perspective. J. Transl.

Med. 19, 245 (2021).

35. Genkel, V. V. & Shaposhnik, I. I. Conceptualization of Heterogeneity of Chronic

Diseases and Atherosclerosis as a Pathway to Precision Medicine: Endophenotype,
Endotype, and Residual Cardiovascular Risk. Int. J. Chronic Dis. 2020, 5950813
(2020).

36. Caberlotto, L. et al. Cross-disease analysis of Alzheimer’s disease and type-2
Diabetes highlights the role of autophagy in the pathophysiology of two highly
comorbid diseases. Sci. Rep. 9, 3965 (2019).

37. Pividori, M. et al. PhenomeXcan: Mapping the genome to the phenome through the

transcriptome. Sci. Adv. 6, eaba2083 (2020).

38. Guo, M. et al. Analysis of disease comorbidity patterns in a large-scale China

population. BMC Med. Genomics 12, 177 (2019).

39. Román, A. L. S. Comorbidity in inflammatory bowel disease. World J. Gastroenterol.

17, 2723 (2011).

40. Kendler, K. S. & Neale, M. C. Endophenotype: a conceptual analysis. Mol.

Psychiatry 15, 789–797 (2010).

41. Iacono, W. G., Malone, S. M. & Vrieze, S. I. Endophenotype best practices. Int. J.

Psychophysiol. 111, 115–144 (2017).

42. Furman, D. et al. Chronic inflammation in the etiology of disease across the life

span. Nat. Med. 25, 1822–1832 (2019).

43. Wong, A. K., Krishnan, A. & Troyanskaya, O. G. GIANT 2.0: genome-scale

integrated analysis of gene networks in tissues. Nucleic Acids Res. 46, W65–W70

16

(2018).

44. Guan, Y. et al. Tissue-Specific Functional Networks for Prioritizing Phenotype and

Disease Genes. PLoS Comput. Biol. 8, e1002694 (2012).

45. Zhu, H., Shang, L. & Zhou, X. A Review of Statistical Methods for Identifying

Trait-Relevant Tissues and Cell Types. Front. Genet. 11, (2021).

46. Kohane, I. S. Finding a new balance between a genetics-first or phenotype-first

approach to the study of disease. Neuron 109, 2216–2219 (2021).

47. Benn, M. & Nordestgaard, B. G. From genome-wide association studies to

Mendelian randomization: novel opportunities for understanding cardiovascular
disease causality, pathogenesis, prevention, and treatment. Cardiovasc. Res. (2018)
doi:10.1093/cvr/cvy045.

48. Kanduri, C., Sandve, G. K., Hovig, E., De, S. & Layer, R. M. Editorial: Genomic
Colocalization and Enrichment Analyses. Front. Genet. 11, 617876 (2021).

49. Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan
to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010).

50. Legault, M.-A., Perreault, L.-P. L. & Dubé, M.-P. ExPheWas: A Browser for

Gene-Based pheWAS Associations.
http://medrxiv.org/lookup/doi/10.1101/2021.03.17.21253824 (2021)
doi:10.1101/2021.03.17.21253824.

51. Safari-Alighiarloo, N., Taghizadeh, M., Rezaei-Tavirani, M., Goliaei, B. & Peyvandi,

A. A. Protein-protein interaction networks (PPI) and complex diseases.
Gastroenterol. Hepatol. Bed Bench 7, 17–31 (2014).

52. Szklarczyk, D. et al. STRING v11: protein–protein association networks with

increased coverage, supporting functional discovery in genome-wide experimental
datasets. Nucleic Acids Res. 47, D607–D613 (2019).

53. Ovens, K., Eames, B. F. & McQuillan, I. Comparative Analyses of Gene

Co-expression Networks: Implementations and Applications in the Study of
Evolution. Front. Genet. 12, 695399 (2021).

54. Russell, M., Aqi, A., Saitou, M., Gokcumen, O. & Masuda, N. Gene communities in
co-expression networks across different tissues. ArXiv arXiv:2305.12963v2 (2023).

55. Emmert-Streib, F., Dehmer, M. & Haibe-Kains, B. Gene regulatory networks and
their applications: understanding biological and medical problems in terms of
networks. Front. Cell Dev. Biol. 2, (2014).

56. Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based

approach to human disease. Nat. Rev. Genet. 12, 56–68 (2011).

57. Vidal, M., Cusick, M. E. & Barabási, A.-L. Interactome Networks and Human

17

Disease. Cell 144, 986–998 (2011).

58. Wang, X., Gulbahce, N. & Yu, H. Network-based methods for human disease gene

prediction. Brief. Funct. Genomics 10, 280–293 (2011).

59. Cvijovic, M. & Polster, A. Network medicine: facilitating a new view on complex

diseases. Front. Bioinforma. 3, 1163445 (2023).

60. Krishnan, A., Taroni, J. N. & Greene, C. S. Integrative Networks Illuminate Biological
Factors Underlying Gene–Disease Associations. Curr. Genet. Med. Rep. 4, 155–162
(2016).

61. Alcalá-Corona, S. A., Sandoval-Motta, S., Espinal-Enríquez, J. & Hernández-Lemus,

E. Modularity in Biological Networks. Front. Genet. 12, 701331 (2021).

62. Ghiassian, S. D., Menche, J. & Barabási, A.-L. A DIseAse MOdule Detection

(DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns
of Disease Proteins in the Human Interactome. PLOS Comput. Biol. 11, e1004120
(2015).

63. The DREAM Module Identification Challenge Consortium et al. Assessment of

network module identification across complex diseases. Nat. Methods 16, 843–852
(2019).

64. Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease.

Nature 452, 429–435 (2008).

65. Pe’er, D., Regev, A., Elidan, G. & Friedman, N. Inferring subnetworks from perturbed

expression profiles. Bioinformatics 17, S215–S224 (2001).

66. Levi, H., Elkon, R. & Shamir, R. DOMINO: a network‐based active module

identification algorithm with reduced rate of false calls. Mol. Syst. Biol. 17, e9593
(2021).

67. Levi, H., Elkon, R. & Shamir, R. DOMINO: A Novel Algorithm for Network-Based

Identification of Active Modules with Reduced Rate of False Calls.
http://biorxiv.org/lookup/doi/10.1101/2020.03.10.984963 (2020)
doi:10.1101/2020.03.10.984963.

68. Kato, G. J. et al. Sickle cell disease. Nat. Rev. Dis. Primer 4, 18010 (2018).

69. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new

perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45,
D353–D361 (2017).

70. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet.

25, 25–29 (2000).

71. Oughtred, R. et al. The BioGRID database: A comprehensive biomedical resource

of curated protein, genetic, and chemical interactions. Protein Sci. Publ. Protein Soc.

18

30, 187–200 (2021).

72. Szklarczyk, D. et al. The STRING database in 2021: customizable protein–protein

networks, and functional characterization of user-uploaded gene/measurement sets.
Nucleic Acids Res. 49, D605–D612 (2021).

73. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five Years of GWAS

Discovery. Am. J. Hum. Genet. 90, 7–24 (2012).

74. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461,

747–753 (2009).

75. Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev.

Genet. 20, 467–484 (2019).

76. Dolgin, E. The most popular genes in the human genome. Nature 551, 427–431

(2017).

77. Dunham, I. Human genes: Time to follow the roads less traveled? PLOS Biol. 16,

e3000034 (2018).

78. Edwards, A. M. et al. Too many roads not taken. Nature 470, 163–165 (2011).

79. Hoffmann, R. & Valencia, A. Life cycles of successful genes. Trends Genet. 19,

79–81 (2003).

80. Haynes, W. A., Tomczak, A. & Khatri, P. Gene annotation bias impedes biomedical

research. Sci. Rep. 8, 1362 (2018).

81. Liu, R., Mancuso, C. A., Yannakopoulos, A., Johnson, K. A. & Krishnan, A.

Supervised learning is an accurate method for network-based gene classification.
Bioinformatics 36, 3457–3465 (2020).

82. Mancuso, C. A., Johnson, K. A., Liu, R. & Krishnan, A. Joint representation of
molecular networks from multiple species improves gene classification. PLOS
Comput. Biol. 20, e1011773 (2024).

83. Leiserson, M. D. M. et al. Pan-cancer network analysis identifies combinations of
rare somatic mutations across pathways and protein complexes. Nat. Genet. 47,
106–114 (2015).

84. Guan, Y., Ackert-Bicknell, C. L., Kell, B., Troyanskaya, O. G. & Hibbs, M. A.
Functional Genomics Complements Quantitative Genetics in Identifying
Disease-Gene Associations. PLoS Comput. Biol. 6, e1000991 (2010).

85. Köhler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the Interactome for

Prioritization of Candidate Disease Genes. Am. J. Hum. Genet. 82, 949–958 (2008).

86. Warde-Farley, D. et al. The GeneMANIA prediction server: biological network

integration for gene prioritization and predicting gene function. Nucleic Acids Res.
38, W214-220 (2010).

19

87. Mancuso, C. A. et al. GenePlexus: a web-server for gene discovery using

network-based machine learning. Nucleic Acids Res. 50, W358–W366 (2022).

88. Mancuso, C. A., Liu, R. & Krishnan, A. PyGenePlexus: a Python package for gene
discovery using network-based machine learning. Bioinformatics 39, btad064
(2023).

89. Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on

human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839
(2017).

90. Piñero, J. et al. The DisGeNET knowledge platform for disease genomics: 2019

update. Nucleic Acids Res. 48, D845–D855 (2020).

91. Grover, A. & Leskovec, J. node2vec: Scalable Feature Learning for Networks. in
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining - KDD ’16 855–864 (ACM Press, San Francisco,
California, USA, 2016). doi:10.1145/2939672.2939754.

92. Liu, R. & Krishnan, A. PecanPy: a fast, efficient and parallelized Python
implementation of node2vec. Bioinformatics 37, 3377–3379 (2021).

93. The Omnigenic Model: Response from the Authors. J. Psychiatry Brain Sci. (2017)

doi:10.20900/jpbs.20170014S8.

94. Wray, N. R., Wijmenga, C., Sullivan, P. F., Yang, J. & Visscher, P. M. Common

Disease Is More Complex Than Implied by the Core Gene Omnigenic Model. Cell
173, 1573–1580 (2018).

95. Ratnakumar, A., Weinhold, N., Mar, J. C. & Riaz, N. Protein-Protein interactions

uncover candidate ‘core genes’ within omnigenic disease networks. PLOS Genet.
16, e1008903 (2020).

96. Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in

discovery of genetic effects on gene expression and complex traits. Nat. Genet. 55,
1866–1875 (2023).

97. Ratajczak, F. et al. Speos: an ensemble graph representation learning framework to
predict core gene candidates for complex diseases. Nat. Commun. 14, 7206 (2023).

98. The Schizophrenia Working Group of the Psychiatric Genomics Consortium 2, et al.
The role of polygenic risk score gene-set analysis in the context of the omnigenic
model of schizophrenia. Neuropsychopharmacology 44, 1562–1569 (2019).

99. Zhang, W., Reeves, G. R. & Tautz, D. Testing Implications of the Omnigenic Model
for the Genetic Analysis of Loci Identified through Genome-wide Association. Curr.
Biol. 31, 1092-1098.e6 (2021).

100.Hickey, S. L., McKim, A., Mancuso, C. A. & Krishnan, A. A network-based approach

for isolating the chronic inflammation gene signatures underlying complex diseases

20

towards finding new treatment opportunities. Front. Pharmacol. 13, 995459 (2022).

21

CHAPTER 2: A NETWORK-BASED APPROACH FOR ISOLATING THE CHRONIC

INFLAMMATION GENE SIGNATURES UNDERLYING COMPLEX DISEASES

TOWARDS FINDING NEW TREATMENT OPPORTUNITIES

Abstract

Complex diseases are associated with a wide range of cellular, physiological, and

clinical phenotypes. To advance our understanding of disease mechanisms and our

ability to treat

these diseases,

it

is critical

to delineate the molecular basis and

therapeutic avenues of specific disease phenotypes, especially those that are

associated with multiple diseases.

Inflammatory processes constitute one such

prominent phenotype, being involved in a wide range of health problems including

ischemic heart disease, stroke, cancer, diabetes mellitus, chronic kidney disease,

non-alcoholic fatty liver disease, and autoimmune and neurodegenerative conditions.

While hundreds of genes might play a role in the etiology of each of these diseases,

isolating the genes involved in the specific phenotype (e.g. inflammation “component”)

could help us understand the genes and pathways underlying this phenotype across

diseases and predict potential drugs to target the phenotype. Here, we present a

computational approach that integrates gene interaction networks, disease-/trait-gene

associations, and drug-target

information to accomplish this goal. We apply this

approach to isolate gene signatures of complex diseases that correspond to chronic

inflammation and use SAveRUNNER to prioritize drugs to reveal new therapeutic

opportunities.

Introduction

Acute inflammation is an organism's healthy response to invasion by pathogens or to

cellular damage caused by injury1. Systemic chronic inflammation (CI) occurs when

these inflammatory responses do not resolve, resulting in persistent, low-grade immune

activation that causes collateral damage to the affected tissue over time2. While the

direct connection of CI to auto-immune diseases has been well known for some time,

only recently has the medical community uncovered the prevalence of CI in a multitude

of complex diseases and disorders2,3. Therefore, it is imperative to better understand the

different molecular mechanisms of CI manifestation across diseases.

22

Network-based methods are a powerful collection of tools in both elucidating specific

pathways and processes that may underlie a complex phenotype4–6 as well as for drug

repurposing7–9. For instance, HotNet2 is a pan-cancer network analysis in which active

network modules in a genome-wide molecular network are determined by guiding the

module detection algorithm with thousands of genes scored with how prevalent they are

across 12 cancers in TCGA4. HotNet2 is then able to determine if any module is

enriched for a given cancer type, pathway, or process.

In a similar vein, another

approach, DIAMOnD, starts with a genome-wide network, and then creates a disease

specific network using an expanded set of known disease-gene annotations5. This

disease specific network is then analyzed and compared to other disease specific

networks generated using the same technique. Both approaches find regions of a

genome-wide network that are enriched for disease-related genes.

Inflammation is an example of an endophenotype, or intermediate phenotype, of a

complex disease. Ghiassian et al. studied endophenotype network models by starting

with a genome-wide network and constructing modules for sets of seed genes related to

three endophenotypes: inflammation, thrombosis, and fibrosis6. The authors showed

that the network modules derived from the three endophenotypes have strong overlap

in the network and that these modules are enriched for genes differentially expressed in

various complex diseases. While the above methods provide invaluable insight

in

disease mechanisms using a disease-focussed and a phenotype-focussed approach,

respectively, they raise the critical question of identifying phenotypic signatures specific

to individual diseases. For instance, can we identify the CI-signature that is specific to a

given disease and use that to find avenues for therapeutic intervention?

23

Figure 2.1: Schematics describing the experimental pipeline. (A) Describes predicting

new disease related genes (step 1), clustering the disease specific interaction network

(step 2) as well as 5,000 networks made from randomly-selected degree-matched

genes for each disease, identifying CI-enriched clusters (steps 4 and 5), and calculating

the proportion of diseases with at least one CI-enriched cluster. These steps were

performed for each gene-gene interaction network in combination with each

inflammation gene set described in the methods. (B) Describes using SAveRUNNER to

find groups of CI-enriched clusters from all diseases with similar CI-signatures (steps 1

24

Figure 2.1 (cont’d)

and 2), and prioritize treatments for the CI component of complex diseases (step 3).

Using ConsensusPathDB with the high-confidence GeneShot derived CI

gene set resulted in the highest proportion of autoimmune diseases and the lowest

proportion of non-disease traits with at least one CI-enriched cluster. Therefore, steps

1-3 were performed with clusters from that network-CI gene st combination only.

In this work, we address this question using a network-based approach. We first

generate a network consisting of only genes associated with a single disease (Figure

2.1A, steps 1-2) Here,

like in DIAMOnD5, we expand our original disease-gene

annotations to build more robust networks and glean insight into unstudied genes. We

use a network-based supervised machine learning model to expand our gene sets10,

which has been shown to systematically outperform label propagation methods like

DIAMOnD. We then cluster the disease specific network, and find clusters that are

significantly enriched for known CI genes, and compare these CI signatures across

diseases (Figure 2.1A, 2.1B steps 1-2). We then use the SAveRUNNER7 method on

these enriched clusters to predict drugs that might help treat the CI-component specific

to a given disease (Figure 2.1B, step 3).

Methods

Selection of complex and autoimmune diseases and associated seed genes

We searched the literature2,11–15

 and curated examples of 17 complex diseases

associated with chronic inflammation (CI) and 9 common autoimmune diseases. Some

of these diseases are quite broad (i.e “Malignant neoplasm of lung”), and to add more

narrowly defined diseases to our list, we used the Human Disease Ontology16 to identify

child terms of

these diseases. The chosen diseases were not meant

to be

comprehensive, but examples of autoimmune diseases and complex diseases thought

to have immune components. We then identified genes annotated to each disease by

the DisGeNet database, which is a database that stores a collection of disease-gene

annotations from expert curated repositories, GWAS catalogs, animal models and the

scientific literature (Piñero et al. 2020). To ensure that our disease gene sets were

largely non-overlapping, we created a network such that nodes were diseases and an

edge was created between two diseases if the two gene sets had

≥0. 6

overlap (

25

|𝐴∩𝐵|/𝑚𝑖𝑛(|𝐴|,  |𝐵|)

). We then chose the most representative disease from each

connected component. This resulted in 10 autoimmune diseases and 37 complex

diseases (Table S1).

Selection of non-disease traits

Two lab members manually curated 113 non-disease-traits that are unlikely to be

related to SNPs associated with CI (i.e. handedness, coffee intake, and average

household income) from the list of traits with GWAS results from the UK Biobank17 to be

used as negative controls. Based on GWAS summary statistics from the Neale group18,

we used Pascal19

(upstream and downstream windows of 50 KB with the

sum-of-chi-squared statistics method; only autosomal variants) to associate genes with

the non-disease traits. Genes with

𝑝 < 0. 001

were included as seed genes for that trait.

GenePlexus

To predict new genes associated with a set of input seed genes, we used GenePlexus,

a tool that builds an L2-regularized logistic regression model using features from a gene

interaction network10. As input features, we used the adjacency matrices from STRING,

STRING with only experimentally derived edges (STRING-EXP)20, BioGRID21, and

ConsensusPathDB22. For predicting disease genes, positive examples were

disease/trait seed genes and negative example genes were generated by: (i) finding the

union of all genes annotated to all diseases in DisGeNET23, (ii) removing genes

annotated to the given seed genes, and (iii) removing genes annotated to any disease

in the collection that significantly overlapped with the given seed genes (𝑝 < 0. 05

based on the one-sided Fisher’s exact test)10. We tested the performance of the above

features for predicting new genes associated with our diseases and traits of interest

using three-fold cross validation and only included diseases in subsequent analyses if

the diseases/traits had

≥15

associated genes and median

𝑙𝑜𝑔2(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟) ≥1

. (i.e.

the area under the precision-recall curve ‘

𝑎𝑢𝑃𝑅𝐶

’ is at least twice as much as expected

by random chance ‘

𝑝𝑟𝑖𝑜𝑟

’10). See Figure 2.1A, step 1.

Identifying clusters of interacting genes within a disease-specific network

One list of disease-associated genes was formed for each of

the four biological

networks used as features in GenePlexus. Specifically, we added genes with a

GenePlexus prediction probability of

≥0. 80

on the network of interest to the original

26

disease or trait seed gene list to create our final set of associated genes for each

disease or trait for that network. We formed disease/trait-specific networks by subsetting

a given network to include only the disease/trait associated genes and any edges

connecting those genes based on direct interactions (Figure 2.1A, step 2). We tested

five prediction-network–cluster-network combinations: Genes predicted on each of the

four networks were clustered on the same network. Genes predicted on STRING were

also clustered on both STRING and STRING-EXP   to test if using the full network for

novel gene prediction but only experimentally derived gene-gene associations for

clustering would improve performance. We then used the Leiden algorithm24 to partition

the disease/trait-specific networks into clusters (Figure 1A, step 2). Specifically, we

used the leiden_find_partition function from the leidenbase R package (v 0.1.3)

(https://github.com/cole-trapnell-lab/leidenbase)

with

100

iterations

and

ModularityVertex Partition as the partition type. We retained clusters containing ≥5

genes.

Cluster GOBP enrichment analysis

We used the R package topGO25 (v 2.44.0) to find enrichment of genes annotated to

GO biological processes (min size = 5, max size = 100) among disease gene clusters.

The annotations were taken from the Genome wide annotation for Human bioconductor

annotation package, org.Hs.eg.db26 (v 3.13.0). The background gene set included all

human genes present in the network of interest.

Defining chronic-inflammation-associated genes

We tested several different sets of chronic inflammation associated genes for this study

including the GO27 biological process

(GOBP)

terms GO:0002544 (”chronic

inflammatory response”) and GO:0006954 (“inflammatory response”). These were

collected from the Genome wide annotation for Human bioconductor annotation

package, org.Hs.eg.db26

(v 3.13.0) with and without propagation of gene-term

relationships

from the

descendent

terms

(org.Hs.egGO2ALLEGS

and

org.Hs.egGO2EG, respectively). GO:0006954 was also filtered to retain gene-term

relationships inferred from experiments (evidence codes EXP, IDA, IPI, IMP, IGI, IEP,

HTP, HDA, HMP, HGI, and HEP). As GO:0002544 without propagation contained < 15

genes,

this list was ultimately not

included in the study. We also identified genes

27

associated with chronic inflammation using Geneshot which, given the search term

“chronic inflammation”, searches Pubmed using manually collected GeneRif gene-term

associations to return a ranked list containing genes that have been previously

published in association with the search term28. We tested both the entire Geneshot

generated list, and the subset of genes with

> 10

associated publications (“High

confidence GeneShot”). As with the disease genes, we predicted additional

chronic-inflammation-associated genes using GenePlexus with features from each

network. Negative examples for GenePlexus were derived from non-overlapping GOBP

terms. We added genes with a prediction probability of

≥0. 80

to the seed gene list to

create our final sets of CI-associated genes.

Creating random traits

After running GenePlexus to predict new genes for each trait, the gene lists for each

trait were used to generate 5,000 random gene lists that have matching node degree

distributions to the original traits (Figure 2.1A, step 3). That is, a random gene list was

generated for a given trait by replacing each of its genes in the network of interest with a

(randomly chosen) gene that has the same node degree, or a gene that has a close

node degree if there are a small number of genes with the exact node degree4,7. We

clustered the random traits as described in section 2.3. Only clusters with

≥ 5

genes

were included. Real traits with no corresponding permuted traits with clusters containing

≥ 5

genes were excluded from the analysis.

Finding CI-gene enriched disease clusters

For each prediction-network—cluster-network pair and each CI gene list expanded on

the prediction network of interest, for each disease and random trait cluster containing

≥5

genes, we calculated an enrichment score score,

𝐸  =  𝑙𝑜𝑔

( (𝐶𝐺 ∩ 𝐶𝐼)/𝐶𝐺
𝐶𝐼/𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 )
2

where

𝐶𝐺

are the genes in a disease cluster,

𝐶𝐼

are the CI genes, and

𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑

are all of the

genes present in the clustering network (Figure 2.1A, step 4). For each real disease or

trait cluster, we used the matching random trait clusters to calculate a p-value,

𝑝 = 𝑛 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑠𝑐𝑜𝑟𝑒 ≥ 𝑟𝑒𝑎𝑙  𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑠𝑐𝑜𝑟𝑒 

𝑛 𝑟𝑎𝑛𝑑𝑜𝑚  𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠  + 1

4,7. We corrected for multiple comparisons across

clusters within a disease using the Benjamini-Hochberg procedure29 (Figure 2.1A, step

5).

Clusters

with

an

𝐹𝐷𝑅 < 0. 05

and

𝐸  > 0

were

considered

28

chronic-inflammation-associated disease clusters and were deemed to represent the ‘CI

signature’ of the disease.

Identifying the optimal prediction-network/cluster-network/CI gene source

combination

We chose the network/inflammation gene set combination that resulted in the highest

proportion of autoimmune diseases and lowest proportion of non-disease traits with at

least

one CI-enriched

cluster

of

any

network/CI-gene

set

combination,

ConsensusPathDB and the Geneshot generated list, subset with genes with associated

publications.

Comparing chronic inflammation signatures across diseases

For CI-enriched clusters identified using ConsensusPathDB and the high-confidence

Geneshot CI genes, we used the SAveRUNNER R package to quantify the similarity

between each pair of CI-enriched clusters using ConsensusPathDB as the base

network7 (Figure 2.1B, step 1). For each pair, SAveRUNNER computes the average

shortest path between each gene in

𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝐴

and the closest gene in

𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝐵

and

uses this value to calculate an adjusted similarity score. Then, a p-value is estimated

based on a null distribution of adjusted similarity scores between randomly generated

clusters with the same node degree distributions as

𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝐴

and

𝐵

. Because the

similarity scores and p-values are not symmetric,

ie.

𝐴  →  𝐵  ≠  𝐵  →  𝐴

, we used

Stouffer’s method to combine p-values for the same pair of clusters and averaged the

adjusted similarities. We then used the Leiden algorithm as described in section 2.3 to

group related clusters (Figure 2.1B, step 2). For each group, we took the union of the

genes belonging to the resident CI-enriched clusters. Using genes unique to each

group, with all of the ConsensusPathDB genes as background, we used TopGO as in

section 2.4 to identify enriched GOBPs.

Identifying expert-curated drug-target associations

The known drug-gene interactions used in this study are the subset of the interactions

present

in the DrugCentral database30

that are also among the expert curated

interactions in the Drug-Gene Interaction database (DGIdb)31. Specifically, we used the

DGIdb API to retrieve only drug-gene interactions that were marked “Expert curated”

(based on the source trust levels endpoint). Intersecting these interactions with those in

29

DrugCentral (through a list of drug synonyms from DrugCentral) resulted in the final list

of expert-curated drug-gene pairs.

Treatment prediction and scoring

We predicted treatment opportunities for the inflammatory component of complex

diseases by using the SAveRUNNER R package 7 (Figure 1B, step 3). SAveRUNNER

builds a bipartite drug-disease network by utilizing the previously determined

expert-curated drug targets,

the CI-associated cluster disease genes, and the

ConsensusPathDB network as a human interactome. Network similarity scores returned

by SAveRUNNER represent the proximity between disease and drug modules, where a

high similarity score means that the disease and drug modules have high proximity in

ConsensusPathDB. SAveRUNNER calculates a p-value where a significant value

represents the disease genes and drug targets are nearby in the network more than

expected by chance (based on an empirical null distribution built using 200 pairs of

randomly selected groups of genes with the same size and degree distribution of the

original sets of disease genes and drug targets). Using the list of

final predicted

associations after normalization of network similarity, the p-values were corrected for

multiple comparisons within each disease using the Benjamini-Hochberg procedure.

Drugs were associated to diseases based on the disease cluster with the lowest

𝐹𝐷𝑅

.

Predicted treatments are disease-drug pairs with an

𝐹𝐷𝑅 < 0. 01

.

Evaluating SAveRUNNER prediction performance

We

calculated

𝑙𝑜𝑔2(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟)

by

ranking

disease-drug

pairs

by

− 𝑙𝑜𝑔

(𝑆𝐴𝑣𝑒𝑅𝑈𝑁𝑁𝐸𝑅 𝐹𝐷𝑅)

10

and using either previously indicated drug-disease pairs

(both approved and off-label) or drug-disease pairs tested in a clinical trial as positive

labels. Approved and off-label drug-disease pairs were collected from DrugCental30.

Only drugs with expert curated target genes were included (see section 2.6.1). The

Unified Medical Language System (UMLS) Concept Unique Identifiers (CUI) were

limited to diseases (T047) and neoplastic processes (T191), and our diseases were

matched to diseases in DrugCentral using UMLS CUIs. Drug-disease pairs tested in a

clinical trial were collected from the database for Aggregate Analysis of Clinical Trials

(AACT) 32. AACT reports the Medical Subject Headings (MeSH) vocabulary names for

diseases. We used disease vocabulary mapping provided by DisGeNET to translate

30

UMLS CUIs for our diseases to MeSH vocabulary names, further restricted to only

those that were present in AACT. We filtered AACT for trials with “Active, not recruiting”,

“Enrolling by invitation”, “Recruiting”, or “Completed” status.

Enrichment of predicted drug-disease pairs among previously indicated

drug-disease pairs

To check for an enrichment of predicted drug-disease pairs among previously indicated

drug-disease pairs for each disease, we tallied the total number of unique drugs

previously indicated to any disease, the number of those drugs indicated to the disease

of interest, the number of drugs predicted to treat the disease by our method, and the

number of drugs predicted to treat the disease by our method that were also previously

indicated for that disease. We calculated a p-value using a one tailed Fisher’s exact

test, and corrected for multiple comparisons within each disease across drugs using the

Benjamini-Hochberg procedure.

Enrichment of anti-inflammatory drugs and immunosuppressants among

predicted treatments

We searched the DrugBank database for the ATC codes for anti-inflammatory drugs

and immunosuppressants including Immunosuppressants (L04), Corticosteroids for

systemic use (H02), Antiinflammatory and antirheumatic products (M01), and

Antihistamines (R06)33. We used these codes to pull all of the drugs in these categories

from our expert curated drug to target gene database. For each disease we ranked

predicted drugs by

  − 𝑙𝑜𝑔

(𝑆𝐴𝑣𝑒𝑅𝑈𝑁𝑁𝐸𝑅 𝐹𝐷𝑅)

and used the fgsea R package (v

10

1.20.0) to perform gene set enrichment analysis for drugs belonging to each of the four

classes34,35.

Reference to tables created for this chapter

In this chapter we reference multiple tables that display genelists for the complex

diseases and UK Biobank GWAS. These tables are too big to display in the chapter and

can be downloaded from the Github repository:

(https://github.com/krishnanlab/chronic-inflammation/tree/main/figures/supplemental).

31

Results

Expanding lists of disease-related genes and identifying disease-specific gene

subnetworks

Our first goal was to establish a comprehensive list of genes associated with the

complex diseases of interest and resolve the genes linked to each disease into subsets

of tightly-connected genes in an underlying molecular network. Towards this goal, we

selected 37 complex diseases associated with underlying systemic inflammation (see

methods). To ensure that we correctly isolate chronic inflammation (CI) signatures, we

devised a set of positive and negative controls. We selected 10 autoimmune disorders

as positive controls because autoimmune disorders are characterized by CI and should

have an easily identifiable CI gene signature. For negative controls, we selected 113

traits from UK Biobank17

that are unlikely to be associated with CI (i.e. Right

handedness, filtered coffee intake, and distance between home and workplace). The full

list

of

traits

is

included

in

supplementary material

in Zenodo

record

(https://zenodo.org/record/6858073).

While thousands of genes may play a role in the etiology of a chronic disease, it is

unlikely that all of these genes have been cataloged in available databases such as

DisGeNET or

identified

by GWAS. Hence, we

expanded

the

lists

of

disease-or-trait-associated genes using GenePlexus10 (Figure 2.1A, step 1). Briefly,

GenePlexus performs supervised machine learning using network-based features to

predict novel genes related to a set of input seed genes. Here, we built one GenePlexus

model per disease using disease-associated genes from DisGeNET or trait-associated

genes from the UK Biobank GWAS as seed genes (positive examples). To test the

robustness of this method for identifying CI enriched clusters, we tested four different

biological interaction networks of varying sizes and edge densities — STRING, STRING

with

only

experimentally

derived

edges

(STRING-EXP)20, BioGRID21,

and

ConsensusPathDB22 (Figure 2.1A, step 1, see methods section “GenePlexus”) Genes

predicted by the GenePlexus model with a probability

≥0. 80

were added to the seed

gene list to create an expanded list of disease- or trait-associated genes. Figure 2.2

shows results for ConsensusPathDB. The proportion of genes predicted by GenePlexus

32

Figure 2.2: (A) Number of genes per disease/trait. (B) Proportion of the genes per

disease/trait

that were predicted by GenePlexus.

(C) Number of clusters per

disease/trait containing at least 5 genes. (D) Proportion of total genes assigned to a

cluster containing at least 5 genes. (E) Proportion of clusters per disease/trait enriched

with genes from at least one GO biological process.

for the non-disease traits is lower than those for the autoimmune and complex diseases

(Figure 2.2B). This observation indicates that genes associated with a specific

autoimmune/complex disease tend to have more similar network neighborhoods than

genes associated with non-disease traits. All disease-associated genes after

GenePlexus prediction are listed in Table S3.

Next, for each disease/trait, we clustered the expanded lists of genes based on their

interactions in the gene-gene interaction network (Fig 2.1 A step 2 and Fig 2.2C; Table

33

S3). On ConsensusPathDB, the complex diseases had the highest proportion of genes

grouped into clusters of

≥5

genes, followed by autoimmune diseases and non-disease

traits (Figure 2.2D). To assess whether clusters are biologically meaningful, we

performed an enrichment analysis between every cluster and hundreds of GO Biological

Process (GOBP) gene sets. We theorize that significant enrichment of a cluster with a

GOBP means the genes in the cluster likely function together to carry out a specific

cellular process or pathway. On ConsensusPathDB,

for autoimmune and complex

diseases, the median proportion of GOBP enriched clusters are

> 0. 75

and

> 0. 60

,

respectively, suggesting most clusters are biologically relevant (Figure 2.2E).

In

contrast, most clusters in non-disease traits are not enriched for a GOBP (Figure 2.2E).

Isolating CI-enriched disease clusters

Clusters of related, disease-associated genes on functional gene interaction networks

are likely to correspond to the pathways and biological processes disrupted during

disease progression. For complex disorders, multiple pathways are likely to be affected.

Our next goal was to identify which cluster(s) within a set of disease-associated genes

corresponds to the CI component of

the disease. For this analysis, similar to the

expansion of disease- or trait-associated genes, we used GenePlexus to predict novel

inflammation genes for each of the 5 sets of inflammation-related seed genes (see

methods section “Defining chronic-inflammation-associated genes”, Table S4). We

then scored the enrichment of CI genes in each disease cluster and performed a

permutation test using 5,000 random gene sets for each disease to determine the

significance of the enrichment score (see methods sections “Creating random traits”

and “Finding CI-gene enriched disease clusters”, Figure 2.1A steps 3-5, Table S5).

With various base networks and CI gene sources, we tested all network–CI-geneset

combinations and chose the one that resulted in the highest proportion of autoimmune

diseases and lowest proportion of non-disease traits with at least one CI-enriched

cluster. Based on this test, we picked ConsensusPathDB as the base network and

‘high-confidence Geneshot’ as the source of CI genes (Figure A2.2). We were able to

identify clusters enriched for CI genes in all of the autoimmune disorders surveyed (9/9),

while finding no CI-enriched clusters among the non-disease traits (Figure 2.3A). We

identified at least one CI-enriched cluster in 18 of 30 of the complex diseases (Figure

34

2.3A). Twelve out of the 27 diseases with at least one CI-enriched cluster had two or

more CI-enriched clusters, and the median proportion of CI-enriched clusters out of the

total clusters is higher for autoimmune diseases than complex diseases (Figure 2.3B).

The number of diseases with at least one CI-enriched cluster varied with different

combinations of prediction network, cluster network, and inflammation gene set (Table

S6). In every case, however, the proportion of autoimmune diseases with at least one

CI-enriched cluster was higher than that for non-disease traits suggesting that our

method is robust to changes in base-network and inflammation geneset (Figure A2.2).

Figure 2.3: (A) Number of diseases/traits with at least one cluster overlapping the

expanded chronic inflammation (CI) geneset (dark pink), out of the total number of

diseases/traits. (B) The proportion of CI-enriched disease clusters among all disease

clusters per disease. (C) Mean probability that genes with no known relationship with

chronic inflammation residing in a CI-enriched cluster or non-CI-enriched cluster are

associated with CI. P-value calculated using a one-sided Fisher’s Exact test.

35

We hypothesized that, through guilt-by-association, even the genes with no known

relationship with chronic inflammation residing in a CI-enriched cluster should have a

higher probability of being CI-associated than those in non-CI-enriched clusters. To test

this hypothesis, we used GenePlexus with features from each gene-gene interaction

network to calculate the probability that every gene is associated with each

inflammation gene set. Then, focusing on the genes in disease clusters that were not

present in the inflammation geneset, we found that the mean CI probability of these

genes in CI-enriched clusters is significantly higher for CI-enriched clusters than

non-enriched clusters in 24 out of 25 network/CI-geneset combinations (Figure

A2.3-S7), including ConsensusPathDB with the high-confidence Geneshot CI geneset

(Figure 2.3C). This observation suggests that the CI-enriched clusters as a whole, and

not just the genes in the high-confidence Geneshot CI geneset residing within them, are

CI-associated in the disease of interest. Knocking out putative inflammation associated

genes in animal models of the diseases the genes have already been associated with

and testing for an increase in known inflammation markers would confirm this result.

Comparing CI gene signatures across diseases

To determine if related diseases have similar chronic inflammation signatures, we used

a network-based approach to quantify the similarity between each pair of

ConsensusPathDB/high-confidence GeneShot CI-enriched disease clusters across

diseases and grouped similar clusters together using the Leiden algorithm24,36 (Figure

2.1B, steps 1-2). Several diseases have more than one CI-enriched cluster and none of

these diseases have clusters belonging only to one group (Figure 2.4A, Table S7).

Moreover, diseases belonging to the same broad category — i.e. autoimmune, cancer,

or cardiovascular disease — do not have a larger proportion of clusters belonging to a

particular group than expected by chance (one-sided Fisher’s exact test, Figure 2.4A).

This suggests that one disease can harbor more than one type of chronic-inflammation

signature, and that the same signatures can be found in very different diseases. For

example,

rheumatoid arthritis, myocardial

ischemia, atherosclerosis and chronic

obstructive airway disease all have CI-enriched clusters belonging to each of the three

signature groups. To determine the biological significance of these signature groups, we

performed enrichment analyses for genes unique to each group among GO biological

36

processes (Figure 2.4B, Table S8). The top 10 significantly enriched terms for each

group are largely distinct, with group 1 being enriched for immune relevant signaling

pathways, group 2 for regulation of immune cell proliferation, and group 3 for regulation

of immune cell chemotaxis (Figure 2.4B).

Figure 2.4: (A) Number of CI-enriched clusters per disease colored by CI-signature

group. (B) Top ten enriched GOBP categories by Benjamini-Hochberg procedure

corrected FDR for each CI-signature group — the group is denoted by the colored

blocks to the left of

the heatmap. The heatmap shows the -log10(FDR) of

the

enrichment for each CI-signature group — * denotes

.
𝐹𝐷𝑅  <  . 05

Predicting novel treatment opportunities

Our final goal was to leverage the ConsensusPathDB/high confidence GeneShot

CI-enriched disease clusters we discovered to find potential avenues for repurposing

approved drugs to therapeutically target systemic inflammation underlying complex

37

diseases (Fig 2.1B, step 3). Towards this goal, we used SAveRUNNER to find

associations between CI-enriched clusters and FDA approved drugs through each

drug’s target genes36. We found that SAveRUNNER predictions for known treatments

were better than random chance —

𝑙𝑜𝑔

(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟  >  0)
2

— for diseases with more

than five known treatments (Figure 2.5A). Moreover, with the exception of myocardial

ischemia, SAveRUNNER predicted drugs in Phase IV clinical trials better than random

chance (Figure 2.5A)32. Drugs in Phase IV are those that have already been proved

effective for treating a disease (in Phase III) and are being monitored for long-term

safety and efficacy.

Figure 2.5:

(A)

𝑙𝑜𝑔2(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟)

of SAveRUNNER predictions using drugs

previously indicated for the disease (top) or drugs ever in Phase IV clinical trials for a

38

Figure 2.5 (cont’d)

disease (bottom) as positive examples. The dotted line is at

.
𝑙𝑜𝑔2(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟) =  0

𝑙𝑜𝑔2(𝑎𝑢𝑃𝑅𝐶/𝑝𝑟𝑖𝑜𝑟)  >  0

denotes predictions better than random chance. (B) Number

of SAveRUNNER predicted genes

(Benjamini-Hochberg procedure corrected

𝐹𝐷𝑅  <  . 01

) per disease.

Figure 2.6:

(A) Heat map showing the enrichment of anti-inflammatory and

immunomodulating drugs among highly ranked SAveRUNNER predicted drugs (gene

set enrichment analysis, * denotes adjusted p-value < .05). (B) Bar plot showing the

adjusted similarity scores of antihistamines for complex diseases with at least one

39

Figure 2.6 (cont’d)

antihistamine among drugs predicted by SAveRUUNER to treat

the disease — *

denotes

𝐹𝐷𝑅  <  . 01

. HRH1 specific antihistamines are those listed in our

high-confidence drug target database as only targeting HRH1.

SAveRUNNER predicted between 3 and 178 high-confidence (

𝐹𝐷𝑅  <  . 01

) treatments

for each disease and identified previously indicated drugs for 5 of the 9 autoimmune

disorders (Figure 2.5B, Table S9), with significant enrichment among drug predictions

for celiac disease (one-sided Fisher’s exact

test, BH corrected

𝐹𝐷𝑅  <  . 001

).

SAveRUNNER found previously indicated treatments for only 3 of the 18 complex

diseases (Figure 2.5B, Table S9). This result

is expected given that, unlike for

autoimmune disorders, most known treatments for these complex disorders are not

likely to target the immune system. Treatments previously tested in a clinical trial were

predicted for 6 autoimmune disorders and 7 of the complex disorders (Figure 2.5B).

We tested for enrichment of drugs belonging to four immune-related drug classes

among treatment predictions highly ranked by SAveRUNNER for each complex disorder

(Figure 2.6A). SAveRUNNER allows for drug prioritization based both on the p-value

and on the adjusted similarity score between drug target genes and CI-enriched cluster

genes. Highly scoring drug-cluster pairs have genes that are closely related in the gene

interaction network, which increases the likelihood that the drug will be on-target for the

paired disease36. We found that antihistamines as a whole are enriched for 6 of the 18

complex disorders (Figure 2.6A). Antihistamines that specifically target H1-receptor

(HRH1) have the highest adjusted similarity score for 6 of the 7 complex disorders with

any antihistamine among their high-confidence targets (Figure 2.6B). SAveRUNNER

predicted that cyproheptadine, which targets both HRH1 and the serotonin 5-HT(2A)

receptor gene, HTR2A, instead of HRH1 alone would be the best antihistamine for

treating non-alcoholic fatty liver disease (Figure 2.6B). While cyproheptadine is also a

high-confidence predicted treatment

for atherosclerosis, myocardial

ischemia, and

chronic obstructive airway disease,

it

is unlikely to be an effective treatment

for

myocardial

infarction or malignant mesothelioma (Figure 2.6B). Interestingly, of the 8

diseases, only myocardial

infarction and malignant mesothelioma do not have a

CI-enriched cluster belonging to CI-signature group 2 (Figure 2.4A). This finding

40

suggests that, even among drugs in the same class, we are able to predict

disease-specific treatments for the chronic inflammation component of the disease

etiology.

Discussion

Complex diseases exhibit a staggering amount of heterogeneity, being associated with

hundreds of genes and with a range of phenotypes. Therefore, to continue advancing

our understanding of disease mechanisms and our ability to treat these diseases, it is

critical to deconvolve disease heterogeneity by: a) resolving subset of disease genes

(and

cellular

processes/pathways)

that

underlie

specific

disease-associated

phenotypes, and b) identifying avenues to diagnostically and/or therapeutically target

those specific phenotypes. Here, we present a computational data-driven approach to

address this critical need (Figure 2.1). We used our approach to study chronic

inflammation (CI) — a major phenotype present across many complex diseases. We

generated comprehensive lists of (known and predicted) disease-associated genes and

identified and classified the CI signal among these genes. We used these signatures to

predict novel treatment options to target the inflammatory components of 18 complex

diseases.

Validating our method with autoimmune disorders

A key aspect of our approach is ensuring its sensitivity to detect CI disease signatures

using autoimmune diseases as positive controls. In autoimmune diseases, the immune

system mistakenly attacks healthy tissue causing long-term systemic inflammation.

Thus, we expect that the underlying CI disease signatures would be easily identifiable

by a valid approach. Indeed, in each of the nine autoimmune diseases analyzed, our

approach isolated gene clusters enriched for CI genes (Figure 2.1A), and identified

drugs already used to treat a number of these disorders (Figure 2.5B). This finding is

encouraging given that we conservatively matched drugs to diseases only based on

expert-curated drug-target data from DGIDb31

rather

than using all drug-target

information in DrugCentral30.

Non-disease GWAS are contrasted to disease GWAS in biological networks

To show that our method was not erroneously uncovering CI signals where there were

none, we identified UK Biobank traits not patently associated with CI (along with their

41

genes) to use as negative controls. Following this analysis, we found that the median

fraction of trait-associated genes predicted by GenePlexus and the median fraction of

genes assigned to sizable clusters were lower for these traits than for autoimmune and

complex diseases (Figure 2.2C and D). Given that GenePlexus is a method that

leverages network connectivity for predicting new genes belonging to a set, these

results suggest that the genes associated with non-disease traits may not be as highly

connected to one another in ConsensusPathDB as the autoimmune and complex

disease genes. Moreover, most of the non-disease trait clusters were not enriched with

genes annotated to GO biological processes, suggesting that these clusters are diffuse

and that

the member genes are unlikely to work together to support a coherent

biological task. While non-disease traits like coffee intake and handedness have been

associated with inflammation37,38,

this analysis (using GWAS-based trait-associated

genes) suggests it

is unlikely that SNPs in a coordinated inflammation pathway

influence non-disease traits and more likely that any association with inflammation is

environmental, not genetic. Taken together, these results suggest that these chosen

traits serve as reasonable negative controls and offer a way to meaningfully contrast the

results from complex diseases.

Ideally, diseases or

traits with no underlying

inflammatory component but with associated genes that cluster in a network (as well as

the autoimmune and complex disease) will serve as better negative controls. Given how

common inflammatory processes are in disease, however, such diseases are difficult to

definitively identify.

CI pathways are shared across diseases

Complex disorders like cardiovascular diseases, diabetes, cancer, and Alzheimer’s

disease are among the leading causes of death and disability among adults over 50

years of age, and all are associated with underlying systemic inflammation 2,3. Patients

with systemic inflammation caused by autoimmune disorders are more likely to have

another CI disorder like cardiovascular disease, type 2 diabetes mellitus, and certain

types of cancer11–13. Further, treating one chronic-inflammatory disease can reduce the

risk of contracting another, suggesting a common underlying pathway 39. For example,

treating rheumatoid arthritis with tumor necrosis factor (TNF) antagonists lowers the

incidence of Alzheimer’s disease and type II diabetes14,40.

42

To better understand how CI-associated disorders relate to one another, we used a

network-based approach to quantify the similarity between their CI-enriched clusters.

We hypothesized, for example, that Crohn's disease and “malignant tumor of colon”

would have similar CI-signatures, given that patients with inflammatory bowel disease

are at

increased risk for developing colorectal cancer41. However, Crohn’s disease

CI-enriched clusters are members of signature groups 1 and 2, while the “malignant

tumor of colon” CI-enriched cluster belongs to group 3 (Figure 2.4A). Instead of sharing

CI-signatures, related CI diseases may,

instead, have complementary signatures.

Indeed, the group 1 signature, which characterizes two of the three Crohn’s disease

CI-enriched clusters,

is enriched for genes that positively regulate proinflammatory

cytokines like tumor necrosis factor (TNF) and in interferon-gamma (IFNɣ) (Figure

2.4B). When these cytokines bind to their respective receptors, reactive oxygen species

are generated causing oxidative stress42. Oxidative stress,

in turn,

induces

DNA-damage that can induce tumor formation. Colorectal tumors are infiltrated with

lymphocytes, which mediate the recruitment of

immune cells that suppress tumor

growth43.

Immune cell

infiltration likely leads to our ability to detect

the group 3

CI-signature among genes associated with “malignant tumor of colon”, given that group

3 is enriched for

immune cell migration and chemotaxis (Figure 2.4A and B).

Alternatively, there is a possibility that every CI-associated disease actually exhibits all

three CI-signatures, and our method is only sensitive enough to detect these in a

handful of diseases.

Predicting treatments for CI components of disease reveals meaningful

relationships

Common

treatments

for

systemic

inflammation,

including

non-steroidal

anti-inflammatory drugs (NSAIDs), corticosteroids, and biologics like tumor necrosis

factor (TNF) antagonists, can cause adverse effects when used long term. For instance,

patients treated with corticosteroids or TNF antagonists have increased risk of

infection41,44,45, and corticosteroid use increases both the risk of fracture 46,47 and the risk

of developing type II diabetes48. NSAIDs present a unique set of side effects, particularly

in elderly patients, including gastrointestinal problems ranging from indigestion to gastric

bleeding, and kidney damage49–51. Therefore, the search for better treatment options for

43

CI

is ongoing. Here, we leverage the CI-signatures to identify novel

treatment

opportunities for the CI-component of 18 complex diseases (Figure 2.5B). Interestingly,

antihistamines were among the top drug associations for 6 of 18 complex diseases

(Figure 2.6A),

including atherosclerosis. Atherosclerosis is characterized by the

deposition of cholesterol plaques on the inner artery walls. Mast cells, immune cells

best known for their response to allergens, are recruited to arteries during plaque

progression, where they release histamines. Histamines then activate the histamine

H1-receptor, increasing vascular permeability, which allows cholesterol easier access to

arteries promoting plaque buildup52. Mepyramine, one of

the HRH1-specific

antihistamines highly associated with atherosclerosis, has already been shown to

decrease the formation of atherogenic plaques in a mouse model of atherosclerosis52.

Interestingly, it is not predicted as a treatment for myocardial ischemia, which occurs

when plaque buildup obstructs blood flow to a coronary artery, suggesting

disease-specific antihistamine efficacy even among related diseases. Cetirizine and

fexofenadine

are

also HRH1-specific

antihistamines

highly

associated with

atherosclerosis but neither prevented or reduced atherosclerosis progression in a

mouse model of atherosclerosis, and both increased atherosclertotic lesions at low

doses53. In the expert-curated drug-target database used in this study, the histamine

H1-receptor is the only target listed for all three drugs; however, the contradictory results

from Rosenberg et al. and Raveendran et al. suggests that drug-specific off-target

effects are mediating atherosclerosis treatment outcomes. A more complete

understanding of drug-gene targets would allow for better predictions of novel disease

treatments.

For example, unlike the other diseases with antihistamines as predicted treatments,

only cyproheptadine, and not

the HRH1-specific drugs, is likely to be an effective

treatment for non-alcoholic fatty liver disease (NAFLD) (Figure 2.6B). Cyproheptadine

is an antagonist for both the HRH1 and the serotonin 5-HT(2A) receptor (HTR2A),

suggesting that blocking 5-HT(2A) could be specifically helpful

for ameliorating

symptoms of NAFLD.

Indeed,

liver-specific Htr2a knockout mice are resistant

to

HFD-induced hepatic steatosis,

increased fat

in the liver 54, and increased serum

44

serotonin levels were correlated with increased disease severity in patients with

NAFLD55.

Conclusion

Overall, we have shown that our method is capable of isolating the chronic inflammation

gene signature of a complex disease using a network-based strategy and, by integrating

information across multiple complementary sources of data, it can predict and prioritize

potential

therapies for the systemic inflammation involved in that specific disease.

Importantly, our approach provides a blueprint for identifying and prioritizing therapeutic

opportunities for any disease endophenotype. This work has been published56

This project was foundational for validating the usefulness of using gene modules and

networks to isolate particular processes within diseases. The pipeline has multiple

steps,

including module discovery, gene classification, and analyzing important

processes and phenotypes at the module level. We used well validated methods for this

task – being leiden clustering for module discovery and GenePlexus for gene

classification. However, during this project we started considering how to improve

various aspects of this pipeline. For module discovery, we are heavily reliant on the

gene annotations from DisGeNet and did not implement a method to remove genes that

external sources imply are low evidenced. Similarly, we run GenePlexus on each

disease as a whole. However, given that we view diseases as collections of biologically

related phenotypes and cellular pathways, this may be an inappropriate way to run

GenePlexus. For example, in the analysis performed in Figures A2.3-7, we see that

non-CI genes in CI clusters are more likely to be associated with CI than genes in

non-CI clusters. If it is known that CI is part of a disease, it may make more sense to

perform gene classification on subsets of a disease that correspond to specific

phenotypes, rather than the disease as a whole. This would have implications for doing

enrichment on clusters and discovering/annotating phenotypes at a modular level.

Deeply considering the steps of chapter 2’s pipeline is what motivates chapters 3 and 4

of this dissertation.

45

REFERENCES

1. Rock, K. L. & Kono, H. The inflammatory response to cell death. Annu Rev Pathol 3,

99–126 (2008).

2. Furman, D. et al. Chronic inflammation in the etiology of disease across the life

span. Nat Med 25, 1822–1832 (2019).

3. Vos, T. et al. Global burden of 369 diseases and injuries in 204 countries and

territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study
2019. The Lancet 396, 1204–1222 (2020).

4. Leiserson, M. et al. Pan-Cancer Network Analysis Identifies Combinations of Rare
Somatic Mutations across Pathways and Protein Complexes. Nature Genetics 47,
106–114 (2015).

5. Ghiassian, S. D., Menche, J. & Barabási, A.-L. A DIseAse MOdule Detection

(DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns
of Disease Proteins in the Human Interactome. PLOS Computational Biology 11,
e1004120 (2015).

6. Ghiassian, S. D. et al. Endophenotype Network Models: Common Core of Complex

Diseases. Sci Rep 6, 27414 (2016).

7. Fiscon, G. & Paci, P. SAveRUNNER: an R-based tool for drug repurposing. BMC

Bioinformatics 22, 150 (2021).

8. Chen, H., Zhang, H., Zhang, Z., Cao, Y. & Tang, W. Network-Based Inference
Methods for Drug Repositioning. Computational and Mathematical Methods in
Medicine 2015, e130620 (2015).

9. Cheng, F. et al. Network-based approach to prediction and population-based

validation of in silico drug repurposing. Nat Commun 9, 2691 (2018).

10. Liu, R., Mancuso, C. A., Yannakopoulos, A., Johnson, K. A. & Krishnan, A.

Supervised-learning is an accurate method for network-based gene classification.
Bioinformatics (2020) doi:10.1093/bioinformatics/btaa150.

11. Dregan, A., Charlton, J., Chowienczyk, P. & Gulliford, M. C. Chronic Inflammatory
Disorders and Risk of Type 2 Diabetes Mellitus, Coronary Heart Disease, and
Stroke. Circulation 130, 837–844 (2014).

12. Armstrong, A. W., Harskamp, C. T. & Armstrong, E. J. Psoriasis and the Risk of

Diabetes Mellitus: A Systematic Review and Meta-analysis. JAMA Dermatology 149,
84–91 (2013).

13. Yashiro, M. Ulcerative colitis-associated colorectal cancer. World J Gastroenterol 20,

16389–16397 (2014).

14. Chou, R. C., Kane, M., Ghimire, S., Gautam, S. & Gui, J. Treatment for Rheumatoid

46

Arthritis and Risk of Alzheimer’s Disease: A Nested Case-Control Analysis. CNS
Drugs 30, 1111–1120 (2016).

15. Autoimmune Diseases: Causes, Symptoms, What Is It & Treatment. Cleveland

Clinic https://my.clevelandclinic.org/health/diseases/21624-autoimmune-diseases.

16. Schriml, L. M. et al. Human Disease Ontology 2018 update: classification, content

and workflow expansion. Nucleic Acids Res 47, D955–D962 (2019).

17. Sudlow, C. et al. UK Biobank: An Open Access Resource for Identifying the Causes
of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine 12,
(2015).

18. Abbot, L. et al. UK Biobank GWAS. nealelab http://www.nealelab.is/uk-biobank/.

19. Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and

Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary
Statistics. PLOS Computational Biology 12, (2016).

20. Szklarczyk, D. et al. The STRING database in 2017: quality-controlled

protein–protein association networks, made broadly accessible. Nucleic Acids Res
45, D362–D368 (2017).

21. Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids

Res 34, D535–D539 (2006).

22. Kamburov, A., Stelzl, U., Lehrach, H. & Herwig, R. The ConsensusPathDB

interaction database: 2013 update. Nucleic Acids Research 41, D793–D800 (2013).

23. Piñero, J. et al. The DisGeNET knowledge platform for disease genomics: 2019

update. Nucleic Acids Research 48, D845–D855 (2020).

24. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing

well-connected communities. Scientific Reports 9, 5233 (2019).

25. Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment Analysis for Gene Ontology.
Bioconductor version: Release (3.14) https://doi.org/10.18129/B9.bioc.topGO
(2022).

26. Carlson, M. org.Hs.eg.db: Genome wide annotation for Human. (2019).

27. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25,

25–29 (2000).

28. Lachmann, A. et al. Geneshot: search engine for ranking genes from arbitrary text

queries. Nucleic Acids Research 47, W571–W577 (2019).

29. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society:
Series B (Methodological) 57, 289–300 (1995).

47

30. Avram, S. et al. DrugCentral 2021 supports drug discovery and repositioning.

Nucleic Acids Research 49, 1160–1169 (2021).

31. Freshour, S. et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0)
with open crowdsource efforts. Nucleic Acids Research 49, 1144–1151 (2021).

32. AACT Database | Clinical Trials Transformation Initiative.

https://aact.ctti-clinicaltrials.org/.

33. Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for

2018. Nucleic Acids Research 46, D1074–D1082 (2018).

34. Korotkevich, G. et al. Fast gene set enrichment analysis. 060012 Preprint at

https://doi.org/10.1101/060012 (2021).

35. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach
for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102,
15545–15550 (2005).

36. Fiscon, G., Conte, F., Farina, L. & Paci, P. SAveRUNNER: A network-based
algorithm for drug repurposing and its application to COVID-19. PLOS
Computational Biology 17, e1008686 (2021).

37. Paiva, C. et al. Consumption of coffee or caffeine and serum concentration of

inflammatory markers: A systematic review. Crit Rev Food Sci Nutr 59, 652–663
(2019).

38. Searleman, A. & Fugagli, A. K. Suspected autoimmune disorders and

left-handedness: Evidence from individuals with diabetes, Crohn’s disease and
ulcerative colitis. Neuropsychologia 25, 367–374 (1987).

39. Fullerton, J. N. & Gilroy, D. W. Resolution of inflammation: a new therapeutic

frontier. Nat Rev Drug Discov 15, 551–567 (2016).

40. Antohe, J. l. et al. Diabetes mellitus risk in rheumatoid arthritis: Reduced incidence
with anti–tumor necrosis factor α therapy. Arthritis Care & Research 64, 215–221
(2012).

41. Shah, S. C. & Itzkowitz, S. H. Colorectal Cancer in Inflammatory Bowel Disease:
Mechanisms and Management. Gastroenterology 162, 715-730.e3 (2022).

42. Chatterjee, S. Chapter Two - Oxidative Stress, Inflammation, and Disease. in
Oxidative Stress and Biomaterials (eds. Dziubla, T. & Butterfield, D. A.) 35–58
(Academic Press, 2016). doi:10.1016/B978-0-12-803269-5.00002-4.

43. Idos, G. E. et al. The Prognostic Implications of Tumor Infiltrating Lymphocytes in
Colorectal Cancer: A Systematic Review and Meta-Analysis. Sci Rep 10, 3360
(2020).

44. Rosenblum, H. & Amital, H. Anti-TNF therapy: Safety aspects of taking the risk.

48

Autoimmunity Reviews 10, 563–568 (2011).

45. Murdaca, G. et al. Infection risk associated with anti-TNF-α agents: a review. Expert

Opinion on Drug Safety 14, 571–582 (2015).

46. Kanis, J. A. et al. A Meta-Analysis of Prior Corticosteroid Use and Fracture Risk.

Journal of Bone and Mineral Research 19, 893–899 (2004).

47. Mitra, R. Adverse Effects of Corticosteroids on Bone Metabolism: A Review. PM&R

3, 466–471 (2011).

48. Blackburn, D., Hux, J. & Mamdani, M. Quantification of the Risk of

Corticosteroid-induced Diabetes Mellitus Among the Elderly. Journal of General
Internal Medicine 17, 717–720 (2002).

49. Griffin, M. R. Epidemiology of Nonsteroidal Anti-inflammatory Drug–Associated
Gastrointestinal Injury. The American Journal of Medicine 104, 23S-29S (1998).

50. Griffin, M. R., Yared, A. & Ray, W. A. Nonsteroidal antiinflammatory drugs and acute

renal failure in elderly persons. Am J Epidemiol 151, 488–496 (2000).

51. Marcum, Z. A. & Hanlon, J. T. Recognizing the Risks of Chronic Nonsteroidal

Anti-Inflammatory Drug Use in Older Adults. Ann Longterm Care 18, 24–27 (2010).

52. Rozenberg, I. et al. Histamine H1 receptor promotes atherosclerotic lesion formation
by increasing vascular permeability for low-density lipoproteins. Arterioscler Thromb
Vasc Biol 30, 923–930 (2010).

53. Raveendran, V. V. et al. Chronic Ingestion of H1-Antihistamines Increase

Progression of Atherosclerosis in Apolipoprotein E-/- Mice. PLoS One 9, e102165
(2014).

54. Choi, W. et al. Serotonin signals through a gut-liver axis to regulate hepatic

steatosis. Nat Commun 9, 4824 (2018).

55. Wang, L. et al. Gut-Derived Serotonin Contributes to the Progression of

Non-Alcoholic Steatohepatitis via the Liver HTR2A/PPARγ2 Pathway. Frontiers in
Pharmacology 11, (2020).

56. Hickey, S. L., McKim, A., Mancuso, C. A. & Krishnan, A. A network-based approach
for isolating the chronic inflammation gene signatures underlying complex diseases
towards finding new treatment opportunities. 2022.02.10.479987 Preprint at
https://doi.org/10.1101/2022.02.10.479987 (2022).

49

APPENDIX A2: INFLAMMATION

Figure A2.1: Bar charts displaying the number of gene annotations and predicted

genes (from GenePlexus) of five different sets for chronic inflammation genes. 1)

chronic_inflammation_gene_shot includes all genes from Geneshot for chronic

inflammation. 2) chronic_inflammation_gene_shot_pubs_greater10 includes genes from

Geneshot that are in at least 10 publications. 3)

chronic_inflammatory_response_GO2ALLEGS is the propagated chronic inflammatory

response GOBP genes. 4) inflammatory_response_GO2ALLEGS is the propagated

inflammatory response GOBP genes. 5) inflammatory_response_GO2EGS is the

propagated inflammatory response GOBP genes.

50

Figure A2.2: Showing proportion of clusters of each trait

type with at

least one

CI-enriched cluster for each network-CI-geneset combination.

51

Figure A2.3: For disease gene clusters created from running GenePlexus on the

diseases utilizing the BioGRID network, and then clustered using BioGRID, plotting the

mean CI probability of the genes of CI-enriched clusters (TRUE) with non-CI enriched

clusters (FALSE).

52

Figure A2.4: For disease gene clusters created from running GenePlexus on the

diseases utilizing the ConsensusPathDB network, and then clustered using

ConsensusPathDB, plotting the mean CI probability of the genes of CI-enriched clusters

(TRUE) with non-CI enriched clusters (FALSE).

53

Figure A2.5: For disease gene clusters created from running GenePlexus on the

diseases utilizing the STRING-EXP network, and then clustered using STRING-EXP,

plotting the mean CI probability of the genes of CI-enriched clusters (TRUE) with non-CI

enriched clusters (FALSE).

54

Figure A2.6: For disease gene clusters created from running GenePlexus on the

diseases utilizing the STRING network, and then clustered using STRING, plotting the

mean CI probability of the genes of CI-enriched clusters (TRUE) with non-CI enriched

clusters (FALSE).

55

Figure A2.7: For disease gene clusters created from running GenePlexus on the

diseases utilizing the STRING network, and then clustered using STRING-EXP, plotting

the mean CI probability of the genes of CI-enriched clusters (TRUE) with non-CI

enriched clusters (FALSE).

56

CHAPTER 3: MODGENEPLEXUS: A MODULAR NETWORK-BASED APPROACH

FOR GENE CLASSIFICATION IMPROVES POST-OMICS AND POST-GWAS

HYPOTHESIS GENERATION FOR DISEASE GENES AND MECHANISMS

Abstract

Network-based gene classification is a method to computationally predict associations

of genes to cellular pathways, complex traits, and diseases. Established methods like

label propagation and GenePlexus have demonstrated success in using genome-wide

networks to classify known genes and uncovering novel associations for various

biological datasets. However, directly applying this network-based approach to complex

diseases has proven challenging because, unlike genes related to pathways and

phenotypes, complex disease genes are not localized in gene networks within a shared

neighborhood. This is confirmed by the fact that building a single network-based ML

classifier for the disease as a whole leads to poor gene discovery performance. We

propose a novel network-based ML approach, ModGenePlexus, to address this

challenge. ModGenePlexus works in two stages. First, using semi-supervised learning,

it decomposes the large disease gene list into coherent ‘modules’ that each contain a

subset of original disease genes and some new candidate genes tightly connected to

each other in the underlying network. Second, using supervised learning,

ModGenePlexus trains one ML classifier per disease module to learn network patterns

unique to that module and predict additional novel genes related to that module. We

applied ModGenePlexus to large disease gene lists derived from several

transcriptomics studies and GWASs. Using systematic and rigorous evaluations, we

demonstrate that ModGenePlexus yields improved performance and facilitates a more

nuanced and biologically specific interpretation of the known and novel disease genes.

Additionally, we show that, by using networks to provide biological context,

ModGenePlexus is capable of using even genes with nominal p-values to improve

prediction performance. Lastly, we demonstrate using type-2 diabetes that

ModGenePlexus allows for the enrichment of unique and relevant biology that using the

experiment result as a whole would miss. These findings underscore the advantages of

ModGenePlexus over conventional methods that consider the entire disease gene set

as a single unit, as it further refines the geneset and thereby expands the applicability of

57

network-based ML to post-GWAS or post-omics hypothesis-generation, especially for

answering questions about real world data–elucidating functional, cellular, and

phenotypic convergence of disease genes.

Introduction

Gene classification is the task of computationally predicting associations of genes to

cellular pathways, complex traits, and diseases. In chapter 1 we showed that numerous

methods have been created for the purpose of allowing researchers to take genes and

predict additional genes likely to be associated with the user list1–8. These methods are

possible because of large publicly available data collections, including ontologies of

biological terms and molecular interaction networks. Two methods of note include label

propagation9 methods, which are a semi-supervised approach that propagates known

genes in a network, and supervised learning methods, which characterize positive gene

labels and negative non-related genes by seeing how these two sets of labeled genes

are related within a network. GenePlexus1,10 is a recent network-based supervised

learning approach to gene classification validated through extensive benchmarking1 that

outperforms label propagation for diverse genesets of cellular pathways, diseases, and

traits. It uses a guilt-by-association approach for gene classification – which assumes

that genes strongly connected to each other in a network are likely to be involved in

similar underlying pathways and functions. GenePlexus' purpose is to prioritize genes

for lab study by predicting the association of every gene in the genome/network to an

input gene list. GenePlexus returns a prediction for every gene in the genome that

represents how well a gene is connected to that original input geneset. However,

GenePlexus has not undergone robust evaluation with extremely large genesets, a

property characteristic of many real world -omics studies. Additionally, it has been

demonstrated that disease and complex trait datasets with lower edge densities of

genes within each geneset, such as DisGeNET11–13 and GWAS, demonstrate inferior

performance compared to Gene Ontology14 Biological Process (GOBP) for gene

classification. (Figure 3.1-3). Figure 3.2 additionally shows GenePlexus performance is

correlated with the overall size of the genesets – where larger size inputs give worse

performance. GenePlexus ran using sets from experimental data such as CREEDS

differential expression data shows very poor performance (Figure 3.1-2) relative to

58

Figure 3.1: Ridge plot demonstrating distribution of network edge density of genesets

within each GSC. Function (red) and disease (blue) genesets are more well connected

in meaningful neighborhoods in STRING relative to GWAS/MGI (black) and perform

better with GenePlexus relative to CREEDS differential expression experimental data

(tan).

Figure 3.2: Plotting GenePlexus performance with the number of genes in a geneset.

The different colored dots represent geneset types. Functional genesets

59

Figure 3.2 (cont’d)

(red) are GOBP, curated disease genesets (blue) are DisGeNET, GWAS and MGI

genesets are (black), and CREEDS (tan) are differential expression results for diseases.

Figure 3.3: Plotting GenePlexus performance with the edge density of the genesets

within the STRING network. Performance is negatively correlated with geneset size but

it positively correlates with the edge density. Different geneset sources also have

different performance. Notably, CREEDS datasets are relatively large and have

relatively poor performance even as network edge density increases.

60

Figure 3.4: t-SNE of STRING network node embeddings, with genes of sample

datasets from different sources plotted. Different genesets have different measures of

how well genes are densely mapped to neighborhoods in the network. Functional

datasets (red) tend to have genes that map to neighborhoods better than disease (blue)

or MGI/GWAS (black).

61

Figure 3.5: Ridge plot demonstrating distribution of performance. X-axis of AUPRC

refers to “log2(auPRC/prior)”, with each GSC. Function (red) and disease (blue)

genesets are more well connected in meaningful neighborhoods in STRING relative to

GWAS/MGI (black) and perform better with GenePlexus relative to experimental data

(tan).

other GSCs tested in the original study. Performance of GenePlexus models can be

poor for two primary reasons. The first being that there is incomplete data at both the

network level and the disease/trait-gene annotation levels – which is exacerbated the

more complex a trait is. The second primary concern is that complex disease/trait genes

are often not localized in the same neighborhood. Rather, a complex trait geneset is

made up of multiple gene subsets with varying degrees of localization. Figure 3.4

shows representations of average sized genesets within GO14,15, DisGeNET11,13, MGI16,

and GWAS. GO and DisGeNET have high edge densities (Figure 3.1, 3.3) relative to

MGI and GWAS, and in these examples the representative GO term is highly localized

to one cluster, while the other three genesets have genes scattered across the network

embeddings. In other words, GenePlexus tends to perform better on smaller genesets

than larger ones, and for genesets that localize in the network – like GO – rather than

genesets that are less localized in network neighborhoods like GWAS (Figure 3.3-5).

62

Experimental results add an additional problem that needs to be dealt with, where gene

results will not only have unclear network patterns and understudied genes, but also a

noticeable level of noise – false positives – compared to defined annotations compiled

through multiple sources like GO and DisGeNET. The original GenePlexus application

assumes all user provided genes are biologically meaningful and does nothing to filter

low quality genes to improve model performance.

GenePlexus and disease modularity

As mentioned in chapter 1, the complexity of human diseases, involving hundreds or

thousands of genes, exacerbates incomplete annotations and non-localized genes.

Disease gene lists exhibit modularity17 within biological networks where subsets of

genes collaborate with each other, forming modules that are not uniformly or densely

connected with every true disease gene. This modularity indicates that complex

diseases have multiple and distinct biological processes, phenotypes, and causal

mechanisms that are relevant. Essentially, disease gene lists exhibit modularity within

biological networks and are very large, which means they are not initially suited for

GenePlexus. The additional implication for GenePlexus results is that genes ranked

higher in the prediction list are those connected to many input genes in the initial set.

Conversely, genes connected to only a small subset are likely to have lower

probabilities and may be overlooked by the user. Potentially meaningful processes in a

disease are difficult to discover when considering the disease as a whole due to

statistical power not because they are unimportant, but because the processes have

less true gene annotations relative to other disease processes.

ModGeneplexus: Using GenePlexus with biologically relevant subsets

In this study, we propose ModGenePlexus (Figure 3.6), an extension of GenePlexus

that uses subsets of the initial input geneset for model creation and prediction – rather

than the entire geneset at once. Recognizing that large genesets encompass multiple

distinct biological processes, we aim to enhance results by determining and predicting

on these smaller, biologically meaningful subsets. Specifically, ModGenePlexus initially

clusters genes into modules (termed gene modules) using a functional gene-gene

interaction network and subsequently runs GenePlexus on each discovered gene

module. Each model makes a prediction for every gene whether that gene is associated

63

with the model’s respective gene module. The results of each genes’ predictions across

modules are then aggregated. If the gene has a very high probability in one of the

models, the max probability is used. Otherwise the average probability score across

modules is used for that gene. When discovering gene modules, ModGenePlexus

denoises the initial geneset by excluding genes that lack meaningful network connection

to other disease genes, and propagates the initial geneset by finding new genes that

were not initially labeled as disease genes – but have dense network connections to

disease gene enriched modules. These methods are crucial for datasets that have not

gone through any processing and are expected to have false positives or less confident

results. We validate ModGenePlexus through showing ModGenePlexus’s performance

with large, real-world experimental datasets. First, we ran a simulation where we

created large genesets that contain modules known to be biologically meaningful and

robust to validate utilizing multi-model prediction and aggregation of the results into a

final prediction. Second, we improved neutral/negative gene label classification in

GenePlexus by determining negative genes utilizing gene modules – rather than with all

positive labeled genes at once. Third, we show significant performance improvements in

real-world differential expression data from the CREEDS database, encompassing

manually extracted gene expression profiles from GEO18 for gene/drug knockouts and

disease patients. We also implemented ModGenePlexus for GWAS experiments, ,

using MAGMA gene prioritization results and varying p-value thresholds to show

performance differences. Next, we apply ModGenePlexus to the same MAGMA results

to determine if training models for prioritized genes that meet a less stringent p-value

threshold improves performance when prediction for high-threshold, more confidently

associated genes. Our analysis revealed that employing models incorporating more

genes (less stringent MAGMA thresholds) leads to improved classification of stringent

test genesets. We then visualize and investigate how network edge density and geneset

size statistics affect performance of ModGenePlexus. Finally, we show using

ModGenePlexus with real experimental results for type-2 diabetes highlights unique

enriched processes and biology that using the whole experiment would not recover, and

we further discuss the implications of using ModGenePlexus and modules in general for

interpreting human disease. Our findings demonstrate that module expansion enables

64

more robust gene classification of large-scale -omics genesets and real experimental

results – facilitating the identification of mechanistically important genes in many diverse

contexts.

Defining terms and models used in this chapter

This chapter utilizes multiple different models and notation that has particular meaning

in the ModGenePlexus pipeline. Figure 3.6 shows the entire pipeline for

ModGenePlexus. However, multiple types of models are created for purposes of

comparison and validating ModGenePlexus performance which tests various

components of ModGenePlexus, We additionally create models for the original

GenePlexus implementation. We provide definitions here:

● AllAssign: Creating a single model for all genes originally assigned to the

geneset. This is also referred to as “GenePlexus” because this model is running

GenePlexus normally.

Figure 3.6: A schematic for modGenePlexus. A functional network (STRING) is used to

discover gene modules based on a user input list utilizing the DOMINO software. This is

a semi-supervised technique which will remove genes from the initial user input list -

and discover novel genes in user geneset-enriched clusters. Nodes with a dashed

border are those discovered through semi-supervised learning. GenePlexus trains a

logistic regression model for each individual cluster containing original seed genes and

those found through the label propagation. Supervised learning is conducted on each

65

Figure 3.6 (cont’d)

individual module for genome-wide gene classification. The gene predictions across

clusters are finally aggregated into a final prediction that will typically contain originally

positive labels from each individual cluster, and novel predictions.

● AllClus: Creating a single model trained for all genes assigned to a cluster.

● MaxAvgProb: Multiple models are created, one model for each cluster. Also

referred to as ModGenePlexus.

● noprop: A model where after cluster assignment, additional propagated genes

are not used in training. “noprop” will appear as a prefix to either AllClus or

MaxAvgProb – such as noprop_AllClus or noprop_MaxAvgProb

● domino: A model where after cluster assignment, additional propagated genes

are not used in training. “domino” will appear as a prefix to either AllClus or

MaxAvgProb – such as domino_AllClus or domino_MaxAvgProb.

● If either AllClus or MaxAvgProb models are not labeled with a prefix, it is

utilizing propagated genes (assume a prefix of domino_).

Terms and their definitions can be used interchangeably depending on context. For

example, running a single model for all genes can be referred to as either GenePlexus

or AllAssign.

Methods

Compiling and using the STRING Network

Version 10 of the STRING19 network was used in this project. This is the version of

STRING that also underlies GenePlexus. Specifically, we used a version of STRING

where we only kept edges that have an edge weight greater than 0.70. This was done

to use only confident edges for cluster assignment of genesets. STRING was chosen as

the method for this project over alternatives such as BioGRID20,21 because (i) the

amount of biological annotations integrated in STRING – including protein-protein

interactions, conservation data, sequence homology, etc. – is more comprehensive, and

(ii) STRING performs the best with GenePlexus compared to alternatives and the goal

with ModGenePlexus is to show it outperforms GenePlexus under rigid evaluation.

66

Compiling CREEDS differential expression datasets

Genesets were compiled from CRowd Extracted Expression of Differential Signatures

(CREEDS)22. We utilized manual signatures for single gene perturbations, single drug

perturbations, and diseases. Manual signatures correspond to those where gene

expression profiles from NCBI Gene Expression Omnibus (GEO)18 were obtained

through manual validation rather than through an automated process. Genesets with at

least 100 genes were used, and both genes that were up-regulated and down-regulated

were included in the same positive label set for GenePlexus models. We utilize

CREEDS datasets covering three biological domains: Gene knockout, human disease,

and drug expression profiles, where 972, 311, and 234 genesets respectively were run

with ModGenePlexus.

Compiling MAGMA gene prioritization results

MAGMA23 gene prioritization scores for summary GWAS results were compiled from

GWAS Atlas24. We created three GSCs utilizing these gene predictions based on

thresholds of

𝑝  < 1 × 10

−2
,

𝑝  < 1 × 10

, and

𝑝  < 1 × 10

. Sets with at least 100

−5

−8

genes were used, and the number of sets for each threshold were 671, 32, and 15

respectively. Multiple thresholds were used because for most MAGMA gene scores,

using very strict thresholds will not give very large genesets – which is the primary use

case of ModGenePlexus. While it is common to use very strict thresholds with GWAS

data due to multiple correction, it has been shown that genes with more nominal

significance values can have biological meaning25. Additionally, when analyzing how

ModGenePlexus utilizing looser thresholds can predict strict threshold test genes, we

added two more thresholds of

𝑝  < 5 × 10

−2

and

𝑝  < 1 × 10

. Because there were

−1

only 32 genesets that met size criteria with a threshold of

𝑝  < 1 × 10

, the same 32

−5

GWAS were used across the looser thresholds for this analysis specifically.

Evaluation metric for measuring model performance

In this chapter, we present results of performance across all models with the auPRC

where it is normalized with the prior and the ratio is log-transformed.

67

The metric log2(auPRC/prior) is defined by:

𝑎𝑢𝑃𝑅𝐶  =  

∑  (𝑅𝑒𝑐𝑎𝑙𝑙
𝑛

  −  𝑅𝑒𝑐𝑎𝑙𝑙
𝑛

)𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

𝑛−1

𝑛

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛  =  𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)

𝑅𝑒𝑐𝑎𝑙𝑙  =  𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)

𝑝𝑟𝑖𝑜𝑟  = 𝑃

𝑃+𝑁   =  

𝑇𝑃 + 𝐹𝑁
(𝑇𝑃+𝐹𝑁)+(𝐹𝑃+𝑇𝑁)

Where TP is true positives, FN is false negatives, FP is false positives, and TN is true

negatives. P and N are the ground truth labels. The curve is constructed using multiple

thresholds based on the GenePlexus predictions, and at each threshold the true

positives, false positives, true negatives, and false negatives are calculated. These

values are used to calculate precision and recall at each threshold and construct the

precision recall curve. This metric was chosen because gene classification is highly

imbalanced, where there are many more negatives than positives. Methods like auROC

and accuracy are unsuited for this problem26. Additionally, optimizing for this metric

controls for false positives27. This is vital since the most important goal in gene

classification is to make sure the top candidate genes in particular have few errors.

Determining pairs of GOBP far away in a network

For Gene Ontology14 Biological Processes (GOBPs) A and B, we utilize the influence

matrix to determine if genes in GOBP A are far away from genes in GOBP B relative to

each geneset’s distance to random genes in the network. The values of a

diffusion-based representation of the network – an influence matrix – are used as

scores. To determine how far GOBP A is from GOBP B, two z-scores are calculated:

𝐴

𝐵

𝑠𝑢𝑟𝑐𝑒

𝑡𝑎𝑟𝑔𝑒𝑡

 − 𝑚𝑒𝑎𝑛(𝐴

𝑅𝑎𝑛𝑑𝑜𝑚)

𝑠𝑜𝑢𝑟𝑐𝑒
𝑅𝑎𝑛𝑑𝑜𝑚)

𝑠𝑑(𝐴

𝑠𝑜𝑢𝑟𝑐𝑒

𝐵

𝑠𝑢𝑟𝑐𝑒

and

𝐴

𝑡𝑎𝑟𝑔𝑒𝑡

 − 𝑚𝑒𝑎𝑛(𝐵

𝑅𝑎𝑛𝑑𝑜𝑚)

𝑠𝑜𝑢𝑟𝑐𝑒
𝑅𝑎𝑛𝑑𝑜𝑚)

𝑠𝑑(𝐵

𝑠𝑜𝑢𝑟𝑐𝑒

where

𝐴

𝐵

𝑠𝑜𝑢𝑟𝑐𝑒

𝑡𝑎𝑟𝑔𝑒𝑡

refers to influence matrix values for genes in GOBP A where A

genes are the start within the influence matrix, and

𝐵

are the genes of GOBP B

𝑡𝑎𝑟𝑔𝑒𝑡

being targeted as the destination.

𝐵

𝐴

𝑠𝑜𝑢𝑟𝑐𝑒

𝑡𝑎𝑟𝑔𝑒𝑡

refers to influence matrix values for

68

 
genes in GOBP B where genes in B are the start, and

𝐴

are the genes of GOBP A

𝑡𝑎𝑟𝑔𝑒𝑡

being targeted.

𝐴

𝑅𝑎𝑛𝑑𝑜𝑚

𝑠𝑜𝑢𝑟𝑐𝑒

is a list of the mean scores of random sampled genes

being targeted from A, the size of the random geneset being the size of GOBP B.

𝐵

𝑅𝑎𝑛𝑑𝑜𝑚

𝑠𝑜𝑢𝑟𝑐𝑒

is a list of the mean scores of random sampled genes being targeted by

B, the size of the random geneset being the size of GOBP A. This method is utilizing the

central limit theorem to determine if the mean connection of one GOBP to another is

greater than the connection to random genes. Meaning, we take the average of the

mean distances across a large number of randomly sampled genesets, and calculate

z-scores based on these averages. The mean of these two z-scores is then taken. To

interpret the results, the z-scores that are negative are those GOBP genesets less

connected to one another within a network relative to the genesets connections to many

randomly sampled genesets. Simulated traits can contain up to 10 GOBPs if all of the

GOBPs have a negative z-score to one another and have no gene intersections.

GenePlexus was run on every GOBP term and every simulated trait.

Positive gene evaluation in a study-biased holdout split

Evaluations are created using a study-biased holdout. This evaluation was chosen

because it is more stringent and reflective of the real world task of novel gene discovery.

Understudied genes are less likely to have robust network connections compared to

well-studied ones. For a given geneset, genes are put into that trait’s training set if they

are in the top ⅔ of genes mentioned in pubmed abstracts. Genes that are in the bottom

⅓ of pubmed abstract mentions are classified as test genes. All models implemented in

this project use this initial positive test holdout for evaluation.

Negative gene assignment and using negative genes in a study-biased holdout

Negative genes were identified using the algorithm implemented in PyGenePlexus,

which intersects the user-input genelist with a geneset collection to identify genes that,

while not originally part of the user list, are part of sets in the collection that exhibit

significant overlap with the user list. More specifically, after finding the union of all genes

annotated to a gene set collection, the seed genes of the relevant geneset are removed

and negative genes are those not annotated to any geneset that significantly overlaps (

𝑝  < 5 × 10

; one-sided fisher’s exact test) with the given seed genes. To ensure

−2

69

robustness, this geneset collection was narrowed down to only include genes within the

positive training universe, excluding genes that are understudied. The neutrals

discovered through this process were not utilized as negatives for model creation or

testing. For assignment to training and test bins, given that all negative genes are well

studied they were assigned to training and test bins through random assignment. This

assignment maintains the same ratio of negative train genes to negative test genes as

that of the positive train genes to positive test gene ratio for the respective geneset. For

evaluation, the negative test set used is a set of genes that are negative in both

AllAssign and in methods that utilize modules in some way. Genes that are in this

negative test split will never be considered for training in any model.

Determining modules and performing semi-supervised gene classification

We utilized the DOMINO28,29 method for module detection using the STRING network.

Briefly, DOMINO first uses Louvain clustering30 to split a network into initial slices. A

hypergeometric test is run with these slices to determine if there is at least some

enrichment with user input genes

𝐹𝐷𝑅  ≤ 0. 3

. Each enriched slice has a single

smaller, sub-component extracted through an iterative process with the Prize Collecting

Steiner Tree algorithm31. The goal of this algorithm is to find a subnetwork of genes that

maximizes the sum of prizes that nodes in the current iterative subnetwork give while

minimizing the penalties of edges connected to nodes not in the subnetwork. The prizes

of active nodes are obtained using influence propagation – meaning genes well

connected to active nodes will have a higher prize associated with it. Starting with

user-defined genes in the slice as a subnetwork, each iteration adds a new non-user

gene to the subnetwork until

∑𝑣∈𝑇𝑝 (𝑣)-∑𝑒∈𝑇𝑐 (𝑒)

– where v is a node,

𝑝(𝑣) 

is the prize

of a node, e is an edge, and

𝑐(𝑒)

is the cost of edge e – is maximized. These obtained

sub-slices are further refined using the Newman-Girvan algorithm32, where edges are

iteratively removed using the betweenness centrality metric. Each iteration, a modularity

score is computed on the new sub-graph, with the stopping criterion being

𝑙𝑜𝑔 (#𝑜 𝑓𝑛 𝑜 𝑑 𝑒 𝑠𝑖 𝑛𝑠 𝑢 𝑏-𝑠 𝑙 𝑖 𝑐 𝑒)
𝑙𝑜𝑔 (#𝑜 𝑓𝑛 𝑜 𝑑 𝑒 𝑠𝑖 𝑛𝑛 𝑒 𝑡 𝑤 𝑜 𝑟 𝑘)  ≤𝑀.

Lastly, final modules are determined by each sub-slice passing

a stricter hypergeometric test using Bonferroni correction (

𝑞 < 0. 05)

. For

ModGenePlexus, the importance of the DOMINO method is that it is a way to discover

gene modules where genes are related in the network, and contains genes that while

70

not part of the original set, were discovered utilizing user gene labels and network

topology. Similarly, genes that are not part of a significantly enriched module aren’t

used. This is because the genes are not well connected to other disease genes in the

network. These genes are considered false positives in the context of this project, as

there is little reason to think they will contribute to network-based gene classification

and their relationship to the disease is unclear since they aren’t well related to other

disease genes.

Refined neutral selection using cluster information

Individual cluster gene assignments were passed into a previously described algorithm

(see methods, section “Negative gene assignment and using negative genes in a

study-biased holdout”). For a trait, each relevant cluster was intersected with the

geneset collection to discover neutral genes for that cluster – rather than the trait as a

whole. If a gene is considered neutral in at least one cluster, then it is used as a neutral

for all models for each module. Additionally, negative assignments for AllClus models

were obtained using this method. The source code of PyGenePlexus was directly

modified to use this new method to determine neutrals and negative gene assignments.

Types of created models and aggregation of module predictions

We utilized multiple types of splits for evaluation or for running ModGenePlexus and its

variations. The first is creating one model for all genes originally annotated to the

disease. This is running the set with GenePlexus normally – called “AllAssign”. A

second model type is utilizing all genes that were assigned to a cluster in the DOMINO

method. This is “AllClus”, where only one model is created for all modules at once. In a

normal DOMINO run, this will include a subset of the original geneset, and those genes

propagated. ModGenePlexus is when models are created for each obtained module

and the predictions are aggregated. The aggregation is created by finding two types of

genes. If a gene has a prediction where

𝑃𝑟𝑜𝑏 > 0. 80

in any module, the max score

across modules is used in the aggregation. For all other genes, the average module

probability score across modules is the final score. This is the complete

ModGenePlexus method and is also known as “MaxAvgProb”. Lastly, two additional

model ways to filter genesets can be run with AllClus and MaxAvgProb. For models that

have a prefix “domino_”, the propagated genes from DOMINO are used. If it has the

71

prefix “noprop_”, then the only propagated genes included are those that are test genes

of the geneset in question.

Running GenePlexus and ModGenePlexus

GenePlexus and ModGenePlexus was run using a modified version of

PyGenePlexus10,33. In this, we ran all models with the STRING network and with the

adjacency matrix settings. These choices were made due to STRING and the adjacency

matrix having superior performance relative to other networks and to allow for each

gene to be considered a feature within the model – which embeddings would not allow.

Compiling GWAS for strict threshold gene prediction

The strict threshold was chosen to be

𝑝  < 1 × 10

−5

because

𝑝  < 1 × 10

−8

did not

have enough GWAS that had MAGMA predicted geneset sizes of at least 100. The

looser thresholds used were

𝑝  < 1 × 10

−2
,

𝑝  < 5 × 10

, and

𝑝  < 1 × 10

. Only

−2

−1

traits that had at least 100 genes with a threshold of

𝑝  < 1 × 10

−5

were considered,

thus 32 GWAS were used for all thresholds and compared in this analysis.

Calculating edge density

Edge density,

𝐷

, for a geneset,

𝐺

, is given by:

𝐷 =  ∑

 𝑊

 /(|𝑇| * (|𝑇| − 1)/2)

{(𝑢,𝑣) ∈𝐺}

𝑢𝑣

where

𝑊

is the edge weight between genes

𝑢𝑣

𝑢

and . This measures how connected

𝑣

the genes in a geneset are within itself.

GOBP enrichment of genesets

GOBP enrichment was determined using the ClusterProfiler34 software package in R.

The GSC was subsetted to include processes that had at least 10 genes and a

maximum of 500 genes. Enrichment was run for the AllAssign geneset and for each

module of type-2 diabetes.

Results

Module expansion replicates performance in combined GOBP simulated traits

To validate our hypothesis that module expansion gives superior performance to regular

GenePlexus for large genesets, we implemented a simulation where we create large

genesets with known biologically meaningful subsets. An initial difficulty of designing

and validating the proposed methodology for ModGenePlexus is that there is no gold

standard of modules that we can use for diseases to test the method35. As such, we

72

chose to design a method that creates large genesets that have truly biological

meaningful datasets, being modules, underlying them. Using GOBPs, we used a

network to discover pairs of GOBPs that are far away from each other in the context of

a biological network (see methods). Individual GOBPs are treated as clusters of this

disease. GOBPs were additionally combined into simulated traits only if there were no

genes overlapping between them. This assures that there is no consideration of fuzzy

cluster assignment and that the GOBPs should have distinct network signals. We found

that our module expansion method ModGenePlexus recreated the performance of

running GenePlexus normally on the entire geneset at once using the evaluation metric

log2(auPRC/prior) (Figure 3.7-8). We additionally see that models run with the larger,

combined genesets have much lower performance compared to running on each term

individually. Given the equivalent overall performance of our method with the original

geneplexus implementation, we next looked into how different categories of genes,

whether they are positive or negative, ranked across the genome-wide predictions. We

can see that positive test genes of the relevant term tend to have higher ranks than

positive genes of the other combined terms, which themselves have higher rankings

than negative test genes (Figure 3.9). This is evidence that true negative genes will

cluster at the bottom of genome wide rankings, while genes that are possibly relevant

can fall anywhere in the genome ranking. Lastly, we hypothesized with ModGenePlexus

we will more easily discover genes that are highly ranked at a module level, but are a

lower rank in the full disease set. We show that genes are most likely to be ranked

highest in the relevant GOBP term the gene originates from – and the genes that

GenePlexus gives very high predictions to are nearly always higher ranked in the

relevant term – rather than being ranked higher in the irrelevant one (Figure 3.10). This

experiment shows that ModGenePlexus can recreate the performance of GenePlexus,

and that the individual modules are able to rank genes higher relative to a larger set

comprising multiple distinct modules. However, this simulation does not do anything to

remove genes that are potentially false positive, as with GOBP we are assuming these

genesets are quite robust and validated when compared to finding functional modules of

real diseases.

73

Figure 3.7: Simulation results for models built from individual GOBP terms (Term), a

single model built from combining the terms (Com), and module expansion of each term

individually (Module Expansion). Module expansion recreates the performance of

creating one model for the entire combined GOBP list.

74

Figure 3.8: log2(auPRC/prior) of simulated trait GenePlexus performance for the

Combined simulated trait mode (x-axis) and for ModGenePlexus (y-axis).

75

Figure 3.9: The ranks of gene predictions are meaningful across the GenePlexus

ranking. The Y-axis is the genome wide rankings scaled from 0-1, where genes closer to

one have higher ranks in the GenePlexus prediction output. Genes that are positive test

genes (PosTest) – those that have true association – are ranked highly in the genome

wide results. Positive test genes of other GOBPs (OtherPos) but are not negative in a

GOBP have ranks that encompass the entire GenePlexus prediction output. Those that

are negative test genes (NegTest) tend to have low ranks.

76

Figure 3.10: Genes that are highly ranked in GenePlexus results of the term the gene is

associated with will have a higher rank in those predictions than in unrelated terms the

gene is not annotated to. Genes typically have the highest rank in model output

corresponding to the term they are positive labels in. If the gene was not highly

predicted in the term it corresponds to, then the instances of a gene being predicted

higher in an incorrect term increases as the corresponding term rank decreases. X-axis

represents the genes being binned in equal frequency bins based on their ranks in

GenePlexus predictions.

77

Improving negative and neutral selection of GenePlexus label assignment using

module assignments

Using modules offers an additional benefit by letting us improve how the model

determines neutral and negative labels for genes. If diseases are interpreted as a

collection of smaller subsets of related genes, then using the entire disease genelist to

determine neutrals is problematic. GenePlexus determines neutral genes by taking the

positive genes (the user input genelist) and performing a hypergeometric test with

genes in a GSC like DisGeNET or GOBP. If there is significant overlap between the

positive labels and a geneset in this collection, then the genes in that set that are

non-positive become neutral genes, meaning they aren’t used as negative labels when

training the model (see methods). However, these genesets are unlikely to overlap

significantly with the entire disease as a whole, so relatively few genes are made neutral

for large user genelists (Figure 3.11). Given that modules discovered are filled with

biologically related genes, we evaluated whether calculating negatives on a per-module

basis where using neutrals discovered in any module, rather than finding neutrals using

the geneset as a whole, improves performance. For larger traits of combined GOBP,

there is strong correlation between the number of combined terms and the number of

discovered neutrals using our new method (Figure 3.11), but not when determining

neutrals from using the entire simulated trait geneset at once. We modified the

PyGenePlexus source code to utilize our new neutral selection method, and found that

using neutrals found on a per-module basis gives better results for all different models

(Figure 3.12-13). In other words, using other meaningful biological datasets to discover

neutrals, rather than only the positive gene list, improves GenePlexus and

ModGenePlexus performance. Additionally, there is significant improvement in

simulated traits irrespective of how many GOBP terms were combined (Figure 3.13).

This is especially interesting given that the more modules a trait has, the more neutrals

are discovered. This suggests that having a lower number of negatives can improve

results when the GSC and biological data is used at the module level. Since complex

diseases have multiple modules, this method is integrated in PyGenePlexus when doing

neutral selection and when using either ModGenePlexus or any other model that utilizes

module assignments.

78

Figure 3.11: The number of neutrals discovered when performing a hypergeometic test

with the entire disease list at once or on a per module basis. The more meaningful

modules there are within one of the traits, the more neutrals are discovered. Even

though simulated traits with 10 terms are significantly larger than those containing 2, the

number of neutrals across simulated traits does not change significantly.

79

Figure 3.12: Simulation results when using our per-module neutral selection method vs.

GenePlexus’s. For each model – GOBP term, the Combined terms, and Module

expansion of the terms – utilizing neutrals that are obtained using the per-module

method has improvement. The more modules there are, the more drastic the

improvement.

80

Figure 3.13: Wilcoxon test that shows significance of result improvement of using the

per-module neutral selection method for the combined (GenePlexus) and module

expansion (ModGenePlexus) outputs. When at least 2 terms are put into a simulated

trait, the performance increase with using per-module neutral selection is significantly

improved at a threshold of

𝑝  < 1 × 10

 (***)

. This performance increase is true for

−8

both the “Combined” trait (GenePlexus on the whole geneset at once) and for

ModGenePlexus (creating multiple models for each term in the simulated trait).

Module expansion gives significantly improved results with differential

expression data

ModGenePlexus was designed with real-world experimental results in mind. Our first

goal was to be able to show that ModGenePlexus can improve gene classification for

differential expression datasets under stringent conditions. Differential expression

datasets are an ideal use case because they involve hundreds or thousands of gene

81

associations and are noisy36,37. We used CREEDS manually extracted differential

expression datasets for diseases, gene knockouts, and drug signatures. After filtering

for geneset sizes and experiment number, we ended up with 311, 972, and 234

genesets, respectively, suitable for creating study-biased holdout splits (see methods).

These traits were chosen based on having a large initial input size of at least 100 genes

– as we want to have traits that capture the etiology of complex diseases – and having

at least 10 genes that are understudied in literature to use as positive test genes in the

evaluation. For CREEDS disease datasets, ModGenePlexus outperformed a single

model for all genes – AllAssign – with the log2(auPRC/prior) evaluation statistic (Figure

3.14). Similar results were seen for the CREEDS gene knockout and drug profile

datasets (Figure A3.1). Next, we created genesets for model AllClus, which creates a

single model for all genes that were assigned in a cluster using the DOMINO module

discovery algorithm. This geneset includes genes that were part of the initial experiment

set that were assigned to a cluster, and propagated genes discovered in the network.

ModGenePlexus also does better than this AllClus method for multiple datasets (Figure

3.15). AllClus is a useful metric because these genesets have the false positive genes

removed, so a comparison to ModGenePlexus shows that creating multiple models for

each module gives an additional significant benefit to the gene classification results. We

also tested models where we remove all propagated genes that were not part of the

disease’s test set to determine whether the improved performance came from simply

propagating test genes in the semi-supervised DOMINO module identification, or if the

novel propagated genes and supervised learning in GenePlexus add anything to

performance. Using all the propagated genes – not just discovered disease test genes –

improves performances for the CREEDS disease dataset and the CREEDS drug and

gene datasets (Figures 3.16). Lastly, we visualize results for an AllClus version of the

non-propagated cluster assignments displayed in Figure 3.17, displaying all discussed

model types at once. These results show that ModGenePlexus – where models are

created for each module and uses the propagated genes found through

semi-supervised learning – performed best. In other words, using all aspects

ModGenePlexus is better than only parts of the method. We ran the Wilcoxon

signed-rank test on the log2(auPRC/prior) results for each disease to test if

82

ModGenePlexus results were significantly better than other simpler models. For the

CREEDS disease, gene, and drug datasets, and for MAGMA gene predictions of

GWASs with prediction threshold

𝑝  < 1 × 10

−2

,ModGenePlexus performs better than

AllAssign in both its propagated and non-propagated forms (Figure 3.18). Additionally,

propagated and non-propagated AllClus methods perform better for the three CREEDS

datasets but not on any of the MAGMA results compared to AllAssign. The performance

of ModGenePlexus is better than the propagated version of AllClus for all three

CREEDS datasets and for MAGMA gene predictions of GWASs with prediction

threshold

𝑝  < 1 × 10

−2

(Figure 3.19). ModGenePlexus and the propagated version of

AllClus is significantly better for the CREEDS datasets compared to the non-propagated

modGenePlexus (Figure 3.20). Figure 3.21-22, shows the correlation of results with the

number of understudied genes in the CREEDS disease dataset. For Figure 3.21,

AllAssign performance is correlated with test geneset size, in line with observations

from the original GenePlexus paper, and in Figure 3.22 we show there is negative

correlation with the number of test genes and performance for all model types, where

traits with more test genes have worse performance. Notably, the disparity between

AllAssign and ModGenePlexus methods performance gets more notable for larger test

set sizes, indicating that while ModGenePlexus still performs relatively worse with larger

datasets, it does better than original GenePlexus. Overall, we demonstrate that

ModGenePlexus is beneficial for most experimental results through each of its important

additions. ModGenePlexus and AllClus perform well, indicating that removing genes

with minimal network connections and adding new genes from propagation significantly

improve results. Adding new genes from the propagation improves the supervised

learning predictions. Lastly, creating models for each module rather than a single

module at once improves overall performance of genome-wide gene rankings even

when having to do an additional step of aggregating the predictions. Notably however,

the GWAS results where genes were assigned using p-value thresholds had mixed

results, raising questions about if modules can be used to improve GWAS gene

prioritization and classification. Figure A3.1-4 contains results showing model

performance for each CREEDS and MAGMA geneset collection.

83

 
Figure 3.14: Comparing GenePlexus and ModGenePlexus performance. GenePlexus

(blue) is creating one model for all genes in the experimental result. ModGenePlexus is

creating

84

Figure 3.14 (cont’d)

multiple models for each module discovered through DOMINO and aggregating gene

predictions across each model.

Figure 3.15: Comparing performance between GenePlexus, ModGenePlexus, and

AllClus. AllClus (green) creates a single model for all genes that were assigned to a

disease-gene enriched cluster.

85

Figure 3.16: Comparing model performance of AllAssign, ModGenePlexus, and a

version of ModGenePlexus that does not use additional propagated genes from

DOMINO (dark purple).

86

Figure 3.17: Comparing model performance with the inclusion of AllClus for both

propagated (light green) and non-propagated (dark green) genelists.

87

Figure 3.18: Wilcox test demonstrating if results of models are significantly better than

GenePlexus (AllAssign) models for all genesets. The cells are annotated if the model is

significantly better to AllAssign for the row’s geneset. One star corresponds to a

significance threshold of

𝑝  < 1 × 10

, two stars a threshold of

𝑝  < 1 × 10

, and

−3

−5

three stars a threshold of

𝑝  < 1 × 10

−8
.

88

Figure 3.19: Wilcox test demonstrating if results of models are significantly better than

the propagated version of AllClus models for all genesets. The cells are annotated if the

model is significantly better to domino_AllClus for the row’s geneset. One star

corresponds to a significance threshold of

𝑝  < 1 × 10

, two stars a threshold of

−3

𝑝  < 1 × 10

, and three stars a threshold of

−5

𝑝  < 1 × 10

−8
.

89

Figure 3.20: Wilcox test demonstrating if results of models are significantly better than

the non-propagated version of modGenePlexus models for all genesets. The cells are

annotated if the model is significantly better to noprop_MaxAvgProb for the row’s

geneset. One star corresponds to a significance threshold of

𝑝  < 1 × 10

, two stars a

−3

threshold of

𝑝  < 1 × 10

, and three stars a threshold of

−5

𝑝  < 1 × 10

−8
.

90

Figure 3.21: GenePlexus performance for CREEDS disease datasets is correlated with

size. Performance of GenePlexus for AllAssign is negatively correlated with the number

of understudied genes that were part of the experimental set.

91

Figure 3.22: Comparing model performance of each model type with the number of

understudied genes for CREEDS Disease experiments. Module expansion methods

show more improvement when there are a larger number of test genes. The annotation

“n=X” refers to the number of CREEDS disease genesets in the bin.AllAssign (blue),

AllClus (green) where light green is the propagated version, and ModGenePlexus

(purple) where light purple is using propagated genes.

ModGenePlexus allows for better prediction of stringent GWAS data

The previous results show that ModGenePlexus gives superior performance for

CREEDS datasets, but the results for GWAS data were more mixed. They were better

when choosing genes from MAGMA gene prioritization results using a threshold of

𝑝  < 1 × 10

, but not the stricter thresholds. GWAS results are challenging to work

−2

with due to needing correction for multiple hypothesis testing at both the summary

statistic level and with gene prioritization algorithms like MAGMA. Notably, one of the

strict thresholds for MAGMA genesets of

𝑝  < 1 × 10

−5

performed well in GenePlexus

– outperforming the looser threshold and the CREEDs datasets (Figure 3.23). The

log2(auPRC/prior) of the MAGMA methods are often above 1.0 for thresholds

𝑝  < 1 × 10

−2

and

𝑝  < 1 × 10

, indicating fairly good model performance, whereas

−5

92

CREEDS data models perform worse than random prediction using GenePlexus.

Crucially, ModGenePlexus gives improved performance for the looser threshold of

𝑝  < 1 × 10

−2

(Figure 3.18) for MAGMA genesets. This geneset collection actually had

the most significant improvement over the AllClus models (Figure 3.19). We

hypothesized that discovering clusters with only the stringent threshold worsened the

quality of the input data for downstream supervised learning. Stringent thresholding is

done to remove false positive results, but has the disadvantage of increasing the

number of false negatives. However, the semi-supervised module discovery can be

used instead to remove bad hits and false positives, where biological data is used to

determine relevance rather than a naive p-value decision threshold. We tested if

utilizing genes that passed looser thresholds in the MAGMA results could be used in the

ModGenePlexus pipeline to improve predictions of test genes in a more stringent set. In

other words, could utilizing genes that meet a threshold of

𝑝  < 1 × 10

−2

help classify

understudied genes with

𝑝  < 1 × 10

. A

−5

𝑝  < 1 × 10

−5

was chosen rather than a

stricter

𝑝  < 1 × 10

−8

due to limitations in the number of available genesets, as very

few MAGMA results had hundreds of genes predicted using the strictest threshold (see

methods). These models were created and performance was evaluated for all GWAS

that met geneset size requirements(Figure 3.24), and we visualize those GWAS traits

which had notably poor performance in AllAssign models - defined as having

log2(auPRC/prior) of below 2.0 (Figure 3.25). Figure 3.24-25 are visualized separately

due to the observation that some of the GWAS did genuinely well with GenePlexus –

and we wanted to see if ModGenePlexus is helpful for those datasets that did poorly.

There was no significant performance increase for all GWAS, but poorly performing

GWAS showed significant improvement (Figure 3.26). Interestingly, both

ModGenePlexus utilizing the

𝑝  < 1 × 10

−2

MAGMA genes and using an AllClus

version of this model gives significantly improved performance. In Figure 3.27, we

added additional looser thresholds of

𝑝  < 5 × 10

−2

and

𝑝  < 1 × 10

−1

to see their

impact on ModGenePlexus performance and if ModGenePlexus continues to improve

classification at looser thresholds. We see that while the ModGenePlexus performance

increase does indeed level off, the performance still does increase with these very loose

93

thresholds. In contrast, AllAssign models performed worse with looser thresholds. Plots

including the GWAS that had an log2(auPRC/prior) > 2 originally are in Figure 3.28.

These results demonstrate that using networks, propagation, and module discovery

provides biological context to low-powered GWAS results, allowing recovery of

previously defined false negatives for novel discovery of genes and can be a tool for

post-GWAS analysis. Notably, this analysis provides further evidence that

ModGenePlexus is most beneficial for genesets that perform notably poorly in normal

GenePlexus.

94

Figure 3.23: Comparing GenePlexus results of each geneset. The MAGMA genesets

obtained from using thresholds of

𝑝  < 1 × 10

−2

and

𝑝  < 1 × 10

−2

perform notably

better than CREEDS datasets. The strictest threshold also performs notably poorly in

GenePlexus.

95

Figure 3.24: Performance of using models trained on nominal threshold genes when

evaluated on stringent test sets for all GWAS. The GWAS here include those that

perform well in GenePlexus with an log2(auPRC) > 2.0.

96

Figure 3.25: Performance of using models trained on nominal threshold genes when

evaluated on stringent test sets for all GWAS log2(auPRC/prior) < 2. ModGenePlexus

shows significant improvement for genesets that performed poorly in GenePlexus.

97

Figure 3.26: Wilcox test demonstrating if results of models are significantly better than

the non-propagated version of modGenePlexus models for all genesets. The cells are

annotated if the model is significantly better to domino_AllClus for the row’s geneset.

One star corresponds to a significance threshold of

𝑝  < 1 × 10

, two stars a

−3

threshold of

𝑝  < 1 × 10

, and three stars a threshold of

−5

𝑝  < 1 × 10

−8
.

98

Figure 3.27: Model performance for models trained on various loose thresholds on the

stringent test set for GWAS where the log2(auPRC < 2). In addition to

99

Figure 3.27 (cont’d)

training on

𝑝  < 1 × 10

, additional models were trained using thresholds of

−2

𝑝  < 5 × 10

−2

and

𝑝  < 1 × 10

. For GenePlexus, using additional genes discovered

−1

in looser thresholds makes performance worse, while ModGenePlexus performs better

when including these genes.

Figure 3.28: Model performance for models trained on various loose thresholds on the

stringent test set for all GWAS. In addition to training on

𝑝  < 1 × 10

, additional

−2

models were trained using thresholds of

𝑝  < 5 × 10

−2

and

𝑝  < 1 × 10

−1
.

100

Figure 3.28 (cont’d)

For GWAS that performed well in GenePlexus, there is less of a relationship between

adding more genes or not.

Network edge density of genes shows net benefit of module discovery and

propagation before supervised learning

In Figure 3.3, we showed that GenePlexus performance is correlated with edge density

of the input geneset. A motivation for using modules for gene classification is to

leverage smaller subsets of biologically significant genes localized to network

neighborhoods, which means that genes in modules are more densely connected

relative to the disease as a whole. Figure 3.29 shows that for complex diseases from

the CREEDS disease set, the average module edge density is higher for the larger

diseases. This suggests the bigger the geneset, the better module discovery is at

finding dense subsets of meaningful genes. For each model type, we calculated the

network edge density (see methods) of the genesets used as inputs for the respective

supervised learning models. For methods that use multiple cluster models, edge density

is represented by either the average or max edge density across the trait modules

depending on the figure. Figures 3.30-32 shows that non-propagated clusters have

higher average edge densities than propagated ones. Similarly, the non-propagated

AllClus has higher edge density than the propagated AllClus (Figure 3.33). AllAssign,

as expected, has the lowest edge density. This means that discovering modules does

indeed modify the genesets to be more dense for GenePlexus input. Figure 3.34 shows

a possible cause of this ranking of edge densities across the genesets – which is that

the noprop genesets are notably smaller than the propagated ones. Because DOMINO

propagates genes without the goal of maximizing edge density (see methods), the

additional genes have meaning in terms of a modularity statistic and for other network

properties, but this lowers the overall cluster densities. Despite the lower densities,

using propagated genes in clusters give superior results for ModGenePlexus (Figure

3.35). We plotted the correlation of average edge density across clusters for propagated

and non-propagated cluster genesets with performance (Figure 3.36). There is no

correlation between performance and average cluster edge density for propagated

(Figure 3.38, Pearson Correlation; -0.03), or non-propagated genesets (Figure 3.39,

101

Pearson Correlation; -0.09), even though a positive relationship between edge density

and performance exists for the experimental disease differential expression results with

the whole CREEDS disease geneset considered (Figure 3.4). Using ModGenePlexus

with smaller and denser clusters yields superior results compared to GenePlexus, but

traits with more dense clusters tend to be bigger, and (Mod)GenePlexus performance

has negative correlation with overall trait size (Figures 3.29, 3.34-35). This could

explain the lack of correlation between edge density of clusters with performance, as

the bigger traits lead to more dense inputs in ModGenePlexus. This could also explain

why ModGenePlexus/GenePlexus performance disparity gets more pronounced for

larger inputs.

We further tested if there is a relationship with the max cluster density for each disease.

Figure 3.37 shows the relationship between max cluster density of CREEDS diseases

with the total set size for propagated and non-propagated clusters, with similar results to

the average cluster edge density.. Correlations for propagated and non-propagated

genesets’ max density values with performance are in Figures 3.38-39 (Pearson; -0.13

and -0.10, respectively). We see that like with the average density across clusters, there

is little/no correlation with performance and the max cluster density for either geneset

type. Overall we demonstrated that geneset size impacts how dense the average

discovered module will be, explaining the more significant improvement of

ModGenePlexus for large traits (Figure 3.22) relative to normal GenePlexus. While

edge density is no longer a directly good predictor of overall performance, it helps

explain when ModGenePlexus improves relative to GenePlexus for large-scale data that

has hundreds or thousands of gene annotations.

102

Figure 3.29: Plotting the relationship of trait size and the average edge density of

obtained clusters with DOMINO. There is a positive correlation between size and

average cluster edge densities.

103

Figure 3.30: Histogram of edge densities for genesets used in models. AllAssign have

the smallest edge densities. For the genesets domino_ and noprop_, the average

cluster edge density of each geneset is plotted.

104

Figure 3.31: Ridge plot of densities across genesets. For the genesets domino_ and

noprop_, the average cluster edge density of each geneset is plotted.

105

Figure 3.32: Contrasting the average cluster edge density of propagated and

non-propagated clusters. For most traits, the average edge density of discovered

clusters is higher in the non-propagated versions.

106

Figure 3.33: Contrasting the propagated and non-propagated versions of AllClus. For

most traits, the non-propagated version has a higher edge density.

107

Figure 3.34: Ridge plot of size distributions for each geneset, subsetted to only include

in AllAssign < 1,000 genes for readability. The order of the ridge plots is based on edge

density distribution.

108

Figure 3.35: Ridge plot of model performance for each geneset, order of the ridge plots

is based on edge density distribution.

109

Figure 3.36: Relationship between the average cluster edge density and performance

for propagated (domino) and non-propagated (noprop). There is no correlated between

the average cluster density and performance for either geneset.

110

Figure 3.37: Relationship between the max cluster edge density and performance for

propagated (domino) and non-propagated (noprop). There is no correlated between the

max cluster density and performance for either geneset.

111

Figure 3.38: Correlation matrix of performance and average/max cluster edge density

statistics for propagated genesets.

112

Figure 3.39: Correlation matrix of performance and average/max cluster edge density

statistics for non-propagated genesets.

Modules contain unique and specific GOBP information

To understand how ModGenePlexus can allow biologists to better understand their gene

list of interest, we performed a case study for type 2 diabetes. This is a list of

differentially expressed genes from the CREEDS database and contains 3,837 genes.

The gene set was run in the ModGenePlexus pipeline which removed 2,584 genes and

resulted in 14 clusters, ranging in size from 54 to 152 genes (Figure 3.40) after

DOMINO propagation. We additionally ran GenePlexus on the full set in AllAssign. Both

AllAssign and the cluster gene lists were passed into the enrichment software in R

package ClusterProfiler34 (see methods). We ran enrichment with GOBPs, looking at

terms with at least 10 genes and a max size of 500 genes.

113

A key feature of ModGenePlexus is that in addition to improving gene classification on

large, noisy gene sets, the gene level predictions are attached to a specific cluster

before the aggregation step. Doing enrichment at the cluster level can give a more

refined view of the biology behind the initial experimental gene set. To determine if more

specific or more general biology is being discovered, we looked at the number of genes

annotated to the GOBPs that are enriched. This assumption is based on the ontology

being propagated up the ontology, where more general ontology go terms contain the

gene annotations of any term below it in the ontology. Aside from cluster 10, AllAssign

enriches for larger GO terms, which are substantially larger when compared to most of

the clusters (Figure 3.41). Additionally, since both the enriched GOBPs and the cluster

sizes are smaller, the statistical power of the enrichment method is increased and gives

more significant q-values than AllAssign (Figure 3.42). For AllAssign we showed in

Figure 3.43 that the percentage of the AllAssign gene list inside of enriched GOBPs is

quite small in a given GOBP. For this analysis, we see that cluster 10 has a large

percentage of its genes in the enriched GOBP terms, which also explains why cluster

10 enriches for even larger genesets tha AllAssign in Figure 3.41.

One concern with ModGenePlexus is that it can drop many genes during the DOMINO

module discovery and propagation process. We demonstrate in this chapter that this

step does improve gene classification results and justify that the genes being lost have

poor network connections, but losing this many genes could mean a lot of potentially

important biology is being lost. To demonstrate if this concern is valid, we counted how

many enriched GOBPs are found in AllAssign and across the clusters (Figure 3.44).

For AllAssign, the number of GOBPs enriched in the gene list were simply counted as

is. For ModGenePlexus, we collected the enriched GOBPs found across all clusters and

counted the number of unique terms. Interestingly, the number of terms enriched in

ModGenePlexus is greater than for AllAssign, 331 to 222, respectively. In addition there

is an overlap between the two sets of 79, which is around 35% of the AllAssign enriched

terms. We then tested if the enriched terms in both methods were capturing biology at

different scales. This was done by compiling the set sizes of the enriched terms unique

to both AllAssign and ModGenePlexus, as well as the terms enriched in both (Figure

3.45). We see that the largest GOBPs on average are found by both methods. This is

114

followed closely by the terms found only by AllAssign, and the terms found by

ModGenePlexus were substantially smaller. This suggests that ModGenePlexus finds

more specific terms and biology. In sum, this demonstrates that ModGenePlexus can

allow researchers to understand their geneset of interest in a more nuanced way,

allowing for further studies at the molecular level.

Specific Biology of Type-2 Diabetes is unraveled with ModGenePlexus

The analyses done indicate that for general metrics of enrichment results that

ModGenePlexus may implicate more specific biology for diseases. To provide further

evidence, we take the top 20 enriched terms for AllAssign of type-2 diabetes (Figure

3.46). There were 222 total enriched terms in AllAssign, and in the top 20 there are 2

relatively specific terms in muscle filament sliding and actin-myosin filament sliding. The

other terms are quite general, including neutrophil activation, muscle system process,

and response to oxygen levels. Looking at two clusters, 5 and 3, we see that they are

enriched for smaller GOBP and see the more specific biology that is recovered. In

cluster 5, 19 total terms were enriched with only 2 also being enriched in AllAssign,

where these 2 are the largest terms in cluster 5 (Figure 3.47). It is enriched for cellular

processes that relate to copper ions47–50 and the mitochondrial membrane51–54, and both

have been enriched for type-2 diabetes. In cluster 3, there were a total of 43 enriched

terms of which 4 are also enriched in AllAssign and are very general (proton

transmembrane transport, ATP metabolic processes, macroautophagy, and

mitochondrial transport). Cluster 3 has biology known to be related to type-2 diabetes

(Figure 3.48). Response to insulin38–41 and iron/transferring processes42–45 are well

documented to be associated with diabetes. Interestingly, many enriched terms are also

related to cytidine processes. A literature search only found one recent study46 that

showed using whole blood metabolomics that cytidine could be a marker for type-2

diabetes, suggesting that the cytidine process in type-2 diabetes might be important yet

understudied.

In cluster 10, we saw enrichment for larger genesets, and it recovers a number of

processes also seen in AllAssign (Figure 3.49). Six out of eleven terms enriched in

cluster 10 are enriched in AllAssign. However, the terms are generally more enriched in

115

cluster 10, an example being translational initiation, which has a q-value of 1.13e-30 in

AllAssign, but a much lower value of 1.27e-99 in cluster 10.

Figure 3.40: The number of genes in each propagated domino cluster and in AllAssign.

These propagated assignments were used for enrichment analysis as inputs.

116

Figure 3.41: The distribution of enriched GOBP term sizes for each cluster and

AllAssign.

117

Figure 3.42: The -log10(q-value) for enriched GOBP in each cluster and AllAssign.

118

Figure 3.43: Distribution of the percent of the input gene list that is in enriched GOBPs

for each cluster and AllAssign.

119

Figure 3.44: Counting the GOBPs that are enriched in the clusters and AllAssign, and

the overlap between them. UniqueDomino refers to unique GOBPS enriched across all

propagated clusters.

120

Figure 3.45: The number of genes annotated to each enriched GOBP in

ModGenePlexus, AllAssign, and in both methods (intersect).

121

Figure 3.46: Top 20 enriched GOBP terms in AllAssign. The total number of enriched

terms is 222, but we visualize only the top 20 here. Significance is shown in q-value,

Count is the intersection of the input gene list and GOBP, and TermSize is the number

of GOBP-gene annotations. The total number of genes in type-2 diabetes for AllAssign

is 3,837.

122

Figure 3.47: Top 20 enriched GOBP terms in Cluster 5. Significance of the term in

cluster 5 is shown in q-value, Count is the intersection of the input gene list and GOBP,

and TermSize is the number of GOBP-gene annotations. AAQvalues is the q-value in

AllAssign, where if the cell is blank then the term was not enriched in AllAssign. 98

genes were used as input for enrichment.

123

Figure 3.48: Top 20 enriched GOBP terms in Cluster 3. Significance of the term in

cluster 3 is shown in q-value, Count is the intersection of the input gene list and GOBP,

and TermSize is the number of GOBP-gene annotations. AAQvalues is the q-value in

AllAssign, where if the cell is blank then the term was not enriched in AllAssign. 101

genes were used as input for enrichment.

124

Figure 3.49: Top 20 enriched GOBP terms in Cluster 10. Significance of the term in

cluster 10 is shown in q-value, Count is the intersection of the input gene list and GOBP,

and TermSize is the number of GOBP-gene annotations. AAQvalues is the q-value in

AllAssign, where if the cell is blank then the term was not enriched in AllAssign. 80

genes were used as input for enrichment.

Discussion

Human diseases can be very complex and could have hundreds of true gene

annotations, making it common for experiments like GWAS and differential expression

to implicate numerous loci or genes. The goal is to interpret these results biologically,

explaining how genes or loci work, relate with one another, or relate to a disease. A

common pipeline is to take experimental results and visualize them in a network using

software. A typical pipeline involves visualizing experimental results in a network using

software, identifying modules within the network, and performing enrichment analysis

for biological processes or phenotypes. Researchers then select interesting modules for

further study. This standard pipeline raises questions about module definition and

quality, which are determined through the relevance and quality of genes implicated in

the initial experiment. Our project aimed to improve input in standard computational and

bioinformatics methods by improving and expanding GenePlexus in a general-purpose

framework that works across multiple experiments for diverse diseases and traits..

125

GenePlexus is a method aimed to recover false negatives and add genes to genesets

that relate to biological entities like diseases, but does nothing on its own to remove

false positive genes. ModGenePlexus is a powerful method because it recognizes that

not only are genesets incomplete, but that genesets are filled with either false positives,

noise, or genes that are so understudied as to not have confident association to many

other genes in the genome. We integrated multiple well validated methods – DOMINO,

propagation, GenePlexus/supervised learning, and genome-wide networks – for

geneset refinement, keeping genes related within external biological data and networks.

ModGenePlexus and using gene modules improves gene classification in multiple

unique and distinct ways. For example, training on loose threshold MAGMA genes to

predict highly significant but understudied genes gave good results with

ModGenePlexus, but very poor results with GenePlexus. Rather than using a stringent

threshold to decide what GWAS loci or genes to use, using networks to give biological

context to determine genes with biological meaning is a powerful tool to predict and

classify genes significant and thus more likely to be mechanistically relevant.

Working around gold standard limitations and module definitions

An open question is how to define modules in a robust way. As there is no good gold

standard that can be used to confidently define modules for diseases of interest, our

goal was not to determine final gene modules underlying a disease, but show that if

modules are defined simply as a subset of biologically related genes, they can be used

to make better predictions for the disease disease holistically. We demonstrated

extensively that gene classification is improved for a variety of datasets in

ModGenePlexus under different conditions. Additionally, running GenePlexus for

individual modules also uncovers unique, specific processes not found when enriching

the whole disease. We argue that the importance of these enrichments is not the

module is a final product in itself, but that the modules can be used as tools to gain

insight into novel biological insights for disease.

The simulated traits shows ModGenePlexus matches GenePlexus performance

In this project, we began by creating a study where if GOBPs are treated as modules,

and if we combine distinct GOBPs into simulated traits, then ModGenePlexus would

recreate performance of running GenePlexus on the large geneset (Figure 3.7). This

126

result was initially surprising as we expected that if a trait had multiple distinct modules,

that this would substantially hurt GenePlexus performance, and that ModGenePlexus

would show noticeable improvement. However, GenePlexus proved to be a robust

method. The highly multidimensional nature of the network allows the linear regression

model used by GenePlexus to consider boundaries effectively even when there are

multiple distinct modules in the network. Therefore, we applied module information to

improve performance for particular genesets or in other aspects of the pipeline, and we

presented many ways where ModGenePlexus is a general improvement for large-scale

experimental genesets and gives meaningful and novel biological associations..

Using module assignments improves negative gene discovery

The robustness of GenePlexus motivated us to consider other ways to use modules to

improve model performance by improving the neutral and negative gene label selection

process. Interpreting large complex diseases as collections of meaningful gene subsets,

rather than one large set where all genes are equal, raises an issue of doing geneset

overlap with other meaningful sets. In a network, possible disease genes are not

necessarily connected to every single meaningful subset within a disease. Genes that

are connected to a single module in a disease, but are not connected in any significant

way to other modules, could be given a negative label because from the perspective the

“disease as a whole” (all disease genes), it does not have dense connections to most of

them and is classified as a negative. We argue that if this gene is well connected to a

module of the disease, that negative label is inappropriate, and using modules to

determine neutral labels prevents this mislabeling. This method increases runtime, but

Figure 3.12 shows that it dramatically improves results of the simulation for all model

types. Notably, the more modules a simulated trait had, the better the new method

performed, which is relevant since real-world diseases have many underlying modules.

This approach enhances GenePlexus by increasing the statistical power of

hypergeometric tests to refine negative label determination to better leverage the

biological data in GO and DisGeNET.

ModGenePlexus allows for better gene classification of experimental data

The discussed simulation has an advantage in that the GOBP “modules” underneath

the traits do not have many, if any, false positives that need to be removed. This

127

motivated our focus on experimental datasets, which were not used in GenePlexus due

to their large size and known false positives. GenePlexus genesets were originally

chosen for their good annotation, in sharp contrast to the realities of initial experimental

results. We chose DOMINOfor module discovery not just because it is a good and

benchmarked module discovery method, but because it finds more genes using network

connections and removes genes with minimal network relationships. Using

ModGenePlexus, we show that the ability to remove “bad” genes significantly improves

GenePlexus. Figure 3.19 shows that only the propagated version ModGenePlexus

outperforms propagated AllClus – meaning AllClus is also a direct improvement over

GenePlexus. Figure 3.20 shows that using the recovered false negative genes through

propagation improves GenePlexus results further. Both removing false positives and

recovering some false negatives heavily improves GenePlexus results and both

additions are essential to have consistently better results across geneset collections of

experimental data. The more high quality gene annotations, the better GenePlexus can

do in classifying genes. We implemented and combined multiple computational

methods to refine genesets that outperforms GenePlexus in multiple ways.

Training on nominally associated genes better predicts more significant genes

As mentioned previously, the mixed results for ModGenePlexus improving GWAS

motivated us to train on looser threshold genes for stringent data. We saw that

ModGenePlexus improves the loose threshold genesets, and we show in Figure 3.29

that bigger traits are likely to have denser clusters discovered. Training on only highly

stringent results may mean we get poor clusters, which would explain why

ModGenePlexus does poorly. In using the looser threshold genes, we see noticeable

improvement in those GWAS that performed poorly in GenePlexus (Figure 3.24-25).

Ultimately, ModGenePlexus can recreate performance when GenePlexus does well, but

improves GenePlexus when it fails. Interestingly, loosening the threshold to an

extremely loose degree continues to improve ModGenePlexus results but worsens them

for GenePlexus. The improvement of ModGenePlexus has a trade-off – it better

recovers truly relevant genes, but given the large number of genes added it is quite

likely that the modules become quite big and possibly broad in terms of biological

interpretation. This raises the question of the utility of big modules versus small

128

modules. Small modules are more useful in terms of looking at enrichment results, but

we argue that this is evidence for big modules being useful for computational methods.

The goal should be to use modules to improve disease interpretation, as such we argue

that using large genesets from GWAS analysis can be useful depending on the goal of

the post-GWAS computational methods being employed. We provide evidence that for

gene classification of mechanistically important genes, large modules are a useful tool

when they are coherent in the network.

Edge densities of modules and performance goes against expectations of the

original GenePlexus study

When analyzing the edge densities of the clusters and inputs used for ModGenePlexus,

it was initially surprising to see that the propagated clusters are less dense than

non-propagated ones. There are multiple reasons why this could be happening. One

explanation is that DOMINO is not using edge density to build its modules, but is

making sure the modularity of the subgraph is higher than the ratio of genes in the

subgraph relative to the entire network. Modularity is associated with edge density, but

genesets can be modular and non-densely connected in the network. The question

arises why utilizing these propagated genes significantly improves results of

ModGenePlexus. A likely reason is because the propagated genes have network

relationships to the understudied, held out genes of the geneset. Understudied genes

are those that are less likely to have robust or ubiquitous network connections, but

should in actuality be connected to at least some of the disease genes. They are not

connected, however, due to incomplete information about either the gene or within the

network itself. This reasoning is likely valid because the very reason the understudied

genes are discovered in the first place is because in some way they are

disease-associated, and to be disease-associated they must work together with other

disease genes. In other words, using DOMINO to “fill in” modules with other

non-disease genes with direct network connection may make the module less dense,

but utilizing less dense network connections is what better recovers understudied genes

that are also not densely connected to the gene modules. The relationship between

edge density and geneset size is also interesting, particularly where larger genesets

tend to have higher cluster densities on average. Finding these higher density clusters

129

was a major motivation for this project, and we can see that the biggest genesets are

the ones with the most drastic improvements in ModGenePlexus. These results and the

original evaluations in GenePlexus show that while edge density is a useful statistic to

improve GenePlexus results, other genes in the network are truly relevant and

additional methods such as ModGenePlexus are needed to find genes in spite of their

less dense connections to most disease genes.

ModGenePlexus finds unique GOBP predictions compared to GenePlexus

Lastly, we show off for a specific disease unique gene predictions and enrichment

results. Doing enrichment at a module level allows for increased statistical power for

common computational enrichment methods. Large, whole diseases are composed of

numerous processes and phenotypes. Those processes that are most likely to be at the

top of enrichment results are broad, ubiquitous processes. This is because they often

involve large numbers of genes, with many of these genes appearing across disease

lists. In addition, specific process gene lists are so small that when a disease has many

hundreds of genes, it is extremely difficult to get a significant enrichment result. With

modules, because they are a small subset of the disease, it is possible to find significant

enrichment between modules and more specific processes. We demonstrated that

using ModGenePlexus removes genes in the propagation process, but this actually

finds additional biology when doing enrichment at the cluster level. ModGenePlexus is

able to find significant biology in the clusters because it utilizes a network created from

biological data to determine meaningful subsets, in this case the STRING network.

STRING integrates vast amounts of data, ranging from physical interactions,

co-expression data, annotation databases, to build the network. This is why the clusters

discovered have meaningful biology, and we leverage this network to improve gene

classification of a researcher’s genelist of interest.

Specific biology of type-2 diabetes revealed interesting pathways in the modules

We demonstrated that the modules created from type-2 diabetes differential expression

results led to interesting biological enrichment relative to the gene list as a whole. An

interesting discovery was cluster 10, which we saw had different properties compared to

other clusters in Figures 3.41-43. It was enriched for many processes that were

enriched in AllAssign. It seems that this cluster was a fairly general cluster, having

130

biology that is enriched with many of the more general GOBP terms. Clusters 3 and 5

had specific processes that are relevant to type-2 diabetes that were not discovered by

AllAssign. ModGenePlexus discovered both general and specific processes that are

enriched for disease genes. How the implicated general and specific processes affect

one another in the context of disease manifestation, patient phenotypes and outcome,

and molecular pathways are questions that need to be answered in the context of each

specific complex disease.

Conclusion: ModGenePlexus gives GenePlexus direct application in hypothesis

generation for experiments

Overall, we demonstrate that ModGenePlexus is a superior method to GenePlexus for

classifying the genes of experimental results. This means modules are a computational

tool to improve interpretation of diseases as a whole, and this information is what will

allow for better targeting of mechanistically important genes discussed in chapter 4.

Future Applications

An important goal is to get ModGenePlexus working in the current PyGenePlexus

python implementation and Webserver. This runs into multiple practical and technical

issues due to the additional memory and computational resources needed. For use by a

researcher, it's important to predict whether ModGenePlexus or GenePlexus should be

used. ModGenePlexus performs better for most genesets we tested in this project, but

not for all – even within the same domain. Having a quantitative way to predict which

method should be used is an important task to make the software usable on the server.

131

REFERENCES

1. Liu, R., Mancuso, C. A., Yannakopoulos, A., Johnson, K. A. & Krishnan, A.

Supervised-learning is an accurate method for network-based gene classification.
Bioinformatics (2020) doi:10.1093/bioinformatics/btaa150.

2. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate

disease genes by network-based boosting of genome-wide association data.
Genome Res. 21, 1109–1121 (2011).

3. Krishnan, A. et al. Genome-wide prediction and functional characterization of the
genetic basis of autism spectrum disorder. Nat Neurosci 19, 1454–1462 (2016).

4. Wang, X., Gulbahce, N. & Yu, H. Network-based methods for human disease gene

prediction. Brief Funct Genomics 10, 280–293 (2011).

5. Mancuso, C. A., Johnson, K. A., Liu, R. & Krishnan, A. Joint representation of

molecular networks from multiple species improves gene classification. PLoS Comput
Biol 20, e1011773 (2024).

6. Leiserson, M. et al. Pan-Cancer Network Analysis Identifies Combinations of Rare
Somatic Mutations across Pathways and Protein Complexes. Nature Genetics 47,
106–114 (2015).

7. Guan, Y., Ackert-Bicknell, C. L., Kell, B., Troyanskaya, O. G. & Hibbs, M. A.
Functional Genomics Complements Quantitative Genetics in Identifying
Disease-Gene Associations. PLOS Computational Biology 6, e1000991 (2010).

8. Warde-Farley, D. et al. The GeneMANIA prediction server: biological network

integration for gene prioritization and predicting gene function. Nucleic Acids Res 38,
W214–W220 (2010).

9. Šubelj, L. Label Propagation for Clustering. in Advances in Network Clustering and
Blockmodeling (eds. Doreian, P., Batagelj, V. & Ferligoj, A.) 121–150 (Wiley, 2019).
doi:10.1002/9781119483298.ch5.

10.Mancuso, C. A., Liu, R. & Krishnan, A. PyGenePlexus: A Python package for gene
discovery using network-based machine learning. 2022.07.02.498552 Preprint at
https://doi.org/10.1101/2022.07.02.498552 (2022).

11.Piñero, J. et al. The DisGeNET knowledge platform for disease genomics: 2019

update. Nucleic Acids Research 48, D845–D855 (2020).

12.Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on

human disease-associated genes and variants. Nucleic Acids Res 45, D833–D839
(2017).

13.Piñero, J. et al. DisGeNET: a discovery platform for the dynamical exploration of

human diseases and their genes. Database (Oxford) 2015, (2015).

132

14.The Gene Ontology Consortium et al. The Gene Ontology knowledgebase in 2023.

GENETICS 224, iyad031 (2023).

15.Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene

Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

16.Law, M. & Shaw, D. R. Mouse Genome Informatics (MGI) Is the International
Resource for Information on the Laboratory Mouse. in Eukaryotic Genomic
Databases (ed. Kollmar, M.) vol. 1757 141–161 (Springer New York, New York, NY,
2018).

17.Gustafsson, M. et al. Modules, networks and systems medicine for understanding

disease and aiding diagnosis. Genome Med 6, 82 (2014).

18.Barrett, T. et al. NCBI GEO: Archive for Functional Genomics Data Sets—Update.

Nucleic Acids Research 41, 991–95 (2013).

19.Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated

over the tree of life. Nucleic Acids Res 43, D447–D452 (2015).

20.Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids

Research 34, D535–D539 (2006).

21.Oughtred, R. et al. The BioGRID database: A comprehensive biomedical resource of
curated protein, genetic, and chemical interactions. Protein Sci 30, 187–200 (2021).

22.Wang, Z. et al. Extraction and Analysis of Signatures from the Gene Expression

Omnibus by the Crowd. Nature Communications 7, 12846 (2016).

23.Leeuw, C. A. de, Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: Generalized
Gene-Set Analysis of GWAS Data. PLOS Computational Biology 11, e1004219
(2015).

24.Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in

complex traits. Nat Genet 51, 1339–1348 (2019).

25.Chen, Z., Boehnke, M., Wen, X. & Mukherjee, B. Revisiting the genome-wide

significance threshold for common variant GWAS. G3 Genes|Genomes|Genetics 11,
jkaa056 (2021).

26.Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the

ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10,
e0118432 (2015).

27.Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC

curves. in Proceedings of the 23rd international conference on Machine learning  -
ICML ’06 233–240 (ACM Press, Pittsburgh, Pennsylvania, 2006).
doi:10.1145/1143844.1143874.

28.Levi, H., Elkon, R. & Shamir, R. DOMINO: A Novel Algorithm for Network-Based

133

Identification of Active Modules with Reduced Rate of False Calls.
http://biorxiv.org/lookup/doi/10.1101/2020.03.10.984963 (2020)
doi:10.1101/2020.03.10.984963.

29.Levi, H., Elkon, R. & Shamir, R. DOMINO: a network‐based active module

identification algorithm with reduced rate of false calls. Molecular Systems Biology
17, e9593 (2021).

30.Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of

communities in large networks. J. Stat. Mech. 2008, P10008 (2008).

31.Johnson, D. S., Minkoff, M. & Phillips, S. The prize collecting Steiner tree problem:
theory and practice. SODA ’00: Proceedings of the eleventh annual ACM-SIAM
symposium on Discrete algorithms 760–769 (2000).

32.Girvan, M. & Newman, M. E. J. Community structure in social and biological

networks. Proc. Natl. Acad. Sci. U.S.A. 99, 7821–7826 (2002).

33.Mancuso, C. A., Liu, R. & Krishnan, A. PyGenePlexus: a Python package for gene

discovery using network-based machine learning. Bioinformatics 39, btad064 (2023).

34.Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R package for comparing

biological themes among gene clusters. OMICS 16, 284–287 (2012).

35.The DREAM Module Identification Challenge Consortium et al. Assessment of

network module identification across complex diseases. Nat Methods 16, 843–852
(2019).

36.Raser, J. M. & O’Shea, E. K. Noise in gene expression: origins, consequences, and

control. Science 309, 2010–2013 (2005).

37.Eling, N., Morgan, M. D. & Marioni, J. C. Challenges in measuring and
understanding biological noise. Nat Rev Genet 20, 536–548 (2019).

38.Boucher, J., Kleinridders, A. & Kahn, C. R. Insulin receptor signaling in normal and

insulin-resistant states. Cold Spring Harb Perspect Biol 6, a009191 (2014).

39.Brännmark, C. et al. Insulin signaling in type 2 diabetes: experimental and modeling
analyses reveal mechanisms of insulin resistance in human adipocytes. J Biol Chem
288, 9867–9880 (2013).

40.Dall’Agnese, A. et al. The dynamic clustering of insulin receptor underlies its
signaling and is disrupted in insulin resistance. Nat Commun 13, 7522 (2022).

41.Batista, T. M., Haider, N. & Kahn, C. R. Defining the underlying defect in insulin

action in type 2 diabetes. Diabetologia 64, 994–1006 (2021).

42.Feng, J. et al. Association of Body Iron Metabolism with Type 2 Diabetes Mellitus in
Chinese Women of Childbearing Age: Results from the China Adult Chronic Disease
and Nutrition Surveillance (2015). Nutrients 15, 1935 (2023).

134

43.Miao, R. et al. Iron metabolism and ferroptosis in type 2 diabetes mellitus and

complications: mechanisms and therapeutic opportunities. Cell Death Dis 14, 186
(2023).

44.Ma, Y., Cai, J., Wang, Y., Liu, J. & Fu, S. Non-Enzymatic Glycation of Transferrin and

Diabetes Mellitus. Diabetes Metab Syndr Obes 14, 2539–2548 (2021).

45.Liu, J., Li, Q., Yang, Y. & Ma, L. Iron metabolism and type 2 diabetes mellitus: A
meta-analysis and systematic review. J Diabetes Investig 11, 946–955 (2020).

46.Teruya, T., Sunagawa, S., Mori, A., Masuzaki, H. & Yanagida, M. Markers for obese
and non-obese Type 2 diabetes identified using whole blood metabolomics. Sci Rep
13, 2460 (2023).

47.Tanaka, A. et al. Role of Copper Ion in the Pathogenesis of Type 2 Diabetes. Endocr

J 56, 699–706 (2009).

48.Gembillo, G. et al. Potential Role of Copper in Diabetes and Diabetic Kidney

Disease. Metabolites 13, 17 (2022).

49.Gong, D. et al. A copper(II)-selective chelator ameliorates diabetes-evoked renal

fibrosis and albuminuria, and suppresses pathogenic TGF-β activation in the kidneys
of rats used as a model of diabetes. Diabetologia 51, 1741–1751 (2008).

50.Gong, D. et al. Quantitative proteomic profiling identifies new renal targets of
copper(II)‐selective chelation in the reversal of diabetic nephropathy in rats.
Proteomics 9, 4309–4320 (2009).

51.Rovira-Llopis, S. et al. Mitochondrial dynamics in type 2 diabetes: Pathophysiological

implications. Redox Biol 11, 637–645 (2017).

52.Prasun, P. Role of mitochondria in pathogenesis of type 2 diabetes mellitus. J

Diabetes Metab Disord 19, 2017–2022 (2020).

53.Pinti, M. V. et al. Mitochondrial dysfunction in type 2 diabetes mellitus: an
organ-based analysis. American Journal of Physiology-Endocrinology and
Metabolism 316, E268–E285 (2019).

54.Patti, M.-E. & Corvera, S. The Role of Mitochondria in the Pathogenesis of Type 2

Diabetes. Endocrine Reviews 31, 364–395 (2010).

135

APPENDIX A3: MODGENEPLEXUS

Figure A3.1: Comparing GenePlexus and ModGenePlexus performance for CREEDS

drug and gene sets, and for the three magma thresholds.

136

Figure A3.2: Comparing performance between GenePlexus, ModGenePlexus, and

AllClus for CREEDS drug and gene sets, and for the three magma thresholds.

137

Figure A3.3: Comparing model performance of AllAssign, noprop_ModGenePlexus and

ModGenePlexus for CREEDS drug and gene sets, and for the three magma thresholds.

138

Figure A3.4: Comparing all model types for CREEDS drug and gene sets, and for the

three magma thresholds.

139

CHAPTER 4: DISCOVERING CORE AND PERIPHERAL GENES USING A

NETWORK-BASED OMNIGENIC MODEL AND TRANSLATING FINDINGS ACROSS

SPECIES

Introduction

Complex diseases manifest through numerous interactions of genetic factors. The

genetic interpretation of disease can be conducted at the variant, gene, process, or

phenotype levels. Experiments at one of these levels can be integrated with other

biological data using computational methods to gain a holistic view of the disease at all

levels of biological organization. GWAS have revolutionized our understanding of the

etiology of complex diseases and highlighted why discovering robust, consistently

successful treatments is challenging. This method is highly useful, not only for its

genetic findings but also because sequencing patient genomes has become

exceedingly affordable. A flaw of GWAS has emerged, ironically caused by the high

scale of complexity that GWAS has been essential in revealing. For very complex

diseases, a "naive" GWAS cannot keep up due to a lack of statistical power and the

sheer number of implicated loci found in large populations. As GWAS sample size has

increased, the number of loci that meet nominal p-values has increased, but the effect

size of implicated SNPs has decreased1,2. Establishing the biological mechanisms of

these many small-effect SNPs is challenging, as their influence on the trait is thought to

be indirect, involving downstream functions. Classic post-GWAS methods, such as fine

mapping, also struggle when the effect sizes are small. The key question is, given that

GWAS results show few SNPs with large effect sizes, what is the best way to interpret

and utilize data from these studies in discovering mechanisms for complex human

disease?

The observation that low effect size SNPs are likely to affect phenotypes and other

genes through indirect effects is one of the primary motivations of the omnigenic

model2,3. The omnigenic model proposes that there are two categories of genes – core

and peripheral – that work together in biological networks to explain disease states or

manifestation. In other words, diseases can be interpreted as a subnetwork in the

human genome. The model's prefix "omni-" is because since all genes will be in a

genome wide network, and all genes have at least some degree of connection,

140

theoretically all genes could have an influence on disease state. This interpretation of

complex disease is not unique to the modern era. Ronald Fisher’s infinitesimal model

was motivated by reconciling ideas of Mendelian inheritance with quantitative (complex)

traits like height and diseases. Fisher’s model proposed that individual loci which were

inherited according to Mendel’s laws have a small effect by themselves that is either

difficult to measure or cannot be, but the cumulative effect of the inherited loci explains

how a trait manifests or is phenotypically observed. The implication is that measuring

the causal effect of a loci is extremely challenging. Fisher noted in his time, that while

plant and animal breeders had the capability to select for particular complex traits, this

owed little to genetic and statistical analysis4 due to not being able to explain the causal

loci. In the modern era, we have proposed and determined mechanistic explanations of

particular genes in complex traits, but disease-gene annotations are still incomplete and

are very challenging to add to. The unique proposal of the omnigenic model is that

networks explain why there are so many genes associated with complex traits, where

peripheral disease genes are associated through their interactions and effects on core

disease genes that are individual mechanistically interpretable entities in the complex

trait.

Figure 4.1: Starting with disease level data, it is possible to break it up into smaller,

distinct, and biologically meaningful subsets. These subsets are used to define

biological context. Understanding how pathways, phenotypes, traits, and context are

141

Figure 4.1 (cont’d)

annotated to a disease and relate to one another is crucial in mechanistically explaining

the disease as a whole.

The observations from both the early and modern eras of genetics show why we need

to use other forms of biological data to interpret genotypic data like GWAS. Alone it is

very difficult to explain why loci are involved in a human disease. Therefore, post-GWAS

analysis aims to provide evidence for downstream pathways and intermediary molecular

traits and phenotypes through which genes contribute to diseases5–8 (Figure 4.1).

Biological networks are powerful tools because they provide context about genetic

relationships by integrating multiple types of biological data. Additionally, we discussed

in chapter 1 and show in chapter 2 and 3 that complex diseases are modular within

biological networks, and the genes in each module will correspond to different

underlying traits, pathways, and phenotypes. Leveraging this disease modularity in

networks can be a way to interpret and discover possible underlying phenotypes, but

first we must answer (i) how do we determine disease genes using GWAS data, (ii) how

do we find and utilize disease gene modules and (iii) how do we discover core genes

that are directly involved in the phenotype/trait disruption that leads to disease?

142

Figure 4.2: A representation of the multi-species network created for GenePlexusZoo. It

is a joint representation of molecular networks of humans and five common model

organisms (mouse, fish, fly, worm, and yeast) are combined using pre-calculated

orthologous groups across these species for the genes at a genome-wide level.

Linking implicated variants or genes to higher-level annotations like pathways is crucial

for understanding biological mechanisms. Evaluating candidate genes and pathway

explanations often starts with model organisms, which have been essential in advancing

our understanding of human disease, organ systems, genes, and many other biological

concepts9–12. Model organism databases are another resource that has grown massively

in recent years13, further incentivizing their use in computational studies. Methods that

allow the transfer of genetic information and biological pathways from one species to

another are key in discovering model organisms genes and pathways that are relevant

in highly complex human disease. Recently, our group released a modification to

GenePlexus termed GenePlexusZoo14. It utilizes a multi-species network (Fig 4.2;

including human, mouse, fish, fly, worm, and yeast) where a user can input genes from

one of six species and retrieve results for the same species as the input set, or results

143

translated into one of the remaining five species. GenePlexusZoo not only

demonstrated improved gene classification within a single species by leveraging the

evolutionary relationships contained in the multi-species network, this multi-species

network allows for cross-species correspondence of GOBPs and phenotypes to be

discovered that would not have detected if evaluating relationships through a naive

overlap of orthologous genes alone (Figure A4.1, see methods) . In chapter 2, we

demonstrated that it is possible to learn meaningful phenotypes underlying modules for

different human diseases, and showed how implicated inflammation phenotypes are

seen across them from using gene predictions from GenePlexus. With GenePlexusZoo,

we can leverage using multi-specie gene edges to discover genes relevant to model

organisms, which can be tested for enrichment with phenotypes. It will be useful to

implement similar methods at the module level and for cross-species analysis. If

meaningful knowledge transfer occurs on a module level, we can determine which

model organisms are important for particular and more specific phenotypes, rather than

the disease as a whole.

In chapters 2 and 3, we have shown that using networks to give context to GWAS and

highly complex human disease data helps to discover functional genes and how genes

in large, experimental genesets work together in biological processes. We propose that

using the network provided in GenePlexusZoo will provide additional benefit in

interpreting and validating functionally relevant genes to complex disease. We propose

using a modified version of ModGenePlexus (as discussed in Chapter 3) integrated with

GenePlexusZoo to reveal biological insights of human disease in a multi-species and

omnigenic framework. We begin by using MAGMA results to obtain an initial list of seed

genes measured from GWAS for 20 complex human diseases and traits. We then apply

DOMINO to discover disease gene-enriched modules, performing semi-supervised

learning to recover false negatives and remove poorly evidenced hits. ModGenePlexus

is then run for each module, and a final disease gene prediction list is aggregated based

on the top predictions across all modules. Next, we use the predicted disease genes to

define and predict a disease module, directly used to predict core and peripheral genes

for the human GWAS. Our approach for classifying these core and peripheral genes

exclusively uses network relationships – avoiding assumptions based on prior disease

144

knowledge. We further investigate various biological traits of these core genes – such

as conservation, cross-disease relationships, and tissue specificity – to learn how

network central genes behave. The disease genes are then translated to model species

using the GenePlexusZoo framework, including determining "core genes" in other

species that best reflect the human biology of the disease. Lastly, we investigate and

attempt to validate the use of human disease gene modules for knowledge transfer by

assessing if predicted model organism genes are related to human genetics of the

GWAS. The power of these methods is that it can take disease level data, break it into

parts, and then build those parts back up into a final, singular disease module for a trait

for interpretation and discovery in a holistic way through computational methods with

limited biological assumptions (Figure 4.3).

145

Figure 4.3: A pipeline for our methods and analysis. We start with a list of disease

genes measured from genotypic data with GWAS for 20 diseases and traits. We then

use ModGenePlexus to perform (semi-) and supervised learning for gene classification.

The genes that have a high prediction are considered disease genes and are a

predicted disease module for the underlying GWAS. To categorize genes, we use an

omnigenic framework and interpretation where genes are either core – genes likely to

have direct effect on disease manifestation – or peripheral whose annotation to the

disease is because of their network connection on core genes. Genes are labeled core

if they have the highest betweenness centrality values of the genes within the specific

disease module. We return results for GenePlexusZoo predicted genes for model

146

Figure 4.3 (cont’d)

organisms, and we investigate if our classified human core genes appear as orthologs

in non-human gene predictions. We additionally utilize large amounts of external

biological data to categorize both core genes and module assignments at phenotype,

tissue, cell levels and within different model-organisms.

Methods

GenePlexusZoo attributes used for creating models

GenePlexusZoo14 is an extension of GenePlexus that utilizes a multi-species network.

Networks from STRING15 were obtained for species Homo sapiens, Mus Musculus,

Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces

cerevisiae. The genes across networks were connected using orthology information

from the eggNOG16 database, which is a genome-scale definition of orthologous groups

for many species. These cross-species edges were weighted using a function of a

genes’ within-species node degree and how many cross-species edges the gene has.

The features used within GenePlexusZoo are derived from a low-dimensional

representation of the cross-species network, generated using the PecanPy17 software,

which is a fast and scalable implementation of the node embedding method node2vec18.

Discovering gene and disease modules

We used the DOMINO19 software as described in chapter 3 in the same way as for

ModGenePlexus. Genesets were passed in and we removed genes that fell into

modules smaller than 10 genes, and added genes propagated to the original set. In

addition to gene modules, we aggregate GenePlexus predictions across these modules

to predict a disease module containing all genes predicted by GenenPlexus. When

GenePlexusZoo returns gene classification results, it calculates a z-score that indicates

how high the probability of a gene prediction is relative to the rest of the predicted genes

of a single species. To aggregate gene predictions for a species of interest, we took all

gene predictions in each module where the

𝑍 > 5. 0

. We use the term “disease module”

for continuity based on a discussion in chapter 1 about types of modules found in

biological networks. In this chapter, a “disease module” can refer to a complex trait as

well depending on whether the GWAS was conducted for a complex trait like height, or

a disease like atrial fibrillation.

147

Compiling MAGMA gene prioritization results

MAGMA20 gene prioritization scores for summary GWAS results were compiled from

gwasATLAS21. We created a GSC utilizing these gene predictions based on thresholds

of

𝑝  < 1 × 10

. We then clustered these genes using DOMINO. We manually chose

−5

20 GWAS that are diverse in terms of phenotypes, including both diseases and complex

traits. A threshold of

𝑝  < 1 × 10

−5

was chosen because we wanted to use genes that

met a strict threshold, however stricter thresholds would not give enough genes for

ModGenePlexus, and chapter 3 shows that using looser threshold genes can improve

prediction of genes that would be highly significant in GWAS results. The 20 GWAS that

were chosen are: age at menarche, age at menopause, age related macular

degeneration, alcohol dependence, atrial fibrillation, celiac disease, crohn’s disease,

educational attainment, height, inflammatory bowel disease, primary biliary cirrhosis,

primary sclerosing cholangitis, rheumatoid arthritis, two systemic lupus erythematosus

studies, type-1 diabetes, type-2 diabetes, ulcerative colitis, and vitiligo.

Running GenePlexusZoo and ModGenePlexus

GenePlexusZoo was run with each individual module for the disease. If there are N

modules, then N models are created. Only human genes from the module assignments

are used, and we return predictions for every species in the multi-species network. For

each species, each gene has N predictions, one for each model. ModGenePlexus was

run using all options – which include utilizing propagated genes from DOMINO, and

creating a model for each cluster.

Finding enriched phenotypes and GOBPs using GenePlexusZoo

GenePlexusZoo utilizes GSCs of Monarch22,23 phenotypes and Gene ontology biological

processes (GOBP)24,25, where each term in a GSC is annotated with a set of genes

known to be associated with the term The GenePlexusZoo software contains files of

model weights, where each set of model weights comes from models trained on each

phenotype and GOBP in the GSCs. To find enriched phenotypes or GOBPs for a GWAS

gene module, the model weights for the model trained using genes from a given module

are compared to each phenotype/GOBP model weight vector by calculating cosine

similarity. These cosine similarities are then normalized into z-scores to indicate how

high the cosine similarity is relative to all other predicted genesets in the collection.

148

Relative to an over representation hypergeometric test, GenePlexusZoo utilizes the

model weights of all genes in the genome for comparing phenotypes, acting much like a

network-based gene enrichment method26. In this project, we use this as a way to

predict non-human phenotypes for our human GWAS without needing to convert the

human geneset to orthologs and then doing an enrichment test.

Determining human core and peripheral genes

Core and peripheral genes are predicted using the networks within the joint

multi-species network and the predicted disease module. For human genes, the human

version of string with edge weights was obtained. The network was subsetted to a

disease-specific subnetwork which includes all genes that had a

𝑍 > 5. 0

. We applied

the betweenness centrality metric using NetworkX27, defined by equation:

(𝑣) =
𝑐
𝐵

∑ σ(𝑠,𝑡 | 𝑣)
σ(𝑠,𝑡)
𝑠,𝑡 ∈𝑉

Where V is the set of nodes,

σ(𝑠, 𝑡)

is the shortest

(𝑠, 𝑡)

-paths, and

σ(𝑠, 𝑡 | 𝑣)

is the

number of those paths that pass through node v other than

𝑠, 𝑡

. If

𝑠 = 𝑡,  σ(𝑠, 𝑡) = 1,  

and if

𝑣  ∈ 𝑠, 𝑡, σ(𝑠, 𝑡|𝑣) = 0

. The equation that NetworkX uses is based on an algorithm

first implemented by Brandes28. Betweenness centrality is a metric that quantifies how

often nodes are in the shortest path between any pairs of nodes. For every node, the

shortest paths are calculated from that node to every other node in the graph. In this

iterative process, what is recorded is how often a node is within the shortest path for

each pair of nodes. Higher betweenness centrality values mean the node is centrally

connected to the graph as a whole. In our context, we use betweenness centrality to

predict core genes because based on the omnigenic definition of core and peripheral

complex disease genes, core genes are a relatively small set that are influenced by a

large amount of peripheral genes through network connection. After running the

betweenness centrality algorithm, the values for each gene are scaled using Z-score

normalization, where high z-scores indicate betweenness centrality values that are high

relative to the rest of the genes in the network. Genes with

𝑍 > 2

were chosen as

predicted core genes for the trait.

149

Predicting genes for other species and a model-organism disease module

As described in Methods 4.1, the networks of multiple species are connected in this

joint representation through use of the eggNOG database and orthology information.

This allows the prediction of other species genes based on the human input data from

MAGMA. Thus for each GWAS trait, we are able to obtain the top ranked genes for

each species. To compile a model-organism disease module, we used the same

threshold of

𝑍 > 5. 0

for choosing genes that were highly predicted for our trait.

Compiling one-to-one human orthologs

Although GenePlexusZoo makes use of many-to-many ortholog information, to

determine correspondence of core genes across species we restrict ourselves to

one-to-one orthologs. We obtained these ortholog relationships from BioMart29.

Discovering if core genes are enriched for being constrained

We determined if genes were constrained using data from gnomAD30–32. A gene being

constrained indicates that there is strong selection against mutation of the gene.

Specifically, we utilized the Loss-of-function Observed / expected upper bound fraction

(LOEUF) score. A gene with a low score is one that has selection against predicted

loss-of-function variation, while a high ratio means the gene does not have much

selection against mutations that inactivate the gene.

Discovering core gene tissue specificity enrichment

Tissue specificity data was collected from the CONE33 method. Using tissue expression

data from GTEx34, z-scores were calculated that indicate a geneset’s tissue specificity.

We calculated z-scores utilizing an average z-score method and a max z-score method,

where the average z-score is based on the average score of genes across tissues, and

max z-score is based on the max score of genes across tissues. Another score used is

the tau value35,36, where a small value means the gene is broadly expressed across

tissues and a high value means it is specific.

Calculating module tissue and cell marker gene overlap

We used data from Jensen’s TISSUE database37 and cell marker data from CellSTAR38.

For each dataset, we took genesets with at least 10 genes and calculated enrichment

using a hypergeometric test between GWAS genes and each set. This test was for

150

every module across the GWAS.

Definition of disease modules for humans compared to model organisms

In this project, we discover disease modules for each GWAS for five species. However,

the interpretation of the GWAS within human vs non-human species needs nuance. The

GWAS come from human studies only, meaning the genes used as input are always

human. This means that the diseases and traits are defined for humans. When we

compile human gene predictions and map those genes to a network, this is what we

term a predicted disease/trait module – composed of all genes thought to be possibly

relevant to the disease based on network connections to genotypic data from GWAS.

When we compile a module of the predicted genes of other species, it should not be

thought of as the same “disease module” within another species. It should be

interpreted as human knowledge transferred into the gene space of other organisms

that may have similar underlying biology to the human disease module. Similarly, when

the betweenness centrality method is run for these modules, these should not be

thought of as “core genes” to the human disease, but as genes with high network

centrality within the module. The distinction of interpretation between human/model

organism data from the “same” methods is vital because while the diseases and traits

are interpretable in humans, it is much less so in other species. The reason why we

create modules and discover “core genes” for other species is because, while not

equivalent, the information discovered is directly relevant to human genetics. What we

are actually displaying are genetically similar non-human gene subnetworks discovered

directly using human disease data. Our goal is to discover human core genes and see if

they have meaning in other organisms. For ease of reading, if a gene is called a “mouse

core gene”, that phrase is used based on the method and interpretation within the

module – we are not making claims that this gene is a core gene for the respective

disease in mice.

Results

Discovering core genes (Defining betweenness centrality and number of core

genes compared to peripheral)

To define core genes for each of the 20 GWAS, we began with running an integration of

ModGenePlexus and GenePlexusZoo (see methods). Models were created for each

151

gene module discovered using DOMINO, and GenePlexusZoo provides a multi-species

network to improve classification. All GWAS used are from humans, and we first

classified human genes. Every module has an initial training set determined from

DOMINO and predicted genes within that module (Figure 4.4). The predictions from

GenePlexus were compiled, and every gene with a prediction of z-score > 5 are

compiled into a final disease module. This disease module is defined as containing

every gene relevant to any of the determined gene modules (Figure 4.5-6). Core genes

that are highly connected to other genes in the disease module are found using the

betweenness centrality calculation (see methods). Genes with a high betweenness

centrality are considered core, while the majority of genes across each GWAS are

peripheral (Figure 4.7). Our method utilizes the omnigenic model to choose a small set

of core genes that are defined using the context of all other genes within the disease

module, including those predicted using GenePlexusZoo and ModGenePlexus. Thus,

genes are core in the context of their network connections, rather than using

assumptions about disease biology outside of the GWAS experiment. In Figure 4.8, we

show that core genes are enriched for genes used in training (either implicated in

GWAS or propagated) compared to those genes predicted from GenePlexus only. We

see that the original experiment is enriched with network-central genes, but the network

still finds those not measured in the experiment39. Sample results for other GWAS are in

Figure A4.2-5.

Core genes discovered for atrial fibrillation are mechanistically meaningful

We predicted core genes using GWAS data and networks, rather than using specific

knowledge about a disease to first define mechanistically important pathways, and find

important gene annotations. It is still important to give biological context to our

predictions to see if there is known relevant biology to the traits. In Figure 4.9, we list all

predicted core genes for atrial fibrillation along with the knowledge source that predicted

their annotation. Six genes came from the initial GWAS study, five genes came from

propagation in the network, and seven came from GenePlexus predictions. All 18

predicted core genes have literature support for being related to atrial fibrillation or

cardiovascular disorders in general. WASL has been linked to cardiovascular disease40,

and is in pathways enriched with known atrial fibrillation genes41. FXR1 is associated

152

with gap junction remodeling when upregulated, and this pathway is a feature of heart

disease involving arrhythmia42. FMR1 targets in cardiac muscle43 and fragile-X carriers

have higher rates of cardiovascular issues44. CFL2 is directly implicated to atrial

fibrillation in a recent GWAS study45 and another recent study measured it as a possible

drug target46. H3-7 is a putative histone gene, and histone modification has been

implicated in zebrafish studies for atrial fibrillation by affecting cardiac contractile

function47. CALML6 has been predicted to be associated with coronary artery disease48

and it negatively regulates the NF-κB signaling pathway source49, a pathway which has

higher activity in atrial fibrillation patients50. UBA3 is involved in the ubiquitin mediated

proteolysis pathway, which is important in atrial fibrillation51. It additionally is involved in

Neddylation, a pathway which controls misfolded proteins and has been linked to

numerous heart issues52,53. Some of the other predicted human core genes will be

discussed later in the results section when discussing the human core gene relationship

to model organisms. We see that the predicted human core genes are highly implicated

in relevant organs and processes for atrial fibrillation, thus it is now worth discussing

other properties of core genes across GWAS gene modules and across each GWAS.

153

Figure 4.4: Visualizing the gene modules for atrial fibrillation. Each module with at least

10 genes was run with GenePlexusZoo and ModGenePlexus. The genes displayed

here are those that had a z-score > 5.0. Original training genes (tan) are those that were

implicated by GWAS or found in DOMINO propagation, and newly predicted genes

(green) were predicted by GenePlexusZoo.

154

Figure 4.5: A visualization of a predicted disease module for atrial fibrillation. This

module contains three types of genes, those implicated by the GWAS (green), those

discovered by DOMINO (yellow) and those predicted by GenePlexus (purple). Nodes

that are bigger are the predicted core genes. Core genes are come from all 3 types of

gene associations and they are distributed across multiple sub-modules within the

disease module.

155

Figure 4.6: Plotting the disease module for atrial fibrillation across the embeddings of

STRING in a t-SNE. Within the functional network of STRING, the disease module is

not all near each other in a two-dimensional representation. Rather, the disease module

is made up of multiple submodules (the gene modules) that are combined to represent

a final disease module made up of all genes. Training and predicted genes can appear

next to each other in the same modules, and both the training and predicted genes

contain core genes.

156

Figure 4.7: How many core and peripheral genes there are in each GWAS. The

omnigenic model states that the number of core disease genes is small relative to the

number of peripheral disease genes. Our method finds that a small minority of genes

(around 10-30 for each GWAS) are predicted to be core, while the rest of the hundreds

of implicated genes are peripheral.

157

Figure 4.8: Box-plot showing that training genes have more core genes than

GenePlexus predictions. The percent of core genes (y-axis) of the training genesets

(green) is higher than in the GenePlexus predictions genesets (red). The initial

biological data is enriched with core genes from a network perspective, and

GenePlexus finds more peripheral genes.

158

Figure 4.9: The 18 human genes predicted to be core for atrial fibrillation. Knowledge

source (left column) for core gene (right column) indicates if the gene initially came

from the GWAS, network propagation, or GenePlexus.

159

Core genes are distributed across disease gene enriched modules

A vague part of the definition of core genes in the omnigenic model is in how many

processes, phenotypes, or mechanisms they are associated with. Whether the core

genes for complex traits are one coherent set or not is an essential question for

applying the omnigenic model for real world disease. We decided to investigate whether

core genes are seen across different modules within the GWAS. In Figure 4.10, we

show how the core genes that were part of the original training data were distributed

across modules. We see that many modules within a GWAS have at least one gene,

with Figure 4.11 showing that most of the modules have at least one core gene when

considering the original module assignments. Notably, we do see some GWAS that

have modules highly populated with core genes relative to others. Vitiligo module 1 has

5 core genes when the other modules have 3 in total, and Age at Menopause has only 2

modules of its 3 modules initially containing core genes, with 4 out of 5 core genes

falling in module 2. Core genes are determined only after the disease module is made,

and in Figures 4.12-13 we show how each module’s predicted genes from GenePlexus

end up being classified as core in our method. With the predictions, 10/20 GWAS have

core genes in all of its modules, and only 5/20 have more than 1 module without any

core genes. For most GWAS, we see that core genes will appear across nearly all found

gene modules, showing that core genes are not all contained in one coherent module.

However, there will be some modules that are relatively enriched with core genes. To

answer what it means for a module to be enriched with core genes, we can see if core

genes are more likely to be module-specific within a GWAS, or shared across GWAS.

160

Figure 4.10: Counting the number of core genes across modules for each GWAS of

training genes only. Core genes fall across modules of the GWAS. This means that the

core genes are not one set of genes in a singular module, but are part of distinct

processes.

161

Figure 4.11: Bar plots showing the number of modules with core genes. Core genes

are in multiple modules for each GWAS, and for each GWAS at least half of all modules

have at least one core gene.

162

Figure 4.12: Counting the number of core genes across modules for each GWAS of

training and prediction genes. Module models predict genes that are core as well.

Relative to training, more modules have at least one core gene after including

GenePlexuspredictions, showing GenePlexus finds genes in the network important for

modules.

163

Figure 4.13: Bar plots showing the number of modules with core genes when including

predicted core genes. Core genes are in multiple modules for each GWAS, and once

predicted genes are considered most modules have at least one core gene in each

GWAS.

164

Figure 4.14: Displaying GWAS genes of atrial fibrillation within a t-SNE of the human

STRING network, with each module displayed. The larger nodes are the predicted core

genes, and each gene corresponds to a module (see legend). GWAS core genes are

distributed across modules for a trait,and the modules with multiple core genes have

those genes as neighbors in the network.

Core genes are most often module specific (but can appear as core or peripheral

in other modules)

Our utilization of GenePlexusZoo means that each gene can potentially be predicted in

each module. We have seen that when considering the original GWAS hits, those hits

predicted to be core are distributed across modules. In the context of our predicted

disease module, we asked if the core genes are predicted by multiple GenePlexus

modules, i.e. have a Z > 5 for multiple modules. For most GWAS, a relatively small

165

number of genes are predicted by multiple modules (Figure 4.15). Ten of the twenty

GWAS – type 1 diabetes, both systemic lupus erythematosus GWAS, rheumatoid

arthritis, primary sclerosing cholangitis, primary biliary cirrhosis, height, crohn’s disease,

celiac disease, and alcohol dependence – have core genes that were highly predicted

by multiple modules. Fifteen GWAS had at least one peripheral gene discovered in

multiple modules, all except age at menopause and menarche, age related macular

degeneration, educational attainment, and schizophrenia. Gene predictions are very

module specific, as seen in Figures 4.16-17 for atrial fibrillation and height respectively.

Height has the most shared genes across modules at 48, but even so the majority of the

731 genes are predicted in only one module. Overall genes are very module specific,

and this provides further evidence that core genes are not one coherent set of genes

but are distributed across multiple important processes that underlie complex traits and

disease.

Figure 4.15: Disease gene predictions across modules for each GWAS. The blue bars

refer to the total number of core genes. The green bars refer to those core genes

that are predicted (z-score > 5) in multiple GenePlexus models across multiple modules.

166

Figure 4.15 (cont’d)

Core genes are typically module-specific and will not be predicted by other modules.

The red bar refers to any gene that is predicted in multiple modules, and this varies

depending on the GWAS.

Figure 4.16: Showing the top ranked gene predictions for each atrial fibrillation

modules. Gene predictions are highly clustered and most genes have a high prediction

in only one module.

167

Figure 4.17: Showing the top ranked gene predictions for each height module. Gene

predictions are highly clustered and most genes have a high prediction in only one

module.

Predicted Core and peripheral genes are often disease specific

Every gene has a prediction score for each GWAS geneset being investigated. This

makes it possible to compare gene scores of all genes across each GWAS. We first

show for all genes that had a z-score > 5 for at least one GWAS, those genes z-scores

for every other GWAS (Figure 4.18). The high gene prediction scores are highly

clustered, and most genes are associated with only a single GWAS. This result is

similar to the previous section where gene predictions as a whole cluster across

modules. Next, we show for every gene predicted to be core in at least one GWAS the

168

prediction scores across all GWAS (Figure 4.19). Once again, there is clustering

behavior here where most core genes are predicted highly in only a small number of

GWAS. However, there are some core genes that have high scores in multiple GWAS.

We decided to expand this analysis and see how many GWAS human core genes tend

to fall in, while also showing how that knowledge transfers across species using

orthologs between the human core genes and other species predicted genes.

Figure 4.18: Plotting the prediction scores for all human genes that had a z-score>5 for

at least one GWAS. Most genes are predicted highly in only one or a small number of

GWAS.

Figure 4.19: Plotting the prediction scores for genes that were predicted to be core for

at least one GWAS. The core genes tend to be predicted in only one GWAS, with a

minority being implicated in multiple GWAS.

169

Discovering orthologs across species

GenePlexusZoo allows users to make predictions for any species within the network, no

matter the original species of the input. With this capability, we wanted to investigate

how core and peripheral gene predictions would transfer to other species. To interpret

the highly predicted genes of other species, we used one-to-one ortholog data from

BioMart (see methods). One-to-one orthologs have an advantage due to ease of

interpretation. Figure 4.20 shows how many one-to-one orthologs of human genes

were discovered using the entire human genome, the human genes from the predicted

disease module for each GWAS, and the predicted human core genes. As would be

expected, zebrafish, mouse, have the most orthologs to human genes, with 7652,

16064, total orthologs respectively. Yeast and worm have noticeably less orthologs at

only 722 and 1530 respectively. All predicted genes across all GWAS were mapped to

orthologs, where for zebrafish there are 1826 orthologs, for yeast 139 orthologs, for

worm 278 orthologs, and mouse 3866 orthologs. The relatively small minority of human

core genes in these GWAS also contain orthologs to model organisms. Across all

GWAS, we found for zebrafish 86 human core genes orthologs, for yeast 15 orthologs,

for worm 20 orthologs, and for mouse 197 orthologs. This data is essential for

interpreting how human core genes are relevant within model organisms.

170

Figure 4.20: Displaying the number of orthologs discovered for each species. The total

number of one-to-one orthologs discovered for all human genes (blue), the number of

orthologs for genes implicated in GWAS for all diseases (green), and the number of

orthologs for human core genes (red).

171

Core genes are GWAS-specific across species

It is interesting to know if core genes are likely to be seen across GWAS. Ideally core

genes are those that are mechanistically important in a specific disease context. It is

unclear whether we should expect core genes to appear across multiple diseases, so

we wanted to test for two different things. First, we see for the human core genes, how

many GWAS they are core in. Second, we map these human core genes to other

species using orthologs. We then see how often the human core gene’s ortholog is core

within that species. These results are displayed in Figure 4.21, where each histogram

corresponds to each species (human, mouse, zebrafish, worm, yeast). We see that

most human core genes are only core in a singular GWAS, the GWAS it is core in.

However, some core genes are implicated across multiple GWAS – where they can

appear as core in up to 8 GWAS for human and zebrafish, and 7 for mouse. Next, since

a gene is defined as core in the context of specific diseases, it is possible that it can be

truly annotated to other diseases, but be peripheral instead. Figure 4.22 displays these

results, and we see that core genes are involved in many more GWAS if we consider

their peripheral status as well. Some of the core genes appear in nearly all of the GWAS

for humans in 19 out of the total 20 GWAS, and appear in other 10 GWAS as orthologs

in mouse and zebrafish. We also predicted the core genes in each species (see

methods) and show how they are distributed across the GWAS. The same trend in

humans is seen in the other species; most species’ core genes are core in a single

GWAS (Figure 4.23). Lastly, we demonstrate that core genes are at higher incidence

for being implicated across multiple GWAS (Figure 4.24) relative to peripheral genes.

These results show that while most genes are only core in specific contexts, they can

appear in other GWAS both as core genes and as peripheral and are more likely to

appear across multiple GWAS than peripheral genes.

172

Figure 4.21: Histograms showing whether human core genes – or their respective

orthologs – are labeled as core across GWAS for each species. For most species, the

human core genes/orthologs are seen in only one GWAS. However, worm and yeast

rarely has implicated human core gene orthologs predicted.

173

Figure 4.22: Histograms showing whether human core genes – or their respective

orthologs – are labeled as core or peripheral across GWAS for each species. A core

gene can be peripheral in other traits because it is discovered to be in the disease

module, but is not core in both GWAS. Core genes are more likely to appear

cross-GWAS as peripheral.

174

Figure 4.23: Histograms showing whether a species's core genes are labeled as core

across GWAS for each species. For other species, the same pattern appears as in

human core genes where most core genes are disease-specific.

175

Figure 4.24: Bar chart showing percent of core and peripheral genes that are implicated

in multiple GWAS. Human core genes are more likely to be seen in multiple GWAS than

peripheral genes are.

Modules give distinct gene and phenotype predictions

We have already demonstrated that core and peripheral genes are likely to be module

and disease specific within humans. Figure 4.25 contrasts human, and mouse genes

predicted to be associated with the human GWAS genes using GenePlexusZoo. Both

species show very modular behavior for the top predicted genes. The same is true in

Zebrafish (Figure 4.26), worm and yeast (Figure A4.6). Figure 4.27 compares model

weights of the embedding feature vectors for human and mouse by looking at the top

feature vectors across modules, where the top 10 feature vectors by absolute value for

each module are displayed. We can see there is modular behavior here as well, with

some embedding vectors having noticeably large model weights for some modules but

not in others. The same observation is true for zebrafish (Figure 4.28) and worm and

176

yeast (Figure A4.7). Lastly, we utilized GenePlexusZoo to find the top three enriched

phenotypes and GOBPs in Figure 4.29 based on these module model weights and

comparing them to model weights built for the GSCs obtained from GO and Monarch

(see methods). Firstly, there is again clustering behavior at the module level within the

GWAS for atrial fibrillation when comparing and contrasting enrichment of GOBPs and

phenotypes for each species. We also see multiple meaningful GOBPs and phenotypes

within the modules. Using module 2 as a test case, we see it has directly implicated

pathways in cardiac muscle tissue morphogenesis54,55 and myofibril assembly56. Some

of the Monarch phenotypes implicated in module 2 include ST-segment depression,

which was implicated in a recent study where during atrial fibrilation rhythm,

ST-segment depression is associated with subsequent heart failure risk for afflicted

patients57. Septal hypertrophy was also an enriched human phenotype from Monarch,

and this is a predictor for patients that start with the disease hypertrophic

cardiomyopathy to eventually also have atrial fibrillation58. Some of the enriched

GOBPS in module 2 for mice includes sex differentiation and male sex differentiation

specifically. This is interesting as there are many sex differences in atrial fibrillation in

terms of symptoms and severity59,60. A study61 was done in mice specifically to

determine if sex differences in intracellular Ca2+ homeostatsis in atrial myocytes might

explain increased incidence of atrial fibrillation in males, and this pathway was found to

have major sex differences in mice. It's also involved in cell types like myocytes also

implicated in atrial fibrillation62. Overall we see multiple meaningful predicted human and

mouse GOBPs for atrial fibrillation within this module, and for the monarch human

phenotypes that are backed up well in the literature. The GOBP and phenotype

enrichment comparisons in zebrafish, worm, and yeast are in Figures A4.8-10.

177

Figure 4.25: Heatmaps showing highly predicted genes in human and mouse across

modules for atrial fibrillation. The genes are not the same across heatmaps, as the one

on the left are human gene predictions across modules and the one on the right are

mouse gene predictions for the modules of atrial fibrillation. The gene scores are highly

clustered across modules for both species.

Figure 4.26: Heatmaps showing highly predicted genes in human and zebrafish across

modules for atrial fibrillation. The genes are not the same across heatmaps, as the one

on the left are human gene predictions across modules and the one on the right are

zebrafish gene predictions for the modules of atrial fibrillation. The gene scores

178

Figure 4.26 (cont’d)

are highly clustered across modules for both species. The color annotation bar for the

heatmaps clarify the module order.

Figure 4.27: Heatmaps showing model weights of feature embedding dimensions in

human and mouse across modules for atrial fibrillation. There is clustering behavior

between the feature vectors and the modules within the GWAS. Meaning that feature

embedding dimension vectors are distinct across modules for both human and mouse.

The color annotation bar for the heatmaps clarify the module order.

179

Figure 4.28: Heatmaps showing model weights of feature embedding dimensions in

human and zebrafish across modules for atrial fibrillation. There is clustering behavior

between the feature vectors and the modules within the GWAS. Meaning that feature

embedding dimension vectors are distinct across modules for both human and

zebrafish. The color annotation bar for the heatmaps clarify the module order.

180

Figure 4.29: Comparing human GOBP and Monarch phenotype predictions across

modules with mouse GOBP and phenotype predictions. There is clustering behavior for

both species between GOBP and phenotype predictions, and we see multiple relevant

GOBPs for atrial fibrillation in humans and mouse.

Analyzing core gene and significant GWAS gene biology

Significant GWAS loci tend to be near genes that are mechanistically important39 and

that have functional annotations using many diverse methods and data mentioned in

chapter 1. Additionally, GWAS has revealed loci that are in highly conserved genomic

regions63,64 that are near human orthologs65, with evidence suggesting GWAS finds

them at higher rates when compared to e-QTL studies66. This implies that non-coding

181

regions implicated in GWAS are more likely to have functional annotation than specific

e-QTL studies. In Pritchard et. al39, they indeed show that the top hits of three GWAS

complex traits do correspond to “known” core genes – genes involved in known

implicated pathways for the traits. Given that GWAS results are enriched with

mechanistically important genes, we contrast comparing the significant results

discovered from MAGMA to nominal p-values, with our core genes to peripheral genes,

if GWAS experimental results or our predicted disease module are enriched with

biological statistics.

We tested if the core genes are more enriched for genes in terms of their age,

constraint, in the amount of one-to-one orthologs contained in Figures 4.30-31. We

decided to test for two sets of core genes in this analysis, one being the usual definition

of core genes where the z-score of the betweenness centrality (BC) values is above 2,

and one where it is simply positive (Z-score > 0). In both sets we see a notable number

of GWAS (9 and 13 sets respectively) that have a significant number of constrained

genes relative to the small betweenness centrality values. In contrast, only 4 of the

GWAS significant genes are enriched for constrained genes relative to nominal

p-values. For comparing the number of orthologs, the set where the z-score of BC

values above 0 has many sets enriched for containing orthologs, implying that orthologs

heavily populate those genes with high network connections in the disease module.

Only 1 GWAS set had its significant genes enriched for orthologs, and only 1 of the

normal core gene set (Z-score > 2) did.

There is rarely enrichment for age in either the GWAS significant genes or in either core

gene set. It is unclear whether it should be expected that significant GWAS genes or

mechanistically important genes are more likely to be old or young. It is commonly

assumed that essential genes are more likely to be old because they correspond to

conserved processes across species. However, there is conflicting evidence that young

genes can quickly become essential in important pathways67,68 and that they are

essential at the same rate as old genes in knockout studies69. Additionally, just because

a gene is old does not mean that function has been conserved – as functional

divergence has been observed over time70. We lastly tested if core genes are more

likely to have high tissue expression using z-scores obtained from GTEx (see

182

methods). We found that while some GWAS significant gene sets were enriched for

tissue specific genes, the core genes were not. This result was surprising, so we

expanded to using multiple metrics to determine tissue specificity (Figure 4.32) to see if

any core gene sets were tissue specific. Using the tau scores and average z-score,

there are some sets that have tissue specific genes, but only for 2 or 3 GWAS

depending on the metric.These results show that using network-defined core genes

rather than only top GWAS results better represents some expected properties of the

mechanistically important genes within a trait – particularly in their selection pressure

and correspondence to orthologs in other species. It is worth investigating further tissue

and cell enrichments to see if there is any meaning within the modules as a whole.

Figure 4.30: Heatmap showing enrichment of predicted core genes – defined here as

the zscore of the betweenness centrality (bc) values being above 2 –

contrasted with peripheral genes, and significant GWAS genes contrasted with

non-significant GWAS genes that were predicted to be in the module. For disease

genes, we show whether core genes and significant MAGMA GWAS hits are enriched

183

Figure 4.30 (cont’d)

for biological annotations of gene age, conservation, the number of 1-1 orthologs to

other species, and tissue specificity.

Figure 4.31: Heatmap showing enrichment of genes with a z-score of betweenness

centrality values (bc) above 0, contrasted with genes with a negative z-score. For

disease genes, we show whether core genes significant MAGMA GWAS hits are

enriched for biological annotations of gene age, conservation, the number of 1-1

orthologs to other species, and tissue specificity.

184

Figure 4.32: Heatmap displaying enrichment of core genes for tissue specific genes for

each GWAS. The values are -log10(p-value) of the results of a Mann-Whitney U test

comparing the tissue specificity scores of core vs peripheral genes. *** indicates a

𝑝𝑣𝑎𝑙  <  0. 05

. Tau score and z-score were taken from GTEX for every gene/tissue pair.

Both a maximum and averaging aggregation strategy were used when determining a

final tissue specificity score for a gene. Each column uses a different metric to

determine if core genes are more tissue specific than peripheral genes for a given

GWAS.

Enriched tissues across modules for atrial fibrillation shows module-specific

enrichment

We next ran enrichment for tissues and cell types from Jensen TISSUES and CellSTAR

databases (see methods). This was done for each module. For atrial fibrillation

specifically, we see in Figure 4.33 that there is a mixture of module specificity and

shared tissues/cells across modules from the TISSUES database. Modules 2 and 3

share enrichment results for tissues related to muscle, cardiovascular systems, and

heart, vital systems for explaining atrial fibrillation manifestation and phenotypes71.

Module 4 is uniquely enriched for lymphocytes and leukocytes and inflammation is

known to be relevant to atrial fibrillation62,72,73. Leukocytes are immune system cells that

are a predisposing factor for atrial fibrillation74, where an increased cell count means

185

there is inflammation which affects atrial fibrillation severity. The neutrophil-lymphocyte

ratio is an inflammatory biomarker that has been used in predicting outcomes for atrial

fibrillation75–78. Module 4 is additionally enriched for adipose and fat tissues, and specific

fat tissues such as epicardial fat – a visceral fat that is near the heart – have been

implicated in atrial fibrillation manifestation79–81. For enrichment with CellStar in Figure

4.34, we see less clustering behavior across modules but still have meaningful

biological results, which is not surprising given that cell types can often appear in

related but distinct tissues82. Module 2 relates to two different types of cardiac muscle

cells (cardiac muscle cell and CL:0000746 both are types of cardiac muscle cells),

along with smooth muscle cells and myocytes. Myocytes are cells in muscles, and

myocytes such as cardiomyocytes and other atrial myocytes are implicated in atrial

fibrillation83,84. Multiple of the modules are enriched for relevant tissue, cell, and organ

systems for atrial fibrillation.

Figure 4.33: Enrichment of atrial fibrillation modules for the Jensen TISSUE database.

There is clustering behavior for the modules of enrichment for the TISSUE database.

186

Figure 4.34: Enrichment of atrial fibrillation modules for the CellStar database.

Enrichment was performed for the module genesets and cell marker genes.

One-to-one orthologs of human core genes in other species reveals biologically

meaningful genes at the human and model organism levels

We have seen in previous sections that some diseases contain a relatively large

number of orthologs to human core genes in some species (mouse, zebrafish)

compared to others (worm, yeast). A reason this could be the case is that model

organisms are not relevant to all parts of a human complex disease, but to only certain

biological concepts like molecular pathways that are involved. We investigated whether

human core genes are predicted as orthologs in model organisms. In Figure 4.35-36,

we see the number of atrial fibrillation human core genes whos’ orthologs are either

core or peripheral in each model organism. There are a total of 13 unique of the total 18

(Figure 4.9) human core genes of atrial fibrillation that are orthologs in other species.

Some genes only appear a single time, and others appear across multiple species

(Figure 4.37). Figure 4.38 shows from what source – the initial GWAS, propagation, or

from GenePlexus – the core genes were discovered from. A recent GWAS study45

predicted candidate genes of TTN, CFL2, CASQ2, FBXO32, and FXR2 for human atrial

fibrillation. We predicted these genes as human core genes, and in addition their

187

orthologs are calculated to have a high BC value among the predicted mouse genes,

with the exception of FXR2 which is core in human but it’s ortholog is peripheral among

mouse genes. FBXO32 and CASQ2 orthologs are core in Zebrafish, while CFL2 and

FXR2 orthologs are peripheral. Multiple other studies implicate these genes85–88 and

have been used in animal models for heart diseases89–91. We predict the FXR2

orthologs as peripheral in mouse. A recent study has predicted FXR2 predicted this

gene as a target of miR-10a using mouse studies, a pathway which has been implicated

in atrial structure remodeling92. For Zebrafish, atrogin-1 (FBXO32) has been implicated

directly with heart failure when deficient93. This observation is important as atrogin-1 is a

muscle-specific E3 ubiquitin ligase that is involved in protein degradation and autophagy

in zebrafish. The gene SKP1 is interesting as it’s orthologs were predicted as relevant in

three of the model organisms – mouse, zebrafish, and yeast. SKP1 is an assembly

factor of a family of E3 ligases in mammals, and has been implicated in regulating the

switch between protein secretion to autophagy94. RYR2 is a gene predicted from

GenePlexusZoo and has been implicated in multiple very recent studies for atrial

fibrillation95. Notably, we predict this gene’s ortholog as being core in mouse, and mouse

studies have shown that this gene is relevant in atrial fibrillation. RyR2-mediated Ca2+

triggers paroxysmal atrial fibrillation. This shows that in mice, mutation is RyR2 is

directly implicated in mouse atrial fibrillation. This gene is the subject of numerous

studies in humans as well, showing that mutation leads to atrial fibrillation and other

atrial issues96–98. This brief overview has shown direct evidence of the biology behind

our predicted core genes for atrial fibrillation for multiple genes. These genes spanned

the sources they came from, whether from the initial GWAS data or from gene

classification, and have direct evidence of being involved with atrial fibrillation across

multiple studies, including human GWAS and model organism experiments. In Figures

4.39-4.42., we plot how the orthologs fall within the disease modules of the other

species. This overview shows that network relationships within other species allow for

isolation of genes that have important relationships to the disease within the organism

and are orthologous to important human disease genes. Additionally, it allows the

discovery of genes like RYR2 that are directly and mechanistically explainable for atrial

fibrillation – which is important for defining human disease core genes.

188

Figure 4.35: Count of how many human core genes are within model organism disease

modules, and whether they are core or peripheral within that organism.

189

Figure 4.36: Showing the human core genes that have orthologs in other species for

atrial fibrillation and how many times they are core/peripheral in five model organisms.

We indicated whether the core gene is core/peripheral in the context of the other

species it is in, and how many other diseases it is an ortholog for. There are 13 human

core genes with orthologs in the model organisms.

190

Figure 4.37: Displaying all human core genes with orthologs in each species. Some

genes only appear in a single species such as ELAVL2 in Zebrafish, while others like

SKP1 appear in all species.

191

Figure 4.38: Showing the source for each human core gene predicted in other species

of how it was implicated with atrial fibrillation. Core genes (right column) come from

each source (left column), where GWAS is the original GWAS from GWAS Atlas,

Propagate are genes from DOMINO propagation, and GenePlexus means the genes

were predicted with the GenePlexus model.

192

Figure 4.39: The mouse genes predicted from transferred human disease genes, with

human core genes highlighted. Many of the genes that are core within the mouse

module are also human core genes.

193

Figure 4.40: The zebrafish genes predicted from transferred human disease genes,

with human core genes highlighted. Human core genes are both core and peripheral in

the zebrafish module.

194

Figure 4.41: The worm genes predicted from transferred human disease genes, with

human core genes highlighted. The worm module contains no human core gene

orthologs.

195

Figure 4.42: The yeast genes predicted from transferred human disease genes, with

human core genes highlighted. The yeast module has a single human core gene that is

peripheral within the yeast genes.

196

Discussion

The contribution of this study is an implementation of a method that hypothesizes core

genes not based on prior knowledge of gene importance, but exclusively through

networks and experimental data. We also show the ability to directly use networks to

answer if human core genes are relevant in model organisms through GenePlexusZoo.

We demonstrated this by focusing on atrial fibrillation across multiple biological levels,

including discovering a disease module from which we predicted core genes that were

shown to be biologically meaningful in heart disease and often specifically atrial

fibrillation (Figure 4.9), contextualizing gene modules by discovering GOBPs and

phenotypes related to each module that are important for atrial fibrillation (Figure 4.29),

and in showing core genes that have relevant orthologs across species, with literature

support showing how these genes can be used in model organism studies to provide

insight into atrial fibrillation mechanisms (Figure 4.37). We also show how core genes

tend to be core only in specific contexts, in this case specific diseases or traits (Figure

4.21-23), and are enriched for containing orthologs and highly constrained genes

(Figure 4.30-4.31). We created a pipeline for analyzing experimental results of any

complex trait or disease by performing gene classification using ModGenePlexus and

GenePlexusZoo, discovering meaningful biological information about the initial geneset

at the gene and module levels, and then transfering that knowledge meaningfully across

species.

ModGenePlexus and GenePlexusZoo independently improve hypothesis

generation

ModGenePlexus and GenePlexusZoo are two additions to GenePlexus that improve

gene classification results as a whole. We demonstrate the improvement of results

ModGenePlexus provides in chapter 3, and also demonstrate that it leads to discovering

more underlying biology of particular diseases like type-2 diabetes. GenePlexusZoo

also improves gene classification through utilizing the multi-species network. However,

improving gene classification is only one of the main motivations for using these

methods or for combining them like we did in this chapter. ModGenePlexus allows large

genesets, like GWAS, to be usable in gene classification and be split into multiple,

meaningful subsets of gene modules. These gene modules are what allows us to

197

discover biology about the trait that using all genes as a whole would not.

GenePlexusZoo is powerful in allowing human genes to be transferred into any other

species the user desires. This allows for the discovery of model organism genes that

are relevant to the human trait or disease of interest. When we combine these two

aspects, we are able to recover multiple meaningful genes and biological processes

across species (Figure 4.29) using atrial fibrillation as an example. The importance of

combining these two methods is that they allow us to interpret specific modules in other

species’ networks, allowing for a more refined way of generating hypotheses for model

organism experiments that explain a specific aspect of the disease biology. When

designing model organism studies where the goal is to learn more about human

diseases, there are three main challenges. First is that model organisms do not

perfectly capture every aspect of the human disease, but rather may be a model for

particular mechanisms and molecular pathways. Using module relationships can allow

the discovery and isolation of important orthologs. The second challenge is that when

running parallel analyses of the same trait in GWAS between humans and model

organisms, it is likely that orthologous genes will not be discovered99. Third, transferring

ortholog knowledge across species through knockout studies of orthologous pairs does

not necessarily give rise to the same observed phenotypes100. This is because

orthologous genes do not necessarily have the same level of importance due to the

different polygenic architecture underlying each species101–103. Since ortholog

connections make it difficult to map function on its own, it is vital to use other data to

give context on whether an ortholog is a suitable candidate for predicting a particular

human function. Combining ModGenePlexus and GenePlexusZoo lets us use networks

to solve this problem, as we integrate large amounts of information from the datasets

used to construct the STRING networks of each species to interpret human genes in

biologically meaningful gene modules. One more additional motivation of using

GenePlexusZoo is that it has implemented a method that allows for finding enriched

GOBP and Monarch phenotypes using pre-calculated model weights, rather than using

a traditional enrichment software like ClusterProfiler, which we used in chapter 3. This

method is powerful because ongoing work in our group has shown the GenePlexus

framework can provide network-based gene set enrichment, achieving similar

198

performance to the widely used overrepresentation method. Additionally, the

GenePlexus software offers a way to seamlessly obtain enrichments in another species

without the use of converting gene lists using orthology. An example is seen in the

Bardet-Biedl Syndrome where GenePlexusZoo uncovered enriched GOBPs that are

biologically relevant, but could not be found from doing ortholog overlap in an ORA

because there are no direct one-to-one orthologs implicated. Rather, network

connections between the two species were used to provide biological insight.

Defining human core genes based on the disease module rather than at the gene

module level

Multiple decisions had to be made to decide an ideal definition for core genes. We have

already discussed why we used betweenness centrality to define core genes within the

predicted disease module. This definition is useful as it does not require assumptions of

gene knowledge about any particular disease – meaning it works generally across

genesets – and because we are using an entire disease module, it also allows us to find

peripheral disease genes, as those disease genes that are not core must be peripheral

in an omnigenic framework. One question was whether it would be better to define core

genes at the module level, rather than at the disease level. This was motivated through

seeing that core genes are not one meaningful biological set in themselves, but are split

across modules for each GWAS (Figure 4.10-13). This was decided against for multiple

reasons. The primary reason is that modules themselves are difficult to define

biologically and work together to give rise to disease manifestations. Interpreting what a

core gene is in this context would be very challenging and is a leap from thinking about

complex traits omnigenically. It is useful to use modules to give biological context to a

disease and discover new biological relationships, but defining them as a biological

entity itself is very challenging. As such, we chose to use our method to define the

omnigenic mapping of the genes holistically. We take a whole disease, break it into

modules, and then use those modules as tools for superior gene classification. Those

gene predictions are then aggregated into a final list that we predict is the disease

module – all genes that could be considered relevant based on genome-wide network

connections – and defines core and peripheral genes in the context of the entire

complex trait. It is valid to interpret disease genes all at once because they are all

199

involved in the disease, and as such do have relevance to one another in biological

systems. Good core genes need to be determined from that observation because the

scale of complexity is what motivates core genes being a concept in the first place.

Human core genes are not one meaningful biological set

Traditional discussions of core genes, such as in the original paper introducing the

omnigenic model2, mention that there could be multiple “core pathways” involved with a

complex disease. However, it's unclear how the core genes would be distributed across

a disease module or a genome-wide network, and what the implications of the answer

would be. In this study, we show for the 20 GWAS that the core genes are divided

amongst the discovered gene modules (Figure 4.10-13). This means that core genes

must be considered in terms of how they work in distinct processes or phenotypes. Core

genes being seen across modules raises questions for how to best determine causal

loci and genetic relationships across diseases. Pritchard et. al. mention that an

assumption behind methods like mendelian randomization is that pleiotropy between

traits that are not causally related is rare105. If there is intersection of core genes across

multiple, not related diseases, then those diseases may have similar underlying

modules due to the presence of similar genes, and thus have a form of pleiotropy where

mutations on the same genes affect multiple, unrelated complex diseases. In chapter 2

we showed that complex diseases can share modules that are enriched for more

general pathways like inflammation. It will be vital to implement methods like that in

omnigenically interpreting disease and core gene relationships.

The omnigenic model can provide context to GWAS results

We discussed in the introduction for this chapter that as GWAS get bigger, the number

of significant loci discovered increases, but the effect size of loci that were previously

implicated in smaller studies are decreasing over time. This observation shows the

necessity of post-GWAS analysis, as traditional methods have assumptions in place

that will no longer work well with the genotype level data. This is the motivation behind

the omnigenic model and behind the method we implemented. We integrated the

GWAS hits with vast amounts of biological data through the multi-species network to

predict genes that the GWAS missed and provide biological context using GOBPs and

phenotypes across species. The GWAS genes successfully discovered other

200

mechanistically important genes for atrial fibrillation. The omnigenic model is not just a

useful framework for thinking about the scale of complexity of human complex diseases

and traits, but we provided evidence that the concept of core genes has validity within

biological networks.

Future directions

A major future direction for this project will involve doing phenotype interpretation at the

cross-species level. We are implementing a double blind study where we asked

graduate students to determine if a set of model organism phenotypes were relevant to

a human disease and a set of human phenotypes. For each module we randomly

provided positive module examples (where the GOBP or phenotype was determined to

be related to the module using GenePlexusZoo), positive disease examples (where the

GOBP or phenotype was determined to be related to a different module for the same

disease) and negative disease examples (where the GOBP or phenotype was

determined to be related to a different disease). This study will help us determine how

relevant our annotated model organism phenotypes are to the GWAS and how

biologically distinct modules are within a GWAS. A second direction to take this project

is to focus on cross-GWAS comparison of core genes. We have shown that the core

genes are meaningful for the GWAS, but many insights into the omnigenic model can

be obtained by seeing how core genes relate across traits and within the network. This

can answer questions in the original paper, such as how pleiotropy could affect studies

of causality and how distinct traits really are from each other in terms of implicated loci.

We began investigating this by seeing that core genes tend to be unique within GWAS

and in seeing similar patterns where core genes are distributed across modules for

each GWAS.

201

REFERENCES

1. Crouch, D. J. M. & Bodmer, W. F. Polygenic inheritance, GWAS, polygenic risk

scores, and the search for functional variants. Proc. Natl. Acad. Sci. U.S.A. 117,
18924–18933 (2020).

2. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An Expanded View of Complex Traits: From

Polygenic to Omnigenic. Cell 169, 1177–1186 (2017).

3.

4.

Liu, X., Li, Y. I. & Pritchard, J. K. Trans Effects on Gene Expression Can Drive
Omnigenic Inheritance. Cell 177, 1022-1034.e6 (2019).

Fisher, R. A., Immer, F. R. & Tedin, O. THE GENETICAL INTERPRETATION OF
STATISTICS OF THE THIRD DEGREE IN THE STUDY OF QUANTITATIVE
INHERITANCE. Genetics 17, 107–124 (1932).

5. Van Der Sijde, M. R., Ng, A. & Fu, J. Systems genetics: From GWAS to disease

pathways. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 1842,
1903–1909 (2014).

6.

Zhu, Z. et al. Causal associations between risk factors and common diseases
inferred from GWAS summary data. Nat Commun 9, 224 (2018).

7. Kao, P. Y. P., Leung, K. H., Chan, L. W. C., Yip, S. P. & Yap, M. K. H. Pathway
analysis of complex diseases for GWAS, extending to consider rare variants,
multi-omics and interactions. Biochimica et Biophysica Acta (BBA) - General
Subjects 1861, 335–353 (2017).

8.

Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and
Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary
Statistics. PLoS Comput Biol 12, e1004714 (2016).

9. McGary, K. L. et al. Systematic discovery of nonobvious human disease models
through orthologous phenotypes. Proc. Natl. Acad. Sci. U.S.A. 107, 6544–6549
(2010).

10. Li, Z. et al. Understanding autism spectrum disorders with animal models:

applications, insights, and perspectives. Zoological Research 42, 800–823 (2021).

11. Alghamdi, S. M., Schofield, P. N. & Hoehndorf, R. Contribution of model organism
phenotypes to the computational identification of human disease genes. Disease
Models & Mechanisms 15, dmm049441 (2022).

12. Ganz, J., Melancon, E. & Eisen, J. S. Zebrafish as a model for understanding

enteric nervous system interactions in the developing intestinal tract. Methods Cell
Biol 134, 139–164 (2016).

13. Howe, D. G. et al. Model organism data evolving in support of translational

medicine. Lab Anim 47, 277–289 (2018).

14. Mancuso, C. A., Johnson, K. A., Liu, R. & Krishnan, A. Joint representation of

202

molecular networks from multiple species improves gene classification. PLoS
Comput Biol 20, e1011773 (2024).

15. Szklarczyk, D. et al. STRING v11: protein–protein association networks with

increased coverage, supporting functional discovery in genome-wide experimental
datasets. Nucleic Acids Research 47, D607–D613 (2019).

16. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and

phylogenetically annotated orthology resource based on 5090 organisms and 2502
viruses. Nucleic Acids Res 47, D309–D314 (2019).

17. Liu, R. & Krishnan, A. PecanPy: a fast, efficient and parallelized Python
implementation of node2vec. Bioinformatics 37, 3377–3379 (2021).

18. Grover, A. & Leskovec, J. node2vec: Scalable Feature Learning for Networks. in
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining - KDD ’16 855–864 (ACM Press, San Francisco,
California, USA, 2016). doi:10.1145/2939672.2939754.

19. Levi, H., Elkon, R. & Shamir, R. DOMINO: A Novel Algorithm for Network-Based

Identification of Active Modules with Reduced Rate of False Calls.
http://biorxiv.org/lookup/doi/10.1101/2020.03.10.984963 (2020)
doi:10.1101/2020.03.10.984963.

20. Leeuw, C. A. de, Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: Generalized
Gene-Set Analysis of GWAS Data. PLOS Computational Biology 11, e1004219
(2015).

21. Tian, D. et al. GWAS Atlas: a curated resource of genome-wide variant-trait

associations in plants and animals. Nucleic Acids Res 48, D927–D932 (2020).

22. Shefchek, K. A. et al. The Monarch Initiative in 2019: an integrative data and

analytic platform connecting phenotypes to genotypes across species. Nucleic
Acids Research 48, D704–D715 (2020).

23. Mungall, C. J. et al. The Monarch Initiative: an integrative data and analytic

platform connecting phenotypes to genotypes across species. Nucleic Acids
Research 45, D712–D722 (2017).

24. The Gene Ontology Consortium et al. The Gene Ontology knowledgebase in 2023.

GENETICS 224, iyad031 (2023).

25. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet

25, 25–29 (2000).

26. Geistlinger, L. et al. Towards a gold standard for benchmarking gene set
enrichment analysis. bioRxiv 674267 (2019) doi:10.1101/674267.

27. Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics,

and function using NetworkX. Proceedings of the 7th Python in Science

203

Conference (SciPy2008) 11–15 (2008).

28. Brandes, U. A faster algorithm for betweenness centrality*. The Journal of

Mathematical Sociology 25, 163–177 (2001).

29. Smedley, D. et al. BioMart – biological queries made easy. BMC Genomics 10, 22

(2009).

30. Chen, S. et al. A genomic mutational constraint map using variation in 76,156

human genomes. Nature 625, 92–100 (2024).

31. Karczewski, K. J. et al. The mutational constraint spectrum quantified from

variation in 141,456 humans. Nature 581, 434–443 (2020).

32. Exome Aggregation Consortium et al. Analysis of protein-coding genetic variation

in 60,706 humans. Nature 536, 285–291 (2016).

33. Liu, R., Yuan, H., Johnson, K. A. & Krishnan, A. CONE: COntext-specific Network

Embedding via Contextualized Graph Attention. Preprint at
https://doi.org/10.1101/2023.10.21.563390 (2023).

34. GTEx Consortium et al. A gene-based association method for mapping traits using

reference transcriptome data. Nat Genet 47, 1091–1098 (2015).

35. Kryuchkova-Mostacci, N. & Robinson-Rechavi, M. A benchmark of gene
expression tissue-specificity metrics. Brief Bioinform bbw008 (2016)
doi:10.1093/bib/bbw008.

36. Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level
relationships in human tissue specification. Bioinformatics 21, 650–659 (2005).

37. Palasca, O., Santos, A., Stolte, C., Gorodkin, J. & Jensen, L. J. TISSUES 2.0: an

integrative web resource on mammalian tissue expression. Database 2018, (2018).

38. Zhang, Y. et al. CellSTAR: a comprehensive resource for single-cell transcriptomic

annotation. Nucleic Acids Research 52, D859–D870 (2024).

39. Sinnott-Armstrong, N., Naqvi, S., Rivas, M. & Pritchard, J. K. GWAS of three

molecular traits highlights core genes and pathways alongside a highly polygenic
background. eLife 10, e58615 (2021).

40. Domingo-Relloso, A. et al. Arsenic Exposure, Blood DNA Methylation, and

Cardiovascular Disease. Circulation Research 131, (2022).

41. Liu, H. et al. Atrial fibrillation alters the microRNA expression profiles of the left atria

of patients with mitral stenosis. BMC Cardiovasc Disord 14, 10 (2014).

42. Chu, M. et al. Increased Cardiac Arrhythmogenesis Associated With Gap Junction
Remodeling With Upregulation of RNA-Binding Protein FXR1. Circulation 137,
605–618 (2018).

204

43. Whitman, S. A. et al. Desmoplakin and Talin2 Are Novel mRNA Targets of Fragile

X–Related Protein-1 in Cardiac Muscle. Circulation Research 109, 262–271 (2011).

44. Tassanakijpanich, N., Cohen, J., Cohen, R., Srivatsa, U. N. & Hagerman, R. J.

Cardiovascular Problems in the Fragile X Premutation. Front. Genet. 11, 586910
(2020).

45. Wass, S. Y. et al. Novel functional atrial fibrillation risk genes and pathways
identified from coexpression analyses in human left atria. Heart Rhythm 20,
1219–1226 (2023).

46. Ning, Z. et al. Novel Drug Targets for Atrial Fibrillation Identified Through Mendelian
Randomization Analysis of the Blood Proteome. Cardiovasc Drugs Ther (2023)
doi:10.1007/s10557-023-07467-8.

47. Cai, W. et al. asb5a/asb5b Double Knockout Affects Zebrafish Cardiac Contractile

Function. IJMS 24, 16364 (2023).

48. Shapiro, D., Lee, K., Asmussen, J., Bourquard, T. & Lichtarge, O. Evolutionary
Action–Machine Learning Model Identifies Candidate Genes Associated With
Early‐Onset Coronary Artery Disease. JAHA 12, e029103 (2023).

49. Sheng, C. et al. CALML6 Controls TAK1 Ubiquitination and Confers Protection

against Acute Inflammation. The Journal of Immunology 204, 3008–3018 (2020).

50. Qu, Y.-C. et al. Activated nuclear factor-κB and increased tumor necrosis factor-α in

atrial tissue of atrial fibrillation. Scandinavian Cardiovascular Journal 43, 292–297
(2009).

51. Liu, Y. et al. Identification of atrial fibrillation-associated lncRNAs and exploration of
their functions based on WGCNA and ceRNA network analyses. gpb 40, 289–305
(2021).

52. Zou, J. et al. Neddylation mediates ventricular chamber maturation through
repression of Hippo signaling. Proc. Natl. Acad. Sci. U.S.A. 115, (2018).

53. Maejima, Y. & Sadoshima, J. SUMOylation: A Novel Protein Quality Control

Modifier in the Heart. Circulation Research 115, 686–689 (2014).

54. Martins, I. L. F. et al. Reviewing Atrial Fibrillation Pathophysiology from a Network

Medicine Perspective: The Relevance of Structural Remodeling, Inflammation, and
the Immune System. Life (Basel) 13, 1364 (2023).

55. Zou, R. et al. Analysis of Genes Involved in Persistent Atrial Fibrillation:

Comparisons of ‘Trigger’ and ‘Substrate’ Differences. Cell Physiol Biochem 47,
1299–1309 (2018).

56. Li, C. Y., Zhang, J. R., Hu, W. N. & Li, S. N. Atrial fibrosis underlying atrial

fibrillation (Review). Int J Mol Med 47, 9 (2021).

205

57. Kawaji, T. et al. Clinical significance of ST-segment depression during atrial

fibrillation rhythm for subsequent heart failure events. European Heart Journal
Open 3, oead060 (2023).

58. Park, K.-M. et al. Atrial Fibrillation in Hypertrophic Cardiomyopathy: Is the Extent of

Septal Hypertrophy Important? PLoS One 11, e0156410 (2016).

59. Westerman, S. & Wenger, N. Gender Differences in Atrial Fibrillation: A Review of

Epidemiology, Management, and Outcomes. Curr Cardiol Rev 15, 136–144 (2019).

60. Siddiqi, H. K. et al. Sex Differences in Atrial Fibrillation Risk: The VITAL Rhythm

Study. JAMA Cardiol 7, 1027 (2022).

61. Thibault, S., Long, V. & Fiset, C. Higher Na+-Ca2+ Exchanger Function and

Triggered Activity Contribute to Male Predisposition to Atrial Fibrillation. IJMS 23,
10724 (2022).

62. Dobrev, D., Heijman, J., Hiram, R., Li, N. & Nattel, S. Inflammatory signalling in

atrial cardiomyocytes: a novel unifying principle in atrial fibrillation pathophysiology.
Nat Rev Cardiol 20, 145–167 (2023).

63. Wangler, M. F., Hu, Y. & Shulman, J. M. Drosophila and genome-wide association
studies: a review and resource for the functional dissection of human complex
traits. Disease Models & Mechanisms 10, 77–88 (2017).

64. Duncan, L. E., Ostacher, M. & Ballon, J. How genome-wide association studies

(GWAS) made traditional candidate gene studies obsolete.
Neuropsychopharmacol. 44, 1518–1523 (2019).

65. Madelaine, R. et al. A screen for deeply conserved non-coding GWAS SNPs

uncovers a MIR-9-2 functional mutation associated to retinal vasculature defects in
human. Nucleic Acids Research 46, 3517–3531 (2018).

66. Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in
discovery of genetic effects on gene expression and complex traits. Nat Genet 55,
1866–1875 (2023).

67. Chen, S., Zhang, Y. E. & Long, M. New Genes in Drosophila Quickly Become

Essential. Science 330, 1682–1685 (2010).

68. Flintoft, L. Young genes are essential too. Nat Rev Genet 12, 79–79 (2011).

69. Chen, S., Krinsky, B. H. & Long, M. New genes as drivers of phenotypic evolution.

Nat Rev Genet 14, 645–660 (2013).

70. Andersson, D. I., Jerlström-Hultqvist, J. & Näsvall, J. Evolution of new functions de
novo and from preexisting genes. Cold Spring Harb Perspect Biol 7, a017996
(2015).

71.

Iwasaki, Y., Nishida, K., Kato, T. & Nattel, S. Atrial Fibrillation Pathophysiology:

206

Implications for Management. Circulation 124, 2264–2274 (2011).

72. Nso, N., Bookani, K. R., Metzl, M. & Radparvar, F. Role of inflammation in atrial
fibrillation: A comprehensive review of current knowledge. Journal of Arrhythmia
37, 1–10 (2021).

73. Zhou, X. & Dudley, S. C. Evidence for Inflammation as a Driver of Atrial Fibrillation.

Front. Cardiovasc. Med. 7, 62 (2020).

74. Lessomo, F. Y. N., Fan, Q., Wang, Z.-Q. & Mukuka, C. The relationship between

leukocyte to albumin ratio and atrial fibrillation severity. BMC Cardiovasc Disord 23,
67 (2023).

75. Chen, Y. et al. Association between circulating leukocytes and arrhythmias:

Mendelian randomization analysis in immuno-cardiac electrophysiology. Front.
Immunol. 14, 1041591 (2023).

76. Paquissi, F. C. The Predictive Role of Inflammatory Biomarkers in Atrial Fibrillation
as Seen through Neutrophil-Lymphocyte Ratio Mirror. Journal of Biomarkers 2016,
1–14 (2016).

77. Ertaş, G. et al. Neutrophil/lymphocyte ratio is associated with thromboembolic
stroke in patients with non-valvular atrial fibrillation. Journal of the Neurological
Sciences 324, 49–52 (2013).

78. Gibson, P. H. et al. Usefulness of Neutrophil/Lymphocyte Ratio As Predictor of

New-Onset Atrial Fibrillation After Coronary Artery Bypass Grafting. The American
Journal of Cardiology 105, 186–191 (2010).

79. Conte, M. et al. Epicardial Adipose Tissue and Cardiac Arrhythmias: Focus on

Atrial Fibrillation. Front Cardiovasc Med 9, 932262 (2022).

80. Chahine, Y. et al. Epicardial adipose tissue is associated with left atrial volume and
fibrosis in patients with atrial fibrillation. Front. Cardiovasc. Med. 9, 1045730
(2022).

81. Wong, C. X. et al. Associations of Epicardial, Abdominal, and Overall Adiposity

With Atrial Fibrillation. Circ: Arrhythmia and Electrophysiology 9, e004378 (2016).

82. Karlsson, M. et al. A single–cell type transcriptomics map of human tissues. Sci.

Adv. 7, eabh2169 (2021).

83. Weil, B. R. & Ozcan, C. Cardiomyocyte Remodeling in Atrial Fibrillation and

Hibernating Myocardium: Shared Pathophysiologic Traits Identify Novel Treatment
Strategies? Biomed Res Int 2015, 587361 (2015).

84. Hao, H. et al. Atrial myocyte-derived exosomal microRNA contributes to atrial

fibrosis in atrial fibrillation. J Transl Med 20, 407 (2022).

85. Patel, K. K. et al. Genomic approaches to identify and investigate genes associated

207

with atrial fibrillation and heart failure susceptibility. Hum Genomics 17, 47 (2023).

86. Faggioni, M. et al. Suppression of Spontaneous Ca Elevations Prevents Atrial

Fibrillation in Calsequestrin 2-Null Hearts. Circ: Arrhythmia and Electrophysiology
7, 313–320 (2014).

87. Nakano, Y. Genome and atrial fibrillation. J Arrhythm 39, 303–309 (2023).

88. Roselli, C. et al. Multi-ethnic genome-wide association study for atrial fibrillation.

Nat Genet 50, 1225–1233 (2018).

89. Faggioni, M. & Knollmann, B. C. Calsequestrin 2 and arrhythmias. American

Journal of Physiology-Heart and Circulatory Physiology 302, H1250–H1260 (2012).

90. Zhang, J.-C. et al. Calcium-Mediated Oscillation in Membrane Potentials and

Atrial-Triggered Activity in Atrial Cells of Casq2R33Q/R33Q Mutation Mice. Front
Physiol 9, 1447 (2018).

91. Kalyanasundaram, A. et al. Functional consequences of stably expressing a

mutant calsequestrin (CASQ2 D307H ) in the CASQ2 null background. American
Journal of Physiology-Heart and Circulatory Physiology 302, H253–H261 (2012).

92. Li, P.-F. et al. Modulation of miR-10a-mediated TGF-β1/Smads signaling affects
atrial fibrillation-induced cardiac fibrosis and cardiac fibroblast proliferation.
Bioscience Reports 39, BSR20181931 (2019).

93. Bühler, A. et al. Atrogin-1 Deficiency Leads to Myopathy and Heart Failure in

Zebrafish. Int J Mol Sci 17, 187 (2016).

94. Li, J. et al. A noncanonical function of SKP1 regulates the switch between
autophagy and unconventional secretion. Sci Adv 9, eadh1134 (2023).

95. Boehm, B. M., Gaa, J., Hoppmann, P., Martens, E. & Westphal, D. S. The Role of

RYR2 in Atrial Fibrillation. Case Rep Cardiol 2023, 6555998 (2023).

96. Di Pino, A., Caruso, E., Costanzo, L. & Guccione, P. A novel RyR2 mutation in a

2-year-old baby presenting with atrial fibrillation, atrial flutter, and atrial ectopic
tachycardia. Heart Rhythm 11, 1480–1483 (2014).

97. Yoneda, Z. T. et al. Early-Onset Atrial Fibrillation and the Prevalence of Rare

Variants in Cardiomyopathy and Arrhythmia Genes. JAMA Cardiol 6, 1371 (2021).

98. Gillis, A. M. & Dobrev, D. Targeting the RyR2 to Prevent Atrial Fibrillation. Circ:

Arrhythmia and Electrophysiology 15, (2022).

99. Wright, S. N. et al. Genome-wide association studies of human and rat BMI

converge on synapse, epigenome, and hormone signaling networks. Cell Rep 42,
112873 (2023).

100. Liao, B.-Y. & Zhang, J. Null mutations in human and mouse orthologs frequently
result in different phenotypes. Proc Natl Acad Sci U S A 105, 6987–6992 (2008).

208

101. Beura, L. K. et al. Normalizing the environment recapitulates adult human immune

traits in laboratory mice. Nature 532, 512–516 (2016).

102. Mestas, J. & Hughes, C. C. W. Of mice and not men: differences between mouse

and human immunology. J Immunol 172, 2731–2738 (2004).

103. Stamboulian, M., Guerrero, R. F., Hahn, M. W. & Radivojac, P. The ortholog

conjecture revisited: the value of orthologs and paralogs in function prediction.
Bioinformatics 36, i219–i226 (2020).

104. Chen, H. et al. New insights on human essential genes based on integrated

analysis and the construction of the HEGIAP web-based platform. Brief Bioinform
21, 1397–1410 (2020).

105. Wagner, G. P. & Zhang, J. The pleiotropic structure of the genotype–phenotype

map: the evolvability of complex organisms. Nat Rev Genet 12, 204–213 (2011).

209

APPENDIX A4: CORE GENES

Figure A4.1: The ten most enriched biological processes associated with the top genes

in each species predicted to be related to Bardet-Biedel Syndrome (BBS): This figure

show results from the original GenePlexusZoo paper14. A classifier trained using human

BBS genes was used to predict the BBS-related genes in model organisms within the

multi-species network. In this graph, nodes represent the ten most enriched biological

processes of each species and are colored by species they are identified in. Edges

represent semantically similar processes. Biological process nodes with thick borders

(and bolded labels) represent those processes in which at least one species in which

none of the annotated genes are orthologous to any human BBS gene. The process

was instead implicated through network connections.

210

Figure A4.2: Genes in gene modules for celiac disease. Tan nodes are the original

training genes used in GenePlexusZoo, and green nodes are predicted genes from

GenePlexusZoo.

211

Figure A4.3: Predicted disease module for GWAS age related macular degeneration.

212

Figure A4.4: A disease module for GWAS systemic lupus erythematosus.

213

Figure A4.5: Displaying GWAS genes of celiac disease within a t-SNE of the human

STRING network, with each module displayed. The larger nodes are the predicted core

genes, and each gene corresponds to a module (see legend). GWAS core genes are

distributed across modules for a trait,and the modules with multiple core genes have

those genes as neighbors in the network.

214

Figure A4.6: Gene predictions for yeast (top) and worm (bottom) for atrial fibrillation.

215

Figure A4.7: Comparing model weights for yeast (top) and worm (bottom) across

modules for atrial fibrillation.

216

Figure A4.8: GOBP and phenotype enrichments for zebrafish for atrial fibrillation across

modules.

217

Figure A4.9: GOBP and phenotype enrichments for yeast for atrial fibrillation across

modules.

218

Figure A4.10: GOBP and phenotype enrichments for worm for atrial fibrillation across

modules.

219

CHAPTER 5: SUMMARY, LIMITATIONS, REFLECTION, AND FUTURE DIRECTIONS

Summary

A primary goal of this dissertation was to propose methods to unravel the complexity of

human disease by providing insights into the validity of interpreting diseases as multiple

distinct subsets. These projects show how applying our methods to experimental data

that is very complex and noisy can lead to discovery of relevant biology. These results

are crucial because we know that diseases are modular and composed of multiple

subsets of genes that contribute to distinct aspects of disease biology, but explaining

how this biology applies to human disease remains extremely challenging. The

discussed projects are united by using network biology to analyze complex human

disease and traits to create general-purpose methods that provide biological insight

across many different diseases. In chapter 2 we validate that we can use modules to

discover relevant chronic inflammation genes, processes, and drug repurposing

candidates for multiple common human complex diseases which are known to have

inflammatory responses, but for which the genetic basis and mechanism had not been

unraveled. By comparing modules across diseases we can learn which diseases

manifest inflammatory phenotypes in a similar manner. The next projects in the

dissertation delve deeply into aspects of the pipeline used in chapter 2. In chapter 3, we

implemented a new method to utilize module information to improve disease gene

classification by refinement of an initial geneset to remove false positives and reclassify

false negatives. This method allows for easier discovery of enriched, biologically

relevant phenotypes and processes with increased statistical power. Using modules, we

uncovered biology relating to experimental datasets that were not found when

considering the dataset as a whole. Lastly, in chapter 4 we demonstrated a pipeline

where genotype-derived GWAS data for diseases can be broken up into smaller

network-driven modules to improve performance of gene classification, then aggregated

back into a final predicted disease module. This technique -- validated and justified in

chapter 3 – allowed us to use the omnigenic framework to explain the biology of

complex disease through classifying functional relevant core genes, relevant tissues

and cell types, relevant enriched GOBPs, and genes that have important orthologs in

220

model organisms. Our method for classifying core genes is general-purpose and does

not require assumptions about gene knowledge of specific diseases. These core genes

are functionally relevant, and definitionally have many relationships to other disease

genes. Chapter 2 has been published and the code released, while the code for chapter

3 is soon to be ready for release after publication. We implemented general-purpose

methods that work with large amounts of diverse gene sets spanning multiple types of

experiments for multiple complex diseases and traits. Interpreting diseases in the

contexts of modules improves the performance of common computational methods, and

is key for discovering how mechanistically relevant genes contribute to disease from

highly heterogeneous data.

Limitations

This dissertation makes strides in implementing and improving general-purpose

computational methods and delving deep into considering how to best learn more about

disease biology. There are notable limitations that we had to work around to obtain

meaningful results. Broadly, a clear issue is the limited knowledge of complex disease

biology in the form of gold standards. Two examples of this limitation are for modules

and for core/peripheral genes in an omnigenic framework. We have justified our use of

modules for computational methods and biological insight despite there being no ground

truth. True modules for disease are undiscovered because the question of what a

module is is controversial in the first place. There is no correct answer on what is an

ideal sized module, how many processes it would be enriched for, what types of

phenotypes or how many, and many more biological questions. Despite their ambiguity,

we showed that modules are useful tools for the methods we implemented. We are not

claiming that the modules used for each geneset are the final modules that would ever

be discovered for any disease. Modules will always change as both geneset

annotations and networks advance as more information is discovered, but this is

beneficial as it will make our methods even more useful when more biological data is

annotated.

A second place where gold standards are highly limited is in core and peripheral genes.

No genesets of these exist for disease. In fact, one of the critiques of the omnigenic

model is that the biology behind core genes is not clearly articulated1. In chapter 4,

221

since there is no gold standard, we had to choose our own definition of core genes. First

we derive a disease module by implementing ModGenePlexus to predict genes that are

likely missing and adding these to our original disease gene module. Then, we define

core/peripheral genes using this module, where the genes with the highest

betweenness centrality values are considered core genes. This definition utilizes only

network connections to determine what genes are core, with the idea being that there

will be a small subset of genes that are highly connected to the rest of the hundreds of

disease genes and that these highly connected genes are more likely to be directly

involved in an important disease mechanism. These genes will have the highest

betweenness centrality because they are the most connected to the other disease

genes. This was done to create a general purpose method for predicting core genes

that can be investigated, rather than relying on prior assumptions about biology to

define potential core genes.

Reflection: Discovering chronic inflammation processes for complex disease

In chapter 2, we asked the question of whether we can isolate inflammation processes

for complex diseases that are known to have a relationship to inflammation, but the

biology behind that relationship is unclear. We introduce inflammation as an

endophenotype for complex disease, meaning it is an intermediate phenotype that

underlies disease. We are not the first to consider inflammation as an endophenotype in

general or for specific complex diseases2–5, and we consider the concept invaluable

because it is known that multiple distinct phenotypes underlie complex disease.

Answering how inflammation underlies each complex disease we investigated in

chapter 2 would not only give a genetic explanation for the inflammation, but also could

allow the development of hypotheses of whether inflammation is a good indicator of a

disease being manifested in the first place. The motivation for the drug repurposing

method is to find drugs that target the specific inflammation component of complex

diseases. We validate that this is possible, but the usefulness depends on whether

inflammation is considered a highly important process underneath the disease. For

non-autoimmune disorders, these answers are still unknown. We put a lot of thought

into how to relate chronic inflammation to disease, and this is where the idea of using

network-based gene classification came from. We originally tried using DIAMOnD6, but

222

this failed because we would get very large genesets that did not mean anything. Using

GenePlexus allowed for a refinement of the genelist to one that passed our designed

permutation tests for finding biological associations between actual discovered disease

clusters and inflammation. This chapter resulted in the completion of a general pipeline

that answered specific questions about inflammation in complex disease. We showed

that using modules is valid in discovering meaningful biological information about a

particular phenotype and predicted drugs that targeted that phenotype at the module

level. The future projects in this dissertation modify and improve aspects of this pipeline.

Reflection: Implementing ModGenePlexus as an extension of GenePlexus

In chapter 2 we used a simple process for discovering disease clusters. We subsetted

the network to include disease genes and clustered this disease-specific subnetwork

using the Leiden Clustering algorithm7. Two considerations of this method are (i) we

cluster after expanding the entire disease list, and (ii) this algorithm is run on a disease

subnetwork, which means it does not take the rest of the genome into account. A major

theme of this dissertation is that diseases are made up of multiple meaningful subsets

of genes where not all genes interact equally together. Because of this, running

GenePlexus on an entire disease geneset at once is questionable. If we want to gain

insight into the inflammation component of a specific disease, why would it make sense

to expand on other processes that have nothing to do with inflammation? This is the

original motivation behind ModGenePlexus. Implementing this method ran into

numerous issues. This can be seen in the simulation where fake traits were created

from multiple GOBP. ModGenePlexus here re-created the performance of simply

creating one model at once. This result was initially surprising but is ultimately a

reflection of the fact that GenePlexus is a very good method optimized for predicting

well-annotated genesets, and improving a good method is a challenging task. These

challenges are what led us to integrate multiple computational methods that all work

together to improve results. Most importantly, this was our motivation for focusing on

real-world experimental data for which GenePlexus performance is typically worse than

on well-annotated genesets. The primary goal of the GenePlexus software is to make it

more usable for tasks scientists actually perform, like downstream analysis of

experimentally determined genesets. ModGenePlexus directly improves GenePlexus

223

through additional geneset refinement. For the analysis where we train on nominal

p-value genes of MAGMA predictions and evaluate on stringent p-value understudied

genes, this was born out of the observation that the stringent genesets performed poorly

with ModGenePlexus. ModGenePlexus does well the more complex and large the initial

input is. This is because larger genesets allow for more dense and complete clusters to

be discovered. This is an ideal result – where our method improves for data that

performs particularly poorly.

Reflection: Interpreting diseases omnigenically through geneset refinement of

noisy experimental data and geneset refinement

GWAS data is useful because it has become relatively straightforward and cheap to

sequence participants. However, an increased number of studies with increased sample

sizes has created many problems for interpretation and disease etiology. Our reliance

on biological networks in these projects, and our interest in the vast complexity of many

human traits means the omnigenic model is a very important concept to us. We

considered for a long time how to create a general-purpose method that can predict

core genes across multiple diseases. Rather, we leveraged the network and utilized it

directly in defining core genes as those that are very important in the context of all

disease gene annotations. The methods we implemented in this dissertation were all

essential to be able to do this final chapter. We used GenePlexus to do gene

classification on the entire genome, ModGenePlexus for improved performance on large

scale experimental datasets like GWAS, and modules to investigate how core genes

relate to underlying disease biology. Integrating GenePlexusZoo introduces the ability to

transfer this knowledge to other species and discover meaningful orthologs that can be

used in hypothesis generation to explain relevant aspects of disease biology . Atrial

fibrillation is the perfect complex disease to use as a test case as it is very complex,

common, has hundreds of gene annotations, and is involved in multiple diverse tissue

and cell types. We provide ample evidence that our method for calculating core genes

finds genes relevant to disease function.

224

Overall reflection: Leveraging complexity and gene relationships is what will

uncover new breakthroughs in understanding human complex trait etiology

A generation ago the scale of complexity of human traits was still unknown8. GWAS is

what revealed that the number of relevant loci was not in the dozens, but much larger9.

Many methods are based on trying to isolate highly significant, large effect size genes

and focusing on them, rather than focusing on using the underlying polygenic

architecture because of the challenges it presents. However, we show in this

dissertation that utilizing large amounts of genes is useful in multiple computational

contexts. We find relevant inflammation phenotypes for diseases where inflammation is

harder to define when considering the hundreds of gene annotations, we perform gene

classification better for datasets containing thousands of genes – uncovering truly

relevant biology, and we discover functionally relevant core genes. Genome-wide

networks are a tool built from countless studies and datasets of vast amounts of biology,

and the gene relationships are meaningful. By leveraging this data, we can provide

context to hundreds of relevant genes recovered in experimental datasets. We saw how

actually using less significant genes led to better prediction of highly significant genes in

GWAS in chapter 3, and in chapter 4 we show that using GenePlexus to create a

disease module, rather than using only the very few top predictions for hypothesis

generation, allowed us to recover functionally relevant core genes and define them as

core within a network. These methods and the positive results only make sense with the

knowledge that diseases are highly complex and contain hundreds of true gene

annotations. Understanding this polygenic environment will be vital in the future for

being able to interpret mutations that are very rare or are novel. Networks explain how

unique mutations can cause the same disease across patients, because a module

within a network is being disrupted. The technology to sequence people is here, but it is

determining how a disease maps in biological networks that will explain disease

genetics broadly and for specific individuals.

Future directions

In chapter 2, we would like to expand on the work done to investigate endophenotypes

further. Specifically, we would like to use GWAS of molecular, directly measurable traits

that are “proxies'' for higher level complex disease that is difficult to diagnose genetically

225

and phenotypically. A classic example of a disease with clinically useful

endophenotypes is schizophrenia, where an example is prepulse inhibition, the

inhibition of the startle reflex in response to weak prestimulus10. A vital question is

whether molecular traits that underlie inflammation have significant enrichment with

complex diseases with inflammation components. The challenge of using the

inflammasome or other large entities like the thrombosome and fibrosome as

endophenotypes5 is that it is unclear how to use that knowledge in clinical settings, as

there are multiple biomarkers for inflammation11. One way to do this proposal is to use

GWAS Atlas as a source because it contains many molecular phenotype summary

statistics. We can determine which phenotypes relate to chronic inflammation through

either gene overlap or GenePlexus, and then see if these phenotypes also relate to

complex disease using methods proposed in chapter 2-4, integrating what we have

learned about how to utilize the polygenic architecture of diseases to our advantage.

For chapter 3, the primary future directions for this project are to integrate it into the

GenePlexus webserver12. This is a technical challenge in itself given the space

requirements needed to create many models for a disease rather than a single one. A

biological direction to take this project is to directly compare the results of multiple

GWAS that were conducted on the same disease, and see how modules and results

differ. This would be especially interesting as we could quantify how modules change

over time in terms of gene assignments and size when comparing older GWAS to newer

ones.

For chapter 4, the primary goal is to do cross-GWAS analysis at the core gene level and

validate making model organism phenotype predictions at the module level. These

questions are vital in terms of being able to answer and state implications about the

omnigenic model in the context of networks for human disease, and in being able to

transfer this knowledge to other organisms using our methods. This knowledge transfer

is important for generating hypotheses for further biological research.

226

REFERENCES

1. Sinnott-Armstrong, N., Naqvi, S., Rivas, M. & Pritchard, J. K. GWAS of three

molecular traits highlights core genes and pathways alongside a highly polygenic
background. eLife 10, e58615 (2021).

2. Barber, R. et al. P3‐267: An inflammatory endophenotype of Alzheimer’s disease.

Alzheimer’s &amp; Dementia 6, (2010).

3. Guglielmo, R., Miskowiak, K. W. & Hasler, G. Evaluating endophenotypes for bipolar

disorder. Int J Bipolar Disord 9, 17 (2021).

4. Genkel, V. V. & Shaposhnik, I. I. Conceptualization of Heterogeneity of Chronic

Diseases and Atherosclerosis as a Pathway to Precision Medicine: Endophenotype,
Endotype, and Residual Cardiovascular Risk. Int J Chronic Dis 2020, 5950813
(2020).

5. Ghiassian, S. D. et al. Endophenotype Network Models: Common Core of Complex

Diseases. Sci Rep 6, 27414 (2016).

6. Ghiassian, S. D., Menche, J. & Barabási, A.-L. A DIseAse MOdule Detection

(DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns
of Disease Proteins in the Human Interactome. PLoS Comput Biol 11, e1004120
(2015).

7. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing

well-connected communities. Sci Rep 9, 5233 (2019).

8. Baxter, R. C., Martin, J. L. & Beniac, V. A. High molecular weight insulin-like growth
factor binding protein complex. Purification and properties of the acid-labile subunit
from human serum. J Biol Chem 264, 11843–11848 (1989).

9. Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577,

179–189 (2020).

10. Greenwood, T. A., Shutes-David, A. & Tsuang, D. W. Endophenotypes in

Schizophrenia: Digging Deeper to Identify Genetic Mechanisms. J Psychiatr Brain
Sci 4, e190005 (2019).

11. Liu, C. H. et al. Biomarkers of chronic inflammation in disease development and
prevention: challenges and opportunities. Nat Immunol 18, 1175–1180 (2017).

12. Mancuso, C. A. et al. GenePlexus: a web-server for gene discovery using

network-based machine learning. Nucleic Acids Research 50, W358–W366 (2022).

227

CHAPTER 6: INTEGRATING GENE PRIORITIZATION METHODS FOR SUMMARY

GWAS STATISTICS

Predicting (prioritizing) which genes are targeted by SNPs in GWAS is a difficult

challenge because most discovered SNPs fall within non-coding regions of the

genome1. Multiple methods have been implemented to perform the task. In chapter 1

we discuss methods that utilize physical distance of implicated SNPs to genes in the

genome. These are methods such as MAGMA2 and Pascal3. Chapter 1 additionally

discusses that these methods are quite good at finding the genes most SNPs target.

However, they are quite naive and other methods have been implemented that integrate

other biological data to better predict genes certain SNPs target.

One such method is known as Transcriptome Wide Association Studies (TWAS). Briefly,

what makes TWAS unique is that it uses gene expression data – specifically eQTL

associations – to prioritize the genes that SNPs target. If a SNP is associated with

expression of a particular gene, and if that SNP is implicated in a GWAS, then the gene

it is a eQTL for is the gene that should be associated, not simply the genes nearby. A

primary example of a TWAS method is the PrediXcan family4–6. PrediXcan uses

transcriptome data sets from GTEx7,8 to build models that relate SNPs to gene

expression within each tissue. Each tissue in GTEx had a model built for it because

gene expression is different depending on biological context. Next, these models are

used to associate the genotypes of participants in the GWAS cohort to predict

expression for those individuals. The expression is predicted for the specific tissue

model used. Lastly, this predicted expression is associated with GWAS case/controls. If

a SNP is seen in the cases quite often, then the model chooses targets based on how

that SNP is related to gene expression data.

A third method for prioritization is to integrate Hi-C data to discover interactions. TWAS

studies use expression data to relate implicated SNPs to genes, and Hi-C has a similar

motivation by investigating how genes in enhancer regions interact with promoter

regions of genes. When the DNA is unwrapped in the nucleus, the DNA is moving

around in the cell environment, and this can lead to interaction between regions of DNA

that can be far away from each other. This is how proteins bound to enhancer regions

can target proteins that are far away from a kilobase perspective, but interact in the

228

3-Dimensional cell environment. Using Hi-C data with GWAS data means that if an

implicated SNP falls within an enhancer region, the regions on the genome the

enhancer are known to interact with are considered as targets for those SNPs. Multiple

methods have been implemented that integrate Hi-C data9,10.

At the start of this PhD program, the first project that I worked on was not any of the

chapters presented in this dissertation. It was to build a method that works across many

GWAS summary statistics to improve gene prioritization from SNP data using multiple

types of data and methodology including locality base-methods, TWAS, a novel

implementation to integrate Hi-C data in prioritization, and a novel network-based

method to integrate the predictions. Specifically, we designed a project to create a

general purpose method that would, for each GWAS, perform gene prioritization using a

physical distance method (Pascal), a TWAS method (PrediXcan), and a new

implementation for integrating Hi-C data to prioritize summary GWAS statistics from UK

Biobank11.

The first implementation for this project was modification to the Pascal method that

would integrate Hi-C data indicating enhancer/target regions within the genome. We

modified the Java source code for Pascal to integrate and use this data. The window for

considering genes was defined as the target regions of enhancer SNPs. If they targeted

gene promoters, these genes were considered for prioritization. This method was

termed 3DPascal. This method was implemented and results were compiled, but there

were numerous difficulties. A major challenge with this method is that it also requires

tissue and cell specific data, as gene regulation and 3D contacts are context specific5,12.

We ran 3DPascal for every GWAS, for 21 Hi-C datasets of tissues and cell lines13. A

major bottleneck here was computational scalability, as running this in parallel for

thousands of GWAS would take days.

In 2019 and 2020, the software H-MAGMA9 was released and then published.

H-MAGMA modified the original MAGMA method for gene prioritization based on

physical distance to integrate Hi-C information. A motivation of 3DPascal was to create

a user friendly method for doing this by modifying a well validated gene prioritization

method with open source code, and H-MAGMA accomplishes that.

229

After the publication of this method, we focused on integrating results of different

prioritization methods into one final prediction score for the gene. We wanted to use

networks to accomplish this. Nodes in a network would be weighted by how many gene

prioritization modalities implicated them, and we would do a random walk to see how

often genes are in the paths between other implicated nodes. We first investigated how

often modalities predicted genes, noticing that if genes were ever predicted, 3DPascal

would often find them (Figure 6.1). This could mean that there are a lot of false positive

predictions in 3DPascal and would further motivate using a network to provide context

in which predicted genes are good. While implementing this method, a new study14 was

released that integrates 13 different methods to link SNPs to genes, including physical

distance, enhancer gene linking, Hi-C data and scATAC-seq data. They show increased

performance for prioritization with their method compared to other prioritization

aggregation methods, and do analysis on dozens of GWAS from UK Biobank. In other

words, this method and study was an expanded version of what we were planning to

release.

While these projects ultimately did not finish, it was still foundational in influencing our

views of complex disease and traits in humans. Gene prioritization is a challenge with

GWAS because of the polygenic architecture complex traits have. It is not obvious how

to relate non-coding variants to genes, and to improve results when multiple types of

biological data are being integrated. Giving context to entities like SNPs and genes

allows for the discovery of novel biology. The idea to utilize biological networks as a

centerpiece in our work also came about with trying to integrate gene prioritization

scores across methods. The idea is that if gene predictions are meaningful, they should

be well connected within a biological network. This was the first project that made us

consider the nuances in integrating experimental results like GWAS with a network to

denoise and better refine the results. Using a network specifically to remove false

positives from our 3DPascal motivation is very analogous to our using a network to

remove false positives of experimental genesets in chapter 3. This project was where

that idea initially came from.

230

Additionally, many of the methods used in these first projects were directly used in

chapter 2-4 of this dissertation. In chapter 2, we utilized the non-disease GWAS from

UK Biobank to compare them to complex disease gene classification predictions, and if

diseases had more enrichment to inflammation relative to traits we know are not related.

In chapters 2-4, we used gene prioritization results for summary statistics to do some

methods or to analyze method validity with genotypic data. In chapter 2, we used

Pascal3, which uses physical distance of SNPs to genes to predict which genes are

relevant based on the significant SNPs found in the GWAS study. MAGMA is a similar

method that we used for chapters 3 and 4.

Figure 6.1: The percentage of gene hits for GWAS studies by each modality or

combination of modalities. Notably, 3DPascal (hicgenes) hits most of the genes that are

predicted by any method, and Pascal and TWAS only predict a small percentage of

genes because 3DPascal predicts so many. This was our motivation for using networks

to integrate the results, as it would hopefully remove false positive 3DPascal results that

do not relate well to the other predictions.

231

REFERENCES

1. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc. Natl. Acad. Sci. U.S.A. 106,
9362–9367 (2009).

2. Leeuw, C. A. de, Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: Generalized
Gene-Set Analysis of GWAS Data. PLOS Computational Biology 11, e1004219
(2015).

3. Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and

Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary
Statistics. PLoS Comput Biol 12, e1004714 (2016).

4. Gamazon, E. R. et al. A gene-based association method for mapping traits using

reference transcriptome data. Nat Genet 47, 1091–1098 (2015).

5. Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene
expression variation inferred from GWAS summary statistics. Nat Commun 9, 1825
(2018).

6. Barbeira, A. N. et al. Integrating predicted transcriptome from multiple tissues

improves association detection. PLoS Genet 15, e1007889 (2019).

7. GTEx GWAS Working Group et al. Exploiting the GTEx resources to decipher the

mechanisms at GWAS loci. Genome Biol 22, 49 (2021).

8. Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45,

580–585 (2013).

9. Sey, N. Y. A. et al. A computational tool (H-MAGMA) for improved prediction of

brain-disorder risk genes by incorporating brain chromatin interaction profiles. Nat
Neurosci 23, 583–593 (2020).

10. Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping
and annotation of genetic associations with FUMA. Nat Commun 8, 1826 (2017).

11. Sudlow, C. et al. UK Biobank: An Open Access Resource for Identifying the Causes
of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine 12,
(2015).

12. Pal, K., Forcato, M. & Ferrari, F. Hi-C analysis: from data generation to integration.

Biophys Rev 11, 67–78 (2019).

13. Schmitt, A. D. et al. A Compendium of Chromatin Contact Maps Reveals Spatially

Active Regions in the Human Genome. Cell Reports 17, 2042–2059 (2016).

14. Gazal, S. et al. Combining SNP-to-gene linking strategies to identify disease genes

and assess disease omnigenicity. Nat Genet 54, 827–836 (2022).

232