NETWORK BIOLOGY POWERED BY GRAPH DEEP LEARNING:
UNDERSTANDING THE MOLECULAR BASIS OF HUMAN GENETICS

By

Renming Liu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computational Mathematics, Science & Engineering—Doctor of Philosophy

2024

ABSTRACT

Graph learning is revolutionizing the 25-year-old field of network biology by advancing our ability

to gain novel molecular insights into biological functions and complex diseases using genome-scale

molecular interaction networks. Earlier developed computational network biology methods are

limited by their accuracy, scalability, and biological context-specificity. This thesis develops several

effective and efficient graph learning methods to further our ability to uncover the multifaceted,

complex, and context-specific roles of human genes.

In living cells, biochemical and biomolecular entities, like protein, RNA, and DNA, intricately

interact to exert and regulate biological functions and express molecular traits. This interconnected

nature renders the network biology perspective, which models the intermolecular relationships

as a graph, quintessential in understanding genes’ biological roles. As a newly developed field

rapidly evolving due to its versatility and broad applicability, graph learning aims to build powerful

models that unravel complex interaction patterns to accurately classify node labels, opening exciting

opportunities to understand human genetics.

This thesis contributes toward a deeper understanding of human genetics using gene interaction

networks by developing algorithms, methods, applications, benchmarks, and software. Specifically,

through a comprehensive study using various biological networks and gene classification tasks, I first

demonstrate that building machine learning models using the network adjacency matrix systematically

achieves more accurate gene classifications compared to traditional network propagation-based

approaches. Despite the superior performance, the proposed approach does not scale well to large

and dense networks, hindering its applicability. To resolve this limitation, I next develop several

effective and efficient graph representation learning methods and software by carefully considering

the data characteristics of biological networks and the context-specific nature of biological systems.

Finally, I consolidate a comprehensive resource for graph deep learning-based gene classification,

allowing future developments to be easily built on top of the work done in this thesis.

Copyright by
RENMING LIU
2024

To my parents for their unwavering love, support, and belief in me,
without which my achievements would linger as mere shadows of dreams.

iv

ACKNOWLEDGEMENTS

I am deeply grateful to my advisor, Dr. Arjun Krishnan, for his immense support and mentorship

throughout my Ph.D. journey, training me to think critically and articulate my scientific research

clearly, unconditionally supporting all of my curiosity and passion, and always believing in me.

Arjun taught me to think like a scientist, an invaluable skill that has profoundly shaped my research

approach. He introduced and guided me to the fascinating world of machine learning, igniting a

passion that became the cornerstone of my dissertation. Arjun has offered extraordinary guidance

during all of my critical milestones, such as preparing for conference talks, the comprehensive

exam, the dissertation defense, and the job search process. I am genuinely grateful for the numerous

insightful discussions we have had, during which he generously shared his experience and advice.

These conversations have been instrumental in shaping my professional development across various

stages. I am also extremely thankful to Dr. Matthew J. Hirn (Matt), who co-advised me with

Arjun for the first three years of my Ph.D. and taught me the solid mathematical foundations of

modern deep learning. Matt always explains the intimidatingly complicated mathematical terms and

theories behind deep learning in the most straightforward and intuitively understandable manner.

Learning these theories from Matt was a joyful experience, as weird as it may sound. His way of

mathematical thinking has forever influenced mine. The gratitude I feel towards both Arjun and

Matt is beyond words. Their dedication and steadfast support have been pivotal in my Ph.D. journey.

I feel extraordinarily fortunate to have had them as my mentors, and I know I would not be where I

am today without their guidance and encouragement.

I want to express my heartfelt appreciation to my Ph.D. committee members, Dr. Christina Chan,

Dr. Jiliang Tang (JT), and Dr. Yuying Xie, for their insightful and constructive suggestions that

significantly shaped my dissertation. I am especially thankful to JT for his extraordinary support

during the crucial final stages of my Ph.D. by providing thoughtful feedback to my preparations

for my job search and dissertation defense. I am grateful to be additionally trained by JT from a

computer scientist’s perspective. JT’s guidance has broadened my scientific perspective and skills,

leaving a lasting impact on my professional development. I also want to thank Yuying for guiding

v

me and introducing me to the exciting fields of multi-omics and spatial single-cell analysis. I feel

deeply blessed to have been served by such dedicated and brilliant minds.

I am incredibly fortunate to have been a part of several supportive and inclusive labs, including

Arjun’s Krishnan Lab, Matt’s CEDAR Team, and JT’s DSE Lab, all of which have provided healthy

working environments to develop my research and skills. I am greatly thankful to every colleague in

each of these labs. I am especially grateful for the help from and discussions with Dr. Christopher

A. Mancuso (Chris) and Dr. Kayla A. Johnson from the Krishnan Lab. Chris has been a fantastic

mentor and has provided me tremendous hands-on guidance to kick start my research at the very

beginning of my Ph.D., and has been continually doing so ever since. Kayla has been unwaveringly

supportive of everyone in the lab, going out of her way to assist anyone in need and offering help

wherever she can. I am incredibly thankful for her responsive feedback on my writing countless times.

I am also grateful for a lot of enriching and inspiring discussions about cell biology, biomedical

technologies, and human complex diseases with Kayla, Dr. Stephanie L. Hickey, Alex McKim, and

Hao Yuan from the Krishnan Lab. In addition, I would also like to thank Chris, Kayla, Hao, Anna

Yannakopoulos, and Keenan Manpearl in the Krishnan Lab for contributing to the exciting projects

we worked on. I am deeply thankful to all my colleagues for their help and support, which have

greatly enhanced my personal and professional growth.

I am immensely grateful to have the privilege to work with many brilliant collaborators. It has

been a great pleasure working on various projects related to graph deep learning with folks from

Mila, particularly Semih Cantürk, Dr. Ladislav Rampášek, Dr. Dominique Beaini, and Dr. Guy

Wolf. I was extremely blessed to be able to work closely with Ladislav and Semih on cutting-edge

research problems in graph learning. I also want to thank my collaborators in the DANCE team led

by JT, including Wenzhuo Tang, Jiayuan Ding, Hongzhi Wen, and Xinnan Dai, for working hard

together on pioneering works of deep learning methods for single-cell analysis. Working with such

talented people has been an amazing experience, and I look forward to continually collaborating in

the coming years.

I want to thank each and every one of my friends for being there for me, supporting me, and

vi

sharing countless unforgettable moments. Thank you, Xin Gu, Yi Duan, Qiuyuan Gao, Yucong

Qin, and Liyi Chen, for sharing an incredible part of my life when I was young and dedicated to

music in a Choir and a Symphony orchestra. I cannot be more grateful to have met and become

close friends with Zhengyi Lian, Ziyang Li, Qing He, Haoming Zheng, Youqing Chen, Zhaoming

Li, Zhenbo Fang, Shiqi Chen, and Yanchao Pan. Each of you has enriched my life immeasurably.

Lastly, I would like to thank all my closest friends during my Ph.D. that have been invaluable to me,

including Jiaxin Yang, Tianyu Yang, Wenzhuo Tang, Kaiqi Yang, Hanbing Wang, Yaxin Li, Haitao

Mao, and Kai Guo. I am truly blessed to have known all my friends in my life.

Finally, I must extend my deepest gratitude to my Mom, Dad, and the rest of my family for their

unconditional love, unwavering support, and steadfast belief in me. I am especially grateful to my

Mom, who has wholeheartedly supported every decision I have made and consistently went above

and beyond to help me whenever needed. The depth of appreciation and indebtedness I feel towards

my parents is immeasurable and light-years beyond what words can adequately express. Thank you,

Mom, Dad, and all my family members, for everything you have done for me. You have truly shaped

the person I am today.

vii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION .
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Background .
1.2 Dissertation contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation structure .

.

.

.

.

.

1
1
4
7

CHAPTER 2 SUPERVISED LEARNING IS AN ACCURATE METHOD FOR

Introduction .
.

.
2.1
2.2 Methods .
.
2.3 Results and discussions .
.
2.4 Conclusion .
.
.
2.5 Extensions .

NETWORK-BASED GENE CLASSIFICATION . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8
.
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

. .
.
.

. .
.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

CHAPTER 3 PECANPY: A FAST, EFFICIENT AND PARALLELIZED PYTHON

.

.

. .

Introduction .

IMPLEMENTATION OF NODE2VEC . . . . . . . . . . . . . . . . . . . . 28
3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 PecanPy implementation and optimization . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Benchmarking computational efficiency and quality of embeddings . . . . . . . .
. 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Results and discussions .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
.
3.5 Conclusion .

. .

.
.

.

.

.

.

.

.

CHAPTER 4 ACCURATELY MODELING BIASED RANDOM WALKS ON WEIGHTED

Introduction .
4.1
4.2 Methods .
.
4.3 Experiments .
4.4 Conclusion .

NETWORKS USING NODE2VEC+ . . . . . . . . . . . . . . . . . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
.

. .
.
.
. .
. .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.

CHAPTER 5 CONE: CONTEXT-SPECIFIC NETWORK EMBEDDING VIA

.
.
Introduction .
5.1
.
.
5.2 Preliminaries
.
5.3 Methods .
.
.
.
5.4 Experiment setup .
5.5 Results and discussions .
.
5.6 Conclusion .

CONTEXTUALIZED GRAPH ATTENTION . . . . . . . . . . . . . . . . 64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
.

. .
. .
.
.

.
.
.
.

.
.
.
.

. .

.

.

.

.

CHAPTER 6 OPEN BIOMEDICAL NETWORK BENCHMARK:

Introduction .

A PYTHON TOOLKIT FOR BENCHMARKING DATASETS
WITH BIOMEDICAL NETWORKS . . . . . . . . . . . . . . . . . . . . . 87
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
.
6.1
6.2 OBNB system description .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Benchmarking experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Results and discussions .

. .

.

.

.

.

.

viii

6.5 Conclusion . .

. .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

CHAPTER 7 CONCLUSION .
.

.
7.1 Summary .
.
7.2 Reflections and limitations
.
7.3 Future directions

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
.
. 115
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

APPENDIX A GENEPLEXUS .

APPENDIX B PECANPY .

APPENDIX C OBNB .

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

ix

CHAPTER 1

INTRODUCTION

1.1 Background

Grasping the precise functional roles of human genes and the context in which they operate is

pivotal for uncovering the molecular basis of human genetics. This understanding not only facilitates

the connection between genotype and phenotype [109] but also is instrumental in developing

treatment strategies for complex diseases [19, 261]. Despite the recent landmark achievement of

the complete, telomere-to-telomere assembly of the human genome in 2022 [186], our knowledge

about every human gene remains greatly incomplete, as experimentally annotating and validating

gene functions is exceedingly time-consuming and resource-intensive [130, 11]. As of January

2024, only about half of the 20K human protein-coding genes are functionally annotated with

high confidence1. This knowledge gap is further exacerbated by the fact that the functional roles

of genes and proteins inherently pertain to their biological contexts, such as specific tissues and

cell types [93, 80]. Developing accurate predictive models to annotate genes’ biological roles and

making these resources and tools easily accessible to biomedical researchers are vital to close this

gap.

Learning about genes through the lens of network biology Biological systems are astonishingly

complex. Different biological entities, such as protein, RNA, and DNA intricately interact to carry

out specific functions or traits. At the cellular level, collections of proteins and RNA bind together to

form functional units, such as the pre-initiation complex and ribosomes, that carry out vital processes.

At the genomic level, regulatory elements, like promoters and enhancers, interact with transcription

factors and other regulatory proteins to precisely control the expression of one or several genes.

Cascading these gene regulatory circuits leads to highly complex mechanisms underlying cellular

differentiation [214], cell fate determination [233, 134], and diseases [168]. Network biology is

a 25-year-old field that aims to understand genetics using these diverse and intricate interactions,

1Based on gene ontology annotations (http://release.geneontology.org/2023-11-15/annotations/goa_human.gaf.gz)

for GOBPs with EXP or HTP evidence codes

1

which are formulated into molecular interaction networks [298, 19, 21, 241, 261]. In these networks,

or more commonly referred to as graphs in mathematical terms, nodes represent genes or proteins

and edges represent the functional association between the two nodes.

Since the introduction of network biology, much progress has been made toward deepening our

understanding of genes through analyzing the structural and proximal characteristics of molecular

interaction networks. On the one hand, these networks provide rich structural prior that correlates

with genes’ functional roles. For example, essential genes tend to have a larger number of interacting

partners compared to non-essential genes [77, 110]. One reasonable hypothesis is that essential

genes are more evolutionarily conservative and that newly emerged genes are more likely to interact

with high degree genes [22]. Furthermore, cancerous genes usually have higher connectivity than

non-cancerous genes [114] and the “bottleneckness” is a good indicator of the gene essentiality [289].

On the other hand, the notion of proximity in these networks characterizes the relationship between

groups of genes. Genes or proteins participating in the same biological process or pathway tend to

co-express or physically interact [19, 148]. Similarly, genes associated with a particular disease are

more likely to reside in a distinct neighborhood of the molecular interaction network than random.

These neighborhoods are commonly referred to as a “disease module” and have shed light on the

intricate interplay between cellular processes and diseases [19]. Moreover, the proximity between

two disease modules in the network indicates their phenotypical and symptom similarity [172].

These proximal characteristics have thus given rise to numerous early studies implementing the

guilt-by-association (GBA) principle [188] via label propagation [126, 51]. Applications of these

approaches have shown remarkable success in identifying cancerous pathways [138], annotating

disease genes [253, 126], reprioritizing GWAS hits [136], and predicting drug-binding [285].

Graph learning offers immense opportunities for learning on biological networks Graph

learning aims to build numerical representations and predictive models using graphs as the

computational backbone [88, 34, 33]. These methods have demonstrated effectiveness in solving

various computational problems where the data can be naturally formulated as graphs, such as

predicting the chemical properties of a molecule, categorizing papers in a citation network, and

2

identifying friendships in a social network [295, 279]. The fundamental principle behind these

graph learning approaches is that the graph structure and connections are closely linked to the

characteristics of the graphs or their nodes. Graph learning methods fall into two main categories:

embeddings, which focus on translating graph structure into vector space representations, and graph

neural networks (GNNs), which extend neural network models to operate on graphs directly. Both

approaches play pivotal roles in extracting meaningful insights from complex graph-structured data.

The early development of network embedding was led by DeepWalk [202] in 2014, which extends

the seminal work in natural language processing, word2vec [174, 173], to networks. DeepWalk

operates by generating random walks on the network and treating the walk sequences as a “corpus”

in text. The subsequent word2vec algorithm then optimizes the resulting low-dimensional space

based on this corpus. Following DeepWalk, many other random walk based network embedding

methods were introduced [83, 61, 217, 2, 182]. node2vec [83] is a notable extension of DeepWalk.

It performs second-order random walks that can search over the graph more flexibly, resembling the

Breadth-First Search (BFS) or the Depth-First Search (DFS) strategies. Such searching flexibility has

shown practical advantages in downstream prediction tasks. Since its introduction in 2016, node2vec

has been particularly popular in computational biology [183, 291] and has been used for essential

protein prediction [292, 266] and various disease gene prediction tasks [15, 16, 161, 286, 200].

In 2013, Bruna and coworkers [36] pioneered the work on GNN by drawing parallels between

convolution in the Euclidean space, as grids, and the graph Laplacian. This spectral formulation

is effective but also computationally expensive due to the full eigen-decomposition of the graph

Laplacian. Defferrard and colleagues [57] later proposed an approximation technique via Chebyshev

polynomials of the graph Laplacian to greatly reduce the filter parameters. In 2016, Kipf and

Welling [124] further simplified this via a first-order approximation, leading to a simple and

yet effective graph convolution module. Additionally, this formulation has a spatial operational

interpretation of aggregating neighboring nodes’ representations followed by a transformation. This

two steps recipe was later generalized by Gilmer and coworkers [74], formalizing the Message Passing

Neural Network (MPNN). To date, many popular GNN architectures such as GraphSAGE [89],

3

GAT [257], GIN [282], and many more [74, 295, 279], fall into the framework of MPNN. GNNs have

been applied to solve various real-world problems due to their exceptional performance [295, 279].

This advent of graph learning techniques presents a transformative potential to network biology,

opening doors to novel discoveries and completing our knowledge about human genes by building

accurate predictive models on the underlying molecular interaction networks. Despite these

opportunities, tremendous gaps remain in effectively and efficiently applying these graph learning

paradigms to elucidate genes’ biological roles: It is unclear (1) how to formulate the problem of

gene function and disease annotation into graph learning tasks, (2) whether applying these advanced

methods will achieve superior results when compared to traditional label propagation methods in

network biology given the scarce labels and unique characteristics of biological networks, and (3) if

graph learning approaches are scalable enough to be applied to genome-scale molecular interaction

networks. These considerations highlight the need for further research and refinement in applying

graph learning techniques to advance our understanding of human genes.

1.2 Dissertation contributions

This dissertation bridges the gap between advanced graph learning approaches and understanding

human genetics through network biology, ultimately leading to a deeper understanding of key genetic

factors influencing human health and diseases. We present comprehensive studies and in-depth

investigations seeking optimal ways to predict genes’ biological roles using genome-scale gene

interaction networks. Across all projects, we make the code and data publicly available to ensure

easy reproducibility and adaptation of our work. When applicable, we package the developed

methods or resources into Python packages and register them on PyPI, further enabling our works to

be applied to broader areas in biomedical research. We summarize the primary contributions of this

dissertation as follows.

• Traditional network biology leverages label propagation to expand gene annotations by

diffusing the relatively limited known annotations for a particular function, disease, or trait

over the gene interaction network. More recently, some studies have presented ideas for

4

supervised learning using the network adjacency matrix as the feature [128]. However,

it was unclear whether one approach is systematically better than the other and whether

supervised learning can leverage the underlying network structure as label propagation

naturally does. In GenePlexus [153] (Chapter 2), we demonstrate conclusive evidence, using

extensive gene classification tasks and gene interaction networks, that supervised learning

consistently outperforms label propagation. Furthermore, it does so by accounting for the

underlying network connectivity. Finally, we highlight the opportunities and the need for

further developing effective and efficient network embedding methods to enhance supervised

learning on gene interaction networks subsequently.

• Despite the promising opportunities of using network embedding, particularly node2vec [83],

to build efficient supervised learning models for gene classification, significant computational

burdens remain. Specifically, the original node2vec implementations failed to embed existing

genome-wide interaction networks with dense connections. In Chapter 3, we develop a

highly optimized node2vec implementation, called PecanPy [151], that enables node2vec

embedding large scale (> 100K nodes) and dense (> 100M edges) gene interaction networks.

Remarkably, through benchmarking on networks with diverse sizes and densities, we show

that PecanPy runs up to an order of magnitude faster with an order of magnitude less memory

usage than the original node2vec implementations. These improvements are achieved by

meticulously optimizing the graph data structure, parallelism, and random walk strategies.

We package PecanPy into a user-friendly Python software, which can be easily installed using

the standard Python Package Index (PyPI) repository.

• Upon comprehensive evaluation of node2vec embeddings using our PecanPy implementations

on diverse genome-wide gene interaction networks, we identify a critical shortcoming of

node2vec in handling weighted graphs, especially when they are dense. Resolving this

limitation is critical for network biology as many existing gene interaction networks are dense

and weighted, either by construction or by integrating gene interaction evidence from multiple

5

sources. In Chapter 4, we propose a natural extension of node2vec, called node2vec+ [150],

which adequately accounts for edge weights in the weighted graph. We demonstrate through

systematic evaluations using synthetic and real-world network data, including several densely

weighted gene interaction networks, that node2vec+ significantly outperforms node2vec,

surpassing the more powerful graph neural network methods.

• Our PecanPy and node2vec+ enable fast, efficient, and effective embedding of a dense

and weighted genome-scale gene interaction network to predict gene-disease associations.

However, the exact cellular roles of a gene or protein depend highly on its precise biological and

cellular conditions. This biological context-specificity motivates the development of a gene

interaction network embedding method that can induce biological contexts, such as tissue, cell

types, and diseases. In Chapter 5, we bridge this gap by presenting our contextualized graph

attention model CONE [155] for learning biological context-aware gene embeddings. CONE

is a versatile approach that can contextualize the embeddings of any gene interaction network

given biological contexts in the form of gene sets, such as tissue-specific genes. Through

various disease gene classification and therapeutic target prediction tasks, we show that CONE

(1) successfully leverages biological contextual information to identify diseases-related genes

and therapeutic targets and (2) sheds light on the relevant biological contexts for particular

diseases.

• Previous chapters have established the foundation for unveiling genes’ precise biological roles

using genome-scale gene interaction networks. The core technical components enabling this

are the powerful graph deep learning (GDL) techniques for representing and learning on

graphs. Similar to network biology, the field of GDL is evolving rapidly, with more powerful

and expressive models being developed every year. This fast-paced development of GDL

necessitates a comprehensive resource for the streamlined adaptation of the work developed in

previous chapters to continually improve network-based gene annotation applications by using

more up-to-date architectures and designing specialized architectures that take advantage of

6

the unique data characteristics of biological networks, such as density (Chapter 4) and context-

specificity (Chapter 5). In Chapter 6, we present such an extensive resource as a benchmarking

dataset Python package, obnb [152], along with a systematic benchmarking study using

current state-of-the-art GDL methods, spanning graph diffusion, graph embedding, and graph

neural networks. obnb is designed to be compatible with the two most popular graph deep

learning frameworks in Python, namely, PyTorch Geometric (https://www.pyg.org/) and Deep

Graph Library (https://www.dgl.ai/), making it straightforward for the GDL community to

further develop on top of it. Our analyses (1) reveal that the efficacy of GDL methods is tied to

the corrected homophily ratio given the gene interaction network, and (2) point to promising

future directions, such as integrating structural or sequential information from pre-trained

genomics and protein foundation models.

1.3 Dissertation structure

We organize the rest of this dissertation as follows. Chapter 2 formally sets up the network-

based gene classification problem and rigorous evaluation framework, showing that supervised

learning systematically outperforms traditional label propagation-based approaches and opening up

opportunities for supervised learning with learned network embeddings. Chapter 3 and Chapter 4

then dive into technical contributions that significantly improve (1) the computational efficiency and

(2) the quality of network embeddings for gene classification. Chapter 5 brings the capability of

gene interaction network embeddings one step further by developing a versatile approach to induce

biological context-specificity. Chapter 6 organizes a comprehensive resource for network-based

gene classification benchmarking, paving the way for future development of novel and biologically

inspired graph deep learning architectures for gene classifications. Finally, we conclude this

dissertation and discuss broader impact with promising future directions in Chapter 7.

7

CHAPTER 2

SUPERVISED LEARNING IS AN ACCURATE METHOD FOR
NETWORK-BASED GENE CLASSIFICATION

Assigning every human gene to specific functions, diseases, and traits is a grand challenge in

modern genetics. Computational methods leveraging molecular interaction networks, such as

supervised learning and label propagation, are vital to addressing this challenge. Despite being a

popular machine learning technique across fields, supervised learning has been applied only in a

few network-based studies for predicting pathway-, phenotype-, or disease-associated genes. It is

unknown how supervised learning broadly performs across different networks and diverse gene

classification tasks and how it compares to label propagation, the widely benchmarked canonical

approach for this problem. Here, we present a comprehensive benchmarking study of supervised

learning for network-based gene classification. We use stringent evaluation schemes to evaluate

this approach and the classic label propagation techniques on hundreds of diverse prediction tasks

and multiple networks. Our results demonstrate that supervised learning on a gene’s full network

connectivity outperforms label propagation and achieves high prediction accuracy by efficiently

capturing local network properties, rivaling label propagation’s appeal for naturally using network

topology. We further show that supervised learning on the full network is superior to learning on

node embeddings, an increasingly popular approach for concisely representing network connectivity.

These results show that supervised learning is an accurate approach for prioritizing genes associated

with diverse functions, diseases, and traits and should be considered a staple of network-based

gene classification workflows. The code to reproduce the experiments and analyses in this study is

available on GitHub: https://github.com/krishnanlab/GenePlexus.

2.1 Introduction

In the post-genomic era, a grand challenge is to characterize every gene across the genome in

terms of the cellular pathways they participate in, and which multifactorial traits and diseases they

are associated with. Computationally predicting the association of genes to pathways, traits, or

diseases – the task termed here as gene classification – has been critical to this quest, helping

8

prioritize candidates for experimental verification and for shedding light on poorly characterized

genes [231, 198, 213, 224, 207, 112] to the success of these methods has been the steady accumulation

of large amounts of publicly available data collections such as curated databases of genes and their

various attributes [242, 144, 119, 206, 37, 276, 281, 38], controlled vocabularies of biological terms

organized into ontologies [49, 14, 229, 237], high-throughput functional genomic assays [67, 137, 17],

and molecular interaction networks [245, 142, 240, 80, 100].

While protein sequence and 3D structure are remarkably informative about the corresponding

gene’s molecular function [10, 273, 213, 112], the pathways or phenotypes that a gene might

participate in significantly depends on the other genes that it works with in a context-dependent

manner. Molecular interaction networks – graphs with genes or proteins as nodes and their physical

or functional relationships as edges – are powerful models for capturing the functional neighborhood

of genes on a whole-genome scale [256, 120]. These networks are often constructed by aggregating

multiple sources of information about gene interactions in a context-specific manner [108, 80].

Therefore, unsurprisingly, several studies have taken advantage of these graphs to perform network-

based gene classification [272, 126, 253].

The canonical principle of network-based gene classification is guilt-by-association, the notion

that proteins and genes that are strongly connected in the network are likely to perform the

same functions and, hence, participate in similar higher-level attributes such as phenotypes and

diseases [268]. Instead of solely aggregating local information from direct neighbors [230], this

principle is better realized by propagating pathway or disease labels across the network to capture

global patterns, achieving state-of-the-art results [272, 180, 253, 126, 51, 138, 293, 190, 256]. These

global approaches belong to a class of methods referred to here as label propagation. Distinct from

label propagation is another class of methods for gene classification that relies on the idea that

network patterns characteristic of genes associated with a specific phenotype or pathway can be

captured using supervised machine learning [24, 133, 80, 129, 84, 192]. While this class of methods

– referred to here as supervised learning – has yielded promising results in several applications, how it

broadly performs across different types of networks and diverse gene classification tasks is unknown.

9

Consequently, supervised learning is used far less than label propagation for network-based gene

classification.

This study aims to perform a comprehensive, systematic benchmarking of supervised learning

(SL) for network-based gene classification across several genome-wide molecular networks and

hundreds of diverse prediction tasks using meaningful evaluation schemes. Within this rigorous

framework, we compare supervised learning to a widely-used, classic label propagation (LP)

technique, testing both the original (Adjacency matrix) and a diffusion-based representation of the

network (Influence matrix). This combination results in four methods (listed with their earliest

known references): label propagation on the adjacency matrix (LP-A) [230], label propagation on

the influence matrix (LP-I) [190], supervised learning on the adjacency matrix (SL-A) [24], and

supervised learning on the influence matrix (SL-I) [133]. Additionally, we evaluate the performance

of supervised learning using node embeddings as features, as the use of node embeddings is

burgeoning in network biology.

Our results demonstrate that SL outperforms LP for gene-function, -disease, and -trait prediction.

We also observe that SL captures local network properties as efficiently as LP, where both methods

achieve more accurate predictions for gene sets that are more tightly clustered in the network. Lastly,

we show that SL using the full network connectivity is superior to using low-dimensional node

embeddings as features, which, in turn, is competitive to LP.

2.2 Methods

We chose a diverse set of undirected, human gene interaction networks based on criteria laid out in

[100] (Figure 2.1): (1) networks constructed using high- or low-throughput data, (2) the type of

interactions the network was constructed from, and (3) if annotations were directly incorporated

in constructing the network. We used versions of the networks released before 2017 so as not to

bias the temporal holdout evaluations. Unless otherwise noted, we used all edge weights, and all

networks’ nodes were mapped into Entrez genes using the MyGene.info database [276, 281]. If

the original node ID is mapped to multiple Entrez IDs, we add edges between all possible mappings.

The networks used in this study are BioGRID [240], the full STRING network [245], as well as the

10

subset with just experimental support (referred to as STRING-EXP in this study), InBioMap [142],

and the tissue-naïve network from GIANT [80], referred to as GIANT-TN in this study. These

networks cover a wide size range, with the number of nodes ranging from 14K to 25K and the

number of edges ranging from 141K to 38M. More information on the networks can be found in

Appendix A.1.1.

2.2.1 Network representations

We consider three distinct representations of molecular networks: the adjacency matrix, an

influence matrix, and node embeddings. We describe each representation in the rest of this section.

Adjacency matrix Let 𝐺 = (𝑉, 𝐸, 𝑤) denote an undirected molecular network, where 𝑉 is the set

of vertices (genes), 𝐸 is the set of edges (associations between genes), and 𝑤 : 𝐸 → R is the edge

weight function that indicates the strengths of the associations. 𝐺 can be represented as a weighted

adjacency matrix, M ∈ R|𝑉 |×|𝑉 |, where

M𝑖, 𝑗 =





𝑤(𝑣𝑖, 𝑣 𝑗 )

if (𝑣𝑖, 𝑣 𝑗 ) ∈ 𝐸

0

otherwise

(2.1)

Influence matrix 𝐺 can also be represented as an influence matrix, F ∈ R|𝑉 |×|𝑉 |, which captures

both local and global structure of the network. 𝐹 is obtained using a random walk with restart

transformation kernel [138],

F = 𝛼 (cid:0)I − (1 − 𝛼)WD

(cid:1) −1

(2.2)

where, 𝛼 ∈ (0, 1) is the restart parameter, D is the diagonal degree matrix whose diagonal elements
are given by D𝑖,𝑖 = (cid:205) 𝑗 M𝑖, and WD = MD−1 column normalized adjacency matrix. We use a restart

parameter of 0.85 in this study as suggested by a previous evaluation [100], which also shows

optimal performance in our experiments (Figure A.1).

Node embeddings 𝐺 can be additionally transformed into a low-dimensional representation

through the process of node embedding. In this study we used the node2vec algorithm [83], which

extends ideas from the word2vec algorithm [174, 173] from natural language processing to graphs.

The objective of node2vec is to find a low-dimensional representation of the adjacency matrix,

11

𝐸 ∈ R|𝑉 |×𝑑, where the hidden dimension 𝑑 ≪ |𝑉 |. This is done by optimizing the following

log-probability objective function:

(cid:16)

𝑃𝑟 (cid:0)N𝑆 (𝑣)|E(𝑣)(cid:1) (cid:17)

log

L =

∑︁

𝑣∈𝑉

(2.3)

where N𝑆 (𝑣) is the network neighborhood of node 𝑣 generated through a sampling strategy 𝑆, and

E(𝑣) ∈ R𝑑 is the feature vector of node 𝑣. In node2vec, the sampling strategy is based on random

walks that are controlled using two parameters 𝑝 and 𝑞, in which a high value of 𝑞 keeps the walk

local (a breadth-first search), and a high value of 𝑝 encourages outward exploration (a depth-first

search). The values of 𝑝 and 𝑞 were both set to 0.1 for every network in this study. Detailed

node2vec hyperparameter tuning information can be found in Appendix A.1.2.

2.2.2 Prediction methods

We compared the prediction performance across four specific methods across two classes, label

propagation (LP) and supervised learning (SL).

Label Propagation LP methods are the most widely used methods in network-based gene

classification and achieve state-of-the-art results [126, 51]. In this study, we considered two LP

methods, label propagation on the adjacency matrix (LP-A) and label propagation on the influence

matrix (LP-I). First, we constructed a binary vector of ground-truth labels, 𝑥 ∈ R|𝑉 |×1, where 𝑥𝑖 = 1

if gene 𝑖 is a positively labeled gene in the training set, and 0 otherwise. In LP-A, we constructed a

score vector, 𝑠 ∈ R|𝑉 |×1, denoting the predictions,

𝑠 = M𝑥

(2.4)

Thus, the predicted score for a gene using LP-A is equal to the sum of the weights of the edges

between the gene and its direct, positively labeled network neighbors. In LP-I, the score vector, 𝑠, is

generated using Equation 2.2, except M is by replaced by F, the influence matrix (Equation 2.2).

The prediction performance is evaluated by taking into account of positive and negative examples.

In both LP-A and LP-I, only positive examples in the training set are used to calculate the score

vector 𝑠 to reflect how label propagation is typically used in practice [126, 51, 205]. We found

12

that accounting for negative examples in the propagation step does no change the LP prediction

performance and thus use the simpler approach of only propagating positive examples in the main

text (Appendix A.2.1).

Supervised Learning Supervised learning (SL) can be used for network-based gene classification

by using each gene’s network neighborhoods as feature vectors, along with gene labels, in

a classification algorithm. Here, we use logistic regression with ℓ2 regularization as the SL

classification algorithm, which is a linear model that aims to minimize the following cost function:

L =

1
2

𝑤⊤𝑤 + 𝐶

(cid:16)

log

𝑛
∑︁

𝑖=1

exp (cid:0) − 𝑦𝑖 (X𝑖𝑤 + 𝑏)(cid:1) + 1

(cid:17)

(2.5)

where 𝑤 ∈ R𝑚 is the vector of weights for a model with 𝑚 features, 𝐶 determines the regularization

strength, n the number of examples, 𝑦 is the ground-truth label, X ∈ R𝑛×𝑚 is the data matrix, and

𝑏 is the intercept. After training a model using the labeled genes in the training set, the learned

model weights are used to classify the genes in the testing set, returning a prediction probability for

these genes that is bounded between 0 and 1. The regularization parameter, 𝐶, was set to 1.0 for

all models in this study. We use scikit-learn’s [196] logistic-regression implementation to

perform model training and prediction.

In this study, three different network-based gene-level feature vectors are used to train three

different SL classifiers: the rows of the adjacency matrix (SL-A), the rows of the influence matrix

(SL-I), and the rows of the node embedding matrix (SL-E). Model selection and hyperparameter

tuning are described in detail in Appendix A.1.2.

2.2.3 Gene set collections

We curate a number of gene set collections to test predictions on a diverse set of tasks: function,

disease, and trait (Figure 2.1). Function prediction is defined as predicting genes associated with

biological processes that are part of the Gene Ontology (referred to here as GOBP) [49, 14]

obtained from MyGene.info [276, 281], and pathways from the Kyoto Encyclopedia of Genes

and Genomes [119], referred to as KEGGBP since disease-related pathways were removed from the

original KEGG annotations in the Molecular Signatures Database [144, 242]. Disease prediction

13

Figure 2.1: Workflow for gene classification pipeline. Four methods are compared: supervised
learning on the adjacency matrix (SL-A), supervised learning on the influence matrix (SL-I), label
propagation on the adjacency matrix (LP-A), and label propagation on the influence matrix (LP-I).
Model performance on a variety of gene classification tasks is evaluated over a number of different
molecular networks, validation schemes, and evaluation metrics. Additionally, the performance
of supervised learning using node embeddings as features (SL-E) is evaluated (not shown in this
figure).

is defined based on predicting genes associated with diseases in the DisGeNET database [206].

Annotations from this database were divided into two separate gene set collections: those that

were manually curated (referred to as DisGeNet in this study) and those derived using the BeFree

text-mining tool (referred to as BeFree in this study). Trait prediction is defined as predicting genes

linked to human traits from Genome-wide Association Studies (GWAS), curated from a community

challenge [47], and mammalian phenotypes (annotated to human genes) from the Mouse Gene

14

Molecular networkAAdjacency matrixIInﬂuence matrix=Network-based featuresSupervised learning(SL)Label propagation (LP)SL-ASL-ILP-ALP-INetworksBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGClassiﬁcation tasksGOBPKEGGBPDisGeNetBeFreeGWASMGIFunctionDiseaseTraitValidationschemesTemporal holdoutStudy-bias holdout5-fold CVEvaluation metricsauPRCP@topKauROCInformatics (MGI) database [37].

Gene set preprocessing Each of these six gene set collections contains about a hundred to

tens of thousands of gene sets that vary widely in specificity and redundancy. Therefore, each

collection is preprocessed to ensure that each source’s final set of prediction tasks are specific,

largely non-overlapping, and not driven by multi-attribute genes. First, if gene sets in a collection

corresponded to terms in an ontology (e.g., biological processes in the GOBP collection), annotations

were propagated along the ontology structure to obtain a complete set of annotations for all gene

sets. Second, we remove gene sets if the number of genes annotated to the gene set is above a certain

threshold and then compare these gene sets to each other in order to remove gene sets that are highly

overlapping with other gene sets in the collection, resulting in a set of specific, non-redundant gene

sets. Finally, individual genes that appear in more than ten of the remaining gene sets in a collection

are removed from all the gene sets in that collection to remove multi-attribute (e.g., multi-functional)

genes that are potentially easy to predict [73]. More details on the gene set preprocessing and gene

set attributes are provided in Appendix A.1.3.

Selecting positive and negative examples

In each gene set collection, genes annotated to that set

are designated as the set of positive examples for a given gene set. The SL methods additionally

require a set of negative genes for each given gene set for training; both SL and LP methods require a

set of negative genes for each gene set for testing. A set of negative genes is generated by (a) finding

the union of all genes annotated to all gene sets in the collection, (b) removing genes annotated to the

given gene set, and c) removing genes annotated to any gene set in the collection which significantly

overlapped with the given gene set (p-value < 0.05 based on the one-sided Fisher’s exact test).

2.2.4 Validation schemes

We perform extensive and rigorous evaluations based on three validation schemes: temporal

holdout, study-bias holdout, and five-fold cross-validation (5FCV). We briefly describe each

validation scheme below and defer details to Appendix A.1.4.

Temporal holdout Within a gene set collection, the genes that only had an annotation to any

gene set in the collection after January 1st, 2017 are considered test genes, and all other genes

15

are considered training genes. Temporal holdout is the most stringent evaluation scheme for gene

classification as it mimics the practical scenario of using current knowledge to predict the future and

is the preferred evaluation method used in the CAFA challenges [213, 112]. Since Gene Ontology is

the only source with clear date stamps for all its annotations, temporal holdout only applies to the

GOBP gene set collection.

Study-bias holdout For study-bias holdout, genes are first ranked by the number of PubMed

articles they were mentioned in, obtained from [35]. The top two-thirds of the most-mentioned

genes are considered training genes, and the rest of the least-mentioned genes are used for testing.

Study-bias holdout mimics the real-world situation of learning from well-characterized genes to

predict novel un(der)-characterized genes.

Five-fold cross validation The last validation scheme is the traditional five-fold cross-validation,

where the genes are split into five equal folds in a stratified manner, where the proportion of genes in

the positive and negative classes is preserved in each split. In all these schemes, only gene sets with

at least ten positive genes in both the training and test sets are considered.

2.2.5 Evaluation Metrics

In this study, we considered three evaluation metrics: the area under the precision-recall curve

(auPRC), the precision of the top K-ranked predictions (P@TopK), and, for completeness, the area

under the receiver-operator curve (auROC). For P@TopK, we set K equal to the number of positives

in the testing set. Since the standard auPRC and P@TopK scores are influenced by the prior of

finding a positive example, which is computed as the proportion of positives to the total of positives

and negatives, we express both metrics as the 𝑙𝑜𝑔2 fold change of the original metric to the prior.

More details on the evaluation metrics are described in Appendix A.1.5.

2.3 Results and discussions

We systematically compare the performance of four gene classification methods (Figure 2.1):

supervised learning on the adjacency matrix (SL-A), supervised learning on the influence matrix

(SL-I), label propagation on the adjacency matrix (LP-A), and label propagation on the influence

16

matrix (LP-I). We choose six gene set collections that represent three prominent gene-classification

tasks: gene-function (GOBP, KEGGBP), gene-disease (DisGeNet, BeFree), and gene-trait (GWAS,

MGI) prediction. We use three different validation schemes: temporal holdout (train on genes

annotated before 2017 and test on genes annotated in 2017 or later; only done for GOBP as it

has clear timestamps), holdout based on study-bias (train on well-studied genes and predict on

less-studied genes), and the traditional five-fold cross-validation (5FCV). Temporal holdout and

study-bias holdout validation schemes are presented in the main text as they are more stringent and

reflective of real-world tasks compared to 5FCV [116].

To ascertain the robustness of the relative performance of the methods to the underlying network,

we choose five different genome-scale molecular networks that differ in their content and construction

(Appendix A.1.1). To be in concert with temporal holdout evaluation and curtail data leakage, all

the networks used throughout this study are the latest versions released before 2017. We present

evaluation results based on the area under the precision-recall curve (auPRC) in the main text and

results based on the precision at top-k (P@topK) and the area under the ROC curve (auROC) in

Appendix A.2, all of which lead to a consistent conclusion in our study.

2.3.1 SL consistently outperforms LP across diverse gene classification tasks and gene

interaction networks

SL methods rank higher than LP methods on average Our first analysis directly compares all

four prediction methods against each other for each gene set in a given collection. For each gene set

collection–network combination, we rank the four methods per gene set (based on auPRC) using

the standard competition ranking and calculate each method’s average rank across all the gene sets

in the collection (Figure 2.2). For function prediction, SL-A is the top-performing method by a

wide margin (particularly clear based on GOBP temporal holdout), with SL-I being the second-best

method. For disease and trait prediction, SL-A and SL-I still outperform LP-I, but to a lesser extent.

In all cases, LP-A is the worst-performing method. The large performance difference between the

SL and LP methods in the GOBP temporal holdout validation is noteworthy since temporal holdout

is the most stringent validation scheme and the one employed in community challenges such as

17

Figure 2.2: Average rank across the four methods. Each point in each boxplot represents the
average rank for a gene set collection–network combination, obtained based on ranking the four
methods in terms of performance for each gene set in a gene set collection using the standard
competition ranking. (A) Functional prediction tasks using GOBP temporal holdout, (B) Functional
prediction tasks using study-bias holdout for GOBP and KEGGBP, and (C) Disease and trait
prediction tasks using study-bias holdout for DisGeNet, BeFree, GWAS, and MGI. The results are
shown for auPRC where different colors represent different networks and different marker styles
represent the different gene set collections. SL methods outperform LP methods for all prediction
tasks.

18

1.52.53.5Average RankATemporal Holdout(GOBPtmp)1.52.53.5Average RankBStudy-bias Holdout(GOBP, KEGGBP)SL-ASL-ILP-ILP-A1.52.53.5Average RankCStudy-bias Holdout(DisGeNet, BeFree, GWAS, MGI)Function PredictionDisease and TraitPredictionNetworksBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGGeneset CollectionsGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIFigure 2.3: Testing for a statistically significant difference between SL and LP methods. (A) A
key on interpreting the analysis. For each network-gene set combination, each method is compared
to the two methods from the other class (i.e. SL-A vs LP-I, SL-A vs LP-A, SL-I vs LP-I, SL-I vs
LP-A). If a method was found to be significantly better than both methods from the other class
(Wilcoxon ranked-sum test with an FDR threshold of 0.05), the cell is annotated with that method.
If both models in that class were found to be significantly better than the two methods in the other
class, the cell is annotated in bold with just the class. The color scale represents the fraction of gene
sets that were higher for the SL methods across all four comparisons. The first column uses GOBP
temporal holdout, whereas the remaining 6 columns use study-bias holdout. (B) SL methods show a
statistically significant improvement over LP methods, especially for function prediction.

CAFA [213, 112].

SL outperforms LP on a significant portion of tasks Following the observation that SL methods

outperform LP methods based on relative ranking, we use a non-parametric paired test (Wilcoxon

signed-rank test) to statistically assess the difference between specific pairs of methods (Figure 2.3A).

For each gene set collection–network combination, we compare the two methods in one class to

the two methods in the other class (i.e., we compare SL-A to LP-A, SL-A to LP-I, SL-I to LP-A,

19

B)A)Example 1: SL methods > LP methods for majority of genesets & no method is statistically signiﬁcantly better than both methods in the other class.Example 2: LP-I is statistically signiﬁcantly better than both SL methods, whereas LP-A is not.Example 3: Both SL methods (SL-A and SL-I) are statistically signiﬁcantly better than both LP methods (LP-I and LP-A).and SL-I to LP-I). Each comparison yields a p-value along with the number of gene sets in the

collection where one method outperforms the other. After correcting the four p-values for multiple

hypothesis testing [26], if a method from one class outperforms both methods from the other class

independently (in terms of the number of winning gene sets), and if both adjusted p-values are

< 0.05, we consider a method to have significantly better performance compared to the entire other

class.

In addition, we track the percentage of times the SL methods outperform the LP methods across

all four comparisons within a gene set collection–network combination. The results show that,

for function prediction, SL is almost always significantly better than LP when considering auPRC

(Figure 2.3B). Based on temporal holdout on GOBP, both SL-A and SL-I always significantly

outperform both LP methods. Based on the study-bias holdout evaluation, in the ten function

prediction gene set collections–network combinations using GOBP and KEGGBP, SL-A is a

significantly better method eight times (80%), and SL-I is a significantly better method six times

(60%). Neither LP-I nor LP-A ever significantly outperform the SL models. The performance of SL

and LP are more comparable for disease and trait prediction, but SL methods still perform better in

a larger fraction of gene sets. For the 20 disease and trait gene set collection–network combinations,

SL-I is a significantly better method eight times (40%), SL-A is a significantly better method six

times (30%), LP-I is a significantly better method once (5%), and LP-A is never a significantly better

method.

SL outperforms LP by notable effective sizes To visually inspect not only the relative performance

of all four methods but also to see how well the models are performing in an absolute sense, we

examine the boxplots of the auPRC values for every gene set collection–network combination

(Figure 2.4). The first notable observation is that, regardless of the method, function prediction

tasks have much better performance results than disease and trait prediction tasks (Figure 2.4B).

Based on temporal holdout for function prediction (GOBPtmp), SL-A is the top-performing model

based on the highest median performance for every network. Additionally, for all networks except

STRING-EXP, SL-I is the second best-performing model. For the ten combinations of five networks

20

Figure 2.4: Boxplots for performance across all gene set collection–network combinations. (A)
The performance for each individual gene set collection–network combination is compared across
the four methods; SL-A (red), SL-I (light red), LP-I (blue), and LP-A (light blue). The methods
are ranked by median value with the highest scoring method on the left. Results show SL methods
outperform LP methods, especially for function prediction. (B) Each point in the plot is the median
value from one of the boxplots in A. This shows that both SL and LP methods perform better for
function prediction compared to disease/trait prediction.

21

GOBPtmpGOBPKEGGBP036BioGRIDauPRCA)Function PredictionDisGeNetBeFreeGWASMGI024Disease and Trait PredictionGOBPtmpGOBPKEGGBP036STRING-EXPauPRCDisGeNetBeFreeGWASMGI03GOBPtmpGOBPKEGGBP048InBioMapauPRCDisGeNetBeFreeGWASMGI03GOBPtmpGOBPKEGGBP036GIANT-TNauPRCDisGeNetBeFreeGWASMGI024GOBPtmpGOBPKEGGBP036STRINGauPRCDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0246Median auPRCB)SL-ASL-ILP-ILP-Awith GOBP and KEGGBP, the top method based on the highest median performance is an SL

method all but once, with SL-A being the top model seven times (70%), SL-I being the top model

two times (20%, GOBP and KEGGBP on GIANT-TN), and LP-A being the top model once (10%,

KEGGBP on STRING-EXP). As noted earlier, for disease and trait prediction, SL and LP methods

have more comparable performance. Of the 20 gene set collection–network combinations, each of

SL-A, SL-I, LP-I, and LP-A is the top method based on median performance five (25%), ten (50%),

four (20%), and one (5%) times, respectively.

Although the boxplots in Figure 2.4 can give an idea of effect sizes, to further quantify this, we

compute the ratios of auPRC values across all gene sets (Figure A.8). For each gene set, we calculate

the ratio of auRPC values, find the percent increase or decrease, and then take the median value

for every gene set collection–network combination. The results show that SL-A has a significant

effect size when compared to LP-I for function prediction (53% for temporal holdout and 19% for

study-bias holdout). In addition, for all prediction tasks, the effect size seen between the SL methods

and LP-I is equal to or greater than that between LP-I and LP-A, where LP-I is widely considered a

much better model than LP-A and thus, the comparison between LP-I and LP-A can be viewed as a

baseline effect size.

2.3.2 Performance of SL correlates with network properties, similar to LP

Among the two classes of network-based models – SL and LP – it is intuitively clear how LP

directly uses network connections to propagate information from the positively labeled nodes to

other nodes close in the network. On the other hand, while SL is an accurate method for gene

classification, it has not been studied if SL’s performance is tied to any traditional notion of network

connectivity. To shed light on this, we investigate the performance of SL-A and LP-I as a function

of three different properties of individual gene sets in a collection: the number of annotated genes,

edge density (a measure of how tightly connected the gene set is within itself), and segregation (a

measure of how isolated the gene set is from the rest of the network). Detailed information on how

the gene set and network properties are calculated are elaborated in Appendix A.1.3.

While the performance of neither SL-A nor LP-I has a strong association with the size of the

22

Figure 2.5: Performance vs Network/gene set properties. SL-A (A-C) is able to capture network
information as efficiently as LP-I (D-F), for the STRING network. There is no correlation between
the number of genes in the gene set versus performance (A,D), but there is a strong correlation
between the performance and the edge density (B,E) as well as segregation (C,F). The different
colored dots represent function gene sets (red, GOBP and KEGGBP), disease gene sets (blue,
DIGenet and BeFree), and trait gene sets (black, GWAS and MGI). The vertical line is the 95%
confidence interval. Similar trends can be seen for the other networks (Figure A.9).

23

50100150Number of Genes246auPRCASL-A50100150Number of Genes0246DLP-I102101Edge Density0246auPRCB102101Edge Density0246E103102101Segregation0246auPRCC103102101Segregation0246FFunction Prediction(GOBP, KEGGBP)Disease Prediction(DisGeNet, BeFree)Trait Prediction(GWAS, MGI)gene set, the performance of SL-A has a strong positive correlation with both edge density and

segregation of the gene set (Figure 2.5 left column). The observed trends are highly consistent

on LP-I (Figure 2.5 right column). For visual clarity, Figure 2.5 presents results for just the

STRING network, but similar results are seen in the other networks as well (Figure A.9). Thus, the

performances of SL and LP across networks and types of prediction tasks are highly correlated with

the local network clustering of the genes of interest. This result substantiates SL as an approach that

can accurately predict gene attributes by taking advantage of local network connectivity.

The above observation also explains why SL and LP methods are, in general, more accurate

for function prediction than disease and trait prediction (Figure 2.4B). This is because molecular

interaction networks are primarily intended, either through curation or reconstruction, to reflect

biological relationships between genes/proteins as they pertain to "normal" cellular function.

Consequently, gene sets related to functions are more tightly clustered than those related to diseases

and traits in the genome-wide molecular networks used in this study (Figure A.3), leading to

suboptimal prediction performance of disease and trait gene predictions.

2.3.3 Network embedding methods demonstrate the potential for efficient and accurate

network-based gene classification

As machine learning on node embeddings is gaining popularity for network-based node

classification, we compare the top SL and LP methods tested here to this approach. Specifically, we

compare LP-I and SL-A to an SL method using embeddings (SL-E) obtained from the node2vec

algorithm [83] (Figure 2.6). For function prediction, we observe that SL-E substantially outperforms

LP-I. For GOBP temporal holdout, SL-E is always significantly better than LP-I. For the GOBP and

KEGGBP study-bias holdout, out of the ten gene set collection–network combinations, SL-E is

significantly better than LP-I 5 times (50%), whereas the converse is true only once (10%). These

patterns nearly reverse for the 20 disease/trait prediction tasks, with LP-I performing significantly

better than SL-E 6 times (30%), and SL-E significantly outperforming LP-I 3 times (15%). Overall,

SL-E performs comparably to the LP baselines, with clear advantages in function predictions.

On the other hand, the comparison between SL-E and SL-A shows that SL-A demonstrably

24

Figure 2.6: Performance of SL-E vs LP-I and SL-A. We compare the performance of supervised
learning on the embedding matrix (SL-E) vs LP-I and SL-A using a Wilcoxon ranked-sum test.
The performance metric is auPRC, the color scale represents the fraction of terms that were higher
for the SL-E model (with purple being SL-E had a higher fraction of better performing gene sets
compared to either LP-I or SL-A) and an × signifies that the p-value from the Wilcoxon test was
below 0.05. A) Shows that SL-E is quite competitive with the classic method of LP-I and B) shows
that SL-A outperforms SL-E in a majority of cases.

outperforms SL-E for both function and disease/trait prediction tasks. Among the 30 gene set

collection–network combinations, SL-A is a significantly better model for 20 times (67%), whereas

SL-E comes out on top just once (3%). This shows that although methods that use node embeddings

are a promising avenue of research, they should be compared to the strong baseline of SL-A

when possible. Meanwhile, the inherently low-dimensional nature of network embeddings offers

substantial computational advantages over utilizing full network features. Therefore, even though

there is a notable performance disparity between SL-E and SL-A currently, there lies a valuable

25

GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGXXXXXXXXXXXXXXXXXXXXA)Wilcoxon test SL-E vs LP-IGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGXXXXXXXXXXXXXXXXXXXXXB)Wilcoxon test SL-E vs SL-A0.00.20.40.60.81.0Percentage of times SL-E outperforms the other methodpotential in further advancing graph embedding techniques, enabling accurate and efficient supervised

learning on hundreds of gene classification tasks.

2.4 Conclusion

In this work, we have established that supervised learning systematically outperforms label

propagation for network-based gene classification across networks and prediction tasks, including

functions, diseases, and traits. Particularly, supervised learning, where every gene is its own feature,

can capture network information just as well as label propagation. We further demonstrated that

supervised learning on the adjacency matrix demonstrably outperforms supervised learning using

node embeddings, and thus we strongly recommend that future work on using node embeddings

for gene classification draws a comparison to using supervised learning on the adjacency matrix.

Despite the observed performance discrepancies, the development of node embeddings remains

a worthwhile pursuit due to their notable computational efficiency in training gene classification

models.

2.5 Extensions

PyGenePlexus [167] packages the functionalities of this work into a user-friendly software written

in Python (https://github.com/krishnanlab/PyGenePlexus). Users can easily build and generate

predictions using SL or LP methods on the preprocessed molecular interaction networks given any

queried gene sets. These predictions indicate how related every gene in the network is to the input

gene set. Additionally, PyGenePlexus offers interpretability by highlighting relevant biological

processes to the query gene set and returns the network connectivity of the top predicted genes to

help biomedical researchers gain deeper insights into their biomedical problems of interest.

GenePlexusWeb [164] brings the usability of GenePlexus one step further by offering the same

functionalities in PyGenePlexus on a website (https://www.geneplexus.net/). Once a user uploads

their own set of human genes and chooses between a number of different human network representa-

tions, GenePlexusWeb returns a table showing the predictions of how relevant every gene in the

network is to the input set. Interpretability is enhanced through visualizing the subnetwork induced

26

by top predicted genes on the website.

GenePlexusZoo [165] extends GenePlexus to five other species beyond human by considering

many-to-many orthology information. It casts molecular networks across multiple species into a

common embedding space using PecanPy (Chapter 3) and builds SL models on them. This cross-

species extension of GenePlexus significantly enables research in human genes by leveraging current

knowledge and experimental data from model organisms. We demonstrate that this multi-species

network representation improves both gene classification within a single species and knowledge

transfer across species, even in cases where the inter-species correspondence is undetectable based

on shared orthologous genes. Thus, GenePlexusZoo effectively leverages the high evolutionary

molecular, functional, and phenotypic conservation across species to discover novel genes associated

with diverse biological contexts.

27

CHAPTER 3

PECANPY: A FAST, EFFICIENT AND PARALLELIZED PYTHON IMPLEMENTATION
OF NODE2VEC

Learning low-dimensional representations (embeddings) of nodes in large graphs is critical to

applying machine learning on massive biological networks. Node2vec is the most widely used method

for node embedding. However, its original Python and C++ implementations scale poorly with

network density, failing for dense biological networks with hundreds of millions of edges. We have

developed PecanPy, a new Python implementation of node2vec that uses cache-optimized compact

graph data structures and precomputing/parallelization to result in fast, high-quality node embeddings

for biological networks of all sizes and densities. The PecanPy software and documentation are

freely available at https://github.com/krishnanlab/pecanpy, along with the code to reproduce all the

benchmarking experiments at https://github.com/krishnanlab/pecanpy_benchmarks.

3.1 Introduction

Large-scale molecular networks are powerful models that capture interactions between biomolecules

(genes, proteins, metabolites) on a genome scale [171] and provide a basis for predicting novel

associations between individual genes/proteins and various cellular functions, phenotypic traits,

and complex diseases [153, 231]. An area of research that has gained rapid adoption in network

science across disciplines is learning low-dimensional numerical representations, or embeddings,

of nodes in a network for easily leveraging machine learning (ML) algorithms to analyze large

networks [78, 39, 88]. Since each node’s embedding vector concisely captures its network

connectivity, node embeddings can be conveniently used as feature vectors in any ML algorithm

to learn/predict node properties or links [88]. One of the earlier node embedding methods that

continues to show good performance in various node classification tasks, especially on biological

networks [153, 291, 183], is a random-walk based approach called node2vec [83]. Recent studies

on the task of network-based gene classification have shown that node2vec achieves the best

performance among the state-of-the-art embedding methods for gene classification [291] and that

using embedding generated from node2vec achieves prediction performance comparable to the

28

state-of-the-art label propagation methods [153].

Despite its popularity, the original node2vec software implementations, written in Python or C++,

have significant bottlenecks in seamlessly applying to all current biological networks. First, due to

inefficient memory usage and data structure, they do not scale to large and dense networks produced

by integrating several data sources on a genome-scale (17–26k nodes and 3–300mi edges) [80, 246].

On the other hand, the embarrassingly parallel precomputations of calculating transition probabilities

and generating random walks are not parallelized in the original software. Resolving these issues

will enable embedding large and dense biological networks, even large biological knowledge graphs.

Finally, the original implementations only support integer-type node identifiers (IDs), making it

inconvenient to work with molecular networks typically available in databases where nodes may

have non-integer IDs.

Recent work presented in preprints [294] and unpublished code repositories have proposed

other implementations of node2vec (Appendx B.1). However, they either do not provide publicly

available software or do not present a full analysis of their implementation, including a benchmark

that ensures the quality of the resulting embeddings. Here, we present PecanPy, an efficient

Python implementation of node2vec that is parallelized, memory efficient, and accelerated using

Numba [132] with a cache-optimized data structure (Figure 3.1). We have extensively benchmarked

our software using networks from the original study and multiple additional large biological networks

to demonstrate both the computational performance and the quality of the node embeddings. In the

rest of this paper, we first summarize the optimization and the performance of PecanPy and then go

into the details of our implementation.

3.2 PecanPy implementation and optimization

The node2vec procedure consists of four stages: loading, preprocessing, walking, and training.

Comprehensive evaluations of the four stages (Figure B.1, B.2, B.3, B.4, B.5) show that the training

stage only takes 1.2% (median) of the total runtime for the original Python implementation, in

contrast to 95.1% for the original C++ implementation. The low fraction of training runtime

indicates the efficiency of skip-gram training using the gensim Python package [216]. Therefore,

29

Figure 3.1: Overview PecanPy for fast and efficient node2vec. PecanPy is a Python implementation
of the node2vec algorithm. PecanPy can operate in three different modes – PreComp, SparseOTF, and
DenseOTF – that are optimized for networks of different sizes and densities; PreComp for networks
that are small (≤ 10, 000 nodes; any density), SparseOTF for networks that are large and sparse
(>10,000 nodes and ≤ 10% of possible edges, i.e., density ≤ 0.1), and DenseOTF for dense networks
(> 10, 000 nodes and > 10% of possible edges). These modes appropriately take advantage
of compact sparse row (CSR) or dense matrix graph data structures, precomputing transition
probabilities (prob.), and computing 2nd-order transition probabilities during walk generation to
achieve significant improvements in performance.

30

we only optimize the first three stages of the node2vec program.

We focus on the implementation in Python as it is currently the most widely-used high-level

language in machine learning, making it convenient to use and develop further as part of the

community. Further, the Numba compiler can be used to achieve C++ level performance. Our

optimization spans three aspects: (1) we implement computationally- and memory-optimized graph

data structures with efficient loading strategies, (2) we provide an option to compute transition

probabilities on-the-fly without saving them, leading to significant reductions in memory usage

and (3) we parallelize the processes of transition probability computation and walk generation.

Finally, as a straightforward but critical improvement for usability, we have made string-type node

IDs acceptable in addition to the integer type, which was the only ID type accepted by the original

implementations.

We present these improvements as PecanPy, a new software for parallelized, efficient, and

accelerated node2vec in Python (Figure 3.1). PecanPy operates in three different modes – PreComp,

SparseOTF, and DenseOTF – each optimized for networks of different sizes and densities (Table 3.1).

PreComp precomputes and stores all second-order transition probabilities as in the original

implementations and, hence, is more suitable for small and sparse networks. On the other hand,

SparseOTF and DenseOTF both compute second-order transition probabilities for the On-The-Fly

during walk generation without saving them. SparseOTF (like PreComp), with its use of a compact

sparse row (CSR) matrix representation, is better suited for networks that are large and sparse.

DenseOTF, which uses the full adjacency matrix as the underlying graph data structure, is well-suited

for dense networks. In the rest of this section, we provide a detailed description of the optimization

done in PecanPy.

Table 3.1: Capabilities of different node2vec implementations.

Implementation

Graph data structure

Precomp. trans. prob.

Parallelized walk Non-integer IDs

Original Python impl.
Original C++ impl.
node2vectors

PecanPy-PreComp
PecanPy-SparseOTF
PecanPy-DenseOTF

NetworkX
NetworkX
NetworkX

CSR
CSR
Dense matrix

✗
✓
✓

✓
✓
✓

✗
✗
✗

✓
✓
✓

✓
✓
✗

✓
✗
✗

31

3.2.1 Eficient graph data structure improves network loading and memory usage

First, we improve the underlying graph data structure, making it more efficient to load and operate

on the graph data. In the original Python implementation, NetworkX [87], which implicitly assumes

that the input is a multigraph, was used to handle all operations on networks using a graph object in

the form of nested dictionaries (dict-of-dict-of-dict). The levels of these dictionaries correspond

to node, neighbors of node, edge type, and edge weight. Thus, explicit declaration of edge types

for weighted edges is required. However, node2vec only deals with homogeneous networks where

only one edge type is present in the network. In the original Python implementation, all edge types

are set to“weight” by default. This extra piece of information requires 295 additional bytes of

memory for every single edge stored in a NetworkX graph (empty dictionary = 240 bytes, empty

string = 49 bytes, single character = 1 byte). This requirement causes not only memory overheads

but also computation overheads by reading the dictionary for “edge type,”which is irrelevant for

homogeneous networks. To address these issues, in this work, we implemented a lite graph object

as a network loader in the form of a list-of-dict, which assumes the network has only one type of

edge. As shown in Figure B.5, the loading time and memory usage (first and second bars in each

group) for list-of-dict are significantly lower than those for NetworkX. Thus, our special lite graph

object efficiently loads networks with reduced memory usage and shorter load time than NetworkX.

3.2.2 Cache optimization further enhances operations on graphs

Next, we optimize the graph data structure further for better cache utilization during computation.

Despite faster loading and lower memory usage using list-of-dict, operating on Python dictionaries

is still suboptimal for cache utilization. Specifically, the neighboring-edge data, which are often used

together to compute transition probabilities, are not physically close to each other in the memory.

Meanwhile, the design principle behind cache is that units of memory that are physically nearby are

likely to be used together. Consequently, every time a specific piece of data in memory is accessed,

a chunk of neighboring physical memory – called the cache line – is also copied from RAM to

cache. Reading from the cache could be up to 100 times faster than reading from RAM. Thus,

to fully leverage the spatial locality of cache lines, inspired by a recent blog post (https://www.

32

singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/), we

further converted the list-of-dict graph data structure to the compact sparse row (CSR) format,

implemented using NumPy arrays [252]. In this way, neighboring-edge data is placed physically

close together, thus improving cache utilization. These optimizations result in speedups in the

preprocessing (Figure B.4C) and walk generation (Figure B.4B) steps of PreComp compared to

that of the original Python implementation in the single-core setup. More specifically, on the

BlogCatalog and Wikipedia networks, PreComp achieves up to an order-of-magnitude speedup in

the preprocessing phase over the original Python implementation.

Due to the compactness of the CSR graph data structure, the memory usage is further reduced

compared to the list-of-dict graph data structure, as shown in Figure B.5B. However, CSR’s

compactness also hinders dynamically constructing sparse matrices, as adding new entries requires

reconstructing the whole representation. Therefore, in practice, we first use list-of-dict to load the

network into memory and then convert the full network to CSR, where all subsequent computations

are performed.

3.2.3 Optimization for dense networks

The CSR representation described above is memory-efficient only if the network is relatively

sparse because it stores non-zero entries of the adjacency matrix using both the indices and the

weights of edges. For dense networks, explicit indexing of edges leads to memory overhead for the

same reasons that redundant edge-type information strains memory, as discussed above. To address

this issue, for dense networks like GIANT-TN (25,825 nodes; fully connected and weighted, with

333,452,400 edges), we use the dense NumPy matrix instead of the CSR to store the graph data.

CSR and dense matrix show similar walk generation times, as illustrated in a comparison

between SparseOTF, which uses CSR, and DenseOTF (Figure B.1B). In the realm of dense networks,

using dense matrix over CSR results in faster network loading and less memory usage. For dense

networks, loading time takes up most of the runtime. For example, in the multi-core setup, loading

the GIANT-TN network as CSR contributes to nearly 80% of the runtime (Figure B.1H). This burden

is mostly due to the inefficiency in reading edge list files as text line-by-line. To mitigate this issue

33

of long loading time for dense networks, we implemented an option in our software that offers users

the ability to first convert the edge list file to dense matrix format and save it as a binary Numpy npz

file, which could then be loaded as a network in the future. As shown in Figure S6A, for sparse

networks like PPI, loading networks as CSR is faster than loading as npz files, but as the density of

the network increases, loading networks as npz files becomes a better option. Similar arguments

could be made for memory usage (Figure B.5B). For GIANT-TN (Figure B.3D, SparseOTF vs

DenseOTF), loading as npz only took 9 seconds with 6GB of peak memory usage, resulting in a

70-times speedup over CSR, which took more than 10 mins to load with 39GB of peak memory

usage.

This trade-off due to the networks’ densities raises a question of when exactly is using a dense

matrix more optimal than using CSR. A dense matrix requires 8 × |𝑉 |2 bytes of memory, while a

CSR requires roughly 12 × |𝐸 | bytes of memory (using 32-bit unsigned integer as index and 64-bit

floating point number as data), where |𝑉 | is the number of nodes and |𝐸 | is the number of edges.

Hence, in theory, for any network with a density of less than two-thirds, CSR should be used over

the corresponding dense matrix. However, since CSR cannot be directly loaded but requires an

intermediate conversion using a list-of-dict, the peak memory usage is also affected by the list-of-dict

graph data structure. Based on the observations from Figure B.5B, where the fold difference in peak

memory usage between CSR and Numpy matrix is much smaller for other sparse networks like

BioGRID, we empirically set the balancing point for network density to be around 0.1.

3.2.4 Reducing memory usage via on-the-fly transition probability computation

Next, we detail our approach to reducing memory usage by computing 2nd order transition

probabilities on the fly. The original node2vec implementations precompute and store all the

2nd-order transition probabilities in advance, which takes up at least |𝐸 |2/|𝑉 | space in memory. One

solution to remedy this issue is to calculate the 2nd order transition probabilities On-The-Fly (OTF)

during walk generation without saving them. Another reason for not precomputing 2nd order

transition probabilities is that, as the network becomes larger and denser, it is very likely that most

of the pre-calculated 2nd order transition probabilities are never used in the generation of walks,

34

causing not only the space but also the invested computation time to be wasted. We combine the

CSR and the dense matrix representation, each with the OTF strategy, into separate modes in our

software, called SparseOTF and DenseOTF, respectively. As an example of the improvement made

by the OTF strategy, the DenseOTF implementation can embed a dense network like the ∼ 26K

fully-connected weighted GIANT-TN network with > 333M edges in just an hour with only 6GB of

peak memory usage using a single core. Remarkably, this means that even for an extremely dense

and large network like GIANT-TN, the software could be run on a personal computer configured

with a reasonable amount of memory (e.g., 16GB). Conversely, both original implementations fail to

run even the sparsified version GIANT-TN-c01 (similar number of nodes but with only ∼ 11% of the

edges in GIANT-TN) on a supercomputer configured with 200GB memory. Moreover, the runtime

of 1 hour for embedding the GIANT-TN network using DenseOTF with single-core configuration is

even shorter than that for embedding a significantly smaller and sparser network STRING (67% of

the nodes and 1% of edges as in GIANT-TN) using the original Python implementation with 28

cores, which took 5 hours to finish. For relatively small and sparse graphs where the amount of

memory on a computer is sufficient to fit all 2nd-order transition probabilities, it could indeed save

time generating walks by avoiding redundant computations for transition probabilities (Figure B.1B).

Hence, we leave the decision of the trade-off between speed and memory to the user by providing

precomputation (PreComp) and the on-the-fly (SparseOTF or DenseOTF) modes in the PecanPy

software.

3.2.5 Parallel random walk generation

All three modes of our new node2vec implementations are fully parallelized. As mentioned earlier,

each node’s transition probabilities computation and random walk generation are embarrassingly

parallel. Specifically, in the process of walk generation, each node is used as a starting point for

an independent random walk with a fixed length on the graph unless a dead end is reached, which

causes early stopping. This process is repeated multiple times depending on the input parameter

specified by the user. As each random walk is independent, multiple walks can be performed in

parallel. Similarly, the precomputation of 2nd-order transition probabilities only depends on the

35

1st-order transition probabilities. Hence, in this work, the processes of walk generation and 2nd-order

transition probabilities precomputation are parallelized using Numba.

3.3 Benchmarking computational efficiency and quality of embeddings

3.3.1 Baseline node2vec implementations

We compile a list of existing node2vec implementations (Appendix B.1) and note that beyond the

original implementations (in Python and C++), none of the other implementations have attributes

that are likely to improve their speed and memory usage over and above the original Python or C++

implementations with the exception of one implementation called nodevectors. Benchmarking

results for the overall speed and memory usage indicate that nodevectors handled at least

one dense network better than the original implementations. However, the nodevectors node

embedding vectors achieve very poor performance in node classification tasks (Figure 3.3). Therefore,

we conduct our detailed, stage-by-stage benchmark of PecanPy only against the original Python and

C++ implementations.

3.3.2 Benchmarking network data

We use a collection of eight networks for benchmarking different node2vec implementations.

Table 3.2 shows the summary statistics of these networks, including the number of nodes, edges,

and network density. PPI, BlogCatalog, and Wikipedia are from the original node2vec paper [83]

(download link from node2vec webpage https://snap.stanford.edu/node2vec/). BioGRID [240],

STRING [245], and GIANT-TN [80] are molecular interaction networks (download from https:

//doi.org/10.5281/zenodo.3352323), where nodes are proteins/genes and edges are interactions

between them. GIANT-TN-c01 is a sub-network of GIANT-TN where edges with edge weight

below 0.01 are discarded. SSN200 [135] is a cross-species network of proteins from 200 species

(download from https://bioinformatics.cs.vt.edu/~jeffl/supplements/2019-fastsinksource/), with the

edges representing protein sequence similarities.

36

Table 3.2: Properties of the diverse networks used in this study.

Network

Weighted Num. nodes Num. edges Density File size

PPI
Wikipedia
BlogCatalog
BioGRID
STRING
SSN200
GIANT-TN-c01
GIANT-TN

✗
✓
✗
✗
✓
✓
✓
✓

3,852
4,777
10,312
20,558
17,352
814,731
25,689
25,825

38,273
92,406
333,983
238,474
3,640,737
72,618,574
38,904,929
333,452,400

5.16E-3
8.10E-3
6.28E-3
1.13E-3
2.42E-2
2.19E-4
1.18E-1
1.00E+0

707 KB
2.0 MB
3.2 MB
2.5 MB
60 MB
2.0 GB
1.1 GB
7.2 GB

3.3.3 Runtime and memory usage

We profile the runtime and memory usage for each implementation on each network. All

benchmarking experiments are done on compute units with 28-core Intel Xeon CPU E5-2680 v4

@2.4GHz. Each one of the four stages in the node2vec program is timed individually. To profile the

original implementations, we add timers to each stage using the built-in time function in Python

and C++. Memory usage is profiled using the system’s built-in GNU timer, which could measure

the maximum resident size (physical memory usage) throughout the runtime of a program. We

profile the runtime and the memory usage of each implementation using two different resource

configurations. The primary results presented are based on the multi-core configuration, with 28

cores, 200GB allocated memory, and a 24-hour wall-time limit. We also profile the implementations

with a single-core configuration, with a single core, 32 GB memory allocated, and an 8-hour

wall-time limit. The multi-core setup emulates performance on a high-performance computing

facility, while the single-core setup emulates the scenario of the software being run on a personal

computer.

3.3.4 Node embedding quality evaluation

Each of the three networks used in the node2vec paper (PPI, BlogCatalog, Wikipedia) contains

node labels. We use these node classification tasks to probe the node embedding quality of the

different implementations. Some of the labelsets from the data repository appeared to have very few

positive examples. For a more rigorous evaluation, we only use label sets containing at least ten

37

positive examples. In total, there are 38 node classes in BlogCatalog, 50 node classes in PPI, and 21

node classes in Wikipedia. For each node class, a one vs. rest L2 regularized logistic regression

model is trained and evaluated through 5-fold cross-validation. Each test fold is evaluated by the

area under the receiver operator curve (auROC) separately, and the mean auROC score across the

five folds is reported. This process is repeated ten times for each label set, and the mean value of the

reported scores is taken as the final evaluation score. Using this evaluation procedure, each network

and each implementation has a list of auROC scores depending on the number of node classes in

the network. For each prediction task, we perform a Wilcoxon paired test to compare this list of

scores to that of the original Python implementation. The resulting statistics are used to determine

whether the quality of the resulting node embedding from a particular implementation is statistically

significantly different from that of the original implementation

3.4 Results and discussions

We comprehensively benchmark PecanPy and the original Python and C++ implementations of

node2vec on a collection of eight networks, including three networks from the original node2vec

paper [83] and five large biological networks that together span a wide range of sizes (approximately

4K to 800K nodes, and approximately 38K to 333M edges) and densities (0.02% to 100%; Table 3.2;

Figure 3.2, B.6, B.7, and B.8).

3.4.1 PecanPy significantly improves runtime and memory usage for node2vec over the

original implementations

Across the board, PecanPy (in one of its three modes) is substantially faster than the original

implementations (Figure 3.2).

In fact, there are three large networks (SSN200, GIANT-TN-

c01, GIANT-TN) that run successfully only using PecanPy ’s OTF implementations. Other

implementations failed to run the GIANT-TN network due to memory limitations that arise from

storing 2nd-order transition probabilities. The original software failed to run SSN200 because

they do not support non-integer-type node IDs; this is supported in PecanPy. DenseOTF failed

for SSN200 since its dense-network design requires more than 5TB of memory to create a double

precision dense matrix of size 800k. However, it considerably improves memory usage and speed

38

Figure 3.2: Summary of runtime and memory of PecanPy and the original implementations of
node2vec using multiple cores. The eight networks of varying sizes and densities are along the
rows. The software implementations are along the columns. The first heatmap (on the left) shows
the performance of the original Python and C++ software along with the three modes of PecanPy
(PreComp, SparseOTF, and DenseOTF). The adjacent 2-column heatmap (on the right) summarizes
the performance of the original (best of Python and C++ versions) and PecanPy (best of PreComp,
SparseOTF, and DenseOTF) implementations. Lighter colors correspond to lower runtime in panel
A and lower memory usage in panel B. Crossed grey indicates that the particular implementation
(column) failed to run for a particular network (row).

39

Figure 3.3: Evaluation of embedding in node classification tasks. Each group of boxplots
corresponds to one of three networks. Individual boxplots in a group correspond to the distribution
of auROC scores using node embeddings generated using a specific implementation (different
colors).

for large dense networks like GIANT-TN due to the more efficient network loading scheme, i.e.,

reading a numpy array file instead of a text (edge list) file (Figure B.3). For relatively small and

sparse networks (e.g., BioGRID, BlogCatalog), using PreComp invariably results in faster walk

generation, thus achieving an overall shorter runtime (Figure B.3). These results underscore the

importance of the three modes of PecanPy. Similarly, in terms of memory usage, one of the modes

of the PecanPy reduces the maximum resident size by up to two orders of magnitude compared

to the original implementations. All these trends are magnified on a single core (Figure B.2, B.6,

B.8). Another implementation nodevectors that handled at least one dense network better than

the original implementations still performed worse overall than PecanPy (Figure B.9). Together,

these results demonstrate that PecanPy is much more computationally and memory efficient than the

original node2vec implementations.

3.4.2 PecanPy produces node2vec embedding of the same quality as the original implementa-

tions

To ensure the quality of node embeddings generated by our new implementations, we evaluated

their use as feature vectors in node classification tasks using datasets from the original paper,

40

including BlogCatalog, PPI, and Wikipedia. As shown in Figure 3.3, our implementations achieve

the same performance as the original Python implementation. On the other hand, the original C++

implementation is significantly worse than the original Python implementation for PPI (Wilcoxon

p-value = 1.98e-3) while being significantly better for Wikipedia (Wilcoxon p-value = 2.41e-4).

These differences are likely due to the different skip-gram implementations in the C++ and the

Python implementations. Finally, the performance of nodevectors feature vectors is worse

than that of all other implementations, potentially due to implementation errors by the author. In

summary, PecanPy is an accurate reimplementation of the original node2vec, resulting in statistically

indifferent performance on downstream node classification tasks.

3.5 Conclusion

We have developed an efficient node2vec Python software, PecanPy, with significant improvement

in both speed and memory usage. Extensive benchmarks have demonstrated that PecanPy can

efficiently generate quality node embeddings for networks at multiple scales, including large (>800k

nodes) and dense (fully connected network of 26k nodes) networks that the original implementations

failed to execute. PecanPy is freely available at https://github.com/krishnanlab/pecanpy, can be

easily installed via the pip package-management system (https://pypi.org/project/pecanpy/), and has

been confirmed to work on a variety of networks, weighted or unweighted, with a wide range of

sizes and densities. Therefore, it can find broad utility beyond biology.

41

CHAPTER 4

ACCURATELY MODELING BIASED RANDOM WALKS ON WEIGHTED NETWORKS
USING NODE2VEC+

Accurately representing biological networks in a low-dimensional space, also known as network

embedding, is a critical step in network-based machine learning and is carried out widely using

node2vec, an unsupervised method based on biased random walks. However, while many net-

works, including functional gene interaction networks, are dense, weighted graphs, node2vec is

fundamentally limited in its ability to use edge weights during the biased random walk generation

process, thus under-using all the information in the network. Here, we present node2vec+, a natural

extension of node2vec that accounts for edge weights when calculating walk biases and reduces

to node2vec in the cases of unweighted graphs or unbiased walks. Using two synthetic datasets,

we empirically show that node2vec+ is more robust to additive noise than node2vec in weighted

graphs. Then, using genome-scale functional gene networks to solve a wide range of gene function

and disease prediction tasks, we demonstrate the superior performance of node2vec+ over node2vec

in the case of weighted graphs. Notably, due to the limited amount of training data in the gene

classification tasks, graph neural networks such as GCN and GraphSAGE are outperformed by both

node2vec and node2vec+. Code to reproduce the benchmarking results is available on GitHub:

https://github.com/krishnanlab/node2vecplus_benchmarks.

4.1 Introduction

Graphs and networks naturally appear in many real-world datasets, including social networks and

biological networks. The graph structure provides insightful information about the role of each node

in the graph, such as protein function in a protein-protein interaction network [153, 128]. To more

efficiently and effectively mine information from large-scale graphs with thousands or millions of

nodes, several node embedding methods have been developed [53, 89]. Among them, node2vec

has been the top choice in bioinformatics due to its superior performance compared to many other

methods [16, 291]. However, many biological networks, such as [80, 113], are dense and weighted

by construction, which we demonstrate to be undesirable conditions for node2vec that can lead to

42

sub-optimal performance.

Node2vec [83] is a second-order random walk based embedding method. It is widely used

for unsupervised node embedding for various tasks, particularly in computational biology [183],

such as for gene function prediction [153], disease gene prediction [199, 15], and essential protein

prediction [266, 292]. Some recent works built on top of node2vec aim to adapt node2vec to more

specific types of networks [270, 251], generalize node2vec to higher dimensions [86], augment

node2vec with additional downstream processing [97], or to study node2vec theoretically [82, 56, 212].

Nevertheless, none of these follow-up works account for the fact that node2vec is less effective for

weighted graphs, where the edge weights reflect the (potentially noisy) similarities between pairs of

nodes. This failing is due to the inability of node2vec to differentiate between small and large edges

connecting the previous vertex with a potential next vertex in the random walk, which subsequently

causes less accurate modeling of the intended walk bias.

Meanwhile, another line of recent works on graph neural networks (GNNs) has shown remarkable

performance in prediction tasks that involve graph structure, including node classification [33, 279].

Although GNNs and embedding methods like node2vec are related in that they both aim at projecting

nodes in the graph to a feature space, two main differences set them apart. First, GNNs typically

require labeled data, while embedding methods do not. This label dependency makes the embeddings

generated by a GNN tied to the quality of the labels, which in some cases, like in biological networks,

are noisy and scarce. Second, GNNs typically require node features as input to train, which are not

always available. In the absence of given node features, one needs to generate them, and often GNN

algorithms use trivial node features such as the constant features or node degree features. These

two differences give node embedding methods a unique place in node classification, apart from the

GNN methods.

Here, we propose an improved version of node2vec that is more effective for weighted graphs by

taking into account the edge weight connecting the previous vertex and the potential next vertex.

The proposed method node2vec+ is a natural extension of node2vec; when the input graph is

unweighted, the resulting embeddings of node2vec+ and node2vec are equivalent in expectation.

43

Moreover, when the bias parameters are set to neutral, node2vec+ recovers a first-order random

walk, just as node2vec does. Finally, we demonstrate the superior performance of node2vec+

through extensive benchmarking on both synthetic datasets and network-based gene classification

datasets using various functional gene interaction networks. Node2vec+ is implemented as part of

PecanPy [151] and is available on GitHub: https://github.com/krishnanlab/PecanPy.

4.2 Methods

We start by briefly reviewing the node2vec method. Then we illustrate that node2vec is less effective

for weighted graphs due to its inability to identify out edges. Finally, we present a natural extension

of node2vec that resolves this issue.

4.2.1 Node2vec overview

In the setting of node embeddings, we are interested in finding a mapping 𝑓 : 𝑉 → R𝑑 that

maps each node 𝑣 ∈ 𝑉 to a 𝑑-dimensional vector so that the mutual proximity between pairs of

nodes in the graph is preserved. In particular, a random walk based approach aims to maximize the

probability of reconstructing the neighborhoods for any node in the graph based on some sampling

strategy 𝑆. Formally, given a graph 𝐺 = (𝑉, 𝐸) (the analysis generalizes to directed and/or weighted

graphs), we want to maximize the log probability of reconstructing the sampled neighborhood

N𝑆 (𝑣) for each 𝑣 ∈ 𝑉:

max
𝑓

∑︁

𝑣∈𝑉

log P(N𝑆 (𝑣)| 𝑓 (𝑣))

(4.1)

Under the conditional independence assumption, and the parameterization of the probabilities as the

softmax normalized inner products [83, 173], the objective function above simplifies to:

max
𝑓

∑︁

(cid:18) ∑︁

𝑣∈𝑉

𝑣′∈N𝑆 (𝑣)

⟨ 𝑓 (𝑣′), 𝑓 (𝑣)⟩ − log 𝑍𝑣

(cid:19)

(4.2)

In practice, the partition function 𝑍𝑣 = (cid:205)𝑣′∈𝑉 ⟨ 𝑓 (𝑣), 𝑓 (𝑣′)⟩ is approximated by negative sam-
pling [174] to save computational time. Given any sampling strategy 𝑆, equation (4.2) can find the

corresponding embedding 𝑓 , which is achieved in practice by feeding the random walks generated

to the skipgram with negative sampling [173].

44

Node2vec devises a second-order random walk as the sampling strategy. Unlike a first-order

random walk [202], where the transition probability of moving to the next vertex 𝑣𝑛, denoted

as P(𝑣𝑛|𝑣𝑐), depends only on the current vertex 𝑣𝑐, a second-order random walk also depends

on the previous vertex 𝑣 𝑝, with transition probability P(𝑣𝑛|𝑣𝑐, 𝑣 𝑝). It does so by applying a bias

factor𝛼𝑝𝑞 (𝑣𝑛, 𝑣 𝑝) to the edge (𝑣𝑐, 𝑣𝑛) ∈ 𝐸 that connects the current vertex and a potential next

vertex. This bias factor is a function that depends on the relation between the previous vertex and

the potential next vertex, and is parameterized by the return parameter 𝑝, and the in-out parameter

𝑞. In this way, the random walk can be generated based on the following transition probabilities:

P(𝑣𝑛|𝑣𝑐, 𝑣 𝑝) =

𝛼 𝑝𝑞 (𝑣𝑛,𝑣 𝑝)𝑤(𝑣𝑐,𝑣𝑛)
(cid:205)𝑣 ∈N(𝑣𝑐 ) 𝛼 𝑝𝑞 (𝑣,𝑣 𝑝)𝑤(𝑣𝑐,𝑣)

if (𝑣𝑐, 𝑣𝑛) ∈ 𝐸

0

otherwise





(4.3)

where the bias factor is defined as:

𝛼𝑝𝑞 (𝑣𝑛, 𝑣 𝑝) =

1
𝑝

1

1
𝑞






if 𝑣 𝑝 = 𝑣𝑛

if 𝑣 𝑝 ≠ 𝑣𝑛 and (𝑣𝑛, 𝑣 𝑝) ∈ 𝐸

(4.4)

if 𝑣 𝑝 ≠ 𝑣𝑛 and (𝑣𝑛, 𝑣 𝑝) ∉ 𝐸

According to this bias factor, node2vec differentiates three types of edges: 1) the return edge, where

the potential next vertex is the previous vertex (Figure 4.1a); 2) the out edge, where the potential

next vertex is not connected to the previous vertex (Figure 4.1b); and 3) the in edge, where the

potential next vertex is connected to the previous vertex (Figure 4.1c). Note that the first-order (or

unbiased) random walk can be seen as a special case of the second-order random walk where both

the return parameter and the in-out parameter are set to neutral (𝑝 = 1, 𝑞 = 1).

We now turn our attention to weighted networks, where the edge weights are not necessarily

zeros or ones. Consider the case where 𝑣𝑛 is connected to 𝑣 𝑝, but with a small weight (Figure 4.1d),

i.e. (𝑣𝑛, 𝑣 𝑝) ∈ 𝐸 and 0 < 𝑤(𝑣𝑛, 𝑣 𝑝) ≪ 1. According to the definition of the bias factor, no matter

how small 𝑤(𝑣𝑛, 𝑣 𝑝) is, (𝑣𝑐, 𝑣𝑛) would always be considered as an in edge. Since in this case 𝑣𝑛 and

𝑣 𝑝 are barely connected, (𝑣𝑐, 𝑣𝑛) should in fact be considered as an out edge. In the extreme case of

45

a fully connected weighted graph, where (𝑣, 𝑣′) ∈ 𝐸 for all 𝑣, 𝑣′ ∈ 𝑉, node2vec completely loses its

ability to identify out edges.

Thus, node2vec is less effective for weighted networks due to its inability to identify potential

out edges where the terminal vertex 𝑣𝑛 is loosely connected to a previous vertex 𝑣 𝑝. Next, we

propose an extension of node2vec that resolves this issue, by taking into account of the edge weight

𝑤(𝑣𝑛, 𝑣 𝑝) in the bias factor.

4.2.2 Node2vec+

The main idea of extending node2vec is to identify potential out edges (𝑣𝑐, 𝑣𝑛) ∈ 𝐸 coming from

𝑣 𝑝, where 𝑣𝑛 is loosely connected to 𝑣 𝑝. Intuitively, we can determine the “looseness” of (𝑣𝑐, 𝑣𝑛)

based on some threshold edge value. However, given that the distribution of edge weights of any

given node in the graph is not known a priori, it is hard to come up with a reasonable threshold value

for all networks. Instead, we define the looseness of (𝑣𝑐, 𝑣𝑛) based on the edge weight statistics for

each node 𝑣.

𝜇(𝑣) =

𝜎(𝑣) =

(cid:205)𝑣′∈N(𝑣) 𝑤(𝑣, 𝑣′)
|N(𝑣)|
(cid:205)𝑣′∈N(𝑣)

√︄

(cid:0)𝑤(𝑣, 𝑣′) − 𝜇(𝑣)(cid:1) 2
|N(𝑣)|

(4.5)

˜𝑤𝛾 (𝑣, 𝑢) =

𝑤(𝑣, 𝑢)
max{𝜇(𝑣) + 𝛾𝜎(𝑣), 𝜖 }

Formally, we first define ˜𝑤𝛾 (𝑣, 𝑢), a normalized version of the edge weight 𝑤(𝑣, 𝑢), based on

the mean 𝜇(𝑣) and the standard deviation 𝜎(𝑣) of the edge weights connecting 𝑣, as in equation

(4.5). In practice, we clip the denominator of ˜𝑤𝛾 (𝑣, 𝑢) by a small number 𝜖 (1e-6 by default) to

prevent divide by zero in some cases when 𝛾 is set to be negative. Then, we say 𝑣 ∈ 𝑉 is 𝛾-loosely

connected (or simply loosely connected if 𝛾 = 0) to 𝑢 ∈ 𝑉 if ˜𝑤𝛾 (𝑣, 𝑢) < 1. Intuitively, we would

like to treat an edge as being “not connected” if it is “small enough”. Finally, an edge (𝑣, 𝑢) is

𝛾-loose if 𝑣 is 𝛾-loosely connected to 𝑢, and otherwise it is 𝛾-tight. Without loss of generality, we

consider the case of 𝛾 = 0 in the subsequent sections to simplify the notion of looseness.

46

Figure 4.1: Illustration of different settings of return and in-out edges. 𝑣 𝑝, 𝑣𝑐, and 𝑣𝑛 indicate
the previous, current, and next vertices. The solid and dotted lines represent edges with large and
small edge weights, respectively. (a-c) are return, out, and in edges considered by node2vec. (d-f)
are variations of (c) when taking into account of edge weights, where node2vec fail to distinguish
from (c).

Based on the definition of looseness of edges, and assuming 𝑣 𝑝 ≠ 𝑣𝑛, there are four types of

(𝑣𝑐, 𝑣𝑛) edges (see Figure 4.1, (c-f)). Following node2vec, we categorize these edge types into in

and out edges. Furthermore, to prevent amplification of noisy connections, we added one more edge

type called the noisy edge, which is always suppressed.

Out edge (4.1b, 4.1d) As a direct generalization to node2vec, we consider (𝑣𝑐, 𝑣𝑛) to be an out

edge if (𝑣𝑐, 𝑣𝑛) is tight and (𝑣𝑛, 𝑣 𝑝) is loose. The in-out parameter 𝑞 then modifies the out edge to

differentiate “inward” and “outward” nodes, and subsequently leads to Breadth First Search or Depth

First Search like searching strategies [83]. Unlike node2vec, however, we further parameterize the

bias factor 𝛼 based on ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝). Any choice of monotonic function should work, but we choose

to use the linear interpolation in this study for simplicity and leave it as future work to explore more

sophisticated interpolation functions such as the sigmoidal functions. Specifically, for an out edge

(𝑣𝑐, 𝑣𝑛), the bias factor is computed as 𝛼𝛾 𝑝𝑞 (𝑣 𝑝, 𝑣𝑐, 𝑣𝑛) = 1

𝑞 ) ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝). Thus the amount
of modification to the out edge depends on the level of looseness of (𝑣𝑛, 𝑣 𝑝). When 𝑤(𝑣𝑛, 𝑣 𝑝) = 0,

𝑞 + (1 − 1

or equivalently (𝑣𝑛, 𝑣 𝑝) ∉ 𝐸, the bias factor for (𝑣𝑐, 𝑣 𝑝) is 1

𝑞 , same as that defined in node2vec.

47

Noisy edge (4.1e) We consider (𝑣𝑐, 𝑣𝑛) to be a noisy edge if both (𝑣𝑐, 𝑣𝑛) and (𝑣𝑛, 𝑣 𝑝) are loose.

Heuristically, the noisy edges are not very informative and thus should be suppressed regardless of

the setting of q to prevent amplification of noise. Thus, the bias factor for a noisy edge is set to be

min{1, 1

𝑞 }.

In edge (4.1c, 4.1f) Finally, we consider (𝑣𝑐, 𝑣𝑛) to be an in edge if (𝑣𝑛, 𝑣 𝑝) is tight, regardless of

𝑤(𝑣𝑐, 𝑣𝑛). The corresponding bias factor is set to neutral as in node2vec.

Combining the above, the bias factor for node2vec+ is defined as follows:

𝛼𝛾 𝑝𝑞 (𝑣 𝑝, 𝑣𝑐, 𝑣𝑛) =

1
𝑝

1

min{1, 1
𝑞 }

if 𝑣 𝑝 = 𝑣𝑛

if ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝) ≥ 1

if ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝) < 1

and ˜𝑤𝛾 (𝑣𝑐, 𝑣𝑛) < 1

(4.6)

𝑞 + (1 − 1
1

𝑞 ) ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝)

if ˜𝑤𝛾 (𝑣𝑛, 𝑣 𝑝) < 1

and ˜𝑤𝛾 (𝑣𝑐, 𝑣𝑛) ≥ 1






Note that the last two caess in equation (4.6) include cases of (𝑣𝑛, 𝑣 𝑝) ∉ 𝐸. Based on the biased

random walk searching strategy using this bias factor, the embedding can be generated accordingly

using (4.2). One can verify, by checking equation (4.6), that this is indeed a natural extension of

node2vec in the sense that (1) for an unweighted graph, the node2vec+ is equivalent to node2vec,

and (2) when 𝑝 and 𝑞 are set to 1, node2vec+ recovers a first-order random walk, same as node2vec

does. Finally, by design, node2vec+ is able to identify potential out edges that would have been

obliviated by node2vec.

4.2.3 Synthetic hierarchical cluster graphs construction

In the next section, we first illustrate the effectiveness of using several synthetic graphs. Here,

we provide the detailed construction steps for the hierarchical cluster graphs. At a high level, they

are created by first sampling in a representation space of the corresponding tree structure and then

applying an RBF kernel to obtain pairwise connection scores. In the following, we describe (1) the

construction of the tree that represents the hierarchy, (2) the representation each node in the tree,

48

(3) constructing the hierarchical cluster graph given the node representations, and (4) maximally

sparsifying the constructed hierarchical cluster graph.

Tree construction We first construct the cluster centroids using a tree structure. A perfect binary

tree is a binary tree where all nodes except the leaf nodes have two children, and all leaf nodes have

the same level. This definition can be generalized to perfect 𝐾-trees, in which all the interior nodes

have 𝐾 number of nodes, for 𝐾 greater than or equal to one. We denote 𝑇𝐾,𝐿 as the perfect 𝐾-tree

with maximum level 𝐿. Figure S1 shows the example of 𝑇2,2, a perfect binary tree (or perfect 2-tree)

with a maximum level of two.

Representing nodes in the tree A straightforward solution to represent the nodes in 𝑇𝐾,𝐿 is

one-hot encoding. For a more compact representation, we leave out the indicator for the root node

and represent it as all zeros. Thus, the dimension of the indicator array is equal to the total number

of nodes in 𝑇𝐾,𝐿, excluding the root node, that is,

|𝑉 (𝑇𝐾,𝐿)| − 1 =

𝐾 𝑙

𝐿
∑︁

𝑙=1

(4.7)

However, if we use one-hot encoding, then all nodes are equally distanced in the Euclidean space.

Instead, we combine the one-hot encoded representations of all the ancestor nodes as the final

representation for each node, denoted as 𝜙𝑖, 𝑖 = 1, . . . , |𝑉 (𝑇𝐾,𝐿)|. In this way, all sibling nodes are

equally distanced, with

√

2 times the distance from the parent node. Figure S1 shows the example

of both the one-hot encoded and the final representations of the grey nodes. Notice the difference

between the final representation and the on-hot encoded representation of the grey leaf node.

Hierarchical clusters We draw data points x from a Gaussian distribution around each node in

the 𝑇𝐾,𝐿 tree:

𝑥 ∼ N(𝜙𝑖, 𝜎), 𝑖 = 1, . . . , |𝑉 (𝑇𝐾,𝐿)|

(4.8)

In the case of K3L2, the data points are drawn using 𝑇3,2. The parameter , which controls the

noisiness of the sampled data points, is set to 0.01 by default. Throughout the study, we fix the

number of data points per node in the tree to 30. Finally, we turn the sampled data points into a fully

connected weighted graph using the RBF kernel.

49

Maximal sparsification of K3L2 We apply a global edge threshold to K3L2 by removing all

edges below a certain value 𝑡. Sweeping over [0.01, 0.9], we find that the maximum global edge

threshold that preserves the graph’s connectivity is about 0.45 (Figure 4.2). This sparsification

drastically reduces the density of the graph from 1.0 (fully connected) down to about 0.1.

Figure 4.2: Network statistics of K3L2 as a function of sparsifying edge threshold. As the
edge threshold increase, the edge density of the graph reduces. In the beginning, the number of
connected components remains one, indicating that the sparsified graph is connected, but becomes
disconnected as the edge threshold increases beyond about 0.45.

4.3 Experiments

4.3.1 Synthetic datasets

We start by demonstrating the ability of node2vec+ to identify potential out edges in weighted

graphs using a barbell graph and the hierarchical cluster graphs. For simplicity, we fix 𝛾 = 0 for all

experiments in this section.

4.3.1.1 Barbell graph

A barbell graph, denoted as 𝐵, is constructed by connecting two complete graphs of size 20 with

a common bridge node (Figure 4.3a). All edges in 𝐵 are weighted 1. There are three types of nodes

in 𝐵, 1) the bridge node; 2) the peripheral nodes that connect the two modules with the bridge node;

50

3) the interior nodes of the two modules. By changing the in-out parameter 𝑞, node2vec could put

the peripheral nodes closer to the bridge node or interior nodes in the embedding space.

Figure 4.3: Barbell graph embedding demonstration. (a) Illustration of the barbell graph, and
(b) Embedding of the
three different types of nodes indicated by the different marker styles.
barbell graph 𝐵 using node2vec. (c-d) Embedding of the noisy barbell graph ˜𝐵 using node2vec and
node2vec+, respectively. Each one of (b-d) contains three different settings of q: 1, 100, and 0.01.

51

When 𝑞 is large, node2vec suppresses the out edges, e.g., an edge connecting a peripheral node

to the bridge node, coming from an interior node. Consequently, the biased random walks are

restricted to the network modules. In this case, the transition from the peripheral nodes to the bridge

node becomes less likely compared to a first-order random walk, thus pushing the embeddings

between the bridge node and the peripheral nodes away from each other. Conversely, when 𝑞 is

small, the transition between the peripheral nodes and the bridge node is encouraged. In this case,

the embeddings of the bridge node and the peripheral nodes are pulled together. To see this, we run

node2vec with fixed 𝑝 = 1, and three different settings of 𝑞 = [1, 100, 0.01]. Indeed, for 𝑞 = 100,

node2vec tightly clusters interior nodes and pushes the bridge node away from the peripheral nodes,

and for 𝑞 = 0.01, the peripheral nodes are pushed away from the interior nodes (Figure 4.3b). Since

node2vec and node2vec+ are equivalent when the graph is unweighted (see Method), we omit the

visualization of node2vec+ embeddings for 𝐵.

Next, we perturb the barbell graph by adding loose edges with edge weights of 0.1, making the

graph fully connected. This perturbed barbell graph is denoted ˜𝐵. As expected, node2vec failed to

make use of the 𝑞 parameter (Figure 4.3c), since none of the edges are identified as an out edge. On

the other hand, node2vec+ can pick up potential out edges and thus qualitatively recovers the desired

outcome (Figure 4.3d). Note that both node2vec and node2vec+ have similar results for ˜𝐵 when

𝑞 = 1. This confirms that node2vec+ and node2vec are equivalent when 𝑝 and 𝑞 are set to neutral,

corresponding to embedding with unbiased random walks. Finally, when using non-neutral settings

of 𝑞, node2vec+ is able to suppress some noisy edges, resulting in less scattered embeddings of the

interior nodes (Figure 4.3d).

4.3.1.2 Hierarchical CLUSTER graph

We use a modified version of the CLUSTER dataset [64] to further demonstrate the advantage

of the node2vec+ due to identifying potential out edges. Specifically, the hierarchical cluster graph

K3L2 contains 𝐿 = 2 levels (3 including the root level) of clusters, and each parent cluster is

associated with 𝐾 = 3 children clusters (Figure 4.4a). There are 30 nodes in each cluster, resulting

in a total of 390 nodes. To generate the hierarchical cluster graph, we first generate point clouds via

52

Figure 4.4: Hierarchical CLUSTER graph classification task. (a) Illustrations of the K3L2
hierarchical clusters. Left: top-down view of the clusters. Right: adjacency matrix of K3L2; colored
brackets indicate the corresponding cluster levels of the nodes. (b) Classification evaluation on
K3L2. (c) Classification evaluation on K3L2c45.

a Gaussian process in a latent space so that the Euclidean distance between two points from two

sibling clusters is about twice (

2 to be precise) the expected Euclidean distance from one of the two

√

points to a point in the parent cluster, which is set to be 1. The noisiness of the clusters is controlled

53

by the parameter 𝜎, which is set to 0.1 by default. These data points are then turned into a fully

connected weighted graph using a RBF kernel. We consider two different tasks (Figure 4.4a), (1)

cluster classification: identifying individual cluster identity of each node in the graph, and (2) level

classification: identifying the level to which the clusters correspond to. We split the nodes into 10%

training and 90% testing and use the multinomial logistic regression model with l2 regularization

for prediction. The evaluation process, including the embedding generation, is repeated ten times,

and the final results are reported by Macro F1 scores.

As shown in Figure 4.4b, the performance of node2vec is not affected by the 𝑞 parameter because

the graph is fully connected. Meanwhile, node2vec+ achieves significantly better performance than

node2vec for large 𝑞 settings for both tasks, demonstrating the ability of node2vec+ to identify

potential out edges and use this information to perform localized biased random walks. Similar

results are observed on a couple of different hierarchical cluster graphs K3L3, K5L1, and K5L2

(Figure 4.5).

On the other hand, one might suspect that the issue with the fully connected graph can be

alleviated by sparsifying the graph based on certain edge weight thresholds. Such an approach is

widely adopted as a post-processing step for constructing functional gene interaction networks. Here,

we show that even after sparsifying the graph aggressively, node2vec+ still outperforms node2vec.

In particular, we sparsify the K3L2 graph using the edge weight threshold 0.45, which is the largest

value that keeps the graph connected. We then perform the same evaluation analysis above on this

sparsified graph K3L2c45. In this case, node2vec indeed performs significantly better than before

the sparsification for both tasks. Nonetheless, node2vec+ achieves even better performance, still

out-competing node2vec (Figure 4.4c).

Finally, we conduct a fine-grained evaluation analysis, showing that node2vec+ consistently

outperforms node2vec under a wide range of conditions, including edge threshold, train-test ratio,

and noise level (Figure 4.6).

54

Figure 4.5: Evaluation of other hierarchical cluster graphs: K3L3, K5L1, and K5L2.

4.3.2 Real-world datasets

Our primary motivation for developing node2vec+ stems from the fact that many functional gene

interaction networks are dense and weighted. To systematically evaluate the ability of node2vec+ to

embed such biological networks, we consider various challenging gene classification tasks, including

gene function and disease gene predictions. Furthermore, we devise experiments with previously

benchmarked datasets BlogCatalog and Wikipedia [83] and confirm that node2vec+ performs equal

55

K3L3K5L1K5L2(a)(b)(c)Cluster classificationLevel classificationFigure 4.6: Fine-grained analysis of K3L2. (a) changing sparsification threshold value. (b)
changing train/test raio, larger value means less training data. (c) changing noise level during
network construction.

to or better than node2vec, depending on whether the network is weighted (Figure 4.7).

4.3.2.1 Datasets

Human functional gene interaction networks We consider functional gene interaction networks,

which is a broader class of gene interaction networks that are routinely used to capture gene functional

relationships.

• STRING [246] is an integrative gene interaction network that combines evidence of protein

interactions from various sources, such as text-mining, high-throughput experiments, and etc.

• HumanBase-global is a tissue-naive version of the HumanBase [80] tissue-specific networks

56

Figure 4.7: Multi-label classification benchmarks using BlogCatalog and Wikipedia.

(previously known as GIANT), which are constructed by integrating hundreds of thousands

of publicly available gene expression studies, protein-protein interactions, and protein-DNA

interactions via a Bayesian approach, calibrated against high-quality known functional gene

interactions.

• HumanBaseTop-global is a sparsified version of HumanBase-global that eliminates all edges

below the prior of 0.1.

Multi-label gene classification tasks We follow the procedure detailed in [153] to prepare the

multi-label gene classification datasets. More specifically, we prepare two collections of gene

classification tasks (each is called a gene set collection):

• GOBP: Gene function prediction tasks derived from the Biological Processes gene sets from

the Gene Ontology [49].

• DisGeNET: Disease gene prediction tasks derived from the disease gene sets from the

DisGeNET database [206].

After filtering and cleaning up the raw gene set collections, we end up with ∼45 functional gene

prediction tasks and ∼100 disease gene prediction tasks (Table 4.1). These gene classification tasks

57

are challenging primarily due to the scarcity of the labeled examples, with on average 100 and 200

positive examples per task for GOBP and DisGeNET, respectively, relative to the (order of) tens of

thousands of nodes in the networks.

We split the genes into 60% training, 20% validation, and 20% testing according to the level

at which they have been studied in the literature (based on the number of PubMed publications

associated with each gene). In particular, the top 60% most well-studied genes are used for training;

the 20% least-studied genes are used for testing, and the rest are used for validation. For GNNs, we

report the test scores at the epoch where the best validation score is achieved.

4.3.2.2 Baseline methods

Node embedding We use node2vec as our primary baseline for node embedding. We exclude

several popular node embedding methods, such as DeepWalk [202], LINE [247], and GraRep [40],

from our main analysis, as it has been shown previously in various contexts [83, 16, 291] that

node2vec performs superior in the node classification settings.

Graph neural network We include two popular GNNs, GCN [124] and GraphSAGE [89] in

our comparison. Both methods have shown exceptional performance on many node classification

tasks, but their performance on the gene classification tasks here still needs to be better studied. For

GraphSAGE, we consider the full-batch training strategy with mean pooling aggregation following

the Open Graph Benchmark [99].

The basic GNN architecture consists of three main components: (1) the pre-message-passing

(pre-mp) layer that map the initial node features to the hidden dimension, (2) the graph convolution

layers, and (3) the post-message-passing (post-mp), or the prediction head, layer that maps the final

node embeddings to the prediction values. In addition, the convolution layers have the option to add

residual (skipsum) connections. We initialize the linear (pre-/post-mp) layers using Xavier uniform

initialization [76]. Finally, to train the GNNs, we use the standard Adam optimizer [122] with a

reduce learning rate on plateau scheduler, along with dropout and weight decay.

58

4.3.2.3 Experiment setup

Evaluation metric Following [153], we use the log2

auPRC
prior as our evaluation metric, which

represents the log2 fold change of the average precision compared to the prior. This metric is more

suitable than other commonly used metrics like AUROC as it corrects for the class imbalance issue

that is prevalent in the gene classification tasks here, as well as emphasizes the correctness of top

predictions.

Tuning embeddings parameters For node2vec and node2vec+, we train a one vs rest logistic

regression with l2 regularization using the embeddings learned. The parameters for embeddings

including dimension, window-size, walk-length, and number of walks per node are set to 128, 10, 80,

and 10, respectively, by default. We tune the hyperparameters for node2vec ( 𝑝, 𝑞) and for node2vec+

( 𝑝, 𝑞, 𝛾) via grid search using the validation sets. To keep the grid search budget comparable, we

search 𝑝 and 𝑞 over {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100}2 for node2vec (𝑛 = 81); we search 𝑝 and

𝑞 over {0.01, 0.1, 1, 10, 100}2, together with 𝛾 ∈ {0, 1, 2} for node2vec+ (𝑛 = 75).

Tuning GNN parameters For both GNNs, we train one model for each combination of a network

and a gene set collection in an end-to-end fashion. The architectures are fixed to five hidden layers

with a hidden dimension of 128. Since the gene interaction networks here do not come with node

features, we use the constant feature for GCN and the degree feature for GraphSAGE, respectively.

We use the Adam optimizer [122] to train the GNNs with 100, 000 max number of epochs. The

learning rates are tuned via grid search from 10−5 to 10−1 based on the validation performance. The

optimal learning rates that result in a decent convergence rate without diverging are 0.01 and 0.0005

for GCN and GraphSAGE, respectively.

Table 4.1: Number of tasks (i.e., gene sets or node classes) for each combinations of network
and gene set collection. The number in the parenthesis is the average number of positive examples.

HumanBase
STRING

GOBP DisGeNET
103 (225.9)
97 (221.5)

46 (98.3)
41 (100.0)

59

(a) HumanBase-global

(b) HumanBaseTop-global
Figure 4.8: Comparison of different 𝛾 settings in node2vec+. Each dot represents the testing
performance (log2

prior ) of a specific gene set, with optimally tuned 𝑝 and 𝑞 settings.

(c) STRING

auPRC

4.3.2.4 Experimental results

Tuning 𝛾 significantly improves performance for dense graph The 𝛾 parameter in node2vec+

(Chapter 4.2.2) controls the threshold of distinguishing in edges and out edges. A small or negative

valued 𝛾 considers most non-zero edges as out edges. Conversely, a large valued 𝛾 identifies less

out edges. When the input graph is noisy and dense, assigning a larger 𝛾 (e.g., 1) can act as a

stronger denoiser to suppress spurious out edges. Figure 4.8a compares the gene classification test

performance between 𝛾 = 0 and 𝛾 = {1, 2} with optimally tuned 𝑝, 𝑞 using the HumanBase-global

network. Higher testing scores are achieved by larger 𝛾 settings, illustrating that, to properly

“denoise” the fully connected weighted graph HumanBase-global, we need to increase the noisy edge

thresholds. On the contrary, the difference in performance due to the 𝛾 settings is less pronounced

for sparse networks like HumanBaseTop (Figure 4.8b) and STRING (Figure 4.8c).

GNN methods performs worse than node2vec(+)

In all settings, node2vec+ significantly

outperforms both GNN methods (Figure 4.9). Notably, for the STRING network, both node2vec

and node2vec+ outperform the two GNNs by a large margin. The sub-optimal GNN performance

here illustrates that, despite being powerful neural network architectures that can leverage the graph

structures, GNNs alone cannot learn effectively given a limited number of labeled examples. On the

contrary, the embedding processes of node2vec(+) are task agnostic and can be carried out effectively

60

Figure 4.9: Gene classification tasks using protein-protein interaction networks. Each panel
corresponds to a specific protein-protein interaction network (HumanBase-global, HumanBaseTop-
global, and STRING). Each point in a boxplot represents the final test score for a specific task (gene
set) in the gene set collection (GOBP or DisGeNET). Starred (*) pairs indicate that the performance
between node2vec and node2vec+ are significantly different (Wilcoxon p-value < 0.05).

without labels. These results indicate that gene classification tasks based on gene interaction network

are more effectively solved by unsupervised shallow embedding methods than GNNs.

nod2vec+ matches or outperforms node2vec node2vec+ significantly outperforms node2vec

(Wilcoxon paired test [274] p-val < 0.05) except for the DisGeNET tasks using HumanBaseTop-

global and STRING networks, in which cases the two methods perform equally (Figure 4.9). The

performance differences are especially pronounced when using the fully connected and noisy

HumanBase-global network, demonstrating node2vec+’s ability to learn robust node representations

in the presence of noise. Nevertheless, when the network is less dense (e.g., HumanBaseTop-global),

node2vec+ is still able to perform at least as well as node2vec, indicating that node2vec+ is overall a

good replacement of node2vec.

4.3.2.5 Tissue-specific functional gene classification

A key feature of functional gene interaction networks constructed using gene expression data

is capturing biological context specificity, such as tissue-specificity provided by the HumanBase

networks. Thus, we further demonstrate the use case of node2vec+ using tissue-specific functional

gene classification tasks derived from [297]. After processing, there are 25 tissue-specific functional

61

gene classification tasks, with 12 different tissues found in the HumanBase database. We follow

a similar experimental setup as above, and for each tissue-specific functional gene classification

task, we report the followings: (1) matched: the prediction performance using the corresponding

tissue-specific network; (2) other: the average prediction performance using tissue-specific networks

other than the corresponding tissue; (3) global: the prediction performance using the tissue-naive

network.

Figure 4.10a shows that node2vec+ outperforms node2vec in most scenarios, especially when

using the full HumanBase networks. In particular, node2vec+, using the matched tissue-specific full

networks for the given functional gene classification tasks, results in significantly better performance

than using other (unrelated) tissue-specific networks, as well as the global (tissue-naive) network.

On the contrary, node2vec cannot fully utilize the tissue-specific networks, as indicated by the lack

of difference in performance between matched and global networks.

We observe similar results using another collection of tissue-specific co-expression networks,

GTExCoExp, that are generated using a benchmarked co-expression network generation workflow

by [113] (Figure 4.10b). The main difference is that, in this case, using the global network achieves

slightly better performance than using the matched tissue-specific networks.

4.4 Conclusion

We have proposed node2vec+, a natural extention of node2vec, which improved the second-order

random walk on weighted graphs by accounting for edge weights. We demonstrated that the

corresponding node embeddings are improved whenever the in-out walks positively influence the

task (meaning that the optimal 𝑞 setting is not 1). Particularly, we showed that node2vec+ better

identifies potential out edges on weighted graphs than node2vec using the synthetic barbell graph

and the hierarchical cluster graph datasets. Evaluations on various challenging gene classification

tasks implied the superiority of node2vec(+) over node2vec, and even over more powerful graph

neural networks.

62

(a) HumanBase

(b) GTExCoExp

Figure 4.10: Tissue-specific functional gene classification performance. Comparison between
node2vec and node2vec+ for their tissue specific GOBP gene annotation capabilities using tissue-
specific networks.

63

CHAPTER 5

CONE: CONTEXT-SPECIFIC NETWORK EMBEDDING VIA
CONTEXTUALIZED GRAPH ATTENTION

Human gene interaction networks, commonly known as interactomes, encode genes’ functional

relationships, which are invaluable knowledge for translational medical research and the mechanistic

understanding of complex human diseases. Meanwhile, the advancement of network embedding

techniques has inspired recent efforts to identify novel human disease-associated genes using

canonical interactome embeddings. However, one pivotal challenge that persists stems from the fact

that many complex diseases manifest in specific biological contexts, such as tissues or cell types,

and many existing interactomes do not encapsulate such information. Here, we propose CONE1, a

versatile approach to generate context-specific embeddings from a context-free interactome. The

core component of CONE consists of a graph attention network with contextual conditioning,

and it is trained in a noise-contrastive fashion using contextualized interactome random walks

localized around contextual genes. We demonstrate the strong performance of CONE embeddings

in identifying disease-associated genes when using known biological contexts associated with the

diseases. Furthermore, our approach offers insights into understanding the biological contexts

associated with human diseases.

5.1 Introduction

The proper operation of cells depends on the precise coordination and interaction of biological

entities, such as genes, RNA, and proteins. As a result, complex human diseases are the ramifications

of perturbations to groups of genes that give rise to pathological states [20, 262]. Leveraging this

interdependence among biological entities, network-based methods have shown great promises in

unveiling human genes’ function [154] and their associated diseases [129, 287]. Recent approaches

achieved this by training machine learning models using network embeddings extracted from the

input gene network [291, 154, 264, 263]. However, a crucial limitation remains as many biological

network embedding methods do not consider the differences induced by various biological contexts.

1https://github.com/krishnanlab/cone

64

The interacting relationships among genes vary across biological contexts, such as tissues, cell

types, or disease states. Many human genes operate in a tissue-dependent manner. For example,

DMD is preferably expressed in muscle [63]. The heterogeneity of the specific set of genes expressed

by a particular biological context contributes significantly to the phenotypic diversity within an

organism, aiding the complex, specialized functions required for its survival [27]. Consequently, the

dysfunction of genes ultimately leads to diseases manifesting only in specific tissues. For example,

Mendelian disorders show clear tissue-specific manifestation and complex diseases have a strong

tendency of tissue selectivities, such as neurological disorders and cardiovascular diseases [93].

However, tissue-disease associations are not always straightforward. Apart from the primary affected

tissues, diseases may affect seemingly unrelated tissues. One prime example is the high risk of

gastrointestinal tract dysfunction observed in patients of Parkinson’s disease, a neurological disorder

primarily centered in the brain [90]. The cryptic connections between diseases may be partially

explained by shared underlying mechanisms among them, which could be well characterized by

networks [47].

Numerous functional genomics projects generate data comprising diverse types, qualities, and

scopes of genes or molecules [189, 245, 121]. To obtain a high-quality and comprehensive network

embedding, several methods are developed to infer a joint network representation by integrating

multiple networks [75, 71, 54, 46, 275]. However, a drawback of network integration methods

is that the integration process can eliminate context-specific information in each input network,

resulting in a context-naive network. In other words, it may assume the same molecular interactions

in the kidney and brain, whereas, in reality, interactions are tissue-specific. To predict a range

of tissue-specific gene functions or gene-disease relationships, we must integrate context-specific

information into the network. Furthermore, state-of-the-art data integration approaches using graph

neural networks [71] may not scale well to the number or size of the networks. Therefore, we require

a scalable method that can handle networks of varying sizes and numbers.

Contributions Here, we address the critical need for a versatile and scalable method for generating

biological context-specific network embeddings. We summarize our main contributions as follows.

65

1. We propose CONE, a versatile contextual network embedding method that takes context

definitions in the form of node sets.

2. The proposed method operates on a shared graph attention network across all contexts, which

is contextualized by conditioning on the raw embeddings. This results in a model that scales

practically independent of the number of contexts.

3. Through a series of experiments, we demonstrate the value of injecting various biological

contexts to improve disease gene prioritization.

Related work A few studies have explored the idea of contextualizing biological network

embeddings using contexts such as tissue or cell type specificity. Notably, OhmNet [297] pioneered

the tissue-specific gene interaction network embedding by leveraging the hierarchical relationships

between different tissue levels and genes. OhmNet learns a multi-layer embedding and operates

on the idea that closely related tissues, or layers, should have similar embeddings. However, the

original OhmNet method requires a highly specific construction of the hierarchical multi-layer

tissue-specific networks, making it hard to extend to broader biological contexts readily. More

recently, PINNACLE [140] further expanded the biological contexts into finer-grain definitions

based on cell types using cell-type-expressed genes constructed from the Tabula Sapiens single

cell atlas [50]. Furthermore, PINNACLE learns context-specific graph attention modules with

independent parameters per context, leading to poor scalability, as each new context requires its own

model. CONE, on the other hand, provides a versatile approach that is also scalable with respect to

the number of contexts.

5.2 Preliminaries

5.2.1 Network Embedding via Sampling

Let 𝐺 = (𝑉, 𝐸, 𝑤) be a weighted undirected graph, where the edge weight function 𝑤 : 𝑉 ×𝑉 → R,

and denote its corresponding adjacency matrix by M ∈ R|𝑉 |×|𝑉 |. The goal of a graph embedding

method aims to find the mapping 𝑓

: 𝑉 → R𝑑 that maps each node 𝑣 ∈ 𝑉 to a 𝑑-dimensional

66

embedding space by minimizing the following objective function.

L𝐺,𝑆+,𝑆− ( ˆ𝑓 )

min
ˆ𝑓

(5.1)

where 𝑆+ and 𝑆− are positive and negative edge sampling functions. Particularly, the Singular Value

Decomposition (SVD) can be viewed as the optimization of equation 5.1, where 𝑆+,− are both

uniform sampler on all pairwise entries in the adjacency matrix, 𝑓 maps to the left ( 𝑓𝐿) and right

( 𝑓𝑅) embedding representations [1]. The loss function is the squared error between the inner product

of the left and right embeddings of the two nodes and the edge weight between them in the graph:

LSVD = E(𝑢,𝑣)∼|𝑉 |×|𝑉 |

(cid:16)

⟨ 𝑓𝐿 (𝑢), 𝑓𝑅 (𝑣)⟩ − M𝑢,𝑣

(cid:17) 2

(5.2)

Random Walk Sampling Random walk on graphs have been studied extensively, with many

applications spanning social network analysis, information retrieval, and so on. In our framework, the

random walk procedure can be seen as the node-pair sampling function. For instance, node2vec [83]

with negative sampling can be reformulated in a noise contrastive fashion [85] as:

LRW = −E(𝑢,𝑣)∼𝑆+ (𝐺) log

(cid:16)

𝜎 (cid:0)⟨ 𝑓 (𝑢), 𝑓 (𝑣)⟩(cid:1) (cid:17)

− 𝑘E(𝑢,𝑣)∼𝑆− (𝐺) log

(cid:16)

1 − 𝜎 (cid:0)⟨ 𝑓 (𝑢), 𝑓 (𝑣)⟩(cid:1)(cid:17)

(5.3)

where 𝜎 is the sigmoid function, 𝑘 is the number of negative samples, and the positive sampling is

achieved by a sliding window over a second-order biased random walk [83]. We refer to the above

as the random walk loss.

5.2.2 Graph Attention Neural Network (GAT)

Graph neural network (GNN) is a special type of neural network architecture that operates

on the underlying graph structure. It does so by iteratively aggregating information from each

node’s neighborhood and transforming the aggregated representations [279, 295]. Particularly,

GAT [258, 32] uses an attention mechanism to weight each node’s neighborhood for aggregation,

and the (pre-activation and pre-normalization) layer updating rule is written as follows.

ℎ′(𝑢) =

∑︁

𝑣∈N(𝑢)

𝛼𝑢,𝑣Wℎ𝑣

67

(5.4)

where 𝛼𝑢,𝑣 is the attention score, W ∈ R𝑑×𝑑 is a learnable linear transformation. In practice, we use

the v2 corrected attention proposed in [32]:

𝛼𝑢,𝑣 = softmax𝑣 (cid:0)𝑎⊤LeakyReLU(W[ℎ(𝑢)||ℎ(𝑣)])(cid:1)

(5.5)

5.3 Methods

We are interested in learning a collection of network embeddings, each specific to a biological

context. For example, we can use heart-specific gene embeddings to unravel more tissue-specific

genes related to cardiovascular diseases. Contextualizing gene embeddings to biological contexts

this way allows us to unveil nuanced relationships between diseases and biological contexts, such

as tissues, cell types, and other diseases or traits. The full pipeline of our approach is depicted in

Figure 5.1.

From a high level, CONE contains two main components, including (1) a GNN decoder and (2)

an MLP context encoder. The GNN decoder converts the raw and learnable node embeddings into

the final embeddings. On the other hand, the MLP context encoder projects the context-specific

similarity profile that describes the relationships among different contexts (Chapter 5.3.2) into a

condition embedding. When added with the raw embeddings, the condition embedding serves as a

high-level contextual semantics, similar to the widely-used positional encodings in Transformer

models [255]. The embeddings are trained using the losses based on the random walk on the

context-specific subgraphs. We employ a straightforward approach to define a context-specific

subgraph as the subgraph induced by the genes relevant to that context. Next, we formally describe

our approach.

5.3.1 Contextualized network embeddings

Let C = {𝐶𝑖}𝑖∈1,...,𝑛𝑐 be a collection of 𝑛𝑐 contexts, where each context 𝐶𝑖 ⊂ 𝑉 is a subset of

nodes that defines the local context. We aim to learn a collection of embedding functions F = { 𝑓𝐶 }

by minimizing the loss function

L𝑡𝑜𝑡 = L𝑅𝑊 ( 𝑓0) + E𝐶∼CL𝑅𝑊

𝐶 ( 𝑓𝐶)

(5.6)

68

where 𝑓0 is the context-naive embedding that is optimized against the whole network 𝐺, and

𝑓𝐶 is the context-specific embedding that is optimized against the contextual random walk loss

L𝑅𝑊
𝐶

that samples random walks on the subgraph induced on the context set 𝐶, that is, 𝐺 (𝐶) =

(𝐶, {(𝑢, 𝑣) ∈ 𝐸 : 𝑢, 𝑣 ∈ 𝐶}, 𝑤). Equation 5.6 aims to simultaneously optimize for the global and

local contextualized representation of all the nodes 𝑣 in the network.

A naive attempt for obtaining the contextualized embeddings in equation 5.6 would be to learn

independent 𝑓𝐶 on the corresponding contextual graph 𝐺 (𝐶). However, the resulting context-

specific embeddings may completely lose the global information of the graph, since each 𝑓𝐶 operates

independently. We provide empirical evidence for this in Chapter 5.5.6.3.

5.3.2 Contextualized GAT

To address the above-mentioned problem, we propose to learn a shared embedding encoding

model using GAT, and contextualize different embeddings by conditioning on the raw embedding

matrix.

Let 𝑔𝜃 : R|𝑉 |×𝑑 → R|𝑉 |×𝑑 be a GAT network parameterized by 𝜃, and Z ∈ R|𝑉 |×𝑑 the raw

embedding matrix that is randomly initialized. Drawing parallels from recent work on conditional

generation [220], we view context-specific embeddings as generation conditioned on a specific

context, and propose to compute the contextualized embedding 𝑓𝐶 as

𝑓 CONE
𝐶

= 𝑔𝜃 (Z + 𝜙(𝐶))

(5.7)

where 𝜙(𝐶) ∈ R1×𝑑 is the context condition embedding that defines the context 𝐶. The context-

naive embeddings are computed by passing the raw embedding alone through the GAT encoder:

= 𝑔𝜃 (Z). Finally, to form the full context-specific embedding for downstream evaluation, we

𝑓 CONE
0
concatenate it with the context-naive embedding and then project it down to 𝑑-dimension via PCA.

Context condition embedding The context condition embedding serves to provide low level

semantics about each context, and two contexts with highly overlapping sets of nodes should have

similar condition embeddings. To that end, we design an approach to encode condition embedding

using the context similarity matrix J ∈ R𝑛𝑐×𝑛𝑐 constructed by taking the Jaccard index between

69

Figure 5.1: Overview of CONE embedding collection training and inference.

all pairwise contexts, J𝑖, 𝑗 = |𝐶𝑖∩𝐶 𝑗 |

|𝐶𝑖∪𝐶 𝑗 | . Finally, we use a two-layer Multi-Layer Perceptron (MLP)
to project J into the condition embeddings, thus for each context 𝐶𝑖, its corresponding condition

embedding is computed as 𝜙(𝐶𝑖) = MLP(J)[𝑖,:].

5.3.3 Training CONE

The loss function defined in equation 5.6 is implemented in practice by alternating between the
context-naive random walk loss L𝑅𝑊 ( 𝑓0), and the context-specific random walk loss L𝑅𝑊 ( 𝑓𝐶) for a

randomly drawn context. We train the model for 120 epochs using the AdamW [158] optimizer with

a constant learning rate of 0.001 and a weight decay of 0.01. We optimize CONE’s hyperparameters

using the primary main DisGeNET benchmark (Chapter 6.3) based on the averaged validation APOP

scores of the context-naive CONE embeddings. The hyperparameter grid is shown in Table 5.1,

where the final hyperparameter settings are bolded.

5.3.4 Complexity analysis

As the main module of CONE, GAT has the computational complexity of O(|𝑉 |𝑑2 + |𝐸 |𝑑) [258,

32].

In addition, the context condition embedding encoder 𝜙 scales linearly with respect to

the number of conditions as O(𝑛𝑐𝑑). However, in practice, since 𝑛𝑐 ≪ |𝐸 |, the computational

70

GNNMLPlosslosslosslossBiological & Clinical ContextsTissuesCell typesDiseasesContext-naive embeddingsContext-specificembeddings collectionLegends:Context similarity profileRaw embeddingsFinal embeddingsAddition (optional)Random walk lossContext-naive gene interactionContext-specific gene interactionGeneral computation flowContext-specific computation flowTable 5.1: Hyperparameter search grid. Final settings are bolded.

Architecture

Optimization

Hyperparameter

Number of layers
Number of heads

Learning rate
Weight decay

Random walk Walk length

Walks per node
Window size

Search grid
[1, 2, 3]
[1, 2, 3, 4, 5, 6]
[0.01, 0.005, 0.001, 0.0005, 0.0001]
[1, 0.5, 0.1, 0.05, 0.001, 0.0005, 0.0001]
[40, 80, 120, 160, 200]
[1, 5, 10, 15, 20]
[5, 10, 15, 20, 25]

complexity of CONE should be equivalent to that of a single GAT network. This effective constant

scaling with respect to the number of contexts is in stark contrast with the recently proposed method

PINNACLE, which scales linearly with respect to the number of contexts due to the implementation

of independent GAT module for each context. We provide empirical evidence for the scalability of

CONE in Appendix 5.5.6.1.

5.4 Experiment setup

We devise diverse biomedical tasks to evaluate the capability of CONE against baseline methods to

prioritize genes in the gene interaction network. These tasks are multi-label classification tasks,

where the goal is to identify human genes that are related to certain diseases using the gene network

embeddings generated by the models. We conduct our main analysis using the PINPPI network,

which is a combined network using BioGRID [240], Menche [172], and HuRI [160], provided by

the PINNACLE [140] paper. Specifically, we obtain the raw PINPPI network from the PINNACLE

paper2, which contains 15, 461 nodes and 207, 641 edges. We then convert the node IDs from gene

symbol to Entrez ID [163] using the MyGeneInfo query service [277]. We only perserve genes that

have exact one-to-one mapping from gene symbol to Entrez ID. After the above conversion, the

final processed Entrez based PINPPI network contains 15, 229 nodes and 206, 835 edges.

For gene label information, we collect the two therapeutic target tasks (RA and IBD) from

PINNACLE. Furthemore, we compile a comprehensive collection of disease-gene annotations from

2https://figshare.com/articles/software/PINNACLE/22708126

71

DisGeNET [206], following the processing steps detailed in Chapter 2.2.3. After filtering out

diseases with less than ten positive genes intersecting with the PINPPI network, the final DisGeNET

benchmark contains 167 diverse human diseases.

For each DisGeNET disease gene prioritization tasks, we randomly split the positive and negative

genes into 6/2/2 train validation test sets. The final prediction performance are reported as the

average test scores across five different random splits. For RA and IBD, we use the pre-defined train

test split given by PINNACLE. Detailed dataset statistics are provided in Table 5.2

Table 5.2: Gene set statistics. First three gene set collections are used as prediction tasks, and the
remaining three gene set collections are used as contexts.

Type

# gene sets Min.

Lower quart.

Med.

Upper quart. Max.

Collection

DisGeNET [206]
IBD [140, 187]
RA [140, 187]

Tasks

Contexts

Disease
Therapeutic Target
Therapeutic Target

167
1
1

30
156
333

10

13

430
1002
204

1122.5
2012
535

21
151
113

1593.5
2751
566

39

189

3091
3051
998

8604
3425
4452

GTEx [157]
Tabula Sapiens [50, 140]
CREEDS [271]

Tissue
Cell type
Disease

5.4.1 Baseline methods

node2vec is a strong baseline method for network embedding-based gene prioritization method

with superior performance on various benchmarks [16]. We also include embeddings generated

by a two-layer GAT (v2) network [258, 32] trained in a standard graph autoencoder style [123] as

a more direct baseline against CONE. Moreover, BIONIC [71] and Gemini [275] are two recent

approaches that learn an integrated embedding across a collection of networks. We use them to

test if embedding multiple context-specific subgraphs together gives an advantage over embedding

a single context-naive network. All baselines and the CONE embeddings are evaluated in an

unsupervised setting, where an ℓ2 regularized logistic regression model is trained for each task using

the embeddings that are learned without accessing any label information.

For context-specific network embeddings, we consider a recently proposed method, PINNA-

CLE [140], which learns separate GAT modules for each context. We directly use the context-specific

embeddings provided by the paper 3 to reanalyze the performance under our fair setting. We point

3https://figshare.com/articles/software/PINNACLE/22708126

72

out that PINNACLE context-specific embeddings differ slightly from CONE in that PINNACLE

only generates embedding for the context-specific nodes. Conversely, CONE generates embeddings

for all nodes regardless of if they are specific to the context. This enables us to evaluate all

context-specific embeddings fairly across diverse disease gene prioritization tasks. Due to this

limitation of PINNACLE, we exclude it from the primary DisGeNET gene prioritization benchmark.

We set the embedding dimensions to 128 for all models.

5.4.2 Context-specific gene classification tasks

We primarily consider tissue-specificity for contextualizing the network embeddings. This leads

to a natural understanding of the downstream disease gene prediction performance based on the

association between tissues (contexts) and diseases (tasks). One widely adopted way of defining

tissue- or cell type-specific genes is by differential gene expression [248], where genes that express

significantly more in one context than others are considered relevant to the given context. Following

this, we first obtain tissue-specific gene expression from the GTEx project [157], and then extract

tissue-specific genes by taking genes with z-scores greater than one across tissues. In addition to

tissue-specific genes, we also showcase CONE using other biological contexts, including cell types

and diseases.

5.4.3 Evaluation metrics

We use the log2 fold-change of the average-precision over the prior (APOP) as the main metric

for evaluating disease gene prioritization performance [154]. The area under the precision recall

curve, which is closely related to the average precision, is a better choice for evaluating tasks with

severe class imbalance [226]. The prior division, on the other hand, corrects for the expectation of

the random guesser performance on different tasks with different number of positive examples. For

the RA and IBD therapeutic target prediction tasks, we follow PINNACLE and report the average

precision and recall at rank five (APR@5) in addition to APOP.

73

Figure 5.2: DisGeNET disease gene predictions performance comparison between node2vec
and CONE embeddings. Each point in the box plot corresponds to the prediction test performance
of a disease averaged across five random splits. Different panels show groups of diseases with
different number of positive genes. For example, the left-most panel contains 31 diseases with at
least 10 but less than 13 positive genes. ns, *, and ** indicate the significance level of the paired
Wilcoxon test between the baseline node2vec and CONE (ns: not significant, *: 0.01 < p-value
≤ 0.05, **: p-value < 0.01).

5.5 Results and discussions

5.5.1 Context-specific embedding improves disease gene prediction performance

We first observe that, overall, picking the best context for each disease achieves noticeable

performance improvement over the context-naive CONE embeddings, as indicated by the good

performance of CONE (best) in Figure 5.2. Moreover, the advantage of using context-specific

embedding is most apparent when the number of positive genes available for the disease is lacking,

which might be attributable to the fact that diseases with more associated genes are more likely to

contain more ubiquitous and less context-specific genes, and consequently reducing the effectiveness

of using context-specific embeddings. In all cases, we note that either the context-naive or the

context-specific CONE embedding consistently matches the performance achieved by the node2vec

baseline.

5.5.2 CONE infer biologically relevant contexts for human diseases

Despite the performance improvement due to the CONE contextualized embeddings, it is still

unclear whether the biological contexts that led to good performance on a particular disease are

74

GeminiBIONICGATnode2vecCONE(naive)CONE(best)02468APOPns[10,13) n=31GeminiBIONICGATnode2vecCONE(naive)CONE(best)*[13,23) n=43GeminiBIONICGATnode2vecCONE(naive)CONE(best)**[23,42) n=39GeminiBIONICGATnode2vecCONE(naive)CONE(best)*[42,173) n=38associated with that disease. To address this question, we manually inspect six diseases where

the connection between tissue and disease manifestation appears readily evident, as shown in

Table 5.3. We hypothesize that the CONE embeddings derived from the disease-related tissue

should have a higher APOP compared to the context-naive CONE or node2vec embeddings. Indeed,

many of the top contexts found by CONE are biologically meaningful, whether as one of the main

affected tissues or a related tissue. For example, the top-performing contexts for subvalvular aortic

stenosis and familial bicuspid aortic valve, both diseases in which there is a problem with the

aortic valve, which is the valve of an artery in the heart, included the artery for subvalvular aortic

stenosis and heart for familial bicuspid aortic valve. More subtle top-performing but biologically

relevant contexts are small intestine and adipose for hypochromic microcytic anemia. The cause

of hypochromic microcytic anemia is typically decreased iron reserves in the body. This may be

due to decreased dietary iron, poor absorption of iron from the gut, and acute or chronic blood

loss [44]. Iron absorption is primarily carried out by cells in the small intestine [209], explaining

why it would be a top context for anemia. Obesity, which is an excessive accumulation of adipose,

has also been molecularly linked to iron deficiency in a way that shows the conditions mutually

affect one another [3]. Also of note is that the primary tissue affected by pure red-cell aplasia,

blood, was not identified in the top three contexts. However, patients with hepatitis, a symptom

of liver inflammation, sometimes develop pure red-cell aplasia [145, 227]. CONE did manage to

highlight cryptic associations between the liver and pure red-cell aplasia. Together, these results

imply that CONE can help uncover subtle disease-tissue relationships. Thus, CONE contextualized

embeddings can not only achieve good prediction performance, but the top-performing contexts

typically show biological relevance.

5.5.3 Context-specific embedding enhances therapeutic target prediction

Inducing biological context information has been recently shown to be beneficial for predicting

therapeutic targets in complex diseases such as Rheumatoid arthritis (RA) and Inflammatory Bowel

Disease (IBD) [140]. Therapeutic target prediction for a particular disease is a binary classification

problem that aims to predict whether a particular human gene can be used as a point of intervention

75

Table 5.3: Top performing contexts for selected diseases in the DisGeNET benchmark.
Performance reported as test APOP scores averaged across five random splits. The top contexts are
sorted descendingly from left two right. For example, the Heart context achieved the highest score
for Nemaline myopathy.

Task

node2vec CONE (naive) CONE (best)

Top contexts

Familial hypertrophic cardiomyopathy
Nemaline myopathy
Subvalvular aortic stenosis
Hypochromic microcytic anemia
Familial bicuspid aortic valve
Pure red-cell aplasia

3.1481
4.9395
4.1582
2.5056
1.2164
6.0797

2.6405
4.7192
3.6587
1.3827
3.4816
5.9410

2.9609
5.1911
3.9863
2.8930
4.1857
6.1684

Pancreas, Stomach, Muscle
Heart, Stomach, Minor Salivary Gland
Colon, Artery, Minor Salivary Gland
Adipose, Skin, Small Intestine
Pancreas, Stomach, Heart
Stomach, Pancreas, Liver

for treating that disease. Compared to CONE, PINNACLE [140] takes an alternative approach by

constructing a heterogeneous network that introduces different biological contexts as a type of node

in the heterogeneous biomedical graph. Furthermore, the PINNACLE model learns individual sets

of parameters for each biological context, contrasting with our unified model that shares the same

set of parameters across all biological contexts. We hypothesize that our approach of tying weights

induces more regularity from the underlying graph and, as a result, produces better-contextualized

embeddings for predicting therapeutic targets.

To test this, we first use the cell-type specific genes processed by PINNACLE to generate

cell-type specific CONE embeddings. We then follow the original evaluation and measure different

contextualized embeddings’ performance using APR@5. We note that the PINNACLE context-

specific embeddings only contain embeddings of genes within the context. Conversely, CONE

context-specific embeddings are genome-wide, meaning that embeddings in any context contain the

same number of genes, spanning the whole network. Thus, to unify the setting between PINNACLE

and CONE, we subset each context-specific CONE embedding to the corresponding context genes.

As shown in Figure 5.3, our CONE embeddings achieve significantly better performance than

the PINNACLE embeddings when used in an unsupervised embedding setting, in which the training

of the embedding does not involve node label information for the downstream prediction tasks.

Furthermore, the highlighted immune cell contexts show that CONE better prioritizes the relevant

cell context of IBD and RA, both of which are autoimmune diseases resulting from the malfunction

of immune cells. Upon top-performing cell contexts in RA, CONE achieves better performance than

76

Table 5.4: Top four performing contexts for predicting RA and IBD therapeutic targets. The
first block of rows show the top four cell types with highest APOP socres of predicting RA and
IBD using PINNACLE embeddings. Similarly, the second block of rows show those for the CONE
embeddings.

Context

Rheumatoid Arthritis (RA)
APOP

APR@5

Context

Inflammatory Bowel Disease (IBD)
APOP

APR@5

PINNACLE

CONE

Large intestine goblet cell
Intrahepatic cholangiocyte
Retinal ganglion cell
Lung ciliated cell

Retinal ganglion cell
Pancreatic acinar cell
Pancreatic alpha cell
Club cell of prostate epithelium

1.734
1.634
1.537
1.525

3.186
3.158
2.855
2.710

0.333
0.200
0.200
0.040

0.760
1.000
0.383
0.383

Pulmonary ionocyte
Pancreatic acinar cell
Lymphatic endothelial cell
Muscle cell

Duct epithelial cell
Luminal cell of prostate epithelium
Tracheal goblet cell
Vascular associated smooth muscle cell

3.145
3.051
2.709
2.672

5.977
5.954
5.883
4.691

0.333
0.250
0.333
0.000

1.000
1.000
1.000
0.500

PINNACLE. Specifically, CONE reveals contexts that are biologically related to RA. For example,

pancreatic acinar cells (rank of APOP=2, rank of APR@5=1) secrete digestive enzymes that are

involved in the digestion process within the small intestine. The early activation of these digestive

enzymes before they reach the duodenum can trigger the onset of acute pancreatitis [139]. On the

other hand, acute pancreatitis is highly associated with RA. Clinical studies have shown that RA

patients were 2.51 times more likely to develop acute pancreatitis [9]. CONE is also able to reveal

hidden associations based on cell-type-specific networks. Similarly, CONE performs better than

PINNACLE in identifying top relevant cell contexts related to IBD. CONE picked up duct epithelial

cells (rank of APOP=1, rank of APR@5=1) as the top cell context. These cells are integral to

the intestinal barrier, serving as the first line of defense against invading microorganisms [219].

However, in IBD patients, the proper function of the intestinal barrier is frequently compromised

to varying degrees [12]. Overall, these examples demonstrate the superior power of CONE in

predicting therapeutic targets using biologically relevant cell contexts.

5.5.4 CONE leverages diverse biological contexts beyond tissue and cell type

Since CONE takes contextual information in the form of node sets, it can be extended to

biological contexts outside of traditionally used tissue and cell type contexts [297, 140]. Here,

we explore the extendability of our CONE approach to different biological contexts in two other

ways. First, we reevaluate the RA and IBD prediction performance using CONE trained on different

disease contexts defined by differentially expressed genes obtained from CREEDS [271]. We find

77

Figure 5.3: Therapeutic area predictions for RA and IBD. Each point represent the APR@5 score
achieved when using a particular cell-type-specific embedding to predict the RA or IBD therapeutic
targets. Immune cell contexts are highlighted in orange and pink for CONE and PINNACLE.

Figure 5.4: Sorted disease contexts for predict the therapeutic area predictions for RA and
IBD.

that the top-ranked contexts for both RA and IBD are indeed highly relevant disorders for both

diseases (Figure 5.4). For example, psoriasis is one of the top disease contexts related to RA (rank of

APOP=3). A clear connection between these two conditions is psoriatic arthritis, a form of arthritis

with a skin rash, which is a common symptom in psoriasis [284]. This indicates that there are similar

genetic programs shared by both diseases [194], which is revealed by CONE using disease context

networks. Furthermore, CONE also reveals several connections between RA and some seemingly

unrelated diseases, such as heart failure (rank of APOP=115). Notably, a recent study has confirmed

that RA patients have a two-fold higher risk of heart failure mortality than those without RA [193].

Similarly, CONE unveils meaningful relationships between IBD and other disease contexts.

Cystitis (rank of APOP=3) is one of the top disease contexts identified by CONE. Clinical studies

78

020406080100120140Rank0.00.20.40.60.81.0APR@5Rheumatoid arthritis (RA)Immune cell contexts from CONEOther cell contexts from CONEImmune cell contexts from PINNACLEOther cell contexts from PINNACLE020406080100120140Rank0.00.20.40.60.81.0APR@5Inflammatory bowel disease (IBD)Immune cell contexts from CONEOther cell contexts from CONEImmune cell contexts from PINNACLEOther cell contexts from PINNACLERA related diseasesSchizophreniaPsoriasisBreast cancerProstate cancerHuntington's disease OsteolysisTurner syndromeCongestive heart failureFracture of femur Autoimmune thrombocytopenic purpuraAlcohol abuseObstructive sleep apneaWilson diseaseFatty liver diseaseHypercholesteremiaIBD related diseasesHelicobacter pylori gastrointestinal tract infectionCystitisMelanomaAmyotrophic lateral sclerosisObesityAutism spectrum disorderAstrocytomaEpilepsy syndromeColitisNicotine dependenceAppendicitisDengue diseaseTeratospermiaLarge cell neuroendocrine carcinomaIntestinal polyposishave shown that cystitis, an inflammation of the bladder, leads to a significant increase in the

risk of IBD [91, 48, 4]. For example, individuals with interstitial cystitis, a condition involving

an inflamed or irritated bladder wall, are 100 times more likely to have IBD [4]. CONE also

finds subtle relationships between IBD and neurological complications exemplified by epilepsy

syndrome (rank of APOP=114) and autism spectrum disorder (rank of APOP=112) in the top

list [42]. Neurological complications affect 0.25% to 47.50% IBD patients [69]. These IBD-related

neurological complications are associated with neuroinflammation or increased risk of blood clots

in brain veins [177]. Some diseases may have a protective effect on other diseases. CONE identified

such protective relationships between Helicobacter pylori gastrointestinal tract infection (rank of

APOP=2) and IBD, since H. pylori infection helps protect against IBD by inducing systematic

immune tolerance and suppressing inflammatory responses [290]. Overall, these examples further

confirm that CONE can leverage disease context to reveal both apparent and cryptic associations

between complex diseases.

Finally, we reperform the DisGeNET benchmark using a diverse construction of context-specific

gene sets, spanning from tissues, cell types, to diseases, and retrieved from various databases,

including CellMarker2.0 [96], Human Protein Atlas [250], and the TISSUES database [191].

We observe that CONE performs similarly under all collections of contexts tested (Figure 5.5).

Together with the fact that CONE performs competitively against baselines and captures biologically

meaningful contextual information, we believe that CONE is a versatile and effective approach to

scalably generate biological network embeddings conditioned on specific biological contexts.

79

Figure 5.5: Performance of CONE on the DisGeNET benchmark across contexts collections.

5.5.5 CONE enables knowledge transfer to unseen contexts

The MLP context encoder used by CONE provides the possibility to generate embeddings for

contexts that are not observed during training. This can be done by feeding the similarity profile of the

new query context against all training contexts to the MLP encoder. To demonstrate the effectiveness

of transferring to unseen context, we retrain the CONE GTEx tissue-specific embeddings but leave

out the Heart context during training. We observe that holding out Heart context does not affect the

disease gene classification performance significantly, with > 0.8 coefficient of correlation against

the original performance. Furthermore, we compile two lists of diseases for which the Heart context

achieved the top five performances for the original and held-out Heart versions of CONE. We found

a significant overlap between the two lists (Hypergeometric p-val < 0.05), with seven common

diseases (Table 5.5). One notable example is the familial bicuspid aortic valve, a known common

congenital heart defect [29]. These results highlight the effective transferability of CONE to unseen

contexts, with similar embedding quality as if the contexts were seen during training.

5.5.6 Additional experiments

5.5.6.1 Scalability

We empirically demonstrate the scalability of CONE against three other related methods,

including GAT, BIONIC, and PINNACLE. All these methods use the GAT module as the main

encoding component. CONE and GAT both employ a single GAT module, while BIONIC and

80

0246APOPCell types from CellMarker2.0Cell types from HPACell types from PINNACLEDiseases from CREEDSTissues from CellMarker2.0Tissues of tissue expressed genes in GTExTissues of tissue specific genes in GTExTissues from HPATissues from TISSUES 2.0DisGeNET CONE(naive)0246APOPDisGeNET CONE(best)Table 5.5: List of diseases for which the Heart context appears to be the top five performing
contexts in terms of test APOP scores. Last two columns indicate whether the disease shows up as
top five context for the original CONE or the one trained without Heart context.

Disease ID

Disease Name

Original Holdout Heart

agammaglobulinemia
asbestosis
cheilitis
chromophobe renal cell carcinoma
dysgammaglobulinemia
epidermolysis bullosa simplex

esophageal squamous cell carcinoma
exotropia
familial bicuspid aortic valve
familial primary hypomagnesemia
porencephaly
sideroblastic anemia
subvalvular aortic stenosis
Legg-Calve-Perthes disease

MONDO:0005580
MONDO:0001286
MONDO:0007194
MONDO:0018100
MONDO:0017410
MONDO:0015194
MONDO:0006987
MONDO:0007885
MONDO:0009532 Miller-Dieker lissencephaly syndrome
MONDO:0015977
MONDO:0016466
MONDO:0002102
MONDO:0017885
MONDO:0001342
MONDO:0017610
MONDO:0003435 microcystic adenoma
MONDO:0018943 myofibrillar myopathy
MONDO:0007243 Burkitt lymphoma
MONDO:0010269 Coats disease
abdominal obesity-metabolic syndrome
MONDO:0000816
arcus senilis
MONDO:0007150
autosomal dominant distal hereditary motor neuropathy
MONDO:0015362
central nervous system primitive neuroectodermal neoplasm
MONDO:0000640
chronic monocytic leukemia
MONDO:0004614
distal arthrogryposis
MONDO:0019942
exostosis
MONDO:0002181
granular cell cancer
MONDO:0003252
intermediate Charcot-Marie-Tooth disease
MONDO:0018778
juvenile open angle glaucoma
MONDO:0020367
leukoencephalopathy with vanishing white matter
MONDO:0011380
MONDO:0021637
low grade glioma
MONDO:0002478 mixed germ cell-sex cord-stromal tumor
MONDO:0004600 monocytic leukemia
nemaline myopathy
MONDO:0018958
non-syndromic X-linked intellectual disability
MONDO:0019181
osteogenesis imperfecta
MONDO:0019019
ovarian endometrioid adenocarcinoma
MONDO:0006335
paronychia
MONDO:0005898
restless legs syndrome
MONDO:0005391
secondary Parkinson disease
MONDO:0006966
septooptic dysplasia
MONDO:0008428
superficial mycosis
MONDO:0024268

✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓

✓
✓
✓
✓
✓
✓
✓

✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓

PINNACLE use individual GAT modules for different contexts.

We consider two types of scaling experiments, number of contexts and context node percentages.

For number of contexts, we fix the number of context-specific nodes to be about 5% of the total

number of nodes and vary the number of contexts from 10 to 1000. Meanwhile, for context node

percentages, we fix the number of contexts to be 100, and vary the context node percentages from

1% to 50%. For both sets of experiments, we use a synthetic network with 10, 000 nodes generated

81

Figure 5.6: Models scalability across different contextualization settings. Star indicates the point
beyond which the model will run out of memory.

using the Barabási–Albert model [18]. The network is generated so that the density is approximately

0.01. We report the peak CUDA memory usage (in bytes) of the forward pass for the model using

the torch.cuda.max_memory_allocated() function. All experiments are conducted on

compute nodes with 5 CPUs, 45GB memory, and a Tesla v100 GPU (32GB). We uniformly set the

following hyperparameters across models: 128 dimensions and one layer. Furthermore, BIONIC

and PINNACLE require subgraph batched training. We set the batch size to 2048 for both models.

On the other hand, CONE and GAT employ full batched computation.

The empirical scalability results are shown in Figure 5.6. We first highlight that CONE shows

great scalability, with very minimal overhead, as more contexts are introduced. Remarkably, CONE’s

memory consumption is comparable to the plain GAT model, which does not take context into

account. This aligns well with our complexity analysis (Chapter 5.3.4). Conversely, BIONIC and

PINNACLE react drastically to the number of contexts, with BIONIC running out of memory

beyond 500 contexts. Furthermore, PINNACLE demonstrates a severe scalability issue with the

context nodes percentage. In other words, PINNACLE memory consumption drastically increases

82

02004006008001000Number of contexts0.00.51.01.52.02.53.0Peak CUDA Memory1e1001020304050Context node percentage0.250.500.751.001.251.501.752.002.25Peak CUDA Memory1e10GATCONEBIONICPINNACLEFigure 5.7: CONE dimensionality reduction effect comparison. Performance comparison between
PCA-reduced CONE (x-axis) and not reduced CONE (y-axis) in terms of testing APOP on the
DisGeNET disease gene classification benchmark.

as the context subgraphs increase. These results showcase the scalability advantage of CONE using

a single shared GAT module to decode embeddings for all contexts.

5.5.6.2 Effects of PCA dimensionality reduction

The final CONE context-specific embeddings are obtained by first concatenating the context-

naive with the context-specific embeddings and then applying PCA to project the dimension by

half. Combining the context-naive and context-specific embeddings gives the final embedding a

more comprehensive view of both the global and local (context-specific) semantics. Dimensionality

reduction is applied so that the results for the final context-specific embeddings can be fairly

compared to the context-naive embeddings. PCA is a common dimensionality reduction technique

due to its simplicity and has been used in previous studies to combine multiple views of the

embeddings, such as Walklets [203].

One question remaining is how does the performance change before and after applying PCA.

Here, we compare the performance between the fully concatenated and PCA-reduced versions of

CONE following the DisGeNET benchmarking setting. We observe little performance difference

between the two versions of CONE (Figure 5.7). Thus, we set the final CONE to use PCA, as it

provides a fairer setting in terms of the number of dimensions while having the same performance.

83

0123456Concat-pca01234567Concat-fullR2=1.000123456Concat-pcaSingleR2=1.00Figure 5.8: Performance for different context similarity and context encoding strategies. Each
point in the box plot corresponds to the prediction test performance of a disease gene classification
task from the DisGeNET benchmark, averaged across five random splits. Different panels show
groups of diseases with different number of positive genes. * indicate that the performances between
the two methods are significantly different (Wilcoxon p-value < 0.05).

5.5.6.3 Ablation studies

In the following, we investigate the effectiveness of our main design choices of CONE, including

the context similarity measure and the MLP context encoder. We follow the same experimental

settings as in our primary DisGeNET gene classification benchmark (Chapter 6.3).

Table 5.6: Ablation study of context similarity and context encoding strategies using the
DisGeNET benchmark. Results are reported as APOP scores averaged across tasks within a group
based on the number of positive examples.

Similarity

Encoder

[10, 13)

[13, 23)

[23, 42)

[42, 173)

Default CONE setting

Jaccard

Context similarity ablations

Cosine
RBF
Spearman

MLP

MLP
MLP
MLP

Context encoder ablation

–

Embedding

2.813

2.801
2.805
2.705

2.697

3.111

3.186
3.189
3.194

3.083

3.116

3.048
3.038
3.050

2.911

1.971

2.003
1.957
1.966

1.932

Context similarity measures Besides the default Jaccard similarity measure, we consider three

other similarity measures, including the cosine similarity, radial basis function, and Spearman

84

CONE(naive)CONE-Emb(naive)CONE-SimCosine(naive)CONE-SimRBF(naive)CONE-SimSpearman(naive)CONE(best)CONE-Emb(best)CONE-SimCosine(best)CONE-SimRBF(best)CONE-SimSpearman(best)02468APOP[10,13) n=31CONE(naive)CONE-Emb(naive)CONE-SimCosine(naive)CONE-SimRBF(naive)CONE-SimSpearman(naive)CONE(best)CONE-Emb(best)CONE-SimCosine(best)CONE-SimRBF(best)CONE-SimSpearman(best)*[13,23) n=43CONE(naive)CONE-Emb(naive)CONE-SimCosine(naive)CONE-SimRBF(naive)CONE-SimSpearman(naive)CONE(best)CONE-Emb(best)CONE-SimCosine(best)CONE-SimRBF(best)CONE-SimSpearman(best)****[23,42) n=39CONE(naive)CONE-Emb(naive)CONE-SimCosine(naive)CONE-SimRBF(naive)CONE-SimSpearman(naive)CONE(best)CONE-Emb(best)CONE-SimCosine(best)CONE-SimRBF(best)CONE-SimSpearman(best)[42,173) n=38correlation coefficient. Table 5.6 shows that the choice of similarity has marginal effects on

the performance, with the default Jaccard similarity consistently achieving better or equivalent

performance against other choices of similarities. Figure 5.8 further indicates that there are no

significant performance differences across the choice of similarity measures according to the paired

Wilcoxon test.

Context encoder A trivial way to encode context embedding is one-hot encoding, which is

equivalent to directly learning the embedding for each context. We call this approach Embedding.

We observe that Embedding achieves the lowest average performance across all groups of tasks

(Table 5.6). In the case of disease task group [23, 42), the Embedding performance is significantly

worse than that of the default CONE (Figure 5.8, Wilcoxon p-value < 0.05).

5.6 Conclusion

We have proposed CONE, a flexible and scalable approach that can inject diverse biological contextual

information into the gene interaction network embeddings. Our study underscores the efficacy of

the CONE method in enhancing the prioritization of genes within the gene interaction network.

CONE consistently demonstrated superior performance in various biomedical tasks compared to

baseline methods. Crucially, the introduction of context-specific embeddings, especially when

positive genes for a disease were limited, resulted in significant performance gains. Moreover, the

contexts identified by CONE were found to be biologically relevant, suggesting that the method not

only boosts prediction accuracy but also provides biologically meaningful insights. This ability to

integrate diverse biological contexts, from tissues and cell types to diseases, positions CONE as a

versatile tool that can be used to uncover both explicit and cryptic relationships within biomedical

datasets.

Limitations and future directions Constructing context-specific subnetworks solely based on

the subgraph induced by context-specific genes is a reasonable but overly simplistic assumption.

In reality, context-specific gene interactions are complicated and encompass diverse mechanisms,

ranging from interactions mediated by non-coding RNAs to the influence of epigenetic modifications

and signaling pathways. Consequently, a promising advancement would involve the carefully

85

constructed context-specific gene interaction networks that account for these intricate nuances, such

as HumanBase [80], HIPPIE [5], IID [127], and many others [146, 43, 260].

86

CHAPTER 6

OPEN BIOMEDICAL NETWORK BENCHMARK:
A PYTHON TOOLKIT FOR BENCHMARKING DATASETS
WITH BIOMEDICAL NETWORKS

Over the past decades, network biology has been a major driver of computational methods developed

to better understand the functional roles of each gene in the human genome in their cellular context.

Following the application of traditional semi-supervised and supervised machine learning (ML)

techniques, the next wave of advances in network biology will come from leveraging graph neural

networks (GNN). However, to test new GNN-based approaches, a systematic and comprehensive

benchmarking resource that spans a diverse selection of biomedical networks and gene classification

tasks is lacking. Here, we present the Open Biomedical Network Benchmark (OBNB), a collection

of node-classification benchmarking datasets derived using networks from 15 sources and tasks

that include predicting genes associated with a wide range of functions, traits, and diseases.

The accompanying Python package, obnb, contains reusable modules that enable researchers to

download source data from public databases or archived versions and set up ML-ready datasets that

are compatible with popular GNN frameworks such as PyG and DGL. Our work lays the foundation

for novel GNN applications in network biology. obnb will also help network biologists easily set-up

custom benchmarking datasets for answering new questions of interest and collaboratively engage

with graph ML practitioners to enhance our understanding of the human genome. OBNB is released

under the MIT license and is freely available on GitHub: https://github.com/krishnanlab/obnb

6.1 Introduction

Life is orchestrated by remarkably complex interactions between biomolecules, such as genes and

their products. Network biology [22, 19, 108, 261] has demonstrated remarkable success over the

past two decades in systematically uncovering genes’ functions and their relations to human traits and

diseases. Accurately identifying genes associated with a particular disease, for example, is a vital

step towards understanding the biological mechanisms underlying the condition, which in turn could

lead to novel and effective diagnostic and treatment strategies [268, 136]. Early work in network

87

biology focused on network diffusion-type methods based on the guilt-by-association principle [188],

which states that genes interacting with each other likely participate in the same biological functions

or pathways. The performance of these methods has subsequently been improved by the application

of supervised learning [181]. In this trajectory, the next wave of improvements in network biology

is likely to emanate from the surge of powerful graph machine learning (ML) techniques such as

graph embeddings [78, 39] and graph neural networks [279, 295]. These methods have shown

promising results in many graph-structured tasks, such as social networks [99], and have started to

attract researchers to apply them to biological tasks [16, 291, 183]. To this end, accelerating the

development and application of graph ML methods in network biology is of great importance.

However, there is a critical need for standardized benchmarks that allow reliable and reproducible

assessment of the novel graph ML methods [232, 99]. Recent efforts such as MoleculeNet [278]

and Therapeutics Data Commons [102] for molecular and therapeutics property predictions, and

Benchmarking GNN [64] and Open Graph Benchmark [99, 98] for more general graph benchmarks,

have proven valuable in advancing the field of graph ML by providing carefully-constructed bench-

marking datasets for applying specialized methods. Meanwhile, such comprehensive benchmarking

datasets and systems are currently lacking for network biology. Furthermore, setting up ML-ready

datasets for network biology is incredibly tedious. Some necessary steps include converting gene

identifiers (IDs), setting up labeled data from annotated biomedical ontologies, filtering labeled

data based on network gene coverage, and constructing realistic data splits mimicking real-world

biologically meaningful scenarios. As a result, despite the remarkable amount of publicly available

data for biomedical networks [100, 111] and annotations [235], the only widely available ML-ready

datasets for network biology are PPI dataset from OhmNet [297] and PPA from the Open Graph

Benchmark [99].

Contributions

In this work, we address this critical need. Our main contributions are as follows:

1. We present a Python package obnb that provides reusable modules for data downloading,

processing, and split generation to set up node classification benchmarking datasets using

publicly available biomedical networks and gene annotation data. The first release version

88

contains interfaces to networks from 15 sources and annotations from three sources.

2. We present a comprehensive benchmarking study on the OBNB node classification datasets

with a wide range of graph ML approaches and tricks to set up the baseline for future

comparisons.

3. We analyze the benchmarking results and point out several exciting directions and the potential

need for a special class of graph neural network to tackle the OBNB tasks.

Related work Several existing Python packages share similar goals with obnb, primarily focusing

on establishing biomedical network datasets and facilitating their analyses. In these networks,

nodes typically represent genes or their products, such as proteins, while edges represent the

functional relationships between them [100], such as physical interactions. PyGNA [68] offers a

suite of tools for analyzing and visualizing single or multiple gene sets using biological networks.

PyGenePlexus [166] and drug2ways [218] specialize in network-based predictions of genes

and drugs. The OGB [99] platform houses a variety of graph benchmarking datasets, which includes

a PPI dataset akin to the STRING-GOBP dataset in OBNB (Table C.6). Nonetheless, all the

aforementioned packages have a limited number of, if any, biomedical network and label data.

obnb, on the other hand, provides an extensive number of biomedical networks and diverse gene

set collections to facilitate the systematic evaluation of graph machine learning methods on diverse

datasets. In a related domain, PyKEEN [7] provides a vast array of biomedical knowledge graph (KG)

datasets and KG embedding methods. There, the main task of interest is link prediction, through

which the missing knowledge link can be completed. Other notable works for constructing large-scale

biomedical KG and setting up link-prediction benchmarks from them include BioCypher [156]

and OpenBioLink [30]. While the tasks associated with KG [267] are orthogonal to the node

classification settings, it is possible to reformulate gene classification problems as KG completion

and vice versa [16, 291]. Nevertheless, the advantages and drawbacks of these two approaches are

yet to be comprehensively evaluated.

89

(a) obnb downloads network and label data from public data sources, then process and combine them into
ML-ready benchmarking graph datasets.

(b) Generating label data from annotated ontology through annotation propagation and non-redundant gene
set extraction.

Figure 6.1: Overview of obnb data processing and benchmarking dataset generation.

6.2 OBNB system description

Making the process of setting up ML-ready network biology benchmarking datasets from publicly

available data as effortlessly as possible is the core mission of obnb. We implement and

package a suite of graph (obnb.graph) and label (obnb.label) processing functionalities

and couple them with the high-level data object obnb.data to provide a simple interface for

users to download and process biological networks and label information. For example, calling

obnb.data.BioGRID("datasets") and obnb.data.DisGeNET("datasets") will

download, process, and save the BioGRID network and DisGeNET label data under the datasets/

directory, which can then be loaded directly next time the functions are called. Users can compose a

dataset object using the network and the label objects, along with the split (Chapter 6.2.3), which

can then be used by a model trainer to train and evaluate a particular graph learning method.

Alternatively, the composed dataset object can be transformed into data objects for standard GNN

90

NetworkLabel (annotated ontologies)Gene set (task) 1:Gene set (task) 2:Ontology relationAnnotationOntology termGeneobnbTask: what is the relevant label for this gene?       ?       ?ML-ready datasetsAnnotated ontologyPropagate annotationsNon-redundantgene sets (tasks)NetworkLabel (annotated ontologies)Gene set (task) 1:Gene set (task) 2:Ontology relationAnnotationOntology termGenenlevalTask: what is the relevant label for this gene?       ?       ?ML-ready datasetsAnnotated ontologyPropagate annotationsNon-redundantgene sets (tasks)framework, including PyTorch Geometric (PyG) [70] and Deep Graph Library (DGL) [265].

In the following subsections, we provide details about the processing steps for the biomedical

networks (Chapter 6.2.1), creation of node labels from annotated ontologies (Chapter 6.2.2), and

preparation of data splits (Chapter 6.2.3). Finally, as we are dedicated to creating a valuable resource

for the computational biology and graph machine learning communities, we closely follow several

open-source package standards, as elaborated in Chapter 6.2.6.

6.2.1 Network

Downloading Currently, tens of genome-scale human gene interaction network databases are

publicly available, each constructed and calibrated with different strategies and sources of interaction

evidence [100]. Unlike in many other domains, such as chemoinformatics, where there are only

a few ways to construct the graph (e.g., molecules), gene interaction networks can be defined

and created in a wide range of manners, all of which capture different aspects of the functional

relationships between genes. Some broad gene interaction mining strategies include experimentally

captured physical protein interactions [240], gene co-expression [113], genetic interactions [60],

and text-mined gene interactions [246]. We leverage the Network Data Exchange (NDEx) [210] to

download the biological network data when possible (Figure 6.1a). The obtained CX stream format

is then converted into a obnb.graph object for further processing.

Gene ID conversion In gene interaction networks, each node represents a gene or its gene product,

such as a protein. There have been several standards in mapping gene identifiers, and different gene

interaction networks might not use the same gene ID. For example, the STRING database [246]

uses Ensembl protein ID [104], PCNet [100] uses HGNC [79], and BioGRID [240] uses Entrez

gene ID [104]. Here, we use the NCBI Entrez gene ID for its advantages, such as supporting more

species other than Human and being less ambiguous [94]. To convert other types of gene IDs into

the Entrez gene, we use the MyGene query service [277], which provides the most up-to-date gene

ID mapping across tens of gene ID types. Following, we remove any gene where more than one gene

ID is mapped to the same Entrez gene, which indicates ambiguity in annotating the gene identifier

currently. The gene interaction network after gene ID conversion will contain equal or less genes,

91

all of which are one-to-one mapped from the original gene IDs to the corresponding Entrez genes.

Connected component

In practice, a small fraction (typically within 1%) of genes may be

disconnected from the largest connected components. The disconnectedness of the network is

typically due to missing information about gene interactions from the measurement; the more

information a network uses to define the interactions, the denser and less likely there will be

disconnected genes. Thus, we extract the largest connected component of the gene interaction

network by default as it is more natural to have a single component for the transductive node

classification tasks we consider. However, a user can also choose to take the full network without

extracting the largest connected component by specifying the largest_comp option to False

when initializing the data object.

6.2.2 Label

Many biological annotations are organized into abstract concept hierarchies, known as ontolo-

gies [23]. Each ontology term (a node in the ontology graph) is associated with a set of genes

provided by current curated knowledge. By the hierarchical nature of the ontologies, if a gene is

associated with an ontology term, then it must also be associated with any more general ontology

terms. Thus, we first propagate gene annotations over the ontology by assigning a gene to all parent

terms of its original annotated terms. This will result in a highly redundant gene set collection,

where each ontology term necessarily contains a subset of its parent term’s set of genes. We reduce

the gene set collection’s redundancy by we picking a representative and non-redundant subset of the

gene sets. The final processed data results in the form of Y ∈ {0, 1}𝑁×𝑝 with 𝑁 number of genes and

𝑝 number of labels. Below, we provide detailed description of (1) the gene annotation propagation,

and (2) gene set filtering steps.

6.2.2.1 Annotated ontology

An ontology is a directed acyclic graph 𝐻 = (𝑉𝐻, 𝐸𝐻), where 𝑣 ∈ 𝑉𝐻 is an ontology term.

The directed edge (𝑢, 𝑣) ∈ 𝐸𝐻 indicates 𝑣 is a parent of 𝑢, that is, 𝑣 is more abstract or general

concept containing 𝑢. Consider 𝐵 as the set of all genes we are interested in, and P(𝐵) be the

power set of 𝐵. Given a gene annotation data, represented as a set of pairs A = {(𝑏, 𝑣)𝑖}, where

92

𝑏 ∈ 𝐵 and 𝑣 ∈ 𝑉𝐻 are gene and ontology term, we represent the raw annotation 𝐽0 : 𝑉𝐻 → P(𝐵)

as a map from an ontology term to a set of genes, so that 𝐽0(𝑣) = {𝑏 : (𝑏, 𝑣) ∈ A}. We then

propagate the raw annotation 𝐽0(·) over the ontology 𝐻 into a propagated annotation 𝐽 (·), where

𝐽 (𝑣) = ∪𝑣′:∃path from𝑣′ to 𝑣 𝐽0(𝑣′), as depicted in Figure 6.1b.

Downloading All the ontology data is downloaded from the OBO Foundry [235] (Figure 6.1a),

which actively maintains tens of large-scale biological ontologies. In the first release, we focus on

three different annotations, Gene Ontology (GO) [14], DisGeNET [206], and DISEASES [81]. We

naturally split GO into three different sub-collections, namely biological process (GOBP), cellular

component (GOCC), and molecular function (GOMF), resulting in a total number of five label

collections.

6.2.2.2 Filtering

The propagated annotations contain highly redundant gene sets, which may bias the benchmarking

evaluations toward genes sets that commonly appear. To address this, we adapt the non-redundant

gene set selection scheme used in [153]. In brief, we construct a graph of gene sets based on the

redundancy between two gene sets, and recursively extract the most representative gene set in each

connected component (Figure 6.1b). In addition, we provide several other filtering methods, such

as filtering gene sets based on their sizes, number of occurrences, and their existence in the gene

interaction network, to help further clean up the gene set collection to be used for final evaluation.

Advanced users can alter these filtering functionalities to flexibly set up a custom gene set collection

to better suit their biological interests besides using the default filtering steps provided by obnb.

6.2.3 Data splitting

A rigorous data splitting should closely mimic real-world scenarios to provide accurate and

unbiased estimations of the models’ performance. One stringent solution is temporal holdout, where

the training input and label data are obtained before a specific time point, and the testing data is only

available afterwards [58, 153]. In practice, this temporal setting is often too restrictive and leads to

insufficient tasks for evaluation [153]. Thus, by default, we consider a closely related but less strict

strategy called study-bias holdout.

93

6/2/2 study-bias holdout We use top 60% of the most studied genes with at least one associated

label for training. The extend to which a gene is studied is based on its number of associated

PubMed publications retrieved from NCBI1. The 20% least studied genes are used for testing, and

the rest of the 20% genes are used for validation. Notice that by splitting up genes this way, some

of the tasks may get very few positive examples in one of the splits. Hence, we remove any task

whose minimum number of positive examples across splits falls below a threshold value (set to 10

by default). For completeness, we also provide functionalities in obnb to generate random splits.

6.2.4 Dataset construction

The network (Chapter 6.2.1) and label (Chapter 6.2.2) modules provide flexible solutions for

processing and setting up datasets. Meanwhile, we provide a default dataset constructor that utilizes

the above modules with reasonable settings to set up a dataset given particular selections of network

and label. The default dataset construction workflow is as follows.

1. Select the network and label (task collection) of interest.

2. Remove genes in the task collection that are not present in the gene interaction network.

3. Remove tasks whose number of positive examples falls below 50 after the gene filtering above.

4. Setup 6/2/2 study-bias holdout (Chapter 6.2.3).

6.2.5 Example code

The obnb library provides high-level APIs for users to set up benchmarking datasets from

diverse selections of gene interaction networks and gene annotations to study the performance

of various graph learning methods. Below, we provide a simple code snippet demonstrating

how one can easily set up a full dataset in a single function call. Further, we demonstrate

how the constructed dataset can be directly used to evaluate the performance of a simple GNN

model. More comprehensive example usage can also be found in the obnb README file:

https://github.com/krishnanlab/obnb/blob/main/README.md.

1https://www.ncbi.nlm.nih.gov/

94

1 from obnb.dataset import OpenBiomedNetBench

2

3 # Download the current processed archive (set version to "latest" to

4 # download data from source directly and process them from scratch)

5 dataset = OpenBiomedNetBench(root="datasets", version="current", graph_name="BioGRID",

6

7

label_name="DisGeNET", auto_generate_feature="OneHotLogDeg")

8 # User built-in GNN trainer to train and evaluat the performance of a simple GCN

9 gcn_model = GCN(in_channels=1, hidden_channels=64,

10

num_layers=5, out_channels=dataset.num_tasks)

11 gcn_trainer = SimpleGNNTrainer(device="cuda", metric_best="apop")

12 gcn_results = gcn_trainer.train(gcn_model, dataset)

13

14 # Alternatively, convert the dataset into your favorite GNN framework, i.e., PyG or DGL

15 # Or use OpenBiomedNetBenchPyG or OpenBiomedNetBenchDGL to directly instantiate the dataset

16 pyg_data = dataset.to_pyg_data()

17 dgl_data = dataset.to_dgl_data()

6.2.6 Community standards and maintenance plans

We follow several community standards to ensure sustainable and maintainable community-wide

contributions, which is the key to the continual development of a code base. First, we release the

code on GitHub https://github.com/krishnanlab/obnb under the permissive MIT license. Second, we

use Sphinx to build the documentation of the code base and host it on ReadTheDocs.org. Several

quick-start examples using the package are also provided on the GitHub README page. Third,

code quality is ensured via various testing and code-linting automation using tox, pytest, pre-commit

hooks, and GitHub actions. Fourth, we provide contribution guidelines and a code of conduct on

the GitHub page. Finally, as a part of our commitment to the community, we also put in place a

maintenance plan to address GitHub issues, merge pull requests, and release updates periodically to

ensure the benchmarks remain adaptive to the evolving needs of the community.

6.3 Benchmarking experiment setup

To provide solid baselines for future references, we benchmark a wide range of graph ML methods

on our comprehensive and diverse OBNB datasets, covering 15 gene interaction networks and four

gene classification tasks (Appendix C.1). The classification performance is reported as the average

95

test precision over prior (APOP) score across five seeds (see Chapter 6.3.6 for the mathematical

definition and the motivation of choosing APOP as our main metric).

6.3.1 Datasets

We present the primary results from the combination of two networks (BioGRID and HumanNet)

and two gene classification tasks (GOBP and DisGeNET). The two networks are chosen for their

distinct characteristics: BioGRID is an unweighted graph whose edges indicate protein interaction

evidence from high throughput experiments [240], whereas HumanNet is a weighted graph whose

edges indicate much more diverse types of interactions such as gene coexpression and associations

derived via literature text-mining [107]. Meanwhile, the two selected label collections cover two

broad classes of gene classification problems, namely, gene function prediction (GOBP) and disease

gene association prediction (DisGeNET). The statistics about networks and task collections for the

primary datasets are shown in Table 6.1 (see Table C.5 and Table C.6 for all dataset statistics).

Table 6.1: Main dataset statistics. The network density is computed as the ratio of existing edges
over all possible undirected edges: 2(#𝑒𝑑𝑔𝑒𝑠)/(#𝑛𝑜𝑑𝑒𝑠 × (#𝑛𝑜𝑑𝑒𝑠 − 1)).

(a) Network statistics

(b) Label set collection statistics

# nodes

# edges Density

BioGRID
HumanNet

18,951
17,211

1,103,298
847,104

0.0030
0.0029

DisGeNET
GOBP

# tasks

298
105

# positives
198.6 ± 135.0
88.4 ± 35.3

6.3.2 Baselines

We consider three general types of methods: (1) label propagation [293] that directly propagates

label information over the graph, (2) graph embedding that first extracts the vectorial representations

of each node in the graph, which are then used to fit a simple classifier, such as logistic regression,

and (3) GNN that learns the mapping from each node to its label space end-to-end.

Graph embeddings We include three distinct featurizations in the main result: using the rows of

the adjacency matrix as the node features (Adj) [153], using the node2vec embedding (N2V) [83],

or using the Laplacian EigenMap embedding (LapEigMap) [25]. Extended results using SVD,

LINE [247], and Walklets [203] are available in the Appendix (Table C.2). Each of these features is

96

coupled with an ℓ2 regularized logistic regression model (LogReg) for downstream prediction.

GNN We include multiple variations of two GNN, GCN [124] and GAT [257], in the main results.

Extended results for SAGE [89], GIN [282], and GatedGCN [31] can be found in the Appendix

(Table C.2). One major challenge in applying GNN to OBNB datasets is the lack of canonical node

features. This is unlike networks in other domains, such as citation networks, where node features

are naturally defined using the paper content, such as bag of words [99]. To tackle this problem, we

experiment with a diverse selection of node featurization strategies, including using the one hot

encoded log degree of the node, using node2vec, and many others. We provide detailed descriptions

for the 15 different feature construction strategies in Appendix 6.3.3.1, with extended results for

GCN and GAT in Table C.1.

Bag of Tricks (BoT)

In addition to optimizing the model architectures, many tricks in model

training and feature augmentations have also shown to be key factors for practically good performance

in existing benchmarks [99]. Here, we further investigate the usefulness of several popular tricks,

including reusing training labels as node features (LabelReuse) [269] and performing post-correction

to the predictions at test time via correct and smooth (C&S) [103].

Figure 6.2: GNN model architecture overview.

6.3.3 GNN backbone design

The basic GNN architecture we used contains the following components:

97

Raw Node FeaturesLinear blockMPNN blockMessage PassingNormalizationActivationDropoutLinear& Sigmoid(Optional)Correct and Smooth N x• Feature encoder Since the genes do not come with canonical features, we need to derive

initial node features for the GNN model. The default option we used is OneHotLogDeg with

32 bins (6.3.3.1). The raw features are first processed by a custom batch norm layer that is

activated during testing in addition to training. Then, the processed features are projected into

the hidden dimension (𝑑 = 128 by default) of the model via a linear layer, followed by batch

normalization and ReLU activation.

• Message passing layers We use five message passing layers in the GNN model by default.

Each message passing layer contains a convolution block (6.3.3.2), followed by normalization,

non-linear activation, and dropouts. We apply residual connection by summing up the input

and the output of the graph convolution block.

• Prediction head and post-processing Finally, we apply a linear layer to project the hidden

dimension to the dimension matching the number of tasks and apply a sigmoid activation

to turn the prediction into binary prediction probabilities. Optionally, we apply correct and

smooth (C&S) [103] at test time to correct the predicted probabilities via a two-step label

propagation (6.3.3.3).

6.3.3.1 Node features

In this section, we provide an overview of a diverse selection of node feature constructions we

used in our benchmark. All node features are constructed using the whole network as input in a

transductive setting, except for a few cases where the node features do not depend on the network

structure, such as constant features. By default, any initial node feature will be 128-dimensional

unless otherwise specified. We start by designing a collection of simple statistics that can be easily

derived.

• Constant uses a one-dimensional trivial feature for every node in the graph.

• RandomNormal samples 32-dimensional features for each node independently via a standard

normal distribution.

98

• OneHotLogDeg (short for LogDeg) first computes the log degree of each node in the graph

and then uniformly bins the nodes into one of 32 bins based on their log degree. The one-hot

encoded node degrees approach has recently been shown to be a great structure encoder, whose

utilization can sometimes result in performance superior to using the original node features

associated with the graph [52, 149]. Meanwhile, the design choice of using log-uniform grids

stems from the scale-free nature of biological networks [6].

• RandomWalkDiag is the landing probability of a node back to itself after 𝑘 hops, which

is also commonly referred to as the random walk structural encodings (RWSE) [65]. It has

been used widely in many graph transformer models due to its expressiveness in capturing

graph structure [141]. Next, we consider several node feature options derived directly from

the adjacency matrix.

• Adj uses the rows of the adjacency matrix as the node feature. It has been shown previ-

ously [153] that logistic regression using the adjacency matrix produced better prediction

performance than the commonly used label propagation algorithm for diverse gene classifica-

tion problems.

• RandProjGaussian and RandProjSparse are random projections of the adjacency matrix

using two different but related approaches. We use the scikit-learn [196] implementations

(GaussianRandomProjection and SparseRandomProjection) to compute these

features.

• SVD uses the left singular vectors of the adjacency matrix as the node features.

• LapEigMap [25] uses the (ℓ2 normalized) eigenvectors of the symmetric normalized graph

Laplacian as the node features.

Node embeddings [183] are powerful approaches to extracting vectorial representations of each

node in a graph and have shown promising results in many biomedical application [153, 291, 16].

Thus, we include a few popular and good-performing node embedding options in our benchmark.

99

• LINE1 and LINE2 [247] extract first- and second-order proximity information from the graph

to train the underlying embeddings. We use the GraPE [41] implementation to compute the

LINE embeddings.

• Node2vec [83] extracts node representation using word2vec [174] on node sequences sampled

from the graph via biased random walks. The biased random walks, in contrast to an earlier

work, DeepWalk [202], which uses an unbiased search, allow the searching strategy to be

more flexible, mimicking either breadth-first search or depth-first search in the random walk

phase.

• Walklets [203] is similar to botch Node2vec and DeepWalk in that it runs word2vec using

random walks sampled from the graph. However, wallets sample node pairs from the random

walk with a specific number of hops, allowing a more explicit control of the multiscale

relationships between nodes.

Finally, we experiment with options that let the model learn the node features freely.

• Embedding lets the model learn the node features freely.

• AdjEmbBag is similar to Embedding but with an additional aggregation step that sums up

the raw embedding of each node from the central node’s neighborhood.

6.3.3.2 Graph neural networks

In this section, summarize the five tested GNN models under the message-passing framework [74].

Let ℎ𝑙

𝑖 be the raw representation of vertex 𝑖 at layer 𝑗, and ˜ℎ𝑙

𝑖 be the corresponding processed

representation, e.g., after non-linear activation and normalization. The message-passing framework

is written as

𝑖 = 𝑓 𝑙 (cid:16) (cid:202)
ℎ𝑙+1

𝜙𝑙
𝑛

(cid:0) ˜ℎ𝑙
𝑗

(cid:1), 𝜙𝑙
𝑠

(cid:0) ˜ℎ𝑙
𝑖

(cid:1) (cid:17)

𝑗 ∈N𝑖

(6.1)

where (cid:201) is the aggregation operator, 𝑓 is the update function, 𝜙𝑛 and 𝜙𝑠 are message functions for
neighbors and self. Different GNNs differ in the choices of (cid:201), 𝑓 , 𝜙𝑛 and 𝜙𝑠 they use.

100

GCN [124] uses a scaled linear transformation as the message function. The scaling factor is

computed as the normalized edge weights with added self-loop. The aggregation is done by summing

up the transformed message functions.

ℎ𝑙+1
𝑖 =

∑︁

𝑗 ∈N𝑖

(cid:16) 𝑒𝑖, 𝑗
√︃

˜𝑑𝑖 ˜𝑑 𝑗

Θ𝑙 ˜ℎ𝑙
𝑗

(cid:17)

+

𝑒𝑖,𝑖
˜𝑑𝑖

Θ𝑙 ˜ℎ𝑙
𝑖

(6.2)

where ˜𝑑𝑖 = 𝑑𝑖 + 1 = (cid:205) 𝑗 ∈N𝑖

𝑒𝑖, 𝑗 is the degree of vertex 𝑖 with self-loop added.

SAGE [89] uses two separate linear transformations as the message functions for neighbors and

self. The aggregation is done by taking the sum2 of the neighbors’ messages and then adding it to the

self-message. We additionally use an affine update function to transform the aggregated messages.

𝑖 = Θ𝑙
ℎ𝑙+1
𝑢

(cid:32)

Θ𝑙
𝑛

(cid:16) ∑︁

(cid:17)

˜ℎ𝑙
𝑗

+ Θ𝑙

𝑠 ˜ℎ𝑙
𝑖

(cid:33)

+ 𝑏𝑙
𝑢

𝑗 ∈N𝑖

(6.3)

GIN [282]

uses a multi-layer perceptron (MLP) to update the aggregated messages, which is done

by summing up neighbors’ representations and self-representation from the last layer.

𝑖 = MLP𝑙 (cid:16) ∑︁
ℎ𝑙+1

𝑗 + (1 + 𝜖) ˜ℎ𝑙
˜ℎ𝑙
𝑖

(cid:17)

𝑗 ∈N𝑖

(6.4)

GAT [257, 32] uses an attention mechanism to distribute the weights from which the linear

transformed messages are aggregated.

ℎ𝑙+1
𝑖 =

∑︁

(cid:16)

𝛼𝑖, 𝑗 Θ𝑙

𝑣 ˜ℎ𝑙
𝑗

(cid:17)

+ 𝛼𝑖,𝑖Θ𝑙

𝑣 ˜ℎ𝑙
𝑖

𝑗 ∈N𝑖

In particular, the original GAT paper [257] formulated the attention scores as follows

𝛼𝑖, 𝑗 =

(cid:16)

exp

LeakyReLU(cid:0)(𝑎𝑙)⊤Θ𝑙

𝑎 [ ˜ℎ𝑙

𝑖 || ˜ℎ𝑙

(cid:205)𝑘∈N𝑖 exp

(cid:16)

LeakyReLU(cid:0)(𝑎𝑙)⊤Θ𝑙

𝑎 [ ˜ℎ𝑙

𝑗 ](cid:1)(cid:17)
𝑖 || ˜ℎ𝑙

𝑘 ](cid:1)(cid:17)

(6.5)

(6.6)

However, it was pointed out in a more recent work, GATv2 [32], that the above formulation was

fundamentally limited in the types of attention the model can learn and proposed a correction that

showed stronger performances follow.

𝛼𝑖, 𝑗 =

(cid:16)

exp

(𝑎𝑙)⊤LeakyReLU(cid:0)Θ𝑙

𝑎 [ ˜ℎ𝑙

𝑖 || ˜ℎ𝑙

(cid:205)𝑘∈N𝑖 exp

(cid:16)

(𝑎𝑙)⊤LeakyReLU(cid:0)Θ𝑙

𝑎 [ ˜ℎ𝑙

𝑗 ](cid:1)(cid:17)
𝑖 || ˜ℎ𝑙

𝑘 ](cid:1)(cid:17)

(6.7)

2The original paper uses mean aggregation, but we found that sum aggregation works better in our benchmark.

101

We also observe a slight but significant improvement when using GATv2 attention as opposed to the

original GAT. Thus, all reported GAT results are based on v2 attention.

GatedGCN [31]

uses gating mechanisms to process neighbors’ messages, with linear transforma-

tions as the message functions. The aggregation is done by summing up the gated messages from

neighbors and adding them to the self-message.

ℎ𝑙+1
𝑖 =

∑︁

(cid:16)

𝜂𝑖, 𝑗 ⊙ Θ𝑙

𝑣 ˜ℎ𝑙
𝑗

(cid:17)

+ Θ𝑙

𝑠 ˜ℎ𝑙
𝑖

𝑗 ∈N𝑖

(6.8)

where ⊙ is the elementwise multiplication operator, and the gating coefficients 𝜂𝑖, 𝑗 are computed as

𝜂𝑖, 𝑗 = sigmoid(cid:0)Θ𝑙
𝑞

˜ℎ𝑙

𝑖 + Θ𝑙
𝑘

˜ℎ𝑙

𝑗 (cid:1)

(6.9)

6.3.3.3 Network diffusion

Label propagation iteratively propagates the label information from the source nodes outwards

through the network neighborhoods [126, 51, 180]. Let Y ∈ {0, 1}𝑛×𝑑 be the (training) label matrix,

and M ∈ R𝑛×𝑛 be the diffusion operator. The propagated information at step 𝑡, denoted F𝑡 ∈ R𝑛×𝑑,

is defined as

F𝑡 = 𝛼MF𝑡−1 + (1 − 𝛼)F0

(6.10)

where F0 = F is the features to be propagated, which is the label matrix Y in the case of

label propagation. The propagated feature is computed by repeating the propagation above until

convergence:

In practice, equation 6.11 is approximated by applying the propagation until the changes are small

PROPAGATE(F) = lim
𝑡→∞

F𝑡

(6.11)

enough.

The original label propagation paper [293] uses the symmetric normalized adjacency matrix

D−1/2AD−1/2 as the diffusion operator, where A and D are the adjacency matrix and the corresponding

diagonal degree matrix. Here, we use the column stochastic matrix D−1A as the diffusion operator

to resemble the random walk with restart (RWR) implementation that is more commonly used in the

network biology literature [51]. We use a propagation parameter 𝛼 of 0.1 (equivalent to a restart

parameter of 0.9) for label propagation.

102

C&S implements the idea of using label propagation3 to (1) correct for the error made by the model

and (2) smooth out the corrected predictions. Specifically, let E = Ytrain − Z be the prediction error

matrix, where Z is the predicted probabilities resulting from the model. C&S can be summarized as

follows.

1. Propagate the error matrix: ˜E = PROPAGATE(E).

2. Correct the original prediction with a fixed scale 𝑠: ˜Z = Z + 𝑠 ˜E.

3. Smooth out the corrected predictions: ZC&S = PROPAGATE( ˜Z)

We use 𝛼 = 1.0 for the correction step, 𝛼 = 0.8 for the smoothing step, and a scaling factor of

𝑠 = 1.5.

6.3.4 Training and hyperparameter details

Table 6.2: Default hyperparameter settings.

Parameter

Value

General architecture

GAT specific

GIN specific

Optimizer
(AdamW [158])

Learning rate scheduler
(ReduceLROnPlateau)

Hidden dimensions
Number of layers
Activation
Dropout
Normalization
Heads
Attention dropout
MLP layers
MLP dimensions
𝜖

Learning rate
Weight decay
Max epochs
Early stopping patience

Scheduler patience
Scheduler reduce factor
Minimum learning rate

128
5
ReLU
0.2
DiffGroupNorm [296]
1
0.05
2
256
0.0

0.01
1e-6
50,000
500

100
0.5
1e-5

3Uses the symmetric normalized adjacency matrix for propagation.

103

Table 6.3: Hyperparameter search space for GCN and GAT on the primary datasets.

Parameter

Search space

Number of layers
Hidden dimensions
Number of heads (GAT)
Attention dropout (GAT)
Dropout
Normalization
Activation

{1, 2, 3, 4, 5, 6, 7, 8}
{64, 128, 192, 256}
{1, 2, 3, 4}
{0.0, 0.05, 0.1}
{0.1, 0.2, 0.3}
{none, BatchNorm, LayerNorm, PairNorm, DiffGroupNorm}
{ReLU, PReLU, GELU, SELU, ELU}

All hyperparameters are listed in the configuration files for each model in the benchmarking

repository4. We summarize the primary default hyperparameters for GNN in Table 6.2. For logistic

regression baselines, we use the SGD optimizer with a constant learning rate of 200, weight decay

of 1e-5, and momentum of 0.9.

Fully tuned GNN models

In addition to the baseline GNN experiments where we used the default

settings elaborated above, we also tuned GCN and GAT specifically for the four primary benchmarks

presented in the main paper with a bag of tricks (BoT). Specifically, we construct node features

by concatenating Node2vec embeddings (𝑑 = 128), OneHotLogDeg encodings (𝑑 = 32), and

LabelReuse. We also apply correct and smooth (C&S) to the GNN predictions as a post-processing

step. To search for the dataset-specific optimal hyperparameters for GCN and GAT each dataset,

we use the Bayesian search optimization strategy provided by Weights&Biases [28] based on the

validation APOP scores. The search space is listed in Table 6.3 and the final hyperparameter settings

are summarized in Table 6.4, 6.5.

Table 6.4: Tuned GCN model on the four primary benchmarks.

BioGRID

HumanNet

DisGeNET

GOBP

DisGeNET

Hidden dimension
Number of layers
Activation
Dropout
Normalization

GOBP

192
3
SELU
0.3

256
1
PReLU
0.3

192
4
GELU
0.3
–

64
4
ReLU
0.3
DiffGroupNorm

DiffGroupNorm DiffGroupNorm

4https://github.com/krishnanlab/obnbench/tree/main/conf

104

Table 6.5: Tuned GAT (v2) model on the four primary benchmarks.

BioGRID

GOBP

DisGeNET

128
4
1
0.05
SELU
0.2

64
2
4
0.1
PReLU
0.3

GOBP

128
4
1
0.1
PReLU
0.2

HumanNet

DisGeNET

64
8
1
0.1
GELU
0.1

BatchNorm DiffGroupNorm DiffGroupNorm DiffGroupNorm

Hidden dimension
Number of layers
Number of heads
Attention dropout
Activation
Dropout
Normalization

6.3.5

Implementation information

The obnbench benchmarking codebase utilizes PyTorch [195] and PyTorch Geometric

(PyG) [70] for building deep (graph) neural network models. In addition, we leverage Weights &

Bias [28] for tracking training results and Hydra [283] for managing experiment configurations. All

benchmarking experiments are run using computational nodes with five CPUs, 45GB memory, and

a Tesla V100 GPU (32GB).

6.3.6 Evaluation metric

We use the 𝑙𝑜𝑔2 fold change of average precision over the prior (APOP) as the primary metric

for our benchmarking study. The prior is computed as the ratio between the number of positives and

the total number of samples, which corresponds to the probability that a randomly drawn sample

is positive. Thus, APOP indicates how much better (> 0), or worse (< 0), a prediction is than a

random guess, in two-fold change. More precisely,

APOP = log2

(cid:18) Average Precision
prior

(cid:19)

= log2

(cid:18)

(cid:205)𝑛 (𝑅𝑛 − 𝑅𝑛−1)𝑃𝑛
(# positives)/(# positives + # negatives)

(cid:19)

(6.12)

where 𝑅𝑛 and 𝑃𝑛 are the recall and precision and precision at the 𝑛th prediction score threshold.

The average precision is related to the the area under the precision and recall curve (AUPRC),

which has been shown to be more suitable than the the area under the receiver operator characteristic

curve (AUROC) in the case of class imbalance [226, 238]. In our dataset, class imbalance is

prevalent, where each class has only one or two hundred of positive examples but thousands of

negative examples (Table 6.1b, C.6). Moreover, AUROC is more tolerant of errors made on top

105

predictions and generally only requires that the global distribution of the predictions made is

consistent with the true labels. AUPRC and AP, on the other hand, penalize the top predictions’

errors more.

Practical relevance The OBNB benchmarks cover diverse gene prioritization tasks, such as

pinpointing relevant genes for a particular disease. This formulation can also be straightforwardly

extend to drug recommendation or drug repurposing tasks, by considering known drug targets (genes)

as positive examples. For such biomedicine applications, in practice, only a few top predictions

made by a predictive method can be experimentally verified by experimentalists due to the high

experimental costs. Thus, APOP’s emphasis on top predictions aligns well with this practical

constraint and encourages methods to make highly accurate top predictions.

6.3.7 Corrected Homophily Ratio

We propose a novel measure of the node classification task’s homophily, called corrected

homophily ratio. This measure extends the traditional definition of homophily [197] to more general

setting of multilabel classification with an emphasis of class imbalance.

Let 𝐺 = (𝑉, 𝐸, 𝑤) be a weighted graph (unweighted graphs can be treated as weighted graphs with

identical edge weights), with node set 𝑉, edge set 𝐸, and the edge weight function 𝑤 : 𝑉 × 𝑉 → R.

Let 𝑦𝑖

: 𝑉 → {0, 1} 𝑝 be the label function for the 𝑖th class (or label), for 𝑖 ∈ {1, . . . , 𝑝}.

We define the set of labeled nodes 𝑉labeled as the ones that are part of at least one label, i.e.,

𝑉labeled = {𝑣 ∈ 𝑉 | max𝑖∈1,...,𝑝 𝑦𝑖 (𝑣) = 1}. Furthermore, we define the positively and negatively
labeled nodes for the 𝑖th class as 𝑉 (𝑖+)
labeled = {𝑣 ∈ 𝑉 |𝑦𝑖 (𝑣) = 0},

labeled = {𝑣 ∈ 𝑉 |𝑦𝑖 (𝑣) = 1} and 𝑉 (𝑖−)

respectively

Definition 1 (Node homophily ratio) The node homophily ratio of the 𝑖th class for node 𝑣 is defined

as the ratio of its neighborhood that is positively labeled in the 𝑖th class.

ℎ𝑖 (𝑣) =

|{𝑢 ∈ N(𝑣)|𝑦𝑖 (𝑢) = 1}|
|N(𝑣)|

(6.13)

Definition 2 (Positive and negative homophily ratio) The positive and negative homophily ratios

of the 𝑖th class (or label) are defined as the ratio of a node’s neighborhood that are positively labeled

106

for the 𝑖th class averaged over the positively and negatively labeled nodes, respectively

ℎ𝑖+ =

1
|𝑉 (𝑖+)

labeled

|

∑︁

𝑣∈𝑉 (𝑖+)

labeled

ℎ𝑖 (𝑣), ℎ𝑖− =

1
|𝑉 (𝑖−)

labeled

|

∑︁

ℎ𝑖 (𝑣)

𝑣∈𝑉 (𝑖−)
labeled

(6.14)

We refer to the positive homophily ratio as the homophily ratio for short. Note that definition 1

is slightly different from the typical definition of node homophily that is used in previous works

targeting multiclass classification settings [197]. Specifically, (1) we only count positive nodes from

the neighborhood instead of nodes having the same class as the central node since we are dealing

with (multilabel) binary classification, which will also come in handy later for defining the corrected

homophily ratio; (2) we only average the node homophily ratios over the positive node sets since our

datasets have a notable class imbalance, where most nodes are negatively labeled. This way, the

homophily ratio is less skewed towards the majority of negatively labeled nodes.

Definition 3 (Corrected homophily ratio) The corrected homophily ratio of the 𝑖th class is defined

as the expected fold change of node homophily between positively and negatively labeled nodes for

class 𝑖

˜ℎ𝑖 = log2

(cid:19)

(cid:18) ℎ𝑖+
ℎ𝑖−

(6.15)

This corrected homophily ratio provides an answer to the following question: How much more likely

are nodes labeled in class 𝑖 to be connected with other nodes labeled in class 𝑖 compared to nodes

not labeled in class 𝑖? A positive value of the corrected homophily ratio indicates that positive

nodes in class 𝑖 have a higher likelihood of interacting with other positive nodes of class 𝑖 than with

negative nodes.

6.4 Results and discussions

6.4.1 Traditional ML methods perform comparably to GNNs overall

Table 6.6 shows the overall performance for the selected model on the four primary benchmarking

datasets (full baseline results in Table C.2). Surprisingly, even after a rather extensive hyperparameter

search and GNN architecture tuning (Appendix 6.3.4), logistic regression using graph-derived

107

features still performs comparably to GNNs. For example, node2vec achieves the best performance

for GOBP prediction when using both BioGRID and HumanNet. Similar results are obtained

when using the other networks (Table C.2) and using the more standard random splitting strategy

(Table C.4). The superior performance of traditional ML methods over GNNs is in stark contrast with

results reported in recent benchmarking studies [232, 64, 99], highlighting the unique characteristics

of biological networks and the challenges in modeling them. In addition, we demonstrate that the

proposed study-bias holdout splitting provides more stringent evaluations than random splitting

(Table C.4).

Table 6.6: Main performance summary on the four primary OBNB benchmarking datasets.
Performance is evaluated in APOP ↑ aggregated over five random seeds. Bold indicates the
best-performing method within the method class (top group: traditional ML; bottom group: GNN).
Green indicates best performing-method across both classes. See Table C.3 for extended baseline
performance references.

BioGRID

HumanNet

C&S?

✗
✗
✗
✗

Model

Features

LabelProp
LogReg
LogReg
LogReg

–
Adj
N2V
LapEigMap

DisGeNET
3.059 ± 0.000
3.053 ± 0.000
2.433 ± 0.029
2.301 ± 0.000
2.452 ± 0.107
2.777 ± 0.031
3.053 ± 0.078
3.116 ± 0.017
2.547 ± 0.207
3.007 ± 0.037
3.100 ± 0.031
3.065 ± 0.021
† recommended BoT, ‡ recommended BoT with fully tuned GNNs (see Section 6.3.4 for tuning details)

GOBP
1.885 ± 0.000
2.528 ± 0.000
2.571 ± 0.015
2.149 ± 0.000
2.022 ± 0.100
2.201 ± 0.035
2.411 ± 0.044
2.572 ± 0.066
1.592 ± 0.408
2.227 ± 0.024
2.624 ± 0.070
2.562 ± 0.070

DisGeNET
0.931 ± 0.000
0.743 ± 0.000
0.836 ± 0.029
0.864 ± 0.002
0.773 ± 0.035
1.026 ± 0.013
1.014 ± 0.020
1.012 ± 0.040
0.552 ± 0.111
1.002 ± 0.018
1.037 ± 0.036
1.063 ± 0.023

LogDeg
LogDeg
N2V+LogDeg+Label
N2V+LogDeg+Label
LogDeg
LogDeg
N2V+LogDeg+Label
N2V+LogDeg+Label

GCN
GCN
GCN†
GCN‡
GAT
GAT
GAT†
GAT‡

✗
✓
✓
✓
✗
✓
✓
✓

GOBP
3.806 ± 0.000
3.964 ± 0.006
4.036 ± 0.019
3.778 ± 0.001
3.524 ± 0.061
3.743 ± 0.015
3.921 ± 0.045
3.812 ± 0.071
3.571 ± 0.159
3.876 ± 0.053
3.908 ± 0.086
3.963 ± 0.082

Carefully designed BoT significantly boosts GNN performance While plain GNNs without BoT

yield relatively unsatisfactory performances, carefully crafted BoT can bring significant enhancement

to GNNs in our benchmark. First, a systematic benchmark on GCN and GAT with 15 different feature

construction reveals that using node2vec as node features consistently achieve top performance across

datasets (Figure C.1). Second, reusing training labels as features also elevates overall performance,

most notably for GAT (Figure C.2). Third, C&S post-processing universally improves performance

for both GNN and logistic regression methods, with, on average, 0.27 improvement in test APOP

scores (Figure C.2,C.3). In light of these observations, we recommend an optimized BoT as follows:

108

Figure 6.3: Correlations between methods across tasks in BioGRID-DisGeNET.

(1) use a combination of LogDeg encodings, node2vec embeddings, and training labels as input

node features, (2) apply C&S post-processing to refine predictions further. Our results indicate that

the recommended BoT helps improve GNNs’ performances by 0.36 APOP test scores on average

across datasets.

6.4.2 Different models have their own strength

Despite the overall rankings of different methods indicating the superiority of their gene

classification capabilities, no single method can achieve the best results across all tasks. We

demonstrate this in Figure 6.3, which shows that the performance of most methods does not correlate

with each other across different tasks on the BioGRID-DisGeNET dataset. For example, while

LogReg+Adj performs better than GAT+LogDeg overall on the BioGRID-DisGeNET dataset,

GAT+LogDeg achieves significantly better predictions (t-test p-value < 0.01) for a handful of tasks

(Table 6.7), such as iris disorder and ocular vascular disorder.

Focusing on the optimal model for one or a few tasks of interest is utterly essential. This is

because, in practice, experimental biologists often come with only one or a few gene sets and want

to either obtain new related genes or reprioritize them based on their relevance to the whole gene

set [164]. Therefore, understanding the characteristics of the task that for which a particular model

109

GATGCNLabelPropLogReg+AdjLogReg+LapEigMapLogReg+Node2vecGATGCNLabelPropLogReg+AdjLogReg+LapEigMapLogReg+Node2vec1.000.610.580.300.410.520.611.000.560.510.540.650.580.561.000.320.550.550.300.510.321.000.430.540.410.540.550.431.000.650.520.650.550.540.651.00Table 6.7: Prediction performance discrepancy across tasks. Task-specific model prediction
performance difference on the BioGRID-DisGeNET dataset between GAT and LogReg+Adj.

Task ID

Task Name

GAT LogReg+Adj Difference

Iris disorder

MONDO:0002289
MONDO:0001703 Color-vision disease
MONDO:0001926 Ureteral disorder
MONDO:0005552 Ocular vascular disease
MONDO:0018470 Renal agenesis
. . .
MONDO:0020018 Cranial malformation
MONDO:0019743 Nephropathy secondary to a storage or other metabolic disease
MONDO:0024757 Cardiovascular neoplasm
MONDO:0045011 Keratinization disease
MONDO:0015160 Multiple congenital anomalies/dysmorphic syndrome-variable

2.160
2.177
1.543
1.507
1.273

0.699
0.773
0.316
0.264
0.280

intellectual disability syndrome

0.149
0.230
-0.336
-0.175
-0.347

4.487
4.876
4.443
4.619
4.828

2.011
1.946
1.879
1.682
1.620

-3.788
-4.103
-4.127
-4.355
-4.549

works well over others is the key to making better architectural decisions for a new task of interest

and, ultimately, designing new specialized GNN for network biology.

6.4.3 Homophily is a driving factor for good predictions

Homophily describes the tendency of a node’s neighborhood to have similar labels to itself.

Such effects are prevalent in many real-world graphs, such as citation networks, and have been

studied extensively recently in the GNN communities as a way to understand what information can

or cannot be captured by GNN effectively [159, 162]. Intuitively, homophily aligns well with the

guilt-by-association principle in network biology, which similarly states that genes that interact

with each other are likely functionally related. Despite this clear connection, existing homophily

measures, such as the homophily ratio [162], do not straightforwardly apply to the OBNB datasets

for two reasons.

1. Most established work on homophily consider multiclass classification task, where each node

has exactly one class label, contrasting with the multilabel classification tasks in OBNB.

2. Label information in OBNB datasets is incredibly sparse. On average, there are only hundreds

of positive genes per class out of the total number of 20K genes.

One attempt to resolve the first issue is to compute the average homophily ratio for each class

independently. However, due to the label scarcity, the resulting metric can not be readily interpreted.

110

In particular, the highest homophily ratio observed in our main benchmarking study is about 0.2

(Figure 6.4). Datasets with this value are typically categorized into heterophily (not homophily),

contradicting the guilt-by-association principle. To address this inconsistency, we propose a corrected

version of the homophily ratio that takes label scarcity into account (Chapter 6.3.7). The main idea

is to measure homophily as a relative measurement: how much more likely nodes labeled in class 𝑖

are connected with nodes also labeled in class 𝑖 than those not in class 𝑖? This measure is directly

interpretable as it is the log fold change of the homophily between positive and negative examples.

Corrected homophily ratio characterizes prediction performance of graph ML methods As

shown in the bottom two panels of Figure 6.4, the corrected homophily ratio is well correlated with

the prediction performance across different networks and tasks. This positive correlation unveils the

fundamental principle that graph ML methods rely on: the underlying graph must provide sufficient

contrast between the neighborhoods of the positive and negative examples. Conversely, the top

two panels of Figure 6.4 indicate that the standard homophily ratio fails to adequately account

for the performance nuances in low homophily tasks As a result, OBNB opens up new research

opportunities in understanding the effects of homophily in GNN by introducing a new type of

homophily that differs significantly from those traditionally studied [162].

Graph based augmentation tricks offer most advantages on intermediate homophilic tasks

We next investigate the relationship between the corrected homophily ratio and the performance

difference between a pair of methods by asking: is one method systematically better than another

method on a particular range of homophily? The first and second rows of Figure C.4 indicate that

GNN augmented with different grpah-based tricks, such as C&S and node2vec features, provide most

apparent advantage at intermediate homophily (corrected homophily ratio of ∼ 2.5). However, as

the homophily increases further, the performance differences diminish, suggesting that all methods

would perform equally perfectly as the graph provide more clear clues about the labels. Similar

trends are observed for the comparison between GAT+BoT and baseline label propagation and

logistic regression methods (Figure C.4 third row), where GAT+BoT most significantly outperforms

the baselines at intermediate homophily.

111

Figure 6.4: Relationship between the (corrected) homophily ratios of gene classification tasks
and the prediction methods’ performances.

6.4.4 Challenges for the OBNB benchmarks and potential future directions

Our systematic benchmarking study paves the way for numerous subsequent explorations. Below,

we outline the primary challenges associated with OBNB and suggest potential areas for future

research.

• Challenge 1. Lack of canonical node features Traditional network biology studies solely

rely on biological network structures to gain insights [19, 22, 126]. Meanwhile, the presence of

rich node features in many existing graph benchmarks is crucial for the success of GNNs [149],

posing a challenge for GNNs in learning without meaningful node features. An exciting

and promising future direction for obtaining meaningful node features is by leveraging the

112

0.050.100.15Homophily ratio0123456Test APOPLabelPropBioGRID-DisGeNETBioGRID-GOBPHumanNet-DisGeNETHumanNet-GOBP0.050.100.15Homophily ratio123456Test APOPGATBioGRID-DisGeNETBioGRID-GOBPHumanNet-DisGeNETHumanNet-GOBP012345Corrected homophily ratio0123456Test APOPLabelPropBioGRID-DisGeNETBioGRID-GOBPHumanNet-DisGeNETHumanNet-GOBP012345Corrected homophily ratio0123456Test APOPGATBioGRID-DisGeNETBioGRID-GOBPHumanNet-DisGeNETHumanNet-GOBPsequential or structural information of the gene product (e.g., protein) using large-scale

biological pre-trained language models like ESM-2 [147].

• Challenge 2. Dense and weighted graphs Many gene interaction networks are dense and

weighted by construction, such as coexpression [113] or integrated functional interactions [246].

In more extreme cases, the weighted networks can be fully connected [80]. The networks

used in the OBNB benchmark are orders of magnitude denser than citation networks [232]

(Table 6.1a). Thus, scalably and effectively learning from densely weighted gene interaction

networks is an area of research to be further explored in the future [150]. One potential

solution would be first sparsifying the original dense network before proceeding to train GNN

models on them, either by straightforwardly applying an edge cutoff threshold or by using

more intricated methods such as spectral sparsification [239].

• Challenge 3. Scarce labeled data Curated biological annotations are scarce, posing the

challenge of effectively training powerful and expressive models with limited labeled examples.

Furthermore, a particular gene can be labeled with more than one biological annotation

(multi-label setting), which is much more complex than the multi-class settings in popular

benchmarking graph datasets such as Cora and CiteSeer [232]. The data scarcity issue naturally

invites the usage of self-supervised [280] graph learning techniques, such as DGI [259], and

knowledge transfer via pre-training, such as TxGNN [101]. However, these methods still face

the challenge of missing node features (challenge 1).

• Challenge 4. Low corrected homophily ratio tasks Our results in Figure 6.4 show that

tasks with low corrected homophily ratios tend to be more challenging to predict, indicating

that current tested methods are limited to local information of the underlying graph. This

naturally opens up opportunities for designing models that (1) better capture long-range

dependencies [66] and (2) exploit higher-order structural information [178].

113

6.5 Conclusion

We have developed obnb, an open-source Python package for rapidly setting up ML-ready

benchmarking graph datasets using diverse biomedical networks and gene annotations from publicly

available sources. obnb takes care of tedious data (pre-)processing steps so that network biologists

can easily set up particular datasets with their desired settings, and graph ML researchers can directly

use the ML-ready datasets for model development. We have established a comprehensive set of

baseline performances on OBNB using a wide range of graph ML methods for future reference and

pointed out potential improvements that could further enhance performances. Together, OBNB

will help accelerate the development of advanced graph ML methods in network biology toward

furthering our understanding of the complex genetic basis of human traits and diseases.

114

CHAPTER 7

CONCLUSION

In this chapter, we summarize the primary results from this dissertation, reflect on current limitations,

and highlight promising future directions.

7.1 Summary

This dissertation aimed to further our understanding of human genomics using genome-scale

gene interaction networks by identifying genes associated with different biological functions,

diseases, and traits. The accurate and complete mapping of these associations is pivotal for

developing effective treatment strategies for complex human diseases. My dissertation projects have

contributed significantly toward this goal by developing (1) computational frameworks and software

for network-based gene classification, (2) efficient and effective algorithms capable of embedding

dense genome-scale networks to enable fast and accurate gene classification in low-dimensional

spaces, and (3) a biological contextual informed network embedding approach that highlights

context-specificity of complex diseases. These results have demonstrated the immense potential for

graph deep learning to shed light on human genetics with remarkable accuracy and context-specificity.

The open-source software, data, methods, and algorithms developed in this dissertation lay the

foundation toward deeper insights into human genomics and can ultimately facilitate the design of

effective therapeutic strategies for complex diseases.

7.2 Reflections and limitations

Despite the tremendous progress made in this dissertation toward computational systems leveraging

molecular interaction networks to gain insights into human biology, there are some key limitations

to be further explored. From a computational perspective, challenges persist in (1) analyzing gene

sets exhibiting unique characteristics and (2) adapting models to accommodate dynamic changes in

the underlying network. On the biological side, the current definition of context-specificity is rather

simplified, potentially obstructing our exploration of the subtle interplay between diseases and their

biological contexts. We discuss these limitations in greater detail below.

115

Lack of inductive embedding capabilities

In GenePlexus (Chapter 2), we have demonstrated

immense opportunities for using network-based features to classify genes, which have led to our

extended efforts to develop and deploy software and web-services, GenePlexusWeb (Chapter 2.5),

to help biomedical researchers explore potential relevant genes given the query gene set of interest.

This online service necessitates highly efficient model training and network feature extractions.

PecanPy (Chapter 3) has contributed toward this goal by providing a fast and efficient network

embedding implementation. Nevertheless, critical challenges remain regarding efficiently adjusting

existing network embeddings for minor changes in the network. Particularly, we need to regenerate

the network embeddings and subsequently retrain the gene classification models even under the

slightest modification of the input gene interaction network, such as adding or removing edges, either

by adapting newly released data or accounting for the biological context-specificity. Addressing this

limitation will significantly speed up the online training and inference time. Inductive representation

learning [89] aims to tackle this by generating the network embeddings inductively given the sampled

neighborhoods. However, its straightforward application is hindered by the lack of canonical features

for genes (Chapter 6.4.4).

Low homophily As we have first observed in GenePlexus and later formally analyzed in OBNB

via the notion of corrected homophily ratio (Chapter 6.3.7), network-based approaches suffered

from accurately predicting genes that are “scattered over the network.” More precisely, these are the

gene sets whose positive and negative examples do not show enough contrast in their neighborhood

compositions. Interestingly, such low-homophily gene sets are typically related to rare diseases and

complex traits, while functional gene sets are typically more localized in the underlying network

(Chapter 2.3.2). One explanation for this is that molecular interaction networks are primarily

intended, either through curation or construction, to reflect biological relationships between genes

and proteins as they pertain to “normal” cellular functions, whereas complex diseases and traits are

often products of (partially) affecting several biological pathways and modules spanning across the

genome-scale network [19, 72]. Addressing this computational challenge will greatly enhance our

understanding of complex diseases by providing an accurate and complete view of their associated

116

genes. Some recent works have explored solutions to the low homophily problem in general graph

learning [162, 149], but their effectiveness in the biological network domain remains elusive.

Simplistic context-specificity definition Biological context-specificity, such as tissues and cell

types, is vital for studying complex human diseases [80, 93, 125]. In CONE (Chapter 5), we stepped

toward this direction by defining and using biological contextual information as gene sets, such as

tissue-specific genes. The core idea was that a pair of genes interact under a particular biological

context if (1) they interact in the context-naive setting and (2) both genes are expressive in that

context. Despite its efficacy in studying the context-specific gene interactions, as demonstrated

by CONE and other related works [228, 140], this simplistic approach disregards intricacies in

context-specific gene and protein interactions in real cellular systems, limiting our ability to fully

investigate the molecular mechanisms underlying complex diseases in their respective biological

contexts. For example, interactions between two proteins could depend on (1) the presence of another

protein, such as scaffold proteins [59] and small ubiquitin-like modifier (SUMO) proteins [92, 169],

(2) their subcellular localizations [105, 288], and (3) post-translational modifications [45, 143].

Several works have presented context-specific gene interaction networks that are constructed

computationally [80, 175, 43], experimentally [215], or via hybrid approaches [211]. However,

how to efficiently learn contextualized embeddings on them with a versatile applicability to any

“context-naive” backbone network, as CONE does, needs to be further explored.

7.3 Future directions

We plan to continuously enhance the GenePlexus tools and services (Chapter 2) to better assist

experimental biologists in making insightful discoveries. First, we aim to expand the tutorials and

resources for GenePlexus, ensuring it is straightforward for experimental biologists to utilize. The

extended efforts will include organizing workshops to present GenePlexus tutorials and showcasing

case studies. Second, we intend to improve user experiences by enabling users to upload their

list of gene interactions, which can be superimposed on the existing genome-scale interaction

networks. This feature will allow biologists with specialized knowledge of specific gene interactions

in particular biological contexts to use this information to make context-specific predictions. We

117

also plan to incorporate a feedback mechanism where users can evaluate the predictions as correct

or incorrect. This feedback can serve as crowd-sourced data to further enhance the accuracy of

GenePlexus predictions.

In addition, the advancements made in this dissertation are not limited to gene classifications. In

the following, we discuss promising future directions adapting this dissertation by integrating with

cutting-edge foundation models and extending to critical applications in transnational biomedical

research.

Integrating structural and sequential information through foundation models The recent surge

of foundation models for proteins and genomics represents a groundbreaking shift in how genomics

data can be analyzed and understood [115, 147, 185]. These models have shown remarkable success

in extracting meaningful representations from proteins and genomic sequences. Integrating these

models with gene interaction networks will enable many exciting research opportunities. First,

this allows one to train inductive models by using the extracted features as raw gene features,

paving the way for dynamic updates to network embeddings. Second, using genomic foundation

model representations presents the possibility of mitigating the low homophily challenge by

providing complementary information about genes from a different modality than gene interactions.

Consequently, embracing this innovative approach could advance our understanding of complex

diseases and traits.

Understanding transcriptional and phenotypic effects of genes This dissertation primarily

focused on unveiling genes’ functional and disease associations. Yet, the transcriptional and

phenotypic effects of the dysfunctioning or dysregulation of every gene (and their combinations)

represent a critical, and even more significant, area of study. Recent works, such as GEARS [222],

have pioneered in this direction. In GEARS, the integration of a specially constructed functional

gene interaction network is instrumental to the success of GEARS in predicting the transcriptional

outcomes of combinatorial gene perturbations that were not encountered during its training phase.

Future research leveraging biologically contextualized gene interaction network embeddings, such

as CONE (Chapter 5), holds immense promise for deepening our understanding of transcriptional

118

dynamics under more physiologically relevant conditions, extending beyond the confines of cell

line studies [222]. Such advancements present profound implications for developing personalized

therapeutic strategies against complex diseases.

Annotating functions and diseases for long non-coding RNAs Aside from the 20K protein-

coding genes in the human genome that we primarily focused on in this dissertation, there has been

growing interest in studying long non-coding RNAs (lncRNAs). Accumulating evidence indicates

that many lncRNAs play pivotal functional roles, with their dysregulation linked to numerous

rare diseases [8, 179]. Yet, our current knowledge of lncRNAs remains much scarcer than that

of protein-coding genes – a gap that is further complicated by their incredibly context-specific

manifestations and functions [170, 11]. Recent advancements in network biology, particularly

through utilizing molecular interaction networks that integrate lncRNA interactions, offer promising

solutions [208, 184]. Similar to proteins, lncRNAs follow the guilt-by-association principle:

lncRNAs are likely to co-function with their network interaction partners [243]. Therefore, the GDL

methods developed in this work, especially the one that is biological context-specific, pave the way

toward demystifying lncRNAs and expanding therapeutic options for rare diseases through them.

Aiding drug repurposing and de novo design for treating complex diseases The precise

mapping of gene functions and their associations with diseases and traits marks a pivotal stride

toward a complete picture of human biology, presenting opportunities for effective and personalized

treatments for human diseases. Several recent studies have highlighted the efficacy of network biology

in elucidating disease mechanisms [19, 72] and proposing viable therapeutic strategies [95, 13, 225].

The advancements presented in this dissertation lay the groundwork for future research to develop

predictive models using GDL on molecular interaction networks to suggest treatments tailored to

individuals, which can be seen as a form of biological context. Additionally, integrating these

approaches with genomics [147, 185] and molecular [221, 223] foundation models, as discussed

above, will further open exciting avenues for repurposing existing [234] and designing de novo

drugs [244, 176], presenting promises for tackling rare and complex diseases that currently lack

viable treatment options.

119

BIBLIOGRAPHY

[1]

[2]

[3]

Sami Abu-El-Haĳa, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step:
Learning node embeddings via graph attention. Advances in neural information processing
systems, 31, 2018.

Bĳaya Adhikari, Yao Zhang, Naren Ramakrishnan, and B Aditya Prakash. Sub2vec: Feature
learning for subgraphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining,
pages 170–182. Springer, 2018.

Elmar Aigner, Alexandra Feldman, and Christian Datz. Obesity as an emerging risk factor
for iron deficiency. Nutrients, 6(9):3587–3600, 2014.

[4] Madhu Alagiri, Sherman Chottiner, Vicki Ratner, Debra Slade, and Philip M Hanno.
Interstitial cystitis: unexplained associations with other chronic disease and pain syndromes.
Urology, 49(5):52–57, 1997.

[5]

[6]

Gregorio Alanis-Lobato, Miguel A Andrade-Navarro, and Martin H Schaefer. Hippie v2. 0:
enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic
acids research, page gkw985, 2016.

Réka Albert. Scale-free networks in cell biology. Journal of cell science, 118(21):4947–4957,
2005.

[7] Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh,
Volker Tresp, and Jens Lehmann. Pykeen 1.0: A python library for training and evaluating
knowledge graph embeddings. J. Mach. Learn. Res., 22(82):1–6, 2021.

[8]

Shabana A Ali, Mandy J Peffers, Michelle J Ormseth, Igor Jurisica, and Mohit Kapoor.
The non-coding rna interactome in joint health and disease. Nature Reviews Rheumatology,
17(11):692–705, 2021.

[9] Motasem Alkhayyat, Mohannad Abou Saleh, Mehnaj Kaur Grewal, Mohammad Abureesh,
Emad Mansoor, C Roberto Simons-Linares, Abby Abelson, and Prabhleen Chahal. Pancreatic
manifestations in rheumatoid arthritis: a national population-based study. Rheumatology,
60(5):2366–2374, 2021.

[10] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic

local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.

[11] Paulo Amaral, Silvia Carbonell-Sala, Francisco M De La Vega, Tiago Faial, Adam Frankish,
Thomas Gingeras, Roderic Guigo, Jennifer L Harrow, Artemis G Hatzigeorgiou, Rory
Johnson, et al. The status of the human gene catalogue. Nature, 622(7981):41–47, 2023.

120

[12] Lena Antoni, Sabine Nuding, Jan Wehkamp, and Eduard F Stange. Intestinal barrier in

inflammatory bowel disease. World journal of gastroenterology: WJG, 20(5):1165, 2014.

[13] DK Arrell and Andre Terzic. Network systems biology for drug discovery. Clinical

Pharmacology & Therapeutics, 88(1):120–125, 2010.

[14] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene
ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.

[15] Sezin Kircali Ata, Le Ou-Yang, Yuan Fang, Chee-Keong Kwoh, Min Wu, and Xiao-Li Li.
Integrating node embeddings and biological annotations for genes to predict disease-gene
associations. BMC systems biology, 12(9):31–44, 2018.

[16] Sezin Kircali Ata, Min Wu, Yuan Fang, Le Ou-Yang, Chee Keong Kwoh, and Xiao-Li
Li. Recent advances in network-based methods for disease gene prediction. Briefings in
bioinformatics, 22(4):bbaa303, 2021.

[17] Awais Athar, Anja Füllgrabe, Nancy George, Haider Iqbal, Laura Huerta, Ahmed Ali,
Catherine Snow, Nuno A Fonseca, Robert Petryszak, Irene Papatheodorou, et al. Arrayexpress
update–from bulk to single-cell expression data. Nucleic acids research, 47(D1):D711–D715,
2019.

[18] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science,

286(5439):509–512, 1999.

[19] Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a

network-based approach to human disease. Nature reviews genetics, 12(1):56–68, 2011.

[20] Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a

network-based approach to human disease. Nature reviews genetics, 12(1):56–68, 2011.

[21] Albert-László Barabási and Zoltán N. Oltvai. Network biology: understanding the cell's

functional organization. Nature Reviews Genetics, 5(2):101–113, February 2004.

[22] Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s

functional organization. Nature reviews genetics, 5(2):101–113, 2004.

[23]

Jonathan BL Bard and Seung Y Rhee. Ontologies in biology: design, applications and future
challenges. nature reviews genetics, 5(3):213–222, 2004.

[24] Zafer Barutcuoglu, Robert E Schapire, and Olga G Troyanskaya. Hierarchical multi-label

prediction of gene function. Bioinformatics, 22(7):830–836, 2006.

[25] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and

121

data representation. Neural computation, 15(6):1373–1396, 2003.

[26] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli. Adaptive linear step-up procedures

that control the false discovery rate. Biometrika, 93(3):491–507, 2006.

[27] Mary Lauren Benton, Abin Abraham, Abigail L LaBella, Patrick Abbot, Antonis Rokas, and
John A Capra. The influence of evolutionary history on human health and disease. Nature
Reviews Genetics, 22(5):269–283, 2021.

[28] Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from

wandb.com.

[29]

Jonathan JH Bray, Rosie Freer, Alex Pitcher, and Rajesh Kharbanda. Family screening for
bicuspid aortic valve and aortic dilatation: a meta-analysis. European Heart Journal, page
ehad320, 2023.

[30] Anna Breit, Simon Ott, Asan Agibetov, and Matthias Samwald. Openbiolink: a benchmarking
framework for large-scale biomedical link prediction. Bioinformatics, 36(13):4097–4098,
2020.

[31] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint

arXiv:1711.07553, 2017.

[32] Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In

International Conference on Learning Representations, 2021.

[33] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep
learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478,
2021.

[34] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine,
34(4):18–42, July 2017.

[35] Garth R Brown, Vichet Hem, Kenneth S Katz, Michael Ovetsky, Craig Wallin, Olga
Ermolaeva, Igor Tolstoy, Tatiana Tatusova, Kim D Pruitt, Donna R Maglott, et al. Gene: a
gene-centered information resource at ncbi. Nucleic acids research, 43(D1):D36–D42, 2015.

[36]

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally
connected networks on graphs. In 2nd International Conference on Learning Representations,
ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[37] Carol J Bult, Judith A Blake, Cynthia L Smith, James A Kadin, and Joel E Richardson.

Mouse genome database (mgd) 2019. Nucleic acids research, 47(D1):D801–D806, 2019.

122

[38] Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James
Hayhurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot
Sollis, et al. The nhgri-ebi gwas catalog of published genome-wide association studies,
targeted arrays and summary statistics 2019. Nucleic acids research, 47(D1):D1005–D1012,
2019.

[39] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of
graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge
and Data Engineering, 30(9):1616–1637, 2018.

[40] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with
global structural information. In Proceedings of the 24th ACM international on conference
on information and knowledge management, pages 891–900, 2015.

[41] Luca Cappelletti, Tommaso Fontana, Elena Casiraghi, Vida Ravanmehr, Tiffany J. Callahan,
Carlos Cano, Marcin P. Joachimiak, Christopher J. Mungall, Peter N. Robinson, Justin Reese,
and Giorgio Valentini. GRAPE for fast and scalable graph processing and random-walk-based
embedding. Nature Computational Science, 3(6):552–568, June 2023.

[42] Giovanni Casella, Gian Eugenio Tontini, Gabrio Bassotti, Luca Pastorelli, Vincenzo Villanacci,
Luisa Spina, Vittorio Baldini, and Maurizio Vecchi. Neurological disorders and inflammatory
bowel diseases. World Journal of Gastroenterology: WJG, 20(27):8764, 2014.

[43]

Junha Cha, Jiwon Yu, Jae-Won Cho, Martin Hemberg, and Insuk Lee. schumannet: a
single-cell network analysis platform for the study of cell-type specificity of disease genes.
Nucleic acids research, 51(2):e8–e8, 2023.

[44] Hammad S. Chaudhry and Madhukar Reddy Kasarla. Microcytic Hypochromic Anemia.

StatPearls Publishing, Treasure Island (FL), 2022.

[45] Nana Cheng, Mingzhu Liu, Wanting Li, BingYue Sun, Dandan Liu, Guoqing Wang, Jingwei
Shi, and Lisha Li. Protein post-translational modification in sars-cov-2 and host interaction.
Frontiers in Immunology, 13:1068449, 2023.

[46] Hyunghoon Cho, Bonnie Berger, and Jian Peng. Compact integration of multi-network

topology for functional analysis of genes. Cell systems, 3(6):540–548, 2016.

[47] Sarvenaz Choobdar, Mehmet E Ahsen, Jake Crawford, Mattia Tomasoni, Tao Fang, David
Lamparter, Junyuan Lin, Benjamin Hescott, Xiaozhe Hu, Johnathan Mercer, et al. Assessment
of network module identification across complex diseases. Nature methods, 16(9):843–852,
2019.

[48] Doreen E Chung, Lesley K Carr, Linda Sugar, Michelle Hladunewich, and Leslie A Deane.
Xanthogranulomatous cystitis associated with inflammatory bowel disease. Canadian
Urological Association Journal, 4(4):E91, 2010.

123

[49] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong.

Nucleic acids research, 47(D1):D330–D338, 2019.

[50] Tabula Sapiens Consortium*, Robert C Jones, Jim Karkanias, Mark A Krasnow, An-
gela Oliveira Pisco, Stephen R Quake, Julia Salzman, Nir Yosef, Bryan Bulthaup, Phillip
Brown, et al. The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans.
Science, 376(6594):eabl4896, 2022.

[51] Lenore Cowen, Trey Ideker, Benjamin J Raphael, and Roded Sharan. Network propagation:
a universal amplifier of genetic associations. Nature Reviews Genetics, 18(9):551–562, 2017.

[52] Hejie Cui, Zĳie Lu, Pan Li, and Carl Yang. On positional and structural node features for
graph neural networks on non-attributed graphs. In Proceedings of the 31st ACM International
Conference on Information & Knowledge Management, pages 3898–3902, 2022.

[53] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding. IEEE

transactions on knowledge and data engineering, 31(5):833–852, 2018.

[54] Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse
covariance estimation across multiple classes. Journal of the Royal Statistical Society Series
B: Statistical Methodology, 76(2):373–397, 2014.

[55]

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In
Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006.

[56] Andrew Davison and Morgane Austern. Asymptotics of network embeddings learned via

subsampling. Journal of Machine Learning Research, 24(138):1–120, 2023.

[57] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
networks on graphs with fast localized spectral filtering. In Advances in Neural Information
Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,
December 5-10, 2016, Barcelona, Spain, 2016.

[58] Christophe Dessimoz, Nives Škunca, and Paul D Thomas. Cafa and the open world of protein

function predictions. Trends in Genetics, 29(11):609–610, 2013.

[59] DN Dhanasekaran, K Kashef, CM Lee, H Xu, and EP Reddy. Scaffold proteins of map-kinase

modules. Oncogene, 26(22):3185–3202, 2007.

[60] Scott J Dixon, Michael Costanzo, Anastasia Baryshnikova, Brenda Andrews, Charles Boone,
et al. Systematic mapping of genetic interaction networks. Annual review of genetics,
43(1):601–625, 2009.

[61] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable repre-
sentation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD

124

international conference on knowledge discovery and data mining, pages 135–144, 2017.

[62] Kevin Drew, John B Wallingford, and Edward M Marcotte. hu. map 2.0: integration of over
15,000 proteomic experiments builds a global compendium of human multiprotein assemblies.
Molecular Systems Biology, 17(5):e10016, 2021.

[63] Dongsheng Duan, Nathalie Goemans, Shin’ichi Takeda, Eugenio Mercuri, and Annemieke
Aartsma-Rus. Duchenne muscular dystrophy. Nature Reviews Disease Primers, 7(1):13,
2021.

[64] Vĳay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio,
and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning
Research, 24(43):1–48, 2023.

[65] Vĳay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier
Bresson. Graph neural networks with learnable structural and positional representations. In
International Conference on Learning Representations, 2021.

[66] Vĳay Prakash Dwivedi, Ladislav Rampášek, Michael Galkin, Ali Parviz, Guy Wolf, Anh Tuan
Luu, and Dominique Beaini. Long range graph benchmark. Advances in Neural Information
Processing Systems, pages 22326–22340, 2022.

[67] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: Ncbi gene
expression and hybridization array data repository. Nucleic acids research, 30(1):207–210,
2002.

[68] Viola Fanfani, Fabio Cassano, and Giovanni Stracquadanio. Pygna: a unified framework for

geneset network analysis. BMC bioinformatics, 21(1):1–22, 2020.

[69]

José M Ferro and Miguel Oliveira Santos. Neurology of inflammatory bowel disease. Journal
of the Neurological Sciences, 424:117426, 2021.

[70] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric.

arXiv preprint arXiv:1903.02428, 2019.

[71] Duncan T Forster, Sheena C Li, Yoko Yashiroda, Mami Yoshimura, Zhĳian Li, Luis
Alberto Vega Isuhuaylas, Kaori Itto-Nakama, Daisuke Yamanaka, Yoshikazu Ohya, Hiroyuki
Osada, et al. Bionic: biological network integration using convolutions. Nature Methods,
19(10):1250–1261, 2022.

[72] Laura I Furlong. Human diseases through the lens of network biology. Trends in genetics,

29(3):150–159, 2013.

[73]

Jesse Gillis and Paul Pavlidis. The impact of multifunctional genes on" guilt by association"
analysis. PloS one, 6(2):e17258, 2011.

125

[74]

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.
Neural message passing for quantum chemistry. In Proceedings of the 34th International
Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,
Proceedings of Machine Learning Research, 2017.

[75] Vladimir Gligorĳević, Meet Barot, and Richard Bonneau. deepnf: deep network fusion for

protein function prediction. Bioinformatics, 34(22):3873–3881, 2018.

[76] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the thirteenth international conference on artificial
intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings,
2010.

[77] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and Albert-László
Barabási. The human disease network. Proceedings of the National Academy of Sciences,
104(21):8685–8690, 2007.

[78] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance:

A survey. Knowledge-Based Systems, 151:78–94, 2018.

[79] Kristian A Gray, Bethan Yates, Ruth L Seal, Mathew W Wright, and Elspeth A Bruford.
Genenames. org: the hgnc resources in 2015. Nucleic acids research, 43(D1):D1079–D1085,
2015.

[80] Casey S Greene, Arjun Krishnan, Aaron K Wong, Emanuela Ricciotti, Rene A Zelaya,
Daniel S Himmelstein, Ran Zhang, Boris M Hartmann, Elena Zaslavsky, Stuart C Sealfon,
et al. Understanding multicellular function and disease with human tissue-specific networks.
Nature genetics, 47(6):569–576, 2015.

[81] Dhouha Grissa, Alexander Junge, Tudor I Oprea, and Lars Juhl Jensen. Diseases 2.0: a
weekly updated database of disease–gene associations from text mining and data integration.
Database, 2022, 2022.

[82] Martin Grohe. word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings
of structured data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems, pages 1–16, 2020.

[83] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016.

[84] Yuanfang Guan, Cheryl L Ackert-Bicknell, Braden Kell, Olga G Troyanskaya, and Matthew A
Hibbs. Functional genomics complements quantitative genetics in identifying disease-gene
associations. PLoS computational biology, 6(11):e1000991, 2010.

126

[85] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation
principle for unnormalized statistical models. In Proceedings of the thirteenth international
conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and
Conference Proceedings, 2010.

[86] Celia Hacker.

k-simplex2vec: a simplicial extension of node2vec.

arXiv preprint

arXiv:2010.05636, 2020.

[87] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and
function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos,
NM (United States), 2008.

[88] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs:

Methods and applications. ArXiv preprint, 2017.

[89] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on
large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA,
2017.

[90] Myat Noe Han, David I Finkelstein, Rachel M McQuade, and Shanti Diwakarla. Gastroin-
testinal dysfunction in parkinson’s disease: current and potential therapeutics. Journal of
Personalized Medicine, 12(2):144, 2022.

[91] SM Mahmudul Hasan and Baljinder S Salh. Emphysematous cystitis as a potential marker of

severe crohn’s disease. BMC gastroenterology, 22(1):181, 2022.

[92] Ronald T Hay. Sumo: a history of modification. Molecular cell, 18(1):1–12, 2005.

[93]

Idan Hekselman and Esti Yeger-Lotem. Mechanisms of tissue and cell-type specificity in
heritable traits and diseases. Nature Reviews Genetics, 21(3):137–150, 2020.

[94] Daniel Himmelstein, Casey Greene, and Alexander Pico. Using entrez gene as our gene

vocabulary, February 2015.

[95] Andrew L Hopkins. Network pharmacology: the next paradigm in drug discovery. Nature

chemical biology, 4(11):682–690, 2008.

[96] Congxue Hu, Tengyue Li, Yingqi Xu, Xinxin Zhang, Feng Li, Jing Bai, Jing Chen, Wenqi
Jiang, Kaiyue Yang, Qi Ou, et al. Cellmarker 2.0: an updated database of manually curated
cell markers in human/mouse and web tools based on scrna-seq data. Nucleic Acids Research,
51(D1):D870–D876, 2023.

[97] Fang Hu, Jia Liu, Liuhuan Li, and Jun Liang. Community detection in complex networks
using node2vec with spectral clustering. Physica A: Statistical Mechanics and its Applications,

127

545:123633, 2020.

[98] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-
lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430,
2021.

[99] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele
Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.
In Advances in Neural Information Processing Systems 33: Annual Conference on Neural
Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.

[100] Justin K Huang, Daniel E Carlin, Michael Ku Yu, Wei Zhang, Jason F Kreisberg, Pablo
Tamayo, and Trey Ideker. Systematic evaluation of molecular networks for discovery of
disease genes. Cell systems, 6(4):484–495, 2018.

[101] Kexin Huang, Payal Chandak, Qianwen Wang, Shreyas Havaldar, Akhil Vaid, Jure Leskovec,
Girish Nadkarni, Benjamin S Glicksberg, Nils Gehlenborg, and Marinka Zitnik. Zero-shot
prediction of therapeutic use with geometric deep learning and clinician centered design.
medRxiv, pages 2023–03, 2023.

[102] Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor
Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine
learning datasets and tasks for drug discovery and development. Advances in neural
information processing systems, 2021.

[103] Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R. Benson. Combining
label propagation and simple models out-performs graph neural networks. In 9th International
Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021,
2021.

[104] T Hubbard, D Andrews, Mario Cáccamo, Graham Cameron, Yuan Chen, M Clamp, Laura
Clarke, Guy Coates, Tony Cox, Fiona Cunningham, et al. Ensembl 2005. Nucleic acids
research, 33(suppl_1):D447–D453, 2005.

[105] Mien-Chie Hung and Wolfgang Link. Protein localization in disease and therapy. Journal of

cell science, 124(20):3381–3392, 2011.

[106] Edward L Huttlin, Raphael J Bruckner, Jose Navarrete-Perea, Joe R Cannon, Kurt Baltier,
Fana Gebreab, Melanie P Gygi, Alexandra Thornock, Gabriela Zarraga, Stanley Tam, et al.
Dual proteome-scale networks reveal cell-specific remodeling of the human interactome.
Cell, 184(11):3022–3040, 2021.

[107] Sohyun Hwang, Chan Yeong Kim, Sunmo Yang, Eiru Kim, Traver Hart, Edward M Marcotte,
and Insuk Lee. Humannet v2: human gene networks for disease research. Nucleic acids
research, 47(D1):D573–D580, 2019.

128

[108] Trey Ideker and Roded Sharan. Protein networks in disease. Genome research, 18(4):644–652,

2008.

[109] Elisabetta Indelicato and Sylvia Boesch. From genotype to phenotype: expanding the clinical
spectrum of cacna1a variants in the era of next generation sequencing. Frontiers in Neurology,
12:263, 2021.

[110] Hawoong Jeong, Sean P Mason, A-L Barabási, and Zoltan N Oltvai. Lethality and centrality

in protein networks. Nature, 411(6833):41–42, 2001.

[111] Junzhong Ji, Aidong Zhang, Chunnian Liu, Xiaomei Quan, and Zhĳun Liu. Survey:
Functional module detection from protein-protein interaction networks. IEEE Transactions
on Knowledge and Data Engineering, 26(2):261–277, 2012.

[112] Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea,
Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, et al.
An expanded evaluation of protein function prediction methods shows an improvement in
accuracy. Genome biology, 17:1–19, 2016.

[113] Kayla A Johnson and Arjun Krishnan. Robust normalization and transformation techniques
for constructing gene coexpression networks from rna-seq data. Genome biology, 23(1):1–26,
2022.

[114] Pall F Jonsson and Paul A Bates. Global topological features of cancer proteins in the human

interactome. Bioinformatics, 22(18):2291–2297, 2006.

[115] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron-
neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al.
Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589,
2021.

[116] Indika Kahanda, Christopher S Funk, Fahad Ullah, Karin M Verspoor, and Asa Ben-Hur. A
close look at protein function prediction evaluation protocols. GigaScience, 4(1):s13742–015,
2015.

[117] Atanas Kamburov, Konstantin Pentchev, Hanna Galicka, Christoph Wierling, Hans Lehrach,
and Ralf Herwig. Consensuspathdb: toward a more complete picture of cell biology. Nucleic
acids research, 39(suppl_1):D712–D717, 2011.

[118] Minoru Kanehisa. The kegg database. In ‘In Silico’Simulation of Biological Processes:
Novartis Foundation Symposium 247, volume 247, pages 91–103. Wiley Online Library,
2002.

[119] Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and Kanae Morishima. Kegg:
new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research,

129

45(D1):D353–D361, 2017.

[120] Ulas Karaoz, TM Murali, Stan Letovsky, Yu Zheng, Chunming Ding, Charles R Cantor, and
Simon Kasif. Whole-genome annotation by using evidence integration in functional-linkage
networks. Proceedings of the National Academy of Sciences, 101(9):2888–2893, 2004.

[121] Chan Yeong Kim, Seungbyn Baek, Junha Cha, Sunmo Yang, Eiru Kim, Edward M Marcotte,
Traver Hart, and Insuk Lee. Humannet v3: an improved database of human gene networks
for disease research. Nucleic acids research, 50(D1):D632–D639, 2022.

[122] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
May 7-9, 2015, Conference Track Proceedings, 2015.

[123] Thomas N Kipf and Max Welling. Variational graph auto-encoders. ArXiv preprint, 2016.

[124] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings, 2017.

[125] Maksim Kitsak, Amitabh Sharma, Jörg Menche, Emre Guney, Susan Dina Ghiassian, Joseph
Loscalzo, and Albert-László Barabási. Tissue specificity of human disease module. Scientific
reports, 6(1):35241, 2016.

[126] Sebastian Köhler, Sebastian Bauer, Denise Horn, and Peter N Robinson. Walking the
interactome for prioritization of candidate disease genes. The American Journal of Human
Genetics, 82(4):949–958, 2008.

[127] Max Kotlyar, Chiara Pastrello, Nicholas Sheahan, and Igor Jurisica. Integrated interactions
database: tissue-specific view of the human and model organism interactomes. Nucleic acids
research, 44(D1):D536–D541, 2016.

[128] Arjun Krishnan, Ran Zhang, Victoria Yao, Chandra L Theesfeld, Aaron K Wong, Alicja
Tadych, Natalia Volfovsky, Alan Packer, Alex Lash, and Olga G Troyanskaya. Genome-wide
prediction and functional characterization of the genetic basis of autism spectrum disorder.
Nature neuroscience, 19(11):1454–1462, 2016.

[129] Arjun Krishnan, Ran Zhang, Victoria Yao, Chandra L Theesfeld, Aaron K Wong, Alicja
Tadych, Natalia Volfovsky, Alan Packer, Alex Lash, and Olga G Troyanskaya. Genome-wide
prediction and functional characterization of the genetic basis of autism spectrum disorder.
Nature neuroscience, 19(11):1454–1462, 2016.

[130] Georg Kustatscher, Tom Collins, Anne-Claude Gingras, Tiannan Guo, Henning Hermjakob,
Trey Ideker, Kathryn S Lilley, Emma Lundberg, Edward M Marcotte, Markus Ralser, et al.
Understudied proteins: opportunities and challenges for functional proteomics. Nature

130

Methods, 19(7):774–779, 2022.

[131] Georg Kustatscher, Piotr Grabowski, Tina A Schrader, Josiah B Passmore, Michael Schrader,
and Juri Rappsilber. Co-regulation map of the human proteome enables identification of
protein functions. Nature biotechnology, 37(11):1361–1371, 2019.

[132] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-based python jit
compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in
HPC, pages 1–6, 2015.

[133] Gert RG Lanckriet, Minghua Deng, Nello Cristianini, Michael I Jordan, and William Stafford
Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In
Biocomputing 2004, pages 300–311. World Scientific, 2003.

[134] Peter Laslo, Jagan MR Pongubala, David W Lancki, and Harinder Singh. Gene regulatory
networks directing myeloid and lymphoid cell fates within the immune system. In Seminars
in immunology, volume 20, pages 228–235. Elsevier, 2008.

[135] Jeffrey N Law, Shiv D Kale, and TM Murali. Accurate and efficient gene function prediction

using a multi-bacterial network. Bioinformatics, 37(6):800–806, 2021.

[136] Insuk Lee, U Martin Blom, Peggy I Wang, Jung Eun Shim, and Edward M Marcotte.
Prioritizing candidate disease genes by network-based boosting of genome-wide association
data. Genome research, 21(7):1109–1121, 2011.

[137] Rasko Leinonen, Hideaki Sugawara, Martin Shumway, and International Nucleotide Se-
quence Database Collaboration. The sequence read archive. Nucleic acids research,
39(suppl_1):D19–D21, 2010.

[138] Mark DM Leiserson, Fabio Vandin, Hsin-Ta Wu, Jason R Dobson, Jonathan V Eldridge,
Jacob L Thomas, Alexandra Papoutsaki, Younhun Kim, Beifang Niu, Michael McLellan,
et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across
pathways and protein complexes. Nature genetics, 47(2):106–114, 2015.

[139] Po Sing Leung and Siu Po Ip. Pancreatic acinar cell: its role in acute pancreatitis. The

international journal of biochemistry & cell biology, 38(7):1024–1030, 2006.

[140] Michelle M Li, Yepeng Huang, Marissa Sumathipala, Man Qing Liang, Alberto Valdeolivas,
Ashwin N Ananthakrishnan, Katherine Liao, Daniel Marbach, and Marinka Zitnik. Contextu-
alizing protein representations using deep learning on protein networks and single-cell data.
bioRxiv, pages 2023–07, 2023.

[141] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design
provably more powerful neural networks for graph representation learning. Advances in
Neural Information Processing Systems, 33:4465–4478, 2020.

131

[142] Taibo Li, Rasmus Wernersson, Rasmus B Hansen, Heiko Horn, Johnathan Mercer, Greg
Slodkowicz, Christopher T Workman, Olga Rigina, Kristoffer Rapacki, Hans H Stærfeldt,
et al. A scored human protein–protein interaction network to catalyze genomic interpretation.
Nature methods, 14(1):61–64, 2017.

[143] Wei Li, Hong-Lian Li, Jian-Zhi Wang, Rong Liu, and Xiaochuan Wang. Abnormal protein
post-translational modifications induces aggregation and abnormal deposition of protein,
mediating neurodegenerative diseases. Cell & Bioscience, 14(1):1–14, 2024.

[144] Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo
Tamayo, and Jill P Mesirov. Molecular signatures database (msigdb) 3.0. Bioinformatics,
27(12):1739–1740, 2011.

[145] Pyoung Suk Lim, In Hee Kim, Seong Hun Kim, Seung Ok Lee, and Sang Wook Kim. A case
of severe acute hepatitis a complicated with pure red cell aplasia. The Korean Journal of
Gastroenterology, 60(3):177–181, 2012.

[146] Cui-Xiang Lin, Hong-Dong Li, Chao Deng, Yuanfang Guan, and Jianxin Wang. Tissuenexus:
a database of human tissue functional gene networks built with a large compendium of curated
rna-seq data. Nucleic acids research, 50(D1):D710–D718, 2022.

[147] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin,
Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level
protein structure with a language model. Science, 379(6637):1123–1130, 2023.

[148] Li Liu, Qian-Zhong Li, Wen Jin, Hao Lv, and Hao Lin. Revealing gene function and
transcription relationship by reconstructing gene-level chromatin interaction. Computational
and structural biotechnology journal, 17:195–205, 2019.

[149] Renming Liu, Semih Cantürk, Frederik Wenkel, Sarah McGuire, Xinyi Wang, Anna Little,
Leslie O’Bray, Michael Perlmutter, Bastian Rieck, Matthew Hirn, et al. Taxonomy of
benchmarks in graph representation learning. In Learning on Graphs Conference, pages 6–1.
PMLR, 2022.

[150] Renming Liu, Matthew Hirn, and Arjun Krishnan. Accurately modeling biased random walks

on weighted networks using node2vec+. Bioinformatics, 39(1):btad047, 2023.

[151] Renming Liu and Arjun Krishnan. Pecanpy: a fast, efficient and parallelized python

implementation of node2vec. Bioinformatics, 37(19):3377–3379, 2021.

[152] Renming Liu and Arjun Krishnan. Open biomedical network benchmark: A python toolkit
for benchmarking datasets with biomedical networks. In Machine Learning in Computational
Biology, pages 23–59. PMLR, 2024.

[153] Renming Liu, Christopher A Mancuso, Anna Yannakopoulos, Kayla A Johnson, and Arjun

132

Krishnan. Supervised learning is an accurate method for network-based gene classification.
Bioinformatics, 36(11):3457–3465, 2020.

[154] Renming Liu, Christopher A Mancuso, Anna Yannakopoulos, Kayla A Johnson, and Arjun
Krishnan. Supervised learning is an accurate method for network-based gene classification.
Bioinformatics, 36(11):3457–3465, 2020.

[155] Renming Liu, Hao Yuan, Kayla A Johnson, and Arjun Krishnan. Cone: Context-specific
network embedding via contextualized graph attention. bioRxiv, pages 2023–10, 2023.

[156] Sebastian Lobentanzer, Patrick Aloy, Jan Baumbach, Balazs Bohar, Vincent J Carey, Porn-
pimol Charoentong, Katharina Danhauser, Tunca Doğan, Johann Dreo, Ian Dunham, et al.
Democratizing knowledge representation with biocypher. Nature Biotechnology, pages 1–4,
2023.

[157] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad,
Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotype-tissue
expression (gtex) project. Nature genetics, 45(6):580–585, 2013.

[158] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International

Conference on Learning Representations, 2018.

[159] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen
Chang, and Doina Precup. Is heterophily a real nightmare for graph neural networks to do
node classification? arXiv preprint arXiv:2109.05641, 2021.

[160] Katja Luck, Dae-Kyum Kim, Luke Lambourne, Kerstin Spirohn, Bridget E Begg, Wenting
Bian, Ruth Brignall, Tiziana Cafarelli, Francisco J Campos-Laborie, Benoit Charloteaux,
et al. A reference map of the human binary protein interactome. Nature, 580(7803):402–408,
2020.

[161] Ping Luo, Yuanyuan Li, Li-Ping Tian, and Fang-Xiang Wu. Enhancing the prediction of
disease–gene associations with multimodal deep learning. Bioinformatics, 35(19):3735–3742,
2019.

[162] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural

networks? In International Conference on Learning Representations, 2021.

[163] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez gene: gene-centered

information at ncbi. Nucleic acids research, 33(suppl_1):D54–D58, 2005.

[164] Christopher A Mancuso, Patrick S Bills, Douglas Krum, Jacob Newsted, Renming Liu, and
Arjun Krishnan. Geneplexus: a web-server for gene discovery using network-based machine
learning. Nucleic Acids Research, 50(W1):W358–W366, 2022.

133

[165] Christopher A Mancuso, Kayla A Johnson, Renming Liu, and Arjun Krishnan. Joint
representation of molecular networks from multiple species improves gene classification.
PLOS Computational Biology, 20(1):e1011773, 2024.

[166] Christopher A Mancuso, Renming Liu, and Arjun Krishnan. Pygeneplexus: A python package

for gene discovery using network-based machine learning. bioRxiv, 2022.

[167] Christopher A Mancuso, Renming Liu, and Arjun Krishnan. Pygeneplexus: a python package
for gene discovery using network-based machine learning. Bioinformatics, 39(2):btad064,
2023.

[168] Daniel Marbach, David Lamparter, Gerald Quon, Manolis Kellis, Zoltán Kutalik, and Sven
Bergmann. Tissue-specific regulatory circuits reveal variable modular perturbations across
complex diseases. Nature methods, 13(4):366–370, 2016.

[169] Stéphane Martin, Kevin A Wilkinson, Atsushi Nishimune, and Jeremy M Henley. Emerging
extranuclear roles of protein sumoylation in neuronal function and dysfunction. Nature
Reviews Neuroscience, 8(12):948–959, 2007.

[170] John S Mattick, Paulo P Amaral, Piero Carninci, Susan Carpenter, Howard Y Chang, Ling-
Ling Chen, Runsheng Chen, Caroline Dean, Marcel E Dinger, Katherine A Fitzgerald, et al.
Long non-coding rnas: definitions, functions, challenges and recommendations. Nature
reviews Molecular cell biology, 24(6):430–447, 2023.

[171] Patrick McGillivray, Declan Clarke, William Meyerson, Jing Zhang, Donghoon Lee, Mengting
Gu, Sushant Kumar, Holly Zhou, and Mark Gerstein. Network analysis as a grand unifier in
biomedical data science. Annual Review of Biomedical Data Science, 1:153–180, 2018.

[172] Jörg Menche, Amitabh Sharma, Maksim Kitsak, Susan Dina Ghiassian, Marc Vidal, Joseph
Loscalzo, and Albert-László Barabási. Uncovering disease-disease relationships through the
incomplete interactome. Science, 347(6224):1257601, 2015.

[173] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word

representations in vector space. ArXiv preprint, 2013.

[174] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. Advances in neural
information processing systems, 26, 2013.

[175] Shahin Mohammadi, Jose Davila-Velderrain, and Manolis Kellis. Reconstruction of cell-
type-specific interactomes at single-cell resolution. Cell systems, 9(6):559–568, 2019.

[176] Michael Moret, Irene Pachon Angona, Leandro Cotos, Shen Yan, Kenneth Atz, Cyrill
Brunner, Martin Baumgartner, Francesca Grisoni, and Gisbert Schneider. Leveraging
molecular structure and bioactivity with chemical language models for de novo drug design.

134

Nature Communications, 14(1):114, 2023.

[177] Germán Morís.

Inflammatory bowel disease: an increased risk factor for neurologic

complications. World Journal of Gastroenterology: WJG, 20(5):1228, 2014.

[178] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,
Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural
networks. In Proceedings of the AAAI conference on artificial intelligence, pages 4602–4609,
2019.

[179] Deisy Morselli Gysi and Albert-László Barabási. Noncoding rnas improve the predictive power
of network medicine. Proceedings of the National Academy of Sciences, 120(45):e2301342120,
2023.

[180] Sara Mostafavi, Debajyoti Ray, David Warde-Farley, Chris Grouios, and Quaid Morris.
Genemania: a real-time multiple association network integration algorithm for predicting
gene function. Genome biology, 9:1–15, 2008.

[181] Giulia Muzio, Leslie O’Bray, and Karsten Borgwardt. Biological network analysis with deep

learning. Briefings in bioinformatics, 22(2):1515–1530, 2021.

[182] Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu, and Santhoshkumar
Saminathan. subgraph2vec: Learning distributed representations of rooted sub-graphs from
large graphs. arXiv preprint arXiv:1606.08928, 2016.

[183] Walter Nelson, Marinka Zitnik, Bo Wang, Jure Leskovec, Anna Goldenberg, and Roded
Sharan. To embed or not: network embedding as a paradigm in computational biology.
Frontiers in genetics, 10:381, 2019.

[184] Kinga Nemeth, Recep Bayraktar, Manuela Ferracin, and George A Calin. Non-coding rnas in

disease: from mechanisms to therapeutics. Nature Reviews Genetics, pages 1–22, 2023.

[185] Eric Nguyen, Michael Poli, Matthew G Durrant, Armin W Thomas, Brian Kang, Jeremy
Sullivan, Madelena Y Ng, Ashley Lewis, Aman Patel, Aaron Lou, et al. Sequence modeling
and design from molecular to genome scale with evo. bioRxiv, pages 2024–02, 2024.

[186] Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla
Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The
complete sequence of a human genome. Science, 376(6588):44–53, 2022.

[187] David Ochoa, Andrew Hercules, Miguel Carmona, Daniel Suveges, Asier Gonzalez-Uriarte,
Cinzia Malangone, Alfredo Miranda, Luca Fumis, Denise Carvalho-Silva, Michaela Spitzer,
et al. Open targets platform: supporting systematic drug–target identification and prioritisation.
Nucleic acids research, 49(D1):D1302–D1310, 2021.

135

[188] Stephen Oliver. Guilt-by-association goes global. Nature, 403(6770):601–602, 2000.

[189] Rose Oughtred, Jennifer Rust, Christie Chang, Bobby-Joe Breitkreutz, Chris Stark, Andrew
Willems, Lorrie Boucher, Genie Leung, Nadine Kolas, Frederick Zhang, et al. The biogrid
database: A comprehensive biomedical resource of curated protein, genetic, and chemical
interactions. Protein Science, 30(1):187–200, 2021.

[190] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation

ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

[191] Oana Palasca, Alberto Santos, Christian Stolte, Jan Gorodkin, and Lars Juhl Jensen. Tissues
2.0: an integrative web resource on mammalian tissue expression. Database, 2018:bay003,
2018.

[192] Christopher Y Park, Aaron K Wong, Casey S Greene, Jessica Rowland, Yuanfang Guan,
Lars A Bongo, Rebecca D Burdine, and Olga G Troyanskaya. Functional knowledge transfer
for high-accuracy prediction of under-studied biological processes. PLoS computational
biology, 9(3):e1002957, 2013.

[193] Elizabeth Park, Jan Griffin, and Joan M Bathon. Myocardial dysfunction and heart failure in

rheumatoid arthritis. Arthritis & Rheumatology, 74(2):184–199, 2022.

[194] Liz Parrish. Psoriasis: symptoms, treatments and its impact on quality of life. British Journal

of Community Nursing, 17(11):524–528, 2012.

[195] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library. Advances in neural information processing
systems, 32, 2019.

[196] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al.
the Journal of machine Learning research,
Scikit-learn: Machine learning in python.
12:2825–2830, 2011.

[197] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn:
Geometric graph convolutional networks. In 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.

[198] Lourdes Peña-Castillo, Murat Tasan, Chad L Myers, Hyunju Lee, Trupti Joshi, Chao Zhang,
Yuanfang Guan, Michele Leone, Andrea Pagnani, Wan Kyu Kim, et al. A critical assessment
of mus musculus gene function prediction using integrated genomic evidence. Genome
biology, 9:1–19, 2008.

[199] J. Peng, J. Guan, and X. Shang. Predicting parkinson’s disease genes based on node2vec and

136

autoencoder. Fontiers in Genetics, 2019.

[200] Jiajie Peng, Jiaojiao Guan, and Xuequn Shang. Predicting parkinson’s disease genes based on

node2vec and autoencoder. Frontiers in genetics, 10:226, 2019.

[201] Livia Perfetto, Leonardo Briganti, Alberto Calderone, Andrea Cerquone Perpetuini, Marta
Iannuccelli, Francesca Langone, Luana Licata, Milica Marinkovic, Anna Mattioni, Theodora
Pavlidou, et al. Signor: a database of causal relationships between biological entities. Nucleic
acids research, 44(D1):D548–D554, 2016.

[202] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social
representations. In The 20th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, 2014.

[203] Bryan Perozzi, Vivek Kulkarni, Haochen Chen, and Steven Skiena. Don’t walk, skip!
online learning of multi-scale network embeddings. In Proceedings of the 2017 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining 2017, pages
258–265, 2017.

[204] Emma Persson, Miguel Castresana-Aguirre, Davide Buzzao, Dimitri Guala, and Erik LL
Sonnhammer. Funcoup 5: functional association networks in all domains of life, supporting
directed links and tissue-specificity. Journal of Molecular Biology, 433(11):166835, 2021.

[205] Sergio Picart-Armada, Steven J Barrett, David R Willé, Alexandre Perera-Lluna, Alex
Gutteridge, and Benoit H Dessailly. Benchmarking network propagation methods for disease
gene identification. PLoS computational biology, 15(9):e1007276, 2019.

[206] Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons,
Emilio Centeno, Javier García-García, Ferran Sanz, and Laura I Furlong. Disgenet: a
comprehensive platform integrating information on human disease-associated genes and
variants. Nucleic acids research, page gkw943, 2016.

[207] Rosario M Piro and Ferdinando Di Cunto. Computational approaches to disease-gene
prediction: rationale, classification and successes. The FEBS journal, 279(5):678–696, 2012.

[208] Rosario Michael Piro and Annalisa Marsico. Network-based methods and other approaches for
predicting lncrna functions and disease associations. Computational Biology of Non-Coding
RNA: Methods and Protocols, pages 301–321, 2019.

[209] Elif Piskin, Danila Cianciosi, Sukru Gulec, Merve Tomas, and Esra Capanoglu.

Iron
absorption: factors, limitations, and improvement methods. ACS omega, 7(24):20441–20456,
2022.

[210] Dexter Pratt, Jing Chen, David Welker, Ricardo Rivas, Rudolf Pillich, Vladimir Rynkov,
Keiichiro Ono, Carol Miello, Lyndon Hicks, Sandor Szalma, et al. Ndex, the network data

137

exchange. Cell systems, 1(4):302–305, 2015.

[211] Yue Qin, Edward L Huttlin, Casper F Winsnes, Maya L Gosztyla, Ludivine Wacheul, Marcus R
Kelly, Steven M Blue, Fan Zheng, Michael Chen, Leah V Schaffer, et al. A multi-scale map
of cell structure fusing protein images and interactions. Nature, 600(7889):536–542, 2021.

[212] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embed-
ding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of
the eleventh ACM international conference on web search and data mining, pages 459–467,
2018.

[213] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop,
Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, et al. A large-
scale evaluation of computational protein function prediction. Nature methods, 10(3):221–227,
2013.

[214] Ricardo N Ramirez, Nicole C El-Ali, Mikayla Anne Mager, Dana Wyman, Ana Conesa, and
Ali Mortazavi. Dynamic gene regulatory networks of human myeloid differentiation. Cell
systems, 4(4):416–429, 2017.

[215] Tavis J Reed, Matthew D Tyl, Alicja Tadych, Olga G Troyanskaya, and Ileana M Cristea.
Tapioca: a platform for predicting de novo protein–protein interactions in dynamic contexts.
Nature Methods, pages 1–13, 2024.

[216] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora.
In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages
45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.

[217] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning
node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD
international conference on knowledge discovery and data mining, pages 385–394, 2017.

[218] Daniel Rivas-Barragan, Sarah Mubeen, Francesc Guim Bernat, Martin Hofmann-Apitius,
and Daniel Domingo-Fernández. Drug2ways: Reasoning over causal paths in biological
networks for drug discovery. PLoS computational biology, 16(12):e1008464, 2020.

[219] Giulia Roda, Alessandro Sartini, Elisabetta Zambon, Andrea Calafiore, Margherita Maroc-
chi, Alessandra Caponi, Andrea Belluzzi, and Enrico Roda. Intestinal epithelial cells in
inflammatory bowel diseases. World journal of gastroenterology: WJG, 16(34):4264, 2010.

[220] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 10684–10695, 2022.

[221] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou

138

Huang. Self-supervised graph transformer on large-scale molecular data. Advances in neural
information processing systems, 33:12559–12571, 2020.

[222] Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of
novel multigene perturbations with gears. Nature Biotechnology, pages 1–9, 2023.

[223] Jerret Ross, Brian Belgodere, Vĳil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and
Payel Das. Large-scale chemical language representations capture molecular structure and
properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.

[224] Juliana S Bernardes. A review of protein function prediction under machine learning

perspective. Recent patents on biotechnology, 7(2):122–141, 2013.

[225] Sepideh Sadegh, James Skelton, Elisa Anastasi, Judith Bernett, David B Blumenthal, Gihanna
Galindez, Marisol Salgado-Albarrán, Olga Lazareva, Keith Flanagan, Simon Cockell, et al.
Network medicine for disease module identification and drug repurposing with the nedrex
platform. Nature Communications, 12(1):6848, 2021.

[226] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the
roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432,
2015.

[227] Akira Sato, Fumiaki Sano, Toshiya Ishii, Kayo Adachi, Ryujirou Negishi, Nobuyuki Mat-
sumoto, and Chiaki Okuse. Pure red cell aplasia associated with autoimmune hepatitis
successfully treated with cyclosporine a. Clinical Journal of Gastroenterology, 7:74–78,
2014.

[228] Martin H Schaefer, Tiago JS Lopes, Nancy Mah, Jason E Shoemaker, Yukiko Matsuoka,
Jean-Fred Fontaine, Caroline Louis-Jeune, Amie J Eisfeld, Gabriele Neumann, Carol Perez-
Iratxeta, et al. Adding protein context to the human protein-protein interaction network to
reveal meaningful interactions. PLoS computational biology, 9(1):e1002860, 2013.

[229] Lynn Marie Schriml, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis,
Victor Felix, Gang Feng, and Warren Alden Kibbe. Disease ontology: a backbone for disease
semantic integration. Nucleic acids research, 40(D1):D940–D946, 2012.

[230] Benno Schwikowski, Peter Uetz, and Stanley Fields. A network of protein–protein interactions

in yeast. Nature biotechnology, 18(12):1257–1261, 2000.

[231] Roded Sharan, Igor Ulitsky, and Ron Shamir. Network-based prediction of protein function.

Molecular systems biology, 3(1):88, 2007.

[232] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann.

Pitfalls of graph neural network evaluation. ArXiv preprint, 2018.

139

[233] Harinder Singh, Kay L Medina, and Jagan MR Pongubala. Contingent gene regulatory
networks and b cell fate specification. Proceedings of the National Academy of Sciences,
102(14):4949–4953, 2005.

[234] Rohit Singh, Samuel Sledzieski, Bryan Bryson, Lenore Cowen, and Bonnie Berger. Contrastive
learning in protein language space predicts interactions between drugs and protein targets.
Proceedings of the National Academy of Sciences, 120(24):e2220778120, 2023.

[235] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner
Ceusters, Louis J Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, et al.
The obo foundry: coordinated evolution of ontologies to support biomedical data integration.
Nature biotechnology, 25(11):1251–1255, 2007.

[236] Cynthia L Smith and Janan T Eppig. The mammalian phenotype ontology: enabling robust
annotation and comparative analysis. Wiley Interdisciplinary Reviews: Systems Biology and
Medicine, 1(3):390–399, 2009.

[237] Cynthia L Smith, Carroll-Ann W Goldsmith, and Janan T Eppig. The mammalian phenotype
ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome
biology, 6:1–9, 2005.

[238] Helen R Sofaer, Jennifer A Hoeting, and Catherine S Jarnevich. The area under the precision-
recall curve as a performance metric for rare binary events. Methods in Ecology and Evolution,
10(4):565–577, 2019.

[239] Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on

Computing, 40(4):981–1025, 2011.

[240] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and
Mike Tyers. Biogrid: a general repository for interaction datasets. Nucleic acids research,
34(suppl_1):D535–D539, 2006.

[241] Joshua M Stuart, Eran Segal, Daphne Koller, and Stuart K Kim. A gene-coexpression network
for global discovery of conserved genetic modules. science, 302(5643):249–255, 2003.

[242] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert,
Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide
expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545–15550,
2005.

[243] Marissa Sumathipala, Enrico Maiorino, Scott T Weiss, and Amitabh Sharma. Network
diffusion approach to predict lncrna disease associations using multi-type biological networks:
Lion. Frontiers in physiology, 10:446144, 2019.

140

[244] Kyle Swanson, Gary Liu, Denise B Catacutan, Autumn Arnold, James Zou, and Jonathan M
Stokes. Generative ai for designing and validating easily synthesizable and structurally novel
antibiotics. Nature Machine Intelligence, 6(3):338–353, 2024.

[245] Damian Szklarczyk, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller,
Jaime Huerta-Cepas, Milan Simonovic, Alexander Roth, Alberto Santos, Kalliopi P Tsafou,
et al. String v10: protein–protein interaction networks, integrated over the tree of life. Nucleic
acids research, 43(D1):D447–D452, 2015.

[246] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime
Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al.
String v11: protein–protein association networks with increased coverage, supporting func-
tional discovery in genome-wide experimental datasets. Nucleic acids research, 47(D1):D607–
D613, 2019.

[247] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: large-
scale information network embedding. In Proceedings of the 24th International Conference
on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, 2015.

[248] Cole Trapnell. Defining cell types and states with single-cell genomics. Genome research,

25(10):1491–1498, 2015.

[249] Dénes Türei, Tamás Korcsmáros, and Julio Saez-Rodriguez. Omnipath: guidelines and
gateway for literature-curated signaling pathway resources. Nature methods, 13(12):966–967,
2016.

[250] Mathias Uhlén, Linn Fagerberg, Björn M Hallström, Cecilia Lindskog, Per Oksvold, Adil
Mardinoglu, Åsa Sivertsson, Caroline Kampf, Evelina Sjöstedt, Anna Asplund, et al. Tissue-
based map of the human proteome. Science, 347(6220):1260419, 2015.

[251] Giorgio Valentini, Elena Casiraghi, Luca Cappelletti, Tommaso Fontana, Justin Reese, and
Peter Robinson. Het-node2vec: second order random walk sampling for heterogeneous
multigraphs embedding. arXiv preprint arXiv:2101.01425, 2021.

[252] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for
efficient numerical computation. Computing in science & engineering, 13(2):22–30, 2011.

[253] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. Associating
genes and protein complexes with disease via network propagation. PLoS computational
biology, 6(1):e1000641, 2010.

[254] Nicole A Vasilevsky, Nicolas A Matentzoglu, Sabrina Toro, Joe E Flack, Harshad Hegde,
Deepak R Unni, Gioconda Alyea, Joanna S Amberger, Larry Babb, James P Balhoff, et al.
Mondo: Unifying diseases for the world, by the world. medRxiv, 2022.

141

[255] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.

[256] Alexei Vazquez, Alessandro Flammini, Amos Maritan, and Alessandro Vespignani. Global
protein function prediction from protein-protein interaction networks. Nature biotechnology,
21(6):697–700, 2003.

[257] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
Track Proceedings, 2018.

[258] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
In International Conference on Learning

Yoshua Bengio. Graph attention networks.
Representations, 2018.

[259] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and
In 7th International Conference on Learning

R. Devon Hjelm. Deep graph infomax.
Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.

[260] Daniel V Veres, Dávid M Gyurkó, Benedek Thaler, Kristof Z Szalay, Dávid Fazekas, Tamás
Korcsmáros, and Peter Csermely. Comppi: a cellular compartment-specific database for
protein–protein interaction network analysis. Nucleic acids research, 43(D1):D485–D493,
2015.

[261] Marc Vidal, Michael E Cusick, and Albert-László Barabási. Interactome networks and human

disease. Cell, 144(6):986–998, 2011.

[262] Marc Vidal, Michael E Cusick, and Albert-László Barabási. Interactome networks and human

disease. Cell, 144(6):986–998, 2011.

[263] Hao Wang, Jiaxin Yang, and Jianrong Wang. Leverage large-scale biological networks to
decipher the genetic basis of human diseases using machine learning. Artificial Neural
Networks, pages 229–248, 2021.

[264] Hao Wang, Jiaxin Yang, Yu Zhang, and Jianrong Wang. Discover novel disease-associated
genes based on regulatory networks of long-range chromatin interactions. Methods, 189:22–33,
2021.

[265] Minjie Yu Wang. Deep graph library: Towards efficient and scalable deep learning on graphs.

In ICLR workshop on representation learning on graphs and manifolds, 2019.

[266] Nian Wang, Min Zeng, Yiming Li, Fang-xiang Wu, and Min Li. Essential protein prediction
based on node2vec and xgboost. Journal of Computational Biology, 28(7):687–700, 2021.

142

[267] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey
of approaches and applications. IEEE Transactions on Knowledge and Data Engineering,
29(12):2724–2743, 2017.

[268] Xiujuan Wang, Natali Gulbahce, and Haiyuan Yu. Network-based methods for human disease

gene prediction. Briefings in functional genomics, 10(5):280–293, 2011.

[269] Yangkun Wang, Jiarui Jin, Weinan Zhang, Yong Yu, Zheng Zhang, and David Wipf. Bag of
tricks for node classification with graph neural networks. arXiv preprint arXiv:2103.13355,
2021.

[270] YueQun Wang, LiYan Dong, XiaoQuan Jiang, XinTao Ma, YongLi Li, and Hao Zhang. Kg2vec:
A node2vec-based vectorization model for knowledge graph. Plos one, 16(3):e0248552,
2021.

[271] Zichen Wang, Caroline D Monteiro, Kathleen M Jagodnik, Nicolas F Fernandez, Gregory W
Gundersen, Andrew D Rouillard, Sherry L Jenkins, Axel S Feldmann, Kevin S Hu, Michael G
McDermott, et al. Extraction and analysis of signatures from the gene expression omnibus by
the crowd. Nature communications, 7(1):1–11, 2016.

[272] David Warde-Farley, Sylva L Donaldson, Ovi Comes, Khalid Zuberi, Rashad Badrawi, Pauline
Chao, Max Franz, Chris Grouios, Farzana Kazi, Christian Tannus Lopes, et al. The genemania
prediction server: biological network integration for gene prioritization and predicting gene
function. Nucleic acids research, 38(suppl_2):W214–W220, 2010.

[273] James C Whisstock and Arthur M Lesk. Prediction of protein function from protein sequence

and structure. Quarterly reviews of biophysics, 36(3):307–340, 2003.

[274] Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics,

pages 196–202. Springer, 1992.

[275] Adelaide Woicik, Mingxin Zhang, Hanwen Xu, Sara Mostafavi, and Sheng Wang. Gem-
ini: memory-efficient integration of hundreds of gene networks with high-order pooling.
Bioinformatics, 39:i504 – i512, 2023.

[276] Chunlei Wu, Ian MacLeod, and Andrew I Su. Biogps and mygene. info: organizing online,

gene-centric information. Nucleic acids research, 41(D1):D561–D565, 2013.

[277] Chunlei Wu, Adam Mark, and Andrew I Su. Mygene. info: gene annotation query as a

service. bioRxiv, page 009332, 2014.

[278] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S
Pappu, Karl Leswing, and Vĳay Pande. Moleculenet: a benchmark for molecular machine
learning. Chemical science, 9(2):513–530, 2018.

143

[279] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip.
A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks
and Learning Systems, 2020.

[280] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. Self-supervised
learning of graph neural networks: A unified review. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2022.

[281] Jiwen Xin, Adam Mark, Cyrus Afrasiabi, Ginger Tsueng, Moritz Juchler, Nikhil Gopal,
Gregory S Stupp, Timothy E Putman, Benjamin J Ainscough, Obi L Griffith, et al. High-
performance web services for querying gene and variant annotation. Genome biology, 17:1–7,
2016.

[282] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019, 2019.

[283] Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github,

2019.

[284] Keiichi Yamanaka, Osamu Yamamoto, and Tetsuya Honda. Pathophysiology of psoriasis: A

review. The Journal of dermatology, 48(6):722–731, 2021.

[285] Xiao-Ying Yan, Shao-Wu Zhang, and Song-Yao Zhang. Prediction of drug–target interaction
by label propagation with mutual interaction information derived from heterogeneous network.
Molecular BioSystems, 12(2):520–531, 2016.

[286] Kuo Yang, Ruyu Wang, Guangming Liu, Zixin Shu, Ning Wang, Runshun Zhang, Jian
Yu, Jianxin Chen, Xiaodong Li, and Xuezhong Zhou. Hergepred: heterogeneous network
embedding representation for disease gene prediction. IEEE journal of biomedical and health
informatics, 23(4):1805–1815, 2018.

[287] Victoria Yao, Rachel Kaletsky, William Keyes, Danielle E Mor, Aaron K Wong, Salman
Sohrabi, Coleen T Murphy, and Olga G Troyanskaya. An integrative tissue-network approach
to identify and test human disease genes. Nature biotechnology, 36(11):1091–1099, 2018.

[288] Osman N Yogurtcu and Margaret E Johnson. Cytosolic proteins can exploit membrane
localization to trigger functional assembly. PLoS computational biology, 14(3):e1006031,
2018.

[289] Haiyuan Yu, Philip M Kim, Emmett Sprecher, Valery Trifonov, and Mark Gerstein. The
importance of bottlenecks in protein networks: correlation with gene essentiality and
expression dynamics. PLoS computational biology, 3(4):e59, 2007.

[290] Yang Yu, Shengtao Zhu, Peng Li, Li Min, and Shutian Zhang. Helicobacter pylori infection

144

and inflammatory bowel disease: a crosstalk between upper and lower digestive tract. Cell
death & disease, 9(10):961, 2018.

[291] Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab,
Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, and Huan Sun. Graph embedding on
biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241–
1251, 2020.

[292] Min Zeng, Min Li, Zhihui Fei, Fang-Xiang Wu, Yaohang Li, Yi Pan, and Jianxin Wang. A
deep learning framework for identifying essential proteins by integrating multiple types of
biological information. IEEE/ACM transactions on computational biology and bioinformatics,
18(1):296–305, 2019.

[293] Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf.
Learning with local and global consistency. Advances in neural information processing
systems, 16, 2003.

[294] Dongyan Zhou, Songjie Niu, and Shimin Chen. Efficient graph computation for node2vec.

arXiv preprint arXiv:1805.00280, 2018.

[295] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng
Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and
applications. AI Open, 1:57–81, 2020.

[296] Kaixiong Zhou, Xiao Huang, Yuening Li, Daochen Zha, Rui Chen, and Xia Hu. Towards
deeper graph neural networks with differentiable group normalization. Advances in neural
information processing systems, 33:4917–4928, 2020.

[297] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer

tissue networks. Bioinformatics, 33(14):i190–i198, 2017.

[298] Marinka Zitnik, Michelle M Li, Aydin Wells, Kimberly Glass, Deisy Morselli Gysi, Arjun
Krishnan, TM Murali, Predrag Radivojac, Sushmita Roy, Anaïs Baudot, et al. Current and
future directions in network biology. arXiv preprint arXiv:2309.08478, 2023.

145

APPENDIX A

GENEPLEXUS

A.1 Additional information

A.1.1 Networks

The networks used in this study are BioGRID, STRING-EXP, InBioMap, GIANT-TN, and

STRING. Detailed information about the network properties and sources can be seen in Table A.1,

with the network construction method and interaction type information coming from [100]. BioGRID

(version 3.4.136) is a low-throughput network that includes both genetic interactions, as well as

physical protein-protein interactions [240]. InBioMap (version 2016_09_12) is a high-throughput,

scored network that contains physical protein-protein interactions as well as pathway database

annotations incorporated as edges [142]. We used the “final-scores” as the edge weights. STRING

(version 10.0) is a high-throughput, scored network that aggregates information from many data

sources [245]. We used two different STRING networks. First, we used the “combined” network

that directly includes database annotations, text-mining, ortholog information, co-expression, and

physical protein interactions (referred to as STRING in this study). We also used a subset of edges in

STRING that only contains experimental evidences, thus restricting the network to one constructed

just from physical protein interactions in humans (referred to as STRING-EXP in this study). For

both networks, we used the corresponding relationship scores as edge weights, after normalizing

them to lie between 0 and 1. The GIANT-TN (version 1.0) network is the tissue-naïve network

from GIANT [80], referred to as the “Global” network on the website, and is constructed from

both low- and high-throughput data, and includes information from co-expression, non-protein

sources, regulatory data, and physical protein-protein interactions. The GIANT-TN network is a

fully connected, scored network. To add sparsity to the GIANT-TN network, we removed all edges

with scores below 0.01 (equal to the prior the Bayesian model used to construct the network). It is

worth noting here that the purpose of this study is not to compare networks against each other, but

rather to determine the performance of SL methods vs LP methods on various types of networks.

146

Table A.1: Gene interaction networks information. LT : low-throughput, HT : high-throughput,
GE : genetic, PH : physical, DA : database annotations, CE : co-expression, NP : non-protein, RE :
regulatory, CC : co-citation, OR : orthologous.

Network

Num. nodes Num. edges Density

Info

Weighted

Interaction types

BioGRID
STRING-EXP
InBioMap
GIANT-TN
STRING

20,558
14, 089
17,399
25,689
17,352

238,474
141,629
644,862
38,904,929
3,540,737

1.13E-3 LT
7.08E-4 HT
1.58E-3 HT
1.92E-3 LT, HT
7.20E-3 HT

✗
✓
✓
✓
✓

GE, PH
PH
PE, DA
CE, NP, PH, RE
CC, CE, OR, DA, PH

A.1.2 Model selection and hyperparameter tuning

The restart hyperparameter used in generating an influence matrix was determined by doing a

grid search over values between 0.55 and 0.95 in 0.1 steps for all networks and gene set collections,

optimizing for auPRC using label propagation (Figure A.1). In general, there was not a strong

dependence on . It can be seen in Figure A.1 that a higher restart probability resulted in marginally

better performance for the larger networks (STRING and GLOBAL), whereas as a smaller restart

probability led to nominally better performance for the smaller networks such as BioGRID. In this

study, we used = 0.85 for every gene set collection–network combination, as = 0.85 offered good

performance and had low variance. This was used for both LP-I and SL-I. We stress that the tuning

of was never done for SL-I, and thus, our finding that SL methods generally outperform LP methods

is not biased by this parameter tuning.

Model selection of the supervised learning classifier was done by comparing four popular

classifiers that are implemented in Python package scikit-learn [196]. To determine the best

supervised learning classifier, we compared their performance over every gene set collection–network

combination using their default hyperparameters in version 0.19 of scikit-learn (Figure A.2).

Logistic regression with ℓ2 regularization is marginally better than linear support vector machines

and both these classifiers outperform random forest and logistic regression with ℓ1 regularization.

For the model selection of the embedding technique, we chose node2vec [83] because its

competitive performance and ease of use [78]. The following hyperparameters were tuned based on

the aggregated performance across all gene set collections–network combinations: 𝑝 - the breadth

first search parameter, 𝑞 - the depth first search parameter, 𝑑 - embedding size, 𝑟 - number of walks

147

per node, 𝑙 - walk length, and 𝑘 - context window size. Since 𝑝 and 𝑞 are coupled, we performed a

grid search for these two parameters leaving all others constant. Each of the other hyperparameters

was tuned by leaving the rest at their default values as described in the original node2vec publication.

This procedure resulted in the following hyperparameter values – 𝑝 = 0.1, 𝑞 = 0.1, 𝑑 = 512, 𝑘 = 8,

𝑙 = 120, and 𝑟 = 10 – which were used for every gene set collection–network combination.

A.1.3 Gene set collections

The gene set collections used in this work are from the Gene Ontology (from version 2 of

MyGene.info API with data retrieved on 2018-05-18, GOBPtmp, GOBP) [49, 14, 276, 281],

Kyoto Encyclopedia of Genes and Genomes (from version 6.1 of MSigDB, KEGGBP) [119],

DisGeNet (version 5.0, DisGeNet, BeFree) [206], GWAS from a community challenge at https:

//www.synapse.org/#!Synapse:syn11944948 [47], and the Mouse Gene Informatics database (data

retrieved on 2018-10-01, MGI) [37].

A.1.3.1 Pre-processing gene sets based on specificity, redundancy, and multi-functionality

Each of these six gene set collections contained anywhere from about a hundred to tens of

thousands of gene sets (Table A.2) that varied widely in specificity and redundancy. The first

pre-processing step we did after downloading the data was to convert the original gene/protein IDs

to entrez gene IDs, which was done using gene ID conversions found in MyGene.info [276, 281].

If the original ID mapped to more than one entrez ID, all of them were included for further analysis.

Next, whenever applicable, annotations to gene sets corresponding to terms in a curated ontology

were propagated along the is_a and part_of relationships to ancestor terms in the corresponding

ontologies: Gene Ontology [14] for GOBP, Disease Ontology [229] for DisGeNet and BeFree, and

Mammalian Phenotype Ontology [236] for MGI.

Subsequent preprocessing steps were designed to ensure that the final set of gene sets from each

source are specific, largely non-overlapping, and not driven by multi-attribute genes.

Specificity To select specific biologically-meaningful gene sets in each collection, we sorted all the

gene sets in a collection from the largest to smallest based on the number of annotated genes (gene

set size), manually examined their descriptions, and chose a size threshold that roughly separated

148

Table A.2: Gene set collection information. The last four columns reflect the fact each gene set
collection is slightly different for every network and these values are presented as either a range, a
median value, or number of genes in a union across all networks used in this study.

Gene set
collection

Num. full
gene sets

Num. non-
redundant
gene sets

Num. gene sets
after holdout
preprocessing

Gene set
sizes

Median
gene set
sizes

Total
num.
genes

GOBPtmp
GOBP
KEGGBP
DisGeNet
BeFree
GWAS
MGI
† The GOBP temporal holdout additionally ensures at least ten genes are in the training and testing sets.

(115, 160)
(84, 96)
(63, 96)
(89, 104)
(49, 57)
(30, 37)
(90, 121)

(27, 452)
(20, 181)
(24, 181)
(21, 368)
(20, 223)
(24, 431)
(20, 132)

754†
11,574
149
4,030
2,891
169
10,264

166
313
138
334
207
74
492

174
76
51
67
80
94
41

9,464
5,301
3,454
4,689
2,692
2,134
2,716

large, generic gene sets from the smaller, specific ones. This threshold was 200 for GOBP and

KEGGBP, 300 for MGI, 400 for BeFree, 500 for GWAS and GOBPtmp, and 600 for DisGeNet.

Redundancy To remove redundant gene sets within a collection, first, we calculated the Jaccard

index (| 𝐴∩𝐵|/| 𝐴∪𝐵|) and the overlap index (| 𝐴∩𝐵|/min(| 𝐴|,|𝐵|)) between all pairs of gene sets, where 𝐴

and 𝐵 represent two sets of genes annotated to the gene sets. Then, we built a graph with the gene

sets as the nodes, and added edges between gene sets pairs if their Jaccard index was >0.5 and their

overlap index was >0.7. The gene set graph constructed in this manner contained many connected

components, each representing a set of highly overlapping gene sets. Finally, we used the following

procedure to pick representative gene sets within each component:

1. calculate a score for each gene set equal to the sum of the proportions of genes in other linked

gene sets that are contained within it (higher this score, more representative that gene set is)

2. create a sorted list of all the gene sets in decreasing order of this score

3. pick the first gene set in the list, remove every subsequent gene set that is connected to it in

the graph, and repeat this step until the sorted list is empty.

This procedure resulted in a reasonable number of non-redundant gene sets within each collection.

The same Jaccard and overlap thresholds were used for all collections except MGI. For MGI, since

an overlap cutoff of 0.7 still resulted in thousands of gene sets, it was further lowered to 0.5.

149

Multi-attribute genes Given the set of largely non-overlapping gene sets in a collection, individual

genes were removed from all gene sets if they appeared in more than 10 gene sets in that collection.

This step ensures that the evaluations are not biased by multi-attribute genes that can potentially be

easily predicted in a non-specific manner [73].

We also note that we did not include the cellular component (CC) or molecular function (MF)

classes of the gene ontology as part of the function classification tasks because two genes that are

annotated to the same CC or MF need not be related to each other functionally.

A.1.3.2 Calculating the network properties of the gene sets

To determine how the performance of a given gene set depends on the network, for each gene set

we calculated three different properties:

1. For a given gene set, 𝑇, the number of genes annotated it is given by 𝑇.

2. For a given gene set, 𝑇, the edge density, 𝐷𝑇 , is given by

𝐷𝑇 =

2 (cid:205)
(𝑢,𝑣)∈𝑇 𝑊𝑢𝑣
|𝑇 |(|𝑇 | − 1)

(A.1)

where 𝑊𝑢𝑣 is the edge weight between genes 𝑢 and 𝑣. The edge density is a measure of how

tightly connected the gene set is within itself.

3. For a given gene set, T, the segregation, ST, is given by

𝑆𝑇 =

(cid:205)
(𝑢,𝑣)∈𝑇 𝑊𝑢𝑣
(cid:205)𝑢∈𝑇,𝑡∈𝑉 𝑊𝑢𝑡

(A.2)

Segregation is a measure of how isolated the gene set is from the rest of the network.

The three gene set properties are shown for all gene set collection–network combinations in

Figure A.3. In general, there is little difference in the number of genes across the different prediction

tasks (i.e. function, disease and trait), except for GOBPtmp, which has the largest number of genes

due to the fact the gene sets need to be larger to have enough with at least 10 testing genes. Edge

density and segregation are highest for the function gene sets (GOBPtmp, GOBP, KEGGBP) and

lowest for the disease and trait gene sets (DisGeNet, BeFree, GWAS, MGI).

150

A.1.4 Evaluation data split

We used three different validation schemes to evaluate gene classification.

Temporal holdout validation Temporal holdout is the most stringent evaluation scheme for gene

classification since it mimics the practical scenario of using current knowledge to predict the future.

Since Gene Ontology was the only source with clear date-stamps for all its annotations, temporal

holdout was applied only to the GOBP gene set collection. Since the goal of this study is to use

relatively recent and widely-used molecular networks, as this would reflect how these models would

be deployed in practice, we chose a temporal cutoff point of Jan 1st, 2017. Then, for each gene set

collection, genes that only had an annotation to any gene set in the collection after 2017-01-01 were

assigned to the testing set and the remaining genes were assigned to the training set. Since this

resulted in the testing set having far fewer genes than the testing set for the other validation schemes,

we made the following minor modifications to the gene set pre-processing procedure: GOBP gene

set collection was first filtered to remove any gene set with fewer than ten training genes or had

fewer than ten testing genes based on the temporal split and the specificity threshold (maximum

number of genes annotated to a gene set) was increased from 200 to 500. Redundancy filtering and

multi-attribute gene filtering were unchanged. As noted in Section 1.1, from each network resource

considered in this study, we chose the most recent version of the network that was released before

2017-01-01 to ensure no data leakage. Finally, genes were removed from gene sets if they were not

present in a given network, gene sets with fewer than ten training genes or fewer than ten testing

genes were filtered out, and the remaining gene sets were used to perform the temporal holdout

validation.

Study-bias holdout validation The goal of study-bias validation is to evaluate the scenario

that is close to the real-world situation of learning from well-characterized genes to predict novel

un(der)-characterized genes. Here, we defined study-bias for each gene as the number of articles

in PubMed (http://www.pubmed.gov/) in which that gene was referenced in, as determined in the

gene2pubmed file (downloaded on 2018-10-30) from the NCBI Gene database [35]. Using this

definition, for each gene set collection–network combination, we created training-testing splits in

151

the following manner: Genes were removed from gene sets if they were not present in the given

network. Then, among the remaining genes, a gene was assigned to the training set if it was in the

top two-thirds of the list of genes sorted by their PubMed count. The remaining genes were assigned

to the testing set. Finally, gene sets with fewer than ten training genes or fewer than ten testing genes

were filtered out and the remaining gene sets were used to perform the study-bias holdout validation.

Five-fold cross-validation To ensure comparability, we performed 5-fold cross-validation using

the same gene sets that were used in study-bias holdout, splitting each gene set randomly into five

approximately equal folds (with similar proportions of positive and negative examples) and, in

rotation, using one fold as the testing set and the remaining four as the training set.

A.1.5 Evaluation metrics

We present results in three metrics, the area under the precision recall curve (auPRC), the

area under the receiver operator curve(auROC), and the precision of the top K ranked predictions

(P@TopK). Since each gene set collection–network combination has a different number of positive

examples (and, hence, different positive:negative proportions), we normalized auPRC and P@topK

by the prior. Specifically, auPRC is given by:

auPRC = log2

(cid:19)

(cid:18) auPRCS
prior

(A.3)

where 𝑎𝑢𝑃𝑅𝐶𝑆 is the standard area under the precision-recall curve, and the prior is 𝑃/𝑃+𝑁 with

𝑃 being the number of positive ground truth labels, and 𝑁 being the number of negative ground

truth labels. The 𝑙𝑜𝑔2 in Equation A.3 allows for the following interpretation: the number of 2-fold

increases of the measured auPRCSover what is expected given the ground truth labels (e.g., a value

of 1 indicates a 2-fold increase, a value of 2 indicates a 4-fold increase). Similarly, P@topK is given

by:

P@topK = log2

(cid:19)

(cid:18) TPK
K × prior

(A.4)

where K is the number of top-predictions to consider, TPK is the number true-positives of the top-K

predictions, and the prior is the same as in Equation A.3. We set K to be the number of ground truth

positives in the testing set. P@topK can be thought of as what is the 2-fold increase in the percent

152

of the top-K predictions that were predicted true over the expected value. Of note, it is possible

that TPk=0 if no true positive is captured within the first K predictions. This causes P@topK to

become −∞. To address this issue, we set such values to be the minimum score obtained across all

predictions for that given gene set collection–network combination.

Precision-based metrics – auPRC and P@topK – are more suitable than the more popular area

under the receiver-operating characteristic curve (auROC) for two reasons. First, gene classification

is a highly imbalanced problem with many more negative samples than positive samples, and auROC

is ill-suited for imbalanced problems [226]. Second, precision can control for Type-1 error (false

positives) [55]. Since the foremost reason for gene classification is to provide a list of candidate

genes for further experimental study, it is more important to make sure the top predictions are as

correct as possible, as opposed to ensuring that, on average, positive examples are ranked higher

than negative examples. However, for completeness, we have provided auROC results in this

Supplemental Material (Section 2).

153

A.2 Additional results

Figure A.1: Restart probability hyperparameter tuning for label propagation. (A) Each point
in each boxplot represents the average rank for a gene set collection–network combination, where
the five restart probabilities that were tried were ranked in terms of performance (auPRC) for each
gene set in a gene set collection using the standard competition ranking. A restart probability of
0.85 was chosen for this study as it resulted in good overall performance as well as low variance
in performance for the different genset-collection–network combinations. (B) The performance
for each individual gene set collection–network combination is compared across the five restart
probabilities: 0.55, 0.65, 0.75, 0.85 and 0.95. The methods are ranked by median value of auRPC
with the highest scoring method on the left. There is no strong dependence of auRPC on the restart
probability.

154

A)B)Figure A.2: Comparison of classifiers for supervised learning. (A) Each point in each boxplot
represents the average rank for a gene set collection–network combination, where the four classifiers
were ranked by the auPRC for each gene set in a gene set collection using the standard competition
ranking. Logistic regression with L2 regularization (LR-L2) was chosen as the classifier for
supervised learning as it had slightly better overall performance than a linear support vector machine
(SVM). (B) The auRPC for each individual gene set collection–network combination is compared
across the four supervised learning classifiers: logistic-regression with L1 regularization (LR-L1),
LR-L2, SVM, and a random forest (RF). The classifiers are ranked by median value with the best
performing one on the left.

155

(cid:36)(cid:12)(cid:37)(cid:12)Figure A.3: Network properties for the different gene set collections. The gene set collections
can be broken up into three prediction tasks; function (GOBPtmp, GOBP, KEGGBP; reds), disease
(DiGeNet, BeFree; blues) and trait (GWAS, MGI; greys). In general, there is little difference in the
number of genes across the different type prediction tasks (i.e. function, disease and trait), except for
GOBPtmp which has the largest number of genes due to the fact the gene sets need to be larger to
have enough with at least 10 testing genes. Edge density and segregation are highest for the function
gene sets (GOBPtmp, GOBP, KEGGBP) and lowest for the disease and trait gene sets (DisGeNet,
BeFree, GWAS, MGI).

156

102BioGRIDNumber of Genes103102101Edge Density103102Segregation102STRING-EXP103102102101102InBioMap103102101103102101102GIANT-TN103102103102102STRING103102101104103102101GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIFigure A.4: Average rank of the four methods for all evaluation metrics and validation schemes.
Each point in a boxplot represents the average rank for a gene set collection–network combination,
where the four methods were ranked in terms of performance for each gene set in a gene set collection
using the standard competition ranking. Different colors represent different networks and different
marker shapes represent different gene set collections.

157

2.03.0Average Rank forTemporal HoldoutAauPRC1.52.02.53.0BP@TopK2.03.0CauROC1.52.02.53.03.5Average Rank forStudy-bias HoldoutD1.52.02.53.0E1.52.02.53.03.5F1.02.03.0Average Rank for5-Fold Cross ValidatoinG1.01.52.02.53.0H2.03.0I2.03.0Average Rank forStudy-bias HoldoutJ1.52.02.5K1.52.02.53.0LSL-ASL-ILP-ILP-A2.03.0Average Rank for5-Fold Cross ValidatoinMSL-ASL-ILP-ILP-A2.03.0NSL-ASL-ILP-ILP-A1.52.02.53.03.5OFunction Prediction (GOBPtmp, GOBP, KEGGBP)Disease and Trait Prediction(DisGeNet, BeFree, GWAS, MGI)NetworksBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGGeneset CollectionsGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIFigure A.5: Testing for a statistically significant difference between SL and LP methods using
all evaluation metrics and validation schemes. For each network-gene set combination, each
method is compared to the two methods from the other class (i.e. SL-A vs LP-I, SL-A vs LP-A, SL-I
vs LP-I, SL-I vs LP-A). If a method was found to be significantly better than both methods from the
other class (Wilcoxon ranked-sum test with an FDR threshold of 0.05), the cell is annotated with that
method. If both models in that class were found to be significantly better than the two methods in the
other class, the cell is annotated in bold with just the class. The color scale represents the fraction of
gene sets that were higher for the SL methods across all four comparisons. The first column uses
GOBP temporal holdout, whereas the remaining 6 columns use study-bias holdout. B) SL methods
show a statistically significant improvement over LP methods, especially for function prediction/

158

GOBPtmpBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGGO Temporal HoldoutSLSLSLSLSLauPRCGOBPtmpSLSLSLSLSLP@TopKGOBPtmpSLSLSLSLSLauROCBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGStudy-bias HoldoutSL-ASLLP-ISL-ASLSLSLSLSLSLSLSLSLSL-ISLSL-ISLSL-ASL-ISL-ASL-ISLSL-ISL-ISLSLSLSL-ISLSL-ISLSLSLSLSL-ASLSLSL-ISLSLSLSL-ISL-ISL-ISLSLSL-ISLSL-ISLGOBPKEGGBPDisGeNetBeFreeGWASMGIBioGRIDSTRING-EXPInBioMapGIANT-TNSTRING5-Fold Cross ValidatoinSLSL-ASL-ASL-ASL-ASL-ASL-ASLSL-ASLSLSLSL-ASLSL-ASL-ASL-ASL-ASL-ISLSL-ASL-ASL-ASL-ASL-AGOBPKEGGBPDisGeNetBeFreeGWASMGISLSLSL-ASL-ASL-ASLSLSLSL-ASL-ISL-ASL-ASL-ASL-ASL-ISL-ASL-ASL-ASL-ASL-AGOBPKEGGBPDisGeNetBeFreeGWASMGISL-ASL-ASL-ASL-ASLSL-ASL-ASL-ASLSL-ASL-ASL-ASL-ASLSL-ASL-ALP-ISL-ASL-A0.00.20.40.60.81.0Percentage of times SL outperforms LPFigure A.6: Boxplots for auPRC performance across all gene set collection–network combina-
tions. The performance for each individual gene set collection–network combination is compared
across the four methods; SL-A (red), SL-I (light red), LP-I (blue), and LP-A (light blue). The
methods are ranked by median value with the highest scoring method on the left. The first column
contains temporal and study-bias holdout, and the second column is 5FCV. The scoring metric is
auPRC.

159

GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI036BioGRIDauPRCHoldout ValidationGOBPKEGGBPDisGeNetBeFreeGWASMGI0365-Fold Cross ValidationGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI036STRING-EXPauPRCGOBPKEGGBPDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI048InBioMapauPRCGOBPKEGGBPDisGeNetBeFreeGWASMGI048GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI036GIANT-TNauPRCGOBPKEGGBPDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI04STRINGauPRCGOBPKEGGBPDisGeNetBeFreeGWASMGI036SL-ASL-ILP-ILP-AFigure A.7: Boxplots for P@TopK performance across all gene set collection–network combi-
nations. The performance for each individual gene set collection–network combination is compared
across the four methods; SL-A (red), SL-I (light red), LP-I (blue), and LP-A (light blue). The
methods are ranked by median value with the highest scoring method on the left. The first column
contains temporal and study-bias holdout, and the second column is 5FCV. The scoring metric is
P@TopK.

160

GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI04BioGRIDP@TopKHoldout ValidationGOBPKEGGBPDisGeNetBeFreeGWASMGI045-Fold Cross ValidationGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI04STRING-EXPP@TopKGOBPKEGGBPDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI048InBioMapP@TopKGOBPKEGGBPDisGeNetBeFreeGWASMGI04GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI036GIANT-TNP@TopKGOBPKEGGBPDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI036STRINGP@TopKGOBPKEGGBPDisGeNetBeFreeGWASMGI04SL-ASL-ILP-ILP-AFigure A.8: Boxplots for auROC performance across all gene set collection–network combina-
tions. The performance for each individual gene set collection–network combination is compared
across the four methods; SL-A (red), SL-I (light red), LP-I (blue), and LP-A (light blue). The
methods are ranked by median value with the highest scoring method on the left. The first column
contains temporal and study-bias holdout, and the second column is 5FCV. The scoring metric is
auROC.

161

GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00BioGRIDauROCHoldout ValidationGOBPKEGGBPDisGeNetBeFreeGWASMGI0.30.60.95-Fold Cross ValidationGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0.30.60.9STRING-EXPauROCGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0.30.60.9InBioMapauROCGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00GIANT-TNauROCGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00STRINGauROCGOBPKEGGBPDisGeNetBeFreeGWASMGI0.500.751.00SL-ASL-ILP-ILP-AFigure A.9: Performance vs Network/gene set properties for all networks. SL-A is able to
capture network information as efficiently as LP-I across all networks. There is no correlation
between the number of genes in the gene set versus performance, but there is a strong correlation
between the performance and the edge density as well as segregation. The different colored dots
represent function gene sets (red, GOBP and KEGGBP), disease gene sets (blue, DIsGeNet and
BeFree), and trait gene sets (black, GWAS and MGI). The vertical line is the 95% confidence
interval and the performance metric is auPRC.

162

50100150024BioGRIDauPRCSL-A50100150LP-I103102101SL-A103102101LP-I103102SL-A103102LP-I50100150024STRING-EXPauPRC50100150103102103102103102101103102101501001500246InBioMapauPRC501001501031021011031021011031021011031021015010015001234GIANT-TNauPRC5010015010210210310210310250100150Number of Genes0246STRINGauPRC50100150Number of Genes102101Edge Density102101Edge Density103102101Segregation103102101SegregationFunction Prediction(GOBP, KEGGBP)Disease Prediction(DisGeNet, BeFree)Trait Prediction(GWAS, MGI)Figure A.10: Effect size for every pair of methods. Each point is the median percent increase
for every gene set collection–network combination. (A) Functional prediction tasks using GOBP
temporal holdout, (B) Functional prediction tasks using study-bias holdout for GOBP and KEGGBP,
and (C) Disease and trait prediction tasks using study-bias holdout for DisGeNet, BeFree, GWAS,
and MGI. The results are shown for auPRC where different colors represent different networks and
different marker styles represent the different gene set collections.

163

0255075Percent IncreaseATemporal Holdout(GOBPtmp)0255075Percent IncreaseBStudy-bias Holdout(GOBP, KEGGBP)SL-A/LP-ISL-I/LP-ISL-A/LP-ASL-I/LP-ASL-A/SL-ILP-I/LP-A050100Percent IncreaseCStudy-bias Holdout(DisGeNet, BeFree, GWAS, MGI)Function ClassificationDisease and TraitClassificationNetworksBioGRIDSTRING-EXPInBioMapGIANT-TNSTRINGGenesetCollectionsGOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGIA.2.1 Label propagation with negative examples

In this section, we present alternative LP results that consider both positive and negative examples

(LPN). We perform the same hyperparameter tuning for the restart parameter as LP (Appendix A.1.2)

and find the optimal restart parameter of 0.45 (Figure A.11). This optimal value for the restart

parameter in LPN is relatively low compared to the optimum value for LP, except for the GIANT-TN

network, where both LP and LPN prefer a higher restart value. It is worth noting that just like with

LP, the dependence on the restart parameter is minimal (Figure A.11B). We also include boxplots

comparing label propagation with and without negative examples (Figure A.12). Lastly, we show a

side-by-side comparison of the ranking analysis (Figure A.13) and Wilxcon analysis (Figure A.14)

using label propagation with and without negative examples. Our results show that even though

using negative examples slightly increases performance in label propagation, the results, when

compared against supervised learning, remain unchanged.

164

Figure A.11: Tuning the restart probability hyperparameter when using negative examples
in label propagation. (A) Each point in each boxplot represents the average rank for a gene set
collection–network combination, where the five restart probabilities (0.45, 0.55, 0.65, 0.75 and
0.85) were ranked in terms of performance (auPRC) for each gene set in a gene set collection using
the standard competition ranking. A restart probability of 0.45 was chosen as optimal. (B) The
performance for each individual gene set collection–network combination is compared across the
five restart probabilities. The methods are ranked by median value of auRPC with the highest
scoring method on the left. There is no strong dependence of auRPC on the restart probability.

165

(cid:36)(cid:12)(cid:37)(cid:12)Figure A.12: Boxplots for performance across all gene set collection–network combinations for
label propagation on the influence matrix with and without using negative examples. (A) The
performance for each individual gene set collection–network combination is compared for label
propagation with negative examples (LPN-I, blue) and label propagation without negative examples
(LP-I, green). The methods are ranked by median value with the highest scoring method on the left.
Results show LPN-I has a moderately increased performance when compared to LP-I. (B) Each
point in the plot is the median value from one of the boxplots in A. This shows that both LPN and
LP methods perform better for function prediction compared to disease/trait prediction.

166

GOBPtmpGOBPKEGGBP036BioGRIDauPRCA)Function PredictionDisGeNetBeFreeGWASMGI02Disease and Trait PredictionGOBPtmpGOBPKEGGBP036STRING-EXPauPRCDisGeNetBeFreeGWASMGI024GOBPtmpGOBPKEGGBP048InBioMapauPRCDisGeNetBeFreeGWASMGI024GOBPtmpGOBPKEGGBP036GIANT-TNauPRCDisGeNetBeFreeGWASMGI024GOBPtmpGOBPKEGGBP036STRINGauPRCDisGeNetBeFreeGWASMGI036GOBPtmpGOBPKEGGBPDisGeNetBeFreeGWASMGI0246Median auPRCB)LPN-ILP-IFigure A.13: Comparing results from average rank analysis with and without using negative
examples in label propagation. The left column has label propagation without negative examples
(LP) and the right column has label propagation with negative examples (LPN). Each point in each
boxplot represents the average rank for a gene set collection–network combination, obtained based
on ranking the four methods in terms of performance for each gene set in a gene set collection using
the standard competition ranking. (A, D) Functional prediction tasks using GOBP temporal holdout,
(B, E) Functional prediction tasks using study-bias holdout for GOBP and KEGGBP, and (C, F)
Disease and trait prediction tasks using study-bias holdout for DisGeNet, BeFree, GWAS, and MGI.
The results are shown for auPRC where different colors represent different networks and different
marker styles represent the different gene set collections. The results show that no substantial
difference can be seen between using or not using negative examples in label propagation.

167

(cid:47)(cid:51)(cid:3)(cid:16)(cid:3)(cid:49)(cid:82)(cid:3)(cid:49)(cid:72)(cid:74)(cid:68)(cid:87)(cid:76)(cid:89)(cid:72)(cid:86)(cid:47)(cid:51)(cid:3)(cid:16)(cid:3)(cid:58)(cid:76)(cid:87)(cid:75)(cid:3)(cid:49)(cid:72)(cid:74)(cid:68)(cid:87)(cid:76)(cid:89)(cid:72)(cid:86)(cid:39)(cid:40)(cid:41)Figure A.14: Comparing the Wilcoxon statistical test analysis with and without using negative
examples in label propagation. (A) Label propagation without negative examples (LP) and (B) label
propagation with negative examples (LPN). For each network-gene set combination, each method is
compared to the two methods from the other class. If a method was found to be significantly better
than both methods from the other class (Wilcoxon ranked-sum test with an FDR threshold of 0.05),
the cell is annotated with that method. If both models in that class were found to be significantly
better than the two methods in the other class, the cell is annotated in bold with just the class. The
color scale represents the fraction of gene sets that were higher for the SL methods across all four
comparisons. The first column uses GOBP temporal holdout, whereas the remaining 6 columns use
study-bias holdout. The results show that no substantial difference can be seen between using or not
using negative examples in label propagation.

168

(cid:47)(cid:51)(cid:3)(cid:16)(cid:3)(cid:49)(cid:82)(cid:3)(cid:49)(cid:72)(cid:74)(cid:68)(cid:87)(cid:76)(cid:89)(cid:72)(cid:86)(cid:47)(cid:51)(cid:3)(cid:16)(cid:3)(cid:58)(cid:76)(cid:87)(cid:75)(cid:3)(cid:49)(cid:72)(cid:74)(cid:68)(cid:87)(cid:76)(cid:89)(cid:72)(cid:86)(cid:36)(cid:12)(cid:37)(cid:12)APPENDIX B

PECANPY

B.1 Existing node2vec implementations

We list existing open-source node2vec implementations in this section, all of which are gathered and

accessed as of Feburary 2021. Besides the original implementations, we only tested nodevectors

implementation as all the others do not contain any feature that could result in significant runtime or

memory usage improvements, as we detailed below.

1. Original Python implementation (https://github.com/aditya-grover/node2vec)

2. Original C++ implementation (https://github.com/snap-stanford/snap/tree/master/examples/

node2vec)

3. node2vectors (https://github.com/VHRanger/nodevectors) is similar to PecanPy-SparseOTF,

which computes the transition probabilities on-the-fly using the cache optimized CSR graph

object. However, it cannot handle dense networks effectively (Section 3.2.3).

4. TensorFlow implementation (https://github.com/apple2373/node2vec) replaces gensim’s

word2vec with TensorFlow for training embedding using the precomputed random walks.

However, we observe that the embedding step of the node2vec contributes minimally to the

total runtime (Figure B.1, B.2). Thus, this implementation will provide significant runtime

improvements.

5. Python 3 implementation (https://github.com/eliorc/node2vec) is the same as the original

Python implementation, but has been reformatted to be compatible with Python 3. Thus, we

do not expect any explicit performance improvements.

6. C++ implementaion (https://github.com/thibaudmartinez/node2vec) is essentially the original

C++ implementation, but with a Python API. This implementation is expected to be at best as

efficient as the original C++ implementation.

169

7. Dependency-less C++ implementation (https://github.com/xgfs/node2vec-c) is the same

as the original C++ implementation, but as a standalone program without other SNAP

functionalities.

8. Implementation with different sampling method other than the original alias method

(https://github.com/NilsFrahm/Node2vec). This implementation is the same as the original

Python implementation, with an alternative random distribution sampling approach, which

does not have a significant impact on the runtime.

170

B.2 Additional results

Figure B.1: Fraction of runtime contributed by each stage of node2vec in different implementa-
tions using multiple cores. Each panel corresponds to a single network and each stacked bar within
a panel corresponds to an individual node2vec implementation. The height of each segment within
a bar represents the fraction of runtime contributed by each of the different stages of node2vec,
tested in a multi-core configuration.

171

Figure B.2: Fraction of runtime contributed by each stage of node2vec in different implementa-
tions using a single core. Each panel corresponds to a single network and each stacked bar within a
panel corresponds to an individual node2vec implementation. The height of each segment within a
bar represents the fraction of runtime contributed by each of the different stages of node2vec, tested
in a single-core configuration.

172

Figure B.3: Raw runtimes of each stage of node2vec in different implementations using multiple
cores. Each parallel plot corresponds to one of four stages of node2vec. Each line traces the raw
runtime (points on parallel y-axes) of a specific network (color) across the different implementations
(x-axis). In all plots, absence of a point for a particular network for any implementation indicates
that the network failed to load.

173

Figure B.4: Raw runtimes of each stage of node2vec in different implementations using a single
core. Each parallel plot corresponds to one of four stages of node2vec. Each line traces the raw
runtime (points on parallel y-axes) of a specific network (color) across the different implementations
(x-axis). In all plots, absence of a point for a particular network for any implementation indicates
that the network failed to load.

174

Figure B.5: Effect of graph data structure on loading time and memory usage. (A) the total time
for loading networks. (B) the total memory usage. Each plot contains groups of bars, each group
corresponding to a network (among seven select networks), and each bar in the group corresponding
to a specific network data structure.

175

Figure B.6: Summary of runtime and memory of PecanPy and the original implementations
of node2vec using a single core. The eight networks of varying sizes and densities are along the
rows. The software implementations are along the columns. The first heatmap (on the left) shows
the performance of the original Python and C++ software along with the three modes of PecanPy
(PreComp, SparseOTF, and DenseOTF). The adjacent 2-column heatmap (on the right) summarizes
the performance of the original (best of Python and C++ versions) and PecanPy (best of PreComp,
SparseOTF, and DenseOTF) implementations. Lighter colors correspond to lower runtime in panel
A and lower memory usage in panel B. Crossed grey indicates that the particular implementation
(column) failed to run for a particular network (row).

176

Figure B.7: Performance of the original Python and C++ implementations and the three new
implementations – PreComp, SparseOTF, and DenseOTF – on eight networks using multiple
cores. The parallel plots trace the performance of different node2vec implementations (x-axis) for
8 networks (colored dots/lines; number of nodes/edges are in legend below) in terms of (A) total
runtime (seconds) and (B) peak memory usage (bytes). Absence of a dot indicates the failure of a
particular implementation to run for a particular network.

177

Figure B.8: Performance of the original Python and C++ implementations and the three new
implementations – PreComp, SparseOTF, and DenseOTF – on eight networks using a single
core. The parallel plots trace the performance of different node2vec implementations (x-axis) for
8 networks (colored dots/lines; number of nodes/edges are in legend below) in terms of (A) total
runtime (seconds) and (B) peak memory usage (bytes). Absence of a dot indicates the failure of a
particular implementation to run for a particular network.

178

Figure B.9: Summary of runtime and memory of PecanPy, nodevectors, and the original
implementations of node2vec using multiple cores (A and B) and a single core (C and D). The
eight networks of varying sizes and densities are along the rows. The software implementations
are along the columns. In each panel, the first heatmap (on the left) shows the performance of the
original Python and C++ software and the nodevectors software along with the three modes of
PecanPy (PreComp, SparseOTF, and DenseOTF). The adjacent 2-column heatmap (on the right)
summarizes the performance of the original (best of Python and C++ versions) and PecanPy (best
of PreComp, SparseOTF, and DenseOTF) implementations. Lighter colors correspond to lower
runtime in panels A and C, and lower memory usage in panels B and D. Crossed grey indicates that
the particular implementation (column) failed to run for a particular network (row).

179

APPENDIX C

OBNB

C.1 Data descriptions

C.1.1 Networks

BioGRID [240]

(https://biogrid-downloads.nyc3.digitaloceanspaces.com/LICENSE.txt, MIT

License) is a protein interaction network curated from primary experimental evidence from the

biomedical literature, as well as evidence inferred from low- and high-throughput experiments.

BioPlex [106]

(https://bioplex.hms.harvard.edu/explorer/help, Creative Commons Attribution-

ShareAlike 4.0 International License) is a protein interaction network whose interactions are

measured by affinity purification-mass spectrometry (AP-MS) analysis shared across two cell lines

(HEK293T and HCT116). This shared interaction network encodes core complexes involving many

essential proteins that are vital for the cell’s survival.

ComPPIHumanInt [260]

(https://academic.oup.com/nar/article/43/D1/D485/2435307, CC BY-

NC 4.0 License) is a context-naive version of the cellular compartment-specific ComPPI [260]

networks for humans, which are constructed by combining protein interactions from nine different

PPI databases (including BioGRID).

ConsensusPathDB [117]

(Free for academic use, see http://cpdb.molgen.mpg.de/ for more

Licensing info) integrates protein interaction evidence (binary protein interaction, protein complex

interaction, genetic interaction, metabolic, signaling, etc.) from 31 public databases, in addition to

interactions curated from the literature.

FunCoup [204]

(https://funcoup.org/help/, CC BY-SA 4.0 License) version 5 is a functional

gene interaction network constructed by integrating a wide range of interaction evidence using a

redundancy-weighted naive Bayes approach.

HIPPIE [5]

(https://academic.oup.com/nar/article/45/D1/D408/2290937,CC BY-NC 4.0 License)

integrates experimentally detected protein interactions from several public databases such as

BioGRID.

180

HumanBaseTopGlobal [80]

(https://humanbase.readthedocs.io/en/latest/, CC BY 4.0 License) is

the tissue-naive version of the HumanBase tissue-specific gene interaction network collections, which

are constructed by integrating hundreds of thousands of publicly available gene expression studies,

protein–protein interactions and protein–DNA interactions via a Bayesian approach, calibrated

against high-quality manually curated functional gene interactions.

HuMAP [62]

(http://humap2.proteincomplexes.org/download, CC0 1.0 License) is a protein

interaction network derived from over seven thousand protein complexes by integrating experimental

evidence from public resources including AP-MS, large-scale biochemical fraction data, proximity

labeling, and RNA hairpin pulldown data.

HumanNet [107]

(https://staging2.inetbio.org/humannetv3/about.php, CC BY-SA 4.0 License) is

a functional gene interaction network originally designed for disease studies. It contains interaction

evidence from gene co-citation from the literature, gene co-expression, pathway, domain profile,

genetic interaction, gene neighborhood, phylogenic profile, and other protein interaction data. All

these interaction evidence are integrated using a Bayesian statistical framework, resulting in a single

value for each pair of genes that indicate the odds ratio for their functional interaction.

HuRI [160]

(http://www.interactome-atlas.org/download, CC BY-4.0 License) is a binary protein

interaction network constructed via the yeast two-hybrid (Y2H), covering about 90% of the

protein-coding genes in the human genome.

OmniPath [249]

(See https://omnipathdb.org/info for Licenses collected for each database)

integrates protein interactions, signaling and regulatory relationships from over 100 resources.

PCNet [100]

(https://www.ndexbio.org/viewer/networks/f93f402c-86d4-11e7-a10d-0ac135e8bacf,

CC BY-NC 4.0 License) is a protein interaction network constructed by requiring that an edge be

supported by at least two out of the 21 selected protein interaction networks, such as BioGRID and

ConsensusPathDB.

ProteomeHD [131]

(https://www.ndexbio.org/viewer/networks/4cb4b0f3-83da-11e9-848d-0ac135e8bacf,

CC BY-4.0 License) is the subnetwork of [131] containing top 0.5% strongest co-regulation signal

between pairs of proteins. The co-regulation is measured by proteins’ response to a total of 294

181

biological perturbations via isotope-labeling mass spectrometry.

SIGNOR [201]

(https://signor.uniroma2.it/about/, CC BY-4.0 License) contains manually cu-

rated causal signaling relationships between proteins and other biochemical molecules, such as

transcriptional activations and phosphorylation.

STRING [246]

(https://string-db.org/cgi/access?footer_active_subpage=licensing, CC BY-4.0

License) is a functional interaction network constructed by integrating seven types of gene interaction

evidence via a probabilistic approach that calibrates against the KEGG [118] pathway database. The

seven evidence types include conserved neighborhood, gene co-occurrence across species, gene

fusion events, gene co-expression, other databases, and text-mined interactions.

C.1.2 Annotations and Ontologies

Gene Ontology [14]

(http://geneontology.org/docs/go-citation-policy/, CC BY-4.0 License) is

a structured and standardized system that provides a comprehensive vocabulary to describe the

molecular functions (GOMF), biological processes (GOBP), and cellular components (GOCC)

associated with genes across different organisms. It aims to unify the representation of biological

knowledge and enable effective analysis and interpretation of genomic data.

Mondo Disease Ontology [254]

(https://github.com/monarch-initiative/mondo/blob/master/LICENSE,

CC BY-4.0 License) is a unifying resource that integrates disease, genotype, and phenotype knowl-

edge across diverse resources, providing a standardized knowledge graph with controlled vocabularies

for diseases and phenotypes.

DisGeNET disease gene annotations [206]

(https://academic.oup.com/nar/article/45/D1/D833/

2290909, CC BY-NC 4.0 License) is a disease gene annotation database that contains a wide range

of disease-gene association evidence, including curated annotations, high-throughput experiments

and other inferred annotations, animal models, and literature text-mined annotation mostly from

BEFREE. By default, we only use the curated and inferred annotations.

DISEASES disease gene annotations [81]

(https://diseases.jensenlab.org/Downloads, CC BY

License) is another disease gene annotation database that has a weekly update schedule for extracting

disease gene annotations via text-mining from the fast-growing literature. The text-mined approach

182

uses full text instead of only using the titles and abstracts. Other disease gene annotations from

experimental data and other databases are also available.

C.1.3 Archived data

In addition to downloading and processing the data directly from the original sources, we

also provide archived versions of the data preprocessed by running the default OBNB processing

pipelines. The archived data is versioned with DOI’s and can be found on Zenodo under the record

https://zenodo.org/record/8045270

183

C.2 Additional results

Table C.1: Ablation study on different initial node feature constructions for GCN and GAT.
Reported values are average test APOP scores aggregated over five random seeds. The best score
within a group (network x label x model) is bolded.

Network

Model Feature

DISEASES

DisGeNET

GOBP

BioGRID

GCN

Constant

0.373 ± 0.018

0.451 ± 0.210

0.557 ± 0.144

RandomNormal

1.329 ± 0.043

0.867 ± 0.017

2.121 ± 0.114

OneHotLogDeg

1.077 ± 0.057

0.827 ± 0.050

2.011 ± 0.148

RandomWalkDiag

0.844 ± 0.103

0.702 ± 0.064

1.358 ± 0.044

Adj

1.442 ± 0.120

0.899 ± 0.108

2.148 ± 0.045

RandProjGaussian

1.288 ± 0.041

0.828 ± 0.034

2.278 ± 0.084

RandProjSparse

1.264 ± 0.047

0.749 ± 0.057

2.306 ± 0.088

SVD

1.259 ± 0.076

0.808 ± 0.030

2.318 ± 0.080

LapEigMap

1.415 ± 0.030

0.963 ± 0.045

2.378 ± 0.098

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.258 ± 0.041

0.798 ± 0.023

2.376 ± 0.081

1.312 ± 0.027

0.857 ± 0.083

2.416 ± 0.048

1.392 ± 0.051

0.965 ± 0.055

2.487 ± 0.112

1.366 ± 0.044

0.886 ± 0.079

2.438 ± 0.052

1.348 ± 0.073

0.788 ± 0.060

2.147 ± 0.113

1.338 ± 0.071

0.798 ± 0.080

2.031 ± 0.089

GAT

Constant

0.399 ± 0.039

0.362 ± 0.071

0.398 ± 0.071

RandomNormal

1.290 ± 0.126

0.755 ± 0.140

2.268 ± 0.047

OneHotLogDeg

0.946 ± 0.289

0.803 ± 0.066

2.181 ± 0.080

RandomWalkDiag

0.873 ± 0.115

0.554 ± 0.160

1.702 ± 0.068

Adj

1.290 ± 0.320

0.594 ± 0.154

1.972 ± 0.230

RandProjGaussian

1.357 ± 0.073

0.897 ± 0.070

2.402 ± 0.081

RandProjSparse

1.351 ± 0.091

0.842 ± 0.063

2.461 ± 0.062

SVD

1.019 ± 0.199

0.700 ± 0.081

2.384 ± 0.058

LapEigMap

1.423 ± 0.072

0.914 ± 0.103

2.541 ± 0.042

LINE1

LINE2

1.480 ± 0.073

0.874 ± 0.071

2.486 ± 0.040

1.454 ± 0.043

0.888 ± 0.057

2.463 ± 0.138

Node2vec

1.416 ± 0.126

0.877 ± 0.032

2.585 ± 0.052

184

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

Walklets

Embedding

AdjEmbBag

1.464 ± 0.072

0.852 ± 0.073

2.392 ± 0.170

1.375 ± 0.092

0.834 ± 0.056

2.279 ± 0.197

0.645 ± 0.049

0.525 ± 0.041

1.242 ± 0.062

HumanNet

GCN

Constant

0.385 ± 0.018

0.898 ± 0.873

0.610 ± 0.037

RandomNormal

3.006 ± 0.112

2.511 ± 0.031

3.302 ± 0.104

OneHotLogDeg

2.873 ± 0.134

2.552 ± 0.059

3.574 ± 0.116

RandomWalkDiag

2.056 ± 0.102

1.672 ± 0.171

3.001 ± 0.149

Adj

3.562 ± 0.049

2.788 ± 0.096

3.715 ± 0.077

RandProjGaussian

3.346 ± 0.096

2.768 ± 0.036

3.654 ± 0.054

RandProjSparse

3.341 ± 0.085

2.784 ± 0.021

3.759 ± 0.039

SVD

3.211 ± 0.062

2.723 ± 0.064

3.780 ± 0.075

LapEigMap

3.219 ± 0.077

2.676 ± 0.072

3.735 ± 0.068

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

2.883 ± 0.097

2.560 ± 0.049

3.700 ± 0.078

2.918 ± 0.070

2.563 ± 0.082

3.656 ± 0.038

3.359 ± 0.070

2.798 ± 0.024

3.816 ± 0.039

3.464 ± 0.066

2.762 ± 0.053

3.842 ± 0.052

3.389 ± 0.051

2.637 ± 0.032

3.538 ± 0.040

3.499 ± 0.032

2.700 ± 0.066

3.482 ± 0.048

GAT

Constant

0.280 ± 0.043

0.449 ± 0.047

0.333 ± 0.035

RandomNormal

3.442 ± 0.331

2.758 ± 0.204

3.535 ± 0.126

OneHotLogDeg

3.052 ± 0.312

2.605 ± 0.074

3.661 ± 0.088

RandomWalkDiag

2.623 ± 0.862

2.340 ± 0.285

3.409 ± 0.059

Adj

3.791 ± 0.071

2.943 ± 0.053

3.770 ± 0.029

RandProjGaussian

3.390 ± 0.105

2.749 ± 0.134

3.803 ± 0.103

RandProjSparse

3.477 ± 0.135

2.811 ± 0.126

3.773 ± 0.085

SVD

3.691 ± 0.080

2.653 ± 0.101

3.792 ± 0.076

LapEigMap

3.438 ± 0.269

2.837 ± 0.180

3.907 ± 0.034

LINE1

LINE2

3.630 ± 0.058

2.794 ± 0.163

3.857 ± 0.085

3.430 ± 0.133

2.867 ± 0.051

3.779 ± 0.123

185

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

Node2vec

Walklets

Embedding

AdjEmbBag

3.662 ± 0.036

2.975 ± 0.027

3.947 ± 0.090

3.483 ± 0.103

2.887 ± 0.092

3.885 ± 0.112

3.539 ± 0.227

2.797 ± 0.087

3.738 ± 0.090

3.592 ± 0.086

2.802 ± 0.065

3.692 ± 0.075

ComPPIHumanInt GCN

Constant

0.263 ± 0.026

0.307 ± 0.014

0.440 ± 0.039

RandomNormal

1.544 ± 0.024

1.075 ± 0.058

2.411 ± 0.064

OneHotLogDeg

1.394 ± 0.077

0.964 ± 0.043

2.244 ± 0.072

RandomWalkDiag

0.946 ± 0.109

0.759 ± 0.053

1.652 ± 0.081

Adj

1.471 ± 0.024

1.037 ± 0.074

2.320 ± 0.049

RandProjGaussian

1.527 ± 0.067

1.069 ± 0.071

2.649 ± 0.030

RandProjSparse

1.558 ± 0.069

1.030 ± 0.046

2.593 ± 0.058

SVD

1.459 ± 0.026

1.112 ± 0.056

2.622 ± 0.091

LapEigMap

1.626 ± 0.037

1.110 ± 0.061

2.484 ± 0.058

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.536 ± 0.103

1.057 ± 0.049

2.540 ± 0.049

1.606 ± 0.062

1.060 ± 0.041

2.588 ± 0.016

1.546 ± 0.080

1.072 ± 0.047

2.744 ± 0.067

1.564 ± 0.053

1.086 ± 0.054

2.593 ± 0.034

1.356 ± 0.040

0.863 ± 0.068

2.190 ± 0.065

1.390 ± 0.042

0.929 ± 0.071

2.366 ± 0.080

GAT

Constant

0.323 ± 0.052

0.327 ± 0.056

0.366 ± 0.047

RandomNormal

1.386 ± 0.072

1.171 ± 0.107

2.584 ± 0.079

OneHotLogDeg

1.287 ± 0.148

0.756 ± 0.272

2.197 ± 0.222

RandomWalkDiag

1.197 ± 0.097

0.811 ± 0.044

1.933 ± 0.211

Adj

1.348 ± 0.198

0.791 ± 0.095

2.087 ± 0.035

RandProjGaussian

1.565 ± 0.027

1.074 ± 0.065

2.706 ± 0.107

RandProjSparse

1.562 ± 0.067

1.107 ± 0.029

2.728 ± 0.068

SVD

1.382 ± 0.076

1.050 ± 0.054

2.730 ± 0.058

LapEigMap

1.622 ± 0.044

1.192 ± 0.043

2.742 ± 0.072

LINE1

1.633 ± 0.055

1.134 ± 0.071

2.625 ± 0.085

186

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.569 ± 0.034

1.112 ± 0.067

2.656 ± 0.071

1.572 ± 0.064

1.169 ± 0.079

2.766 ± 0.064

1.623 ± 0.028

1.188 ± 0.070

2.789 ± 0.043

1.544 ± 0.036

1.011 ± 0.082

2.487 ± 0.072

0.887 ± 0.078

0.611 ± 0.061

1.724 ± 0.066

BioPlex

GCN

Constant

0.342 ± 0.032

0.356 ± 0.052

0.488 ± 0.083

RandomNormal

1.304 ± 0.016

0.868 ± 0.030

2.445 ± 0.064

OneHotLogDeg

1.225 ± 0.065

0.909 ± 0.050

2.500 ± 0.044

RandomWalkDiag

1.242 ± 0.043

0.835 ± 0.038

2.461 ± 0.138

Adj

1.182 ± 0.067

0.817 ± 0.023

2.613 ± 0.078

RandProjGaussian

1.218 ± 0.030

0.855 ± 0.066

2.583 ± 0.061

RandProjSparse

1.256 ± 0.085

0.869 ± 0.017

2.563 ± 0.053

SVD

1.273 ± 0.055

0.707 ± 0.046

2.513 ± 0.036

LapEigMap

1.206 ± 0.038

0.785 ± 0.057

2.382 ± 0.026

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.234 ± 0.028

0.790 ± 0.033

2.604 ± 0.073

1.242 ± 0.059

0.789 ± 0.053

2.544 ± 0.060

1.206 ± 0.042

0.784 ± 0.060

2.549 ± 0.074

1.215 ± 0.060

0.817 ± 0.045

2.558 ± 0.065

1.157 ± 0.024

0.772 ± 0.066

2.582 ± 0.034

1.240 ± 0.038

0.770 ± 0.021

2.582 ± 0.100

GAT

Constant

0.275 ± 0.063

0.275 ± 0.146

0.569 ± 0.143

RandomNormal

1.256 ± 0.070

0.884 ± 0.047

2.489 ± 0.063

OneHotLogDeg

1.089 ± 0.100

0.793 ± 0.044

2.479 ± 0.098

RandomWalkDiag

1.041 ± 0.082

0.788 ± 0.090

2.430 ± 0.085

Adj

1.157 ± 0.029

0.811 ± 0.042

2.582 ± 0.069

RandProjGaussian

1.222 ± 0.064

0.924 ± 0.041

2.588 ± 0.094

RandProjSparse

1.213 ± 0.041

0.882 ± 0.063

2.632 ± 0.009

SVD

1.139 ± 0.089

0.721 ± 0.073

2.539 ± 0.045

LapEigMap

1.204 ± 0.062

0.832 ± 0.058

2.371 ± 0.064

187

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.201 ± 0.060

0.802 ± 0.035

2.453 ± 0.074

1.242 ± 0.034

0.850 ± 0.035

2.478 ± 0.057

1.229 ± 0.039

0.768 ± 0.042

2.369 ± 0.061

1.196 ± 0.027

0.855 ± 0.073

2.531 ± 0.059

1.159 ± 0.044

0.746 ± 0.088

2.493 ± 0.071

1.183 ± 0.043

0.784 ± 0.050

2.522 ± 0.056

HuRI

GCN

Constant

0.346 ± 0.015

0.529 ± 0.033

0.384 ± 0.055

RandomNormal

0.504 ± 0.108

0.676 ± 0.080

0.956 ± 0.125

OneHotLogDeg

0.552 ± 0.080

0.579 ± 0.076

1.047 ± 0.102

RandomWalkDiag

0.549 ± 0.107

0.484 ± 0.045

1.027 ± 0.129

Adj

0.587 ± 0.023

0.631 ± 0.032

0.942 ± 0.075

RandProjGaussian

0.676 ± 0.070

0.673 ± 0.046

1.016 ± 0.186

RandProjSparse

0.686 ± 0.087

0.689 ± 0.055

1.076 ± 0.101

SVD

0.628 ± 0.056

0.667 ± 0.010

1.005 ± 0.049

LapEigMap

0.581 ± 0.014

0.526 ± 0.045

0.997 ± 0.085

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

0.658 ± 0.069

0.690 ± 0.054

1.053 ± 0.089

0.632 ± 0.125

0.741 ± 0.074

1.106 ± 0.084

0.566 ± 0.061

0.738 ± 0.053

1.126 ± 0.114

0.596 ± 0.078

0.681 ± 0.032

1.179 ± 0.056

0.572 ± 0.070

0.606 ± 0.034

0.888 ± 0.109

0.652 ± 0.039

0.660 ± 0.030

1.059 ± 0.076

GAT

Constant

0.305 ± 0.078

0.393 ± 0.077

0.345 ± 0.060

RandomNormal

0.524 ± 0.159

0.541 ± 0.131

0.968 ± 0.114

OneHotLogDeg

0.465 ± 0.060

0.510 ± 0.053

0.872 ± 0.189

RandomWalkDiag

0.473 ± 0.057

0.445 ± 0.129

0.972 ± 0.057

Adj

0.683 ± 0.109

0.528 ± 0.053

0.956 ± 0.157

RandProjGaussian

0.592 ± 0.086

0.530 ± 0.071

1.028 ± 0.139

RandProjSparse

0.603 ± 0.115

0.456 ± 0.085

1.017 ± 0.064

SVD

0.522 ± 0.099

0.412 ± 0.029

1.021 ± 0.099

188

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

LapEigMap

0.599 ± 0.082

0.557 ± 0.048

1.071 ± 0.080

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

0.585 ± 0.068

0.652 ± 0.043

1.101 ± 0.033

0.634 ± 0.067

0.743 ± 0.053

1.095 ± 0.054

0.577 ± 0.017

0.657 ± 0.067

1.116 ± 0.129

0.630 ± 0.065

0.602 ± 0.032

1.085 ± 0.085

0.647 ± 0.087

0.601 ± 0.020

0.941 ± 0.081

0.603 ± 0.056

0.581 ± 0.040

0.902 ± 0.059

OmniPath

GCN

Constant

0.415 ± 0.083

0.416 ± 0.044

0.451 ± 0.046

RandomNormal

1.417 ± 0.042

0.910 ± 0.126

1.798 ± 0.073

OneHotLogDeg

1.195 ± 0.041

0.930 ± 0.046

1.753 ± 0.090

RandomWalkDiag

0.847 ± 0.038

0.762 ± 0.045

1.427 ± 0.056

Adj

1.444 ± 0.019

1.118 ± 0.062

2.050 ± 0.049

RandProjGaussian

1.381 ± 0.089

1.014 ± 0.053

1.935 ± 0.075

RandProjSparse

1.338 ± 0.073

1.066 ± 0.060

1.935 ± 0.020

SVD

1.281 ± 0.052

0.849 ± 0.024

1.884 ± 0.048

LapEigMap

1.408 ± 0.030

1.108 ± 0.092

2.120 ± 0.040

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.366 ± 0.075

0.946 ± 0.049

2.027 ± 0.042

1.338 ± 0.065

0.963 ± 0.054

1.938 ± 0.096

1.384 ± 0.046

1.024 ± 0.020

1.982 ± 0.054

1.433 ± 0.050

1.036 ± 0.041

2.101 ± 0.033

1.329 ± 0.062

0.831 ± 0.069

1.504 ± 0.048

1.373 ± 0.042

1.085 ± 0.024

1.917 ± 0.054

GAT

Constant

0.353 ± 0.044

0.343 ± 0.077

0.381 ± 0.053

RandomNormal

1.374 ± 0.041

1.034 ± 0.091

1.945 ± 0.104

OneHotLogDeg

0.775 ± 0.142

0.773 ± 0.135

1.685 ± 0.110

RandomWalkDiag

0.754 ± 0.171

0.657 ± 0.159

1.580 ± 0.181

Adj

1.257 ± 0.179

1.184 ± 0.068

1.865 ± 0.095

RandProjGaussian

1.146 ± 0.117

1.032 ± 0.041

2.079 ± 0.126

RandProjSparse

1.216 ± 0.075

0.916 ± 0.121

2.023 ± 0.272

189

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

SVD

1.032 ± 0.125

0.889 ± 0.061

1.916 ± 0.159

LapEigMap

1.344 ± 0.052

1.103 ± 0.087

2.105 ± 0.061

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.263 ± 0.082

0.936 ± 0.035

2.077 ± 0.108

1.376 ± 0.041

1.031 ± 0.104

2.026 ± 0.081

1.317 ± 0.040

1.014 ± 0.090

2.096 ± 0.063

1.248 ± 0.099

1.022 ± 0.059

2.190 ± 0.052

1.371 ± 0.032

0.917 ± 0.103

1.738 ± 0.023

0.752 ± 0.076

0.724 ± 0.140

1.399 ± 0.171

ProteomeHD

GCN

Constant

0.289 ± 0.112

0.489 ± 0.021

0.729 ± 0.481

RandomNormal

0.695 ± 0.128

0.583 ± 0.060

1.267 ± 0.074

OneHotLogDeg

0.698 ± 0.060

0.692 ± 0.018

1.413 ± 0.078

RandomWalkDiag

0.645 ± 0.072

0.665 ± 0.053

1.247 ± 0.049

Adj

0.656 ± 0.060

0.588 ± 0.067

1.386 ± 0.054

RandProjGaussian

0.617 ± 0.096

0.714 ± 0.070

1.412 ± 0.089

RandProjSparse

0.685 ± 0.074

0.669 ± 0.084

1.461 ± 0.112

SVD

0.809 ± 0.073

0.575 ± 0.081

1.474 ± 0.073

LapEigMap

0.715 ± 0.017

0.535 ± 0.090

1.272 ± 0.035

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

0.588 ± 0.061

0.594 ± 0.042

1.316 ± 0.082

0.646 ± 0.071

0.630 ± 0.034

1.323 ± 0.073

0.682 ± 0.049

0.621 ± 0.071

1.379 ± 0.102

0.635 ± 0.093

0.649 ± 0.082

1.400 ± 0.141

0.687 ± 0.074

0.539 ± 0.024

1.293 ± 0.158

0.600 ± 0.096

0.621 ± 0.123

1.480 ± 0.083

GAT

Constant

0.375 ± 0.158

0.473 ± 0.082

0.675 ± 0.233

RandomNormal

0.655 ± 0.115

0.657 ± 0.129

1.458 ± 0.049

OneHotLogDeg

0.705 ± 0.115

0.687 ± 0.063

1.339 ± 0.199

RandomWalkDiag

0.700 ± 0.065

0.691 ± 0.077

1.196 ± 0.277

Adj

0.731 ± 0.149

0.625 ± 0.068

1.487 ± 0.080

RandProjGaussian

0.808 ± 0.118

0.701 ± 0.036

1.491 ± 0.102

190

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

RandProjSparse

0.725 ± 0.130

0.710 ± 0.061

1.502 ± 0.080

SVD

0.792 ± 0.137

0.828 ± 0.039

1.556 ± 0.057

LapEigMap

0.678 ± 0.041

0.534 ± 0.135

1.394 ± 0.079

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

0.638 ± 0.159

0.627 ± 0.094

1.338 ± 0.120

0.579 ± 0.079

0.599 ± 0.115

1.433 ± 0.076

0.839 ± 0.100

0.702 ± 0.075

1.446 ± 0.089

0.644 ± 0.106

0.653 ± 0.098

1.544 ± 0.064

0.650 ± 0.141

0.659 ± 0.100

1.460 ± 0.041

0.685 ± 0.064

0.666 ± 0.079

1.408 ± 0.052

SIGNOR

GCN

Constant

0.388 ± 0.066

0.425 ± 0.157

0.500 ± 0.148

RandomNormal

1.478 ± 0.050

1.349 ± 0.080

1.571 ± 0.056

OneHotLogDeg

1.427 ± 0.032

1.183 ± 0.087

1.754 ± 0.066

RandomWalkDiag

1.268 ± 0.011

1.022 ± 0.043

1.601 ± 0.127

Adj

1.461 ± 0.042

1.306 ± 0.058

1.575 ± 0.087

RandProjGaussian

1.488 ± 0.046

1.253 ± 0.083

1.764 ± 0.062

RandProjSparse

1.535 ± 0.067

1.369 ± 0.107

1.780 ± 0.071

SVD

1.396 ± 0.093

1.279 ± 0.071

1.788 ± 0.066

LapEigMap

1.597 ± 0.035

1.345 ± 0.042

1.746 ± 0.057

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.546 ± 0.068

1.354 ± 0.049

1.813 ± 0.101

1.506 ± 0.084

1.390 ± 0.103

1.708 ± 0.076

1.566 ± 0.071

1.345 ± 0.068

1.887 ± 0.115

1.562 ± 0.076

1.333 ± 0.088

1.732 ± 0.049

1.396 ± 0.035

1.200 ± 0.076

1.498 ± 0.064

1.496 ± 0.071

1.229 ± 0.130

1.665 ± 0.100

GAT

Constant

0.279 ± 0.038

0.242 ± 0.051

0.488 ± 0.176

RandomNormal

1.603 ± 0.049

1.281 ± 0.060

1.707 ± 0.050

OneHotLogDeg

1.180 ± 0.069

1.009 ± 0.087

1.576 ± 0.047

RandomWalkDiag

1.185 ± 0.115

0.935 ± 0.151

1.759 ± 0.115

Adj

1.538 ± 0.104

1.251 ± 0.023

1.529 ± 0.053

191

Table C.1 (cont’d).

Network

Model Feature

DISEASES

DisGeNET

GOBP

RandProjGaussian

1.531 ± 0.085

1.382 ± 0.107

1.751 ± 0.037

RandProjSparse

1.508 ± 0.033

1.290 ± 0.073

1.715 ± 0.090

SVD

1.529 ± 0.048

1.178 ± 0.102

1.718 ± 0.017

LapEigMap

1.578 ± 0.053

1.326 ± 0.059

1.749 ± 0.081

LINE1

LINE2

Node2vec

Walklets

Embedding

AdjEmbBag

1.583 ± 0.036

1.282 ± 0.052

1.844 ± 0.071

1.630 ± 0.038

1.325 ± 0.083

1.863 ± 0.048

1.519 ± 0.066

1.287 ± 0.047

1.807 ± 0.033

1.579 ± 0.045

1.411 ± 0.025

1.762 ± 0.052

1.474 ± 0.095

1.254 ± 0.051

1.695 ± 0.053

1.457 ± 0.198

1.154 ± 0.096

1.405 ± 0.156

192

Table C.2: Baseline performance reference. Reported values are average test APOP scores
aggregated over five random seeds. The best performance achieved by (1) GNNs or (2) logistic
regression and label propagation are bolded for each dataset (network × label). The best performance
across the two methods groups is additionally colored by green.

Network

Model

Feature

DISEASES

DisGeNET

GOBP

BioGRID

HumanNet

GCN

SAGE

GIN

GAT

OneHotLogDeg

1.111 ± 0.043

0.773 ± 0.035

2.022 ± 0.100

OneHotLogDeg

0.840 ± 0.105

0.665 ± 0.071

1.510 ± 0.114

OneHotLogDeg

0.720 ± 0.061

0.700 ± 0.050

1.443 ± 0.120

OneHotLogDeg

0.946 ± 0.289

0.552 ± 0.111

1.592 ± 0.408

GatedGCN OneHotLogDeg

1.014 ± 0.065

0.753 ± 0.047

1.924 ± 0.055

LabelProp

–

LogReg

Adj

1.210 ± 0.000

0.931 ± 0.000

1.885 ± 0.000

1.328 ± 0.001

0.743 ± 0.000

2.528 ± 0.000

LogReg

LapEigMap

1.288 ± 0.001

0.864 ± 0.002

2.149 ± 0.000

LogReg

SVD

0.881 ± 0.010

0.724 ± 0.003

2.088 ± 0.003

LogReg

LINE1

1.117 ± 0.024

0.722 ± 0.004

2.264 ± 0.019

LogReg

LINE2

1.132 ± 0.021

0.825 ± 0.007

2.351 ± 0.017

LogReg

Node2vec

1.116 ± 0.054

0.836 ± 0.029

2.571 ± 0.015

LogReg

Walklets

1.023 ± 0.052

0.786 ± 0.046

2.189 ± 0.020

GCN

SAGE

GIN

GAT

OneHotLogDeg

2.902 ± 0.050

2.452 ± 0.107

3.524 ± 0.061

OneHotLogDeg

2.850 ± 0.107

2.356 ± 0.093

3.326 ± 0.067

OneHotLogDeg

2.378 ± 0.126

2.019 ± 0.113

3.151 ± 0.017

OneHotLogDeg

3.052 ± 0.312

2.547 ± 0.207

3.571 ± 0.159

GatedGCN OneHotLogDeg

3.004 ± 0.132

2.327 ± 0.026

3.486 ± 0.047

LabelProp

–

LogReg

Adj

3.728 ± 0.000

3.059 ± 0.000

3.806 ± 0.000

3.812 ± 0.000

3.053 ± 0.000

3.964 ± 0.006

LogReg

LapEigMap

2.737 ± 0.003

2.301 ± 0.000

3.778 ± 0.001

LogReg

SVD

2.785 ± 0.002

2.412 ± 0.004

3.618 ± 0.001

LogReg

LINE1

2.178 ± 0.010

1.632 ± 0.005

3.348 ± 0.011

LogReg

LINE2

2.270 ± 0.016

1.679 ± 0.005

3.485 ± 0.004

LogReg

Node2vec

3.316 ± 0.020

2.433 ± 0.029

4.036 ± 0.019

LogReg

Walklets

2.670 ± 0.069

2.050 ± 0.115

3.081 ± 0.054

ComPPIHumanInt GCN

OneHotLogDeg

1.359 ± 0.094

0.993 ± 0.042

2.251 ± 0.064

193

Table C.2 (cont’d).

Network

Model

Feature

DISEASES

DisGeNET

GOBP

SAGE

GIN

GAT

OneHotLogDeg

1.101 ± 0.063

0.724 ± 0.076

1.865 ± 0.091

OneHotLogDeg

0.963 ± 0.057

0.777 ± 0.057

1.621 ± 0.077

OneHotLogDeg

1.287 ± 0.148

0.756 ± 0.272

2.197 ± 0.222

GatedGCN OneHotLogDeg

1.166 ± 0.045

0.861 ± 0.050

2.044 ± 0.094

LabelProp

–

LogReg

Adj

1.352 ± 0.000

1.106 ± 0.000

2.076 ± 0.000

1.431 ± 0.000

1.016 ± 0.000

2.707 ± 0.002

LogReg

LapEigMap

1.257 ± 0.001

1.045 ± 0.005

2.177 ± 0.002

LogReg

SVD

0.888 ± 0.003

0.702 ± 0.001

1.999 ± 0.002

LogReg

LINE1

1.135 ± 0.023

0.905 ± 0.014

2.438 ± 0.019

LogReg

LINE2

1.185 ± 0.021

0.895 ± 0.012

2.495 ± 0.024

LogReg

Node2vec

1.341 ± 0.034

1.074 ± 0.005

2.806 ± 0.049

LogReg

Walklets

1.073 ± 0.053

0.890 ± 0.032

2.109 ± 0.104

GCN

SAGE

GIN

GAT

OneHotLogDeg

1.277 ± 0.034

0.895 ± 0.046

2.535 ± 0.065

OneHotLogDeg

1.118 ± 0.043

0.787 ± 0.049

2.215 ± 0.084

OneHotLogDeg

1.182 ± 0.083

0.822 ± 0.086

2.360 ± 0.054

OneHotLogDeg

1.089 ± 0.100

0.793 ± 0.044

2.479 ± 0.098

GatedGCN OneHotLogDeg

0.970 ± 0.049

0.723 ± 0.052

2.303 ± 0.114

LabelProp

–

LogReg

Adj

0.964 ± 0.000

0.556 ± 0.000

2.174 ± 0.000

1.087 ± 0.003

0.838 ± 0.011

2.467 ± 0.001

LogReg

LapEigMap

1.147 ± 0.005

0.903 ± 0.019

2.298 ± 0.017

LogReg

SVD

0.824 ± 0.006

0.588 ± 0.001

2.170 ± 0.022

LogReg

LINE1

0.914 ± 0.062

0.653 ± 0.011

2.475 ± 0.042

LogReg

LINE2

0.843 ± 0.050

0.627 ± 0.015

2.321 ± 0.015

LogReg

Node2vec

0.816 ± 0.073

0.639 ± 0.053

2.194 ± 0.039

LogReg

Walklets

0.742 ± 0.067

0.688 ± 0.053

1.822 ± 0.046

GCN

SAGE

GIN

GAT

OneHotLogDeg

0.529 ± 0.075

0.625 ± 0.089

1.040 ± 0.049

OneHotLogDeg

0.491 ± 0.083

0.541 ± 0.029

0.937 ± 0.048

OneHotLogDeg

0.591 ± 0.089

0.477 ± 0.052

1.008 ± 0.113

OneHotLogDeg

0.465 ± 0.060

0.510 ± 0.053

0.872 ± 0.189

194

BioPlex

HuRI

Table C.2 (cont’d).

Network

Model

Feature

DISEASES

DisGeNET

GOBP

GatedGCN OneHotLogDeg

0.602 ± 0.090

0.591 ± 0.098

0.957 ± 0.074

LabelProp

–

LogReg

Adj

0.545 ± 0.000

0.598 ± 0.000

0.962 ± 0.000

0.494 ± 0.003

0.455 ± 0.000

1.002 ± 0.003

LogReg

LapEigMap

0.538 ± 0.007

0.596 ± 0.001

0.985 ± 0.023

LogReg

SVD

0.545 ± 0.027

0.433 ± 0.005

0.721 ± 0.063

LogReg

LINE1

0.608 ± 0.030

0.596 ± 0.030

1.025 ± 0.040

LogReg

LINE2

0.575 ± 0.062

0.557 ± 0.018

1.017 ± 0.093

LogReg

Node2vec

0.633 ± 0.065

0.500 ± 0.043

0.904 ± 0.078

LogReg

Walklets

0.485 ± 0.071

0.458 ± 0.088

0.707 ± 0.069

GCN

SAGE

GIN

GAT

OneHotLogDeg

1.196 ± 0.054

0.983 ± 0.067

1.742 ± 0.107

OneHotLogDeg

0.913 ± 0.074

0.779 ± 0.051

1.539 ± 0.074

OneHotLogDeg

0.795 ± 0.044

0.731 ± 0.030

1.521 ± 0.086

OneHotLogDeg

0.775 ± 0.142

0.773 ± 0.135

1.685 ± 0.110

GatedGCN OneHotLogDeg

1.112 ± 0.039

0.898 ± 0.031

1.698 ± 0.059

LabelProp

–

LogReg

Adj

1.358 ± 0.000

0.897 ± 0.000

1.593 ± 0.000

1.051 ± 0.001

0.709 ± 0.000

1.862 ± 0.000

LogReg

LapEigMap

1.319 ± 0.000

1.060 ± 0.004

1.943 ± 0.004

LogReg

SVD

0.866 ± 0.001

0.635 ± 0.006

1.512 ± 0.012

LogReg

LINE1

0.834 ± 0.026

0.787 ± 0.007

1.893 ± 0.032

LogReg

LINE2

0.913 ± 0.033

0.692 ± 0.012

1.844 ± 0.032

LogReg

Node2vec

1.178 ± 0.035

0.924 ± 0.035

2.125 ± 0.059

LogReg

Walklets

0.915 ± 0.041

0.795 ± 0.045

1.704 ± 0.064

GCN

SAGE

GIN

GAT

OneHotLogDeg

0.690 ± 0.064

0.637 ± 0.066

1.459 ± 0.138

OneHotLogDeg

0.556 ± 0.107

0.644 ± 0.030

1.304 ± 0.041

OneHotLogDeg

0.608 ± 0.083

0.606 ± 0.107

1.350 ± 0.221

OneHotLogDeg

0.705 ± 0.115

0.687 ± 0.063

1.339 ± 0.199

GatedGCN OneHotLogDeg

0.521 ± 0.101

0.712 ± 0.044

1.169 ± 0.052

LabelProp

–

LogReg

Adj

0.709 ± 0.000

0.669 ± 0.000

1.036 ± 0.000

0.849 ± 0.047

0.619 ± 0.005

1.329 ± 0.141

195

OmniPath

ProteomeHD

Table C.2 (cont’d).

Network

Model

Feature

DISEASES

DisGeNET

GOBP

SIGNOR

LogReg

LapEigMap

0.955 ± 0.001

0.721 ± 0.016

1.219 ± 0.039

LogReg

SVD

0.878 ± 0.002

0.736 ± 0.008

1.508 ± 0.079

LogReg

LINE1

0.566 ± 0.089

0.667 ± 0.023

1.307 ± 0.060

LogReg

LINE2

0.576 ± 0.099

0.663 ± 0.032

1.226 ± 0.085

LogReg

Node2vec

0.643 ± 0.076

0.772 ± 0.068

1.357 ± 0.040

LogReg

Walklets

0.432 ± 0.050

0.525 ± 0.111

1.048 ± 0.061

GCN

SAGE

GIN

GAT

OneHotLogDeg

1.387 ± 0.067

1.167 ± 0.079

1.732 ± 0.031

OneHotLogDeg

0.924 ± 0.106

0.838 ± 0.086

1.441 ± 0.094

OneHotLogDeg

1.106 ± 0.058

0.870 ± 0.055

1.479 ± 0.090

OneHotLogDeg

1.180 ± 0.069

1.009 ± 0.087

1.576 ± 0.047

GatedGCN OneHotLogDeg

1.067 ± 0.028

0.852 ± 0.060

1.430 ± 0.085

LabelProp

–

LogReg

Adj

1.288 ± 0.000

1.096 ± 0.000

1.695 ± 0.000

1.303 ± 0.003

1.052 ± 0.001

1.417 ± 0.004

LogReg

LapEigMap

1.306 ± 0.004

1.056 ± 0.005

1.746 ± 0.010

LogReg

SVD

0.864 ± 0.004

0.801 ± 0.002

1.183 ± 0.002

LogReg

LINE1

1.412 ± 0.033

0.955 ± 0.022

1.781 ± 0.056

LogReg

LINE2

1.257 ± 0.034

0.917 ± 0.016

1.572 ± 0.047

LogReg

Node2vec

1.341 ± 0.021

1.172 ± 0.027

1.684 ± 0.099

LogReg

Walklets

1.017 ± 0.101

0.985 ± 0.049

1.259 ± 0.065

196

(a) GCN

(b) GAT

Figure C.1: Boxplots representing the rankings of different feature construction when used by
GNNs (lower the better). Each point in a box is a ranking of a particular node feature construction
on a specific dataset. A lower rank indicates that the particular node feature achieved higher test
performance than others. For both GCN and GAT, node2vec appears to be the top-ranked node
feature overall.

197

GCN+ConstantGCN+RandomNormalGCN+OneHotLogDegGCN+RandomWalkDiagGCN+AdjGCN+RandProjGaussianGCN+RandProjSparseGCN+SVDGCN+LapEigMapGCN+LINE1GCN+LINE2GCN+Node2vecGCN+WalkletsGCN+EmbeddingGCN+AdjEmbBag2468101214GAT+ConstantGAT+RandomNormalGAT+OneHotLogDegGAT+RandomWalkDiagGAT+AdjGAT+RandProjGaussianGAT+RandProjSparseGAT+SVDGAT+LapEigMapGAT+LINE1GAT+LINE2GAT+Node2vecGAT+WalkletsGAT+EmbeddingGAT+AdjEmbBag2468101214Figure C.2: Boxplots representing the performance improvement of using different tricks with
GNNs across datasets. Each point in a box is the test APOP difference of GNN with added tricks
vs. plain GNN using OneHotLogDeg feature on a specific dataset. A positive value implies the
added trick improves GNN performance.

Figure C.3: Boxplots representing the performance improvement of using C&S with logistic
regression models across datasets. Each point in a box is the test APOP difference of logistic
regression augmented with C&S vs. plain logistic regression on a specific dataset. A positive value
implies C&S improves logistic regression performance.

198

GATGCNGINGatedGCNSAGE0.20.00.20.40.60.81.0Test APOP+OneHotLogDeg+LabelReuse+CS+OneHotLogDeg+CS+OneHotLogDeg+LabelReuse+CS+Node2vec+OneHotLogDeg+LabelReuse (BoT)AdjLINE1LINE2LapEigMapNode2vecSVDWalklets0.250.000.250.500.751.001.251.50Test APOPFigure C.4: Relationships between the corrected homophily ratio and performance difference
between methods. Each panel summarizes tasks over three networks (BioGRID, HumanNet,
ComPPIHumanInt) and three gene set collections (DISEASES, DisGeNET, GOBP), with a total
of 1,672 tasks. In each panel, the x-axis represents the corrected homophily ratio and the y-axis
represents the test APOP performance difference between the two methods. The first (A, B, C)
and second rows (D, E, F) show the performance improvement to GCN and GAT when augmented
with individual tricks or combined BoT. The third row (G, H, I) shows the performance difference
between the best GNN method and the baseline methods, including label propagation and logistic
regression.

199

0.20.00.20.40.60.81.01.21.4(A)GCN+CS vs. GCN(B)GCN+Node2vec vs. GCN(C)GCN+BoT vs. GCN0.20.00.20.40.60.81.01.21.4(D)GAT+CS vs. GAT(E)GAT+Node2vec vs. GAT(F)GAT+BoT vs. GAT12340.20.00.20.40.60.81.01.21.4(G)GAT+BoT vs. LabelProp1234(H)GAT+BoT vs. LogReg+Adj1234(I)GAT+BoT vs. LogReg+Node2vecTable C.3: Combined SOTA performance reference. LogReg* are the best test performances
achieved from any logistic regression models we have tested for each dataset. Colored texts indicate
the first (green), second (orange), and third (purple) best performance achieved for a particular
dataset (network × label).

Network

Model

DISEASES

DisGeNET

GOBP

BioGRID

LabelProp

LogReg*

GCN+BoT

1.210 ± 0.000

0.931 ± 0.000

1.885 ± 0.000

1.556 ± 0.002

1.026 ± 0.022

2.571 ± 0.015

1.511 ± 0.053

1.014 ± 0.020

2.411 ± 0.044

SAGE+BoT

1.486 ± 0.058

1.031 ± 0.049

2.402 ± 0.061

GIN+BoT

GAT+BoT

1.410 ± 0.024

1.007 ± 0.028

2.386 ± 0.022

1.609 ± 0.048

1.037 ± 0.036

2.624 ± 0.070

GatedGCN+BoT 1.547 ± 0.027

1.038 ± 0.006

2.517 ± 0.047

HumanNet

LabelProp

3.728 ± 0.000

3.059 ± 0.000

3.806 ± 0.000

LogReg*

GCN+BoT

3.812 ± 0.000

3.158 ± 0.000

4.053 ± 0.021

3.552 ± 0.050

3.053 ± 0.078

3.921 ± 0.045

SAGE+BoT

3.401 ± 0.066

3.052 ± 0.041

3.816 ± 0.083

GIN+BoT

GAT+BoT

3.513 ± 0.029

3.054 ± 0.051

3.861 ± 0.063

3.761 ± 0.060

3.100 ± 0.031

3.908 ± 0.086

GatedGCN+BoT 3.677 ± 0.066

3.086 ± 0.020

3.889 ± 0.048

ComPPIHumanInt LabelProp

1.352 ± 0.000

1.106 ± 0.000

2.076 ± 0.000

BioPlex

LogReg*

GCN+BoT

1.644 ± 0.006

1.240 ± 0.009

2.806 ± 0.049

1.648 ± 0.012

1.211 ± 0.013

2.685 ± 0.047

SAGE+BoT

1.694 ± 0.055

1.210 ± 0.033

2.629 ± 0.082

GIN+BoT

GAT+BoT

1.608 ± 0.020

1.219 ± 0.006

2.611 ± 0.021

1.665 ± 0.035

1.230 ± 0.025

2.785 ± 0.041

GatedGCN+BoT 1.672 ± 0.053

1.218 ± 0.009

2.735 ± 0.048

LabelProp

LogReg*

GCN+BoT

0.964 ± 0.000

0.556 ± 0.000

2.174 ± 0.000

1.358 ± 0.006

0.939 ± 0.002

2.587 ± 0.022

1.324 ± 0.027

0.911 ± 0.010

2.553 ± 0.069

SAGE+BoT

1.246 ± 0.022

0.865 ± 0.035

2.513 ± 0.038

GIN+BoT

GAT+BoT

1.349 ± 0.010

0.868 ± 0.009

2.504 ± 0.024

1.355 ± 0.040

0.873 ± 0.025

2.548 ± 0.075

200

Table C.3 (cont’d).

Network

Model

DISEASES

DisGeNET

GOBP

HuRI

OmniPath

GatedGCN+BoT 1.301 ± 0.035

0.859 ± 0.011

2.590 ± 0.029

LabelProp

LogReg*

GCN+BoT

0.545 ± 0.000

0.598 ± 0.000

0.962 ± 0.000

0.650 ± 0.000

0.656 ± 0.000

1.084 ± 0.020

0.634 ± 0.065

0.693 ± 0.019

1.229 ± 0.119

SAGE+BoT

0.593 ± 0.040

0.679 ± 0.031

1.190 ± 0.127

GIN+BoT

GAT+BoT

0.583 ± 0.042

0.702 ± 0.012

1.143 ± 0.047

0.667 ± 0.048

0.687 ± 0.045

1.174 ± 0.047

GatedGCN+BoT 0.596 ± 0.017

0.695 ± 0.018

1.195 ± 0.054

LabelProp

LogReg*

GCN+BoT

1.358 ± 0.000

0.897 ± 0.000

1.593 ± 0.000

1.542 ± 0.017

1.093 ± 0.006

2.125 ± 0.059

1.577 ± 0.038

1.073 ± 0.024

2.052 ± 0.032

SAGE+BoT

1.465 ± 0.048

1.041 ± 0.065

1.974 ± 0.040

GIN+BoT

GAT+BoT

1.478 ± 0.032

1.104 ± 0.017

1.995 ± 0.046

1.520 ± 0.028

1.083 ± 0.025

2.067 ± 0.050

GatedGCN+BoT 1.544 ± 0.036

1.079 ± 0.020

2.122 ± 0.080

ProteomeHD

LabelProp

0.709 ± 0.000

0.669 ± 0.000

1.036 ± 0.000

SIGNOR

LogReg*

GCN+BoT

0.955 ± 0.001

0.776 ± 0.000

1.519 ± 0.071

0.764 ± 0.017

0.745 ± 0.026

1.387 ± 0.088

SAGE+BoT

0.747 ± 0.051

0.734 ± 0.023

1.425 ± 0.100

GIN+BoT

GAT+BoT

0.771 ± 0.034

0.718 ± 0.029

1.529 ± 0.161

0.830 ± 0.039

0.727 ± 0.042

1.463 ± 0.052

GatedGCN+BoT 0.829 ± 0.029

0.716 ± 0.022

1.389 ± 0.097

LabelProp

LogReg*

GCN+BoT

1.288 ± 0.000

1.096 ± 0.000

1.695 ± 0.000

1.582 ± 0.006

1.298 ± 0.007

1.894 ± 0.001

1.590 ± 0.037

1.273 ± 0.013

1.844 ± 0.031

SAGE+BoT

1.540 ± 0.017

1.243 ± 0.005

1.784 ± 0.041

GIN+BoT

GAT+BoT

1.593 ± 0.026

1.253 ± 0.010

1.850 ± 0.028

1.627 ± 0.042

1.260 ± 0.009

1.799 ± 0.034

201

Table C.3 (cont’d).

Network

Model

DISEASES

DisGeNET

GOBP

GatedGCN+BoT 1.525 ± 0.026

1.254 ± 0.006

1.849 ± 0.029

202

Table C.4: Random splitting evaluation performance of different methods on the four primary
OBNB datasets evaluated in APOP ↑ aggregated over five seeds. Bold indicates the best-
performing method within the method class (traditional ML or GNN). Blue indicates the evaluated
performance is higher than the study-biased holdout evaluation counterpart.

Model

Features

C&S?

LabelProp
LogReg
LogReg
LogReg

–
Adj
N2V
LapEigMap

GCN
GCN
GCN†
GCN‡
GAT
GAT
GAT†
GAT‡

LogDeg
LogDeg
N2V+LogDeg+Label
N2V+LogDeg+Label
LogDeg
LogDeg
N2V+LogDeg+Label
N2V+LogDeg+Label

✗
✗
✗
✗

✗
✓
✓
✓
✗
✓
✓
✓

BioGRID

HumanNet

DisGeNET
1.415 ± 0.000
1.546 ± 0.001
1.485 ± 0.037
1.421 ± 0.000
1.055 ± 0.116
1.664 ± 0.066
1.780 ± 0.043
1.769 ± 0.051
0.921 ± 0.290
1.791 ± 0.015
1.728 ± 0.063
1.818 ± 0.013

GOBP
2.654 ± 0.000
3.479 ± 0.000
3.269 ± 0.022
2.812 ± 0.004
2.076 ± 0.204
2.832 ± 0.082
3.330 ± 0.049
3.380 ± 0.047
2.722 ± 0.498
3.076 ± 0.037
3.357 ± 0.185
3.448 ± 0.010

DisGeNET
3.370 ± 0.000
3.280 ± 0.000
2.987 ± 0.014
2.516 ± 0.002
2.770 ± 0.045
3.213 ± 0.010
3.278 ± 0.023
3.274 ± 0.013
2.837 ± 0.197
3.252 ± 0.021
3.265 ± 0.018
3.266 ± 0.003

GOBP
4.138 ± 0.000
4.300 ± 0.001
4.107 ± 0.033
3.876 ± 0.004
3.898 ± 0.063
4.134 ± 0.044
4.205 ± 0.030
4.202 ± 0.023
3.838 ± 0.092
4.141 ± 0.016
4.226 ± 0.030
4.235 ± 0.020

† recommended BoT, ‡ recommended BoT with fully tuned GNNs (see Appendix 6.3.4 for tuning details)

As shown in Table C.4, random-split performance is higher than study-biased holdout in all cases.

This indicates that, besides being more realistic, the study-biased holdout split is a much harder

evaluation scheme and thus provides a more stringent evaluation of the tested gene classification

method.

Table C.5: Statistics for all networks in OBNB (obnbdata-0.1.0)

Network

Weighted Num. nodes Num. edges

Density

Category

HumanBaseTopGlobal [80]
HuMAP [62]
STRING [246]
ConsensusPathDB [117]
FunCoup [204]
PCNet [100]
BioGRID [240]
HumanNet [107]
HIPPIE [5]
ComPPIHumanInt [260]
OmniPath [249]
ProteomeHD [131]
HuRI [160]
BioPlex [106]
SIGNOR [201]

✓
✓
✓
✓
✓
✗
✗
✓
✓
✓
✗
✗
✗
✗
✗

25,689
15,433
18,480
17,735
17,892
18,544
19,765
18,591
19,338
17,015
16,325
2,471
8,100
8,108
5,291

77,807,094
35,052,604
11,019,492
10,611,416
10,037,478
5,365,116
1,554,790
2,250,780
1,542,044
699,620
289,134
125,172
103,188
71,004
28,676

0.117908 Large & Dense
0.147180 Large & Dense
0.032269
0.033739
0.031357
0.015603
0.003980
0.006513
0.004124
0.002417
0.001085
0.020509
0.001573
0.001080
0.001025

Large
Large
Large
Large
Medium
Medium
Medium
Medium
Small
Small
Small
Small
Small

203

Table C.6: Statistics for all datasets in OBNB (obnbdata-0.1.0).

Label

Network

Num. tasks Num. pos. avg. Num. pos. std. Num. pos. med.

DISEASES BioGRID

BioPlex

ComPPIHumanInt

ConsensusPathDB

FunCoup

HIPPIE

HuMAP

HuRI

HumanBaseTopGlobal

HumanNet

OmniPath

PCNet

ProteomeHD

SIGNOR

STRING

DisGeNET

BioGRID

BioPlex

ComPPIHumanInt

ConsensusPathDB

FunCoup

HIPPIE

HuMAP

HuRI

HumanBaseTopGlobal

HumanNet

OmniPath

PCNet

ProteomeHD

SIGNOR

STRING

GOBP

BioGRID

BioPlex

ComPPIHumanInt

ConsensusPathDB

FunCoup

178.1

123.8

174.6

177.4

177.1

178.1

168.0

130.3

178.5

179.0

180.2

171.8

76.9

144.6

175.4

208.3

138.6

204.1

207.4

204.7

208.1

194.3

122.9

219.7

204.2

199.6

202.1

78.0

147.3

208.0

89.5

77.6

91.8

90.1

87.8

145

72

145

144

145

143

123

50

149

142

135

143

15

89

146

305

189

301

298

299

306

279

152

287

302

298

292

56

219

296

114

38

104

112

114

204

137.4

64.4

134.5

137.5

135.1

137.6

119.2

56.7

137.7

136.9

131.1

130.6

22.4

89.4

135.6

143.1

71.4

138.7

140.8

139.4

142.9

126.7

54.7

145.7

140.3

136.0

135.5

24.8

81.9

140.6

37.1

22.6

37.0

37.0

36.7

127.0

101.5

125.0

126.0

127.0

127.0

120.0

112.5

129.0

127.0

131.0

122.0

70.0

117.0

126.0

159.0

111.0

159.0

161.5

158.0

159.5

155.0

108.0

173.0

158.5

153.5

159.0

71.0

124.0

162.0

76.0

76.0

77.5

76.5

74.0

Table C.6 (cont’d).

Label

Network

Num. tasks Num. pos. avg. Num. pos. std. Num. pos. med.

HIPPIE

HuMAP

HuRI

HumanBaseTopGlobal

HumanNet

OmniPath

PCNet

ProteomeHD

SIGNOR

STRING

GOCC

BioGRID

BioPlex

ComPPIHumanInt

ConsensusPathDB

FunCoup

HIPPIE

HuMAP

HuRI

HumanBaseTopGlobal

HumanNet

OmniPath

PCNet

ProteomeHD

SIGNOR

STRING

GOMF

BioGRID

BioPlex

ComPPIHumanInt

ConsensusPathDB

FunCoup

HIPPIE

HuMAP

HuRI

HumanBaseTopGlobal

HumanNet

89.2

84.6

69.9

89.2

88.6

88.7

89.0

80.4

81.3

88.9

96.0

77.0

96.1

95.5

95.9

97.3

91.8

69.0

96.9

95.8

94.0

93.5

59.5

66.4

96.1

97.7

79.6

98.1

98.2

97.5

97.7

92.1

74.8

99.2

98.7

111

96

27

115

117

106

105

5

41

116

71

35

68

71

71

69

62

21

71

70

67

70

2

26

69

63

25

62

63

64

63

58

22

62

61

205

37.1

32.3

16.0

37.3

36.9

36.2

36.0

22.6

22.7

36.6

36.3

20.9

35.5

35.8

36.4

35.8

33.4

12.4

36.3

35.6

34.6

35.0

4.5

11.3

35.6

39.8

18.0

39.7

39.5

39.1

39.9

36.8

12.8

39.9

39.5

76.0

74.0

65.0

76.0

75.0

74.0

77.0

70.0

78.0

75.0

87.0

71.0

88.0

87.0

87.0

89.0

82.0

67.0

88.0

88.0

84.0

85.0

59.5

67.5

88.0

85.0

79.0

86.0

85.0

83.5

84.0

74.5

73.5

86.0

84.0

Table C.6 (cont’d).

Label

Network

Num. tasks Num. pos. avg. Num. pos. std. Num. pos. med.

OmniPath

PCNet

ProteomeHD

SIGNOR

STRING

64

58

2

25

62

95.2

97.4

59.5

91.4

98.0

38.5

38.6

4.5

23.7

39.3

83.5

86.0

59.5

94.0

82.5

206