TOWARDS GRAPH FOUNDATION MODEL:
FROM NETWORK SCIENCE THEORY TO PRACTICE APPLICATION

By

Haitao Mao

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

Graphs, which abstract complex systems of relations and interactions, can model transportation

networks, trading networks, epidemic networks, the World Wide Web, and more. Graph Neural

Networks (GNNs), aiming to adaptively integrate both feature and structure knowledge on networks,

have emerged as a popular graph representation learning technique. The key component in GNNs is

the non-parameterized message-passing operator, which updates node representation via aggregating

neighbor node messages. Message-passing operators generally follow the algorithm alignment

design principle, where the operator should align with the task principle or the corresponding graph

algorithms, e.g., Common Neighbor for friend recommendation, PageRank for web Search, and

Dijkstra algorithm for finding the shortest path.

Despite the initial success achieved by GNNs, most of them are end-to-end trained and evaluated

on one single dataset. Consequently, an architectural overfitting phenomenon can be found where

GNNs generally cannot perform well across graphs with various properties, as they are specifically

designed to beat existing methods on the limited benchmark dataset. Such obstacle makes it difficult

for practitioners to determine which models will generalize effectively to unseen datasets. Thereby, a

new but necessary requirement for graph representation learning is to be expressive enough to capture

diverse properties and adapt across different graphs. To achieve this ultimate goal, I first endeavor to

elucidate all the underlying network patterns across diverse graphs with a properly designed random

network model. Theoretical analysis is further conducted to principally characterize the relationship

between different network patterns. Secondly, I conduct analysis on GNN’s inner mechanism about

which network patterns GNNs can capture and which not. With the help of the random network

model, I can identify the inevitable but largely ignored limitation of GNNs. Finally, I develop Graph

Foundation Models (GFMs) to address the incapability of GNNs. GFMs comprehensively consider

the training data diversity, expressive architecture design, and suitable training objective, which

moderates the over-reliance on the algorithm alignment and captures diverse network patterns on

different graphs.

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my family for their unconditional love and support. Their

unwavering belief in me has been a constant source of strength throughout this journey. They have

stood by me through every challenge and triumph. Their patience, sacrifices, and boundless kindness

have shaped me in ways words cannot fully express. This achievement would not have been possible

without their love, and for that, I am forever grateful.

I would like to express my gratitude to my supervisor, Dr. Jiliang Tang. Over the years, I have

learned invaluable lessons from him—not only in constructing rigorous research but also in being a

kind and principled person, as well as in mentoring students.

I am deeply appreciative of my dissertation committee members, Dr. Sijia Liu, Dr. Hui Liu, Dr.

Mikhail Galkin, and Dr. Neil Shah, for their insightful feedback and invaluable suggestions, which

have greatly enriched my work.

I would like to extend my special thanks to those who offered me inspiration during challenging

moments, including Dr. Yao Ma, Guangliang Liu, Justin Ding, Dr. Qiang Fu, Yuanqi Du, Yupeng

Hou, Haonan Wang, Dr. Mikhail Galkin, and Dr. Zhaocheng Zhu, Dr. Yuke Wang, Yanqiao Zhu,

Jieyu Zhang, Yanbang Wang, and Xingyue Huang, Dr. Hong Gao. Their insights and encouragement

were invaluable in helping me navigate obstacles along the way.

Additionally, I am incredibly grateful for the support and encouragement of my friends and

colleagues throughout my Ph.D. journey. I feel fortunate to have shared this experience with Zhikai

Chen, Jingzhe Liu, Wenzhuo Tang, Kaiqi Yang, Haoyu Han, Kai Guo, Hanbing Wang, Juanhui Li,

Harry Shomer, Quang Truong, and all the other members of the Data Science and Engineering Lab.

iii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

CHAPTER 3

SOURCE FREE GRAPH DOMAIN ADAPTATION . . . . . . . . . . .

1

3

9

CHAPTER 4

DEMYSTIFYING STRUCTURAL DISPARITY IN GRAPH NEURAL
NETWORKS: CAN ONE SIZE FIT ALL? . . . . . . . . . . . . . . . . 31

CHAPTER 5

REVISITING LINK PREDICTION: A DATA PERSPECTIVE . . . . . 47

CHAPTER 6

GRAPH FOUNDATION MODEL ARE ALREADY HERE . . . . . . . 60

CHAPTER 7

CROSS-DOMAIN GRAPH DATA SCALING: A SHOWCASE WITH
DIFFUSION MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . 76

CHAPTER 8

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 93

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

iv

CHAPTER 1

INTRODUCTION

Graphs, which abstract complex systems of relations and interactions, can model transportation

networks, trading networks [329], epidemic networks, the World Wide Web, and more. Graph Neural

Networks (GNNs) [134, 84], aiming to adaptively integrate both feature and structure knowledge on

networks, have emerged as a popular graph machine learning technique. The key component in

GNNs is the non-parameterized message-passing operator, which updates node representation via

aggregating neighbor node messages. The message-passing operator requires task-specific designs to

match the corresponding graph algorithms, e.g., Common Neighbor [5] for friend recommendation,

PageRank for web Search, and Dijkstra algorithm for finding the shortest path.

Despite the initial success achieved by GNNs, the fixed, non-parameterized message-passing

operators often fail to generalize on different networks with distribution shifts [190]. To break the

inevitable but largely ignored “algorithmic glass ceiling” of GNNs, my research aims to develop

a Graph Foundation Model (GFM) which can generalize across different graphs with diverse

properties. Despite graphs can have different appearance, I propose a Graph Vocabulary Hypothesis

which suppose the existence of shared patterns across graphs.

To achieve this ultimate goal, my research are majorly two-fold: (i) Where are the essential

patterns? My research lies foundation in the network science [203]. I design the random network

model to comprehensively describe essential network patterns. Theoretical analysis is then conducted

to principally characterize the relationship between different network patterns. With suitable network

model and well-defined network patterns, I can uncover the limitations of GNNs by identifying

the patterns they fail to capture, revealing their intrinsic drawbacks. (ii) How to capture essential

patterns? To address the incapability of GNNs, I design flexible GFM architectures, obtain their

capability in a data-driven approach via large-scale pre-training.

Organization: The structure of this dissertation proposal is outlined as follows:

Chapter 2 provides a comprehensive summary of the evolution of graph machine learning

technique, spanning from the traditional graph heuristic, Graph Neural Network to recent GFM design.

1

The chapter offers a detailed architectural overview, illustrating the graph learning progression.

Chapter 3 finds the performance degradation of GNN under domain shift. To address this

challenge, I propose a model-agnostic algorithm to adapt to the target graph structure and enhance

discriminative ability.

Chapter 4 builds the CSBM-S model to study the feature homophily and heterophily pattern

simultaneously.

I challenge the conventional belief that GNNs struggle with heterophily by

demonstrating their effectiveness under specific conditions, while also revealing their intrinsic

limitations when dealing with both homophily and heterophily nodes together.

Chapter 5 build the latent space model capturing fundamental network proximities, feature

proximity (FP), local structural proximity (LSP), and global structural proximity (GSP) crucial

for link prediction and reveals that current GNN-based methods predominantly capture LSP while

failing to effectively model GSP and FP.

Chapter 6 proposes the blueprint of Graph Foundation Models. I introduce a graph vocabulary

perspective to identify fundamental transferable units in graphs, providing a principled framework

for GFM development, discussing graph transferability principles, and outlining challenges such as

scalability, model architecture, and the integration of large language models (LLMs) for graph-related

tasks.

Chapter 7 showcases the feasiblity of GFM with a Graph Foundation model with data

scaling behavior. The proposed GFM can both predict missing edges and generate effective data

argumentation to enhance other related tasks.

Chapter 8 concludes this dissertation. The overarching goal of our research is to move from

the GNN learning paradigm towards the new GFM learning paradigm grounded in network science

insights.

2

CHAPTER 2

BACKGROUND

2.1 Heuristic & Network Patterns

Network patterns provide a conventional understanding of the network system by identifying

fundamental network patterns, e.g., network motif [202] and establishing the key principles, e.g.,

triadic closure principle [107] and homophily principle, which are generally valid across different

domains. Those principles have been generally utilized to guide the design of heuristic algorithms.

Moreover, they are the most elementary principles across graphs from different domains, identifying

essential invariances as the foundation to build GFMs.

Network patterns Over the years, various theories and expertise knowledge have emerged to

understand and predict links in networks. We typically review the foundational theories shaping the

landscape of link prediction and network analysis.

Homophily [127] is a long-standing principle in social network analysis, stemming from the shared

beliefs and thoughts of individuals. It suggests that people with aligned perspectives are likely to

connect with one another, despite potential differences in their social position. [211, 67, 127, 54, 201]

provide further study illustrating how homophily induced different phenomenons in social networks.

It serves as the principle guidance for methods ranging from conventional page-rank [51] and label

propagation [37] to recent advanced GNNs. Existing GNN architectures, often crafted based on

the homophily principle, demonstrate strong performance on diverse homophilous graphs across

various domains.

While homophily predominates in network analysis, it is not a universal rule. In many real-world

scenarios, “opposites attract", resulting in networks characterized by heterophily—where nodes are

more likely to link with dissimilar nodes. GNNs built with the homophily principle often struggle

with heterophilious networks, except in cases of “good heterophily” [183, 177], where GNNs can

identify and leverage consistent patterns in connections between dissimilar nodes. However, most

heterophilious networks are complex and varied, posing challenges for GNNs due to their irregular

and intricate interaction patterns [178, 282, 187]. Consequently, GNNs’ transferability, more assured

3

in homophilous graphs, is facing significant challenges in heterophilous ones.

Tradic closure [107] is another fundamental concept in social network analysis, stemming from

that friends of friends become friends themselves. It suggests that two individuals, who have a

mutual friend, become connected themselves. [61, 237, 106, 217] provide further study illustrating

how homophily induced different phenomenons in social networks. [245] theoretically proves the

effectiveness of heuristics algorithms.

More recently, [211, 8, 1] focus on investigating the interplay between the role of the tradic

closure and homophily. [8] first demonstrate that triadic closure intensifies the impacts of homophily.

Nonetheless, [1] points out that [8] builds on the existence of sufficient tradic closure rather than

considering the tradic closure and homophily simultaneously. With a modest modification, [1] finds

that tradic closure can introduce individuals to those unlike themselves, consequently reducing

segregation.

Heuristic algorithms Heuristic algorithms inspired network patterns serves as the primary

methodology for graph tasks, due to their simplicity, interpretability, and scalability. Heuristic

algorithms inspired by the network principles are based on the hypothesis that nodes with higher

network similarity have an increased probability of connection. The proximity score between two

nodes represents their potential connection probability. This score is tailored to the characteristics

of specific node pairs, viewed from various perspectives.

Table 2.1 Popular heuristics for link prediction, where Γ(𝑖) denotes the neighbor set of vertex 𝑖.

Name

Formula

Factor

LSP

LSP

1
log |Γ(𝑘 ) |

Adamic-Adar (AA)

common neighbors (CN)

|Γ(𝑖) ∩ Γ( 𝑗)|
(cid:205)𝑘 ∈Γ(𝑖)∩Γ ( 𝑗 )
resource allocation (RA) (cid:205)𝑘 ∈Γ( 𝑗 )∩Γ (𝑖)
Katz

(cid:205)∞

Personal PageRank (PPR)

1
LSP
|Γ(𝑘 ) |
𝑙=1 𝜆𝑙 |paths⟨𝑙⟩ (𝑖, 𝑗)| GSP
GSP
[𝜋𝑖] 𝑗 + [𝜋 𝑗]𝑖

Feature Homophily (FH)

dis(𝑥𝑖, 𝑥 𝑗)

FP

Generally, there are three types of heuristic approaches to measure the proximity and the formal

4

definition of some representative ones are listed in Table 2.1. (1) Local structural heuristics.

This category includes four prevalent algorithms: Common Neighbor (CN) [212], Jaccard [112],

Adamic Adar (AA) [5] and Resource Allocation (RA) [339]. These methods are fundamentally

based on the assumption that nodes sharing more common neighbors are more likely to connect.

AA and RA are weighted variants of common Neighbors. Typically, they incorporate node

degree information, emphasizing that nodes with a lower degree have a higher influence.

(2)

Global structural heuristics. This category includes algorihms Katz [126], SimRank [113], and

Personalized PageRank (PPR) [26] in the graph. They consider the global structural patterns with

the number of paths between two nodes, where more paths indicate high similarity and are more

likely to connect. Specifically, Katz counts the number of all paths between two nodes, emphasizing

shorter paths by penalizing longer ones with a factor. SimRank assumes that two nodes are similar

if they are linked to similar nodes, and PPR produces a ranking personalized to a particular node

based on the random walk. (3) Feature proximity heuristics is available when node attributes are

available, incorporating the side information about individual nodes. [214, 332] combines graph

structure with latent features and explicit features for better performance.

Network Analysis. Important network analysis principles [192] fall into three primary concepts

including: (1) local structural proximity corresponding to the triadic closure principle [107], where

friends of friends become friends themselves. It inspires well-known conventional methods including

CN, RA, AA [5]. (2) global structural proximity corresponding to the decay factor principle, where

two nodes with more short paths between them have a higher probability of being connected. It

inspires well-known conventional methods, e.g. Simrank and Katz [126, 113]. (3) feature proximity

corresponding to the homophily principle [211] where shared beliefs and thoughts can be found in

connected individuals.

These principles guide the evolution of link prediction algorithms, from basic heuristics to

sophisticated GNNs [35, 148]. GNNs, inspired by these principles, perform well across diverse

graphs in multiple domains. Moreover, [337] provides empirical evidence supporting the beneficial

transferability of these guiding principles.

5

2.2 Graph Neural Networks

Graph Neural Networks (GNNs) have emerged as a powerful technique in Deep Learning,

specifically designed for graph-structured data. They address the limitations of traditional neural

networks in dealing with irregular data structures. GNNs learn node representations by aggregating

neighborhood and transforming features recursively. The node representation can then be successfully

utilized to a wide range of graph-related downstream tasks [132, 325, 297, 89, 287, 346, 68, 304].

The aggregation mechanism in GNNs is often viewed as feature smoothing [150, 185, 342].

This perspective leads some recent studies [341, 50, 152, 83] claiming that GNN models are overly

reliant on homophilic patterns and unsuited to capturing heterophilic patterns To accommodate

heterophilic graphs, recent works propose carefully designed GNN architectures, e.g., CPGNN [340],

GGNN [303], GPRGNN [50], GCNII [42] GBK-GNN [63], ACM-GNN [176], Bernnet [87]. More

recent analyses on GNNs [14, 184, 176] indicate that even GCN [132], a vanilla GNN, can deliver

strong performance on certain heterophilic graphs. According to these findings, new metrics and

understandings [21, 223, 156, 176, 184, 178] have been proposed to further expose the remaining

weaknesses of GNNs.

Advanced GNNs for the ink prediction task (GNN4LP) are built on the basic of vanilla

GNNs [133, 135] which learn single node structural representation by aggregating neighborhood

and transforming features recursively, equipped with pairwise decoders. GNN4LP models augment

vanilla GNNs by incorporating more complicated pairwise structural information inspired by

heuristic methods. For instance, NCNC [283] and NBFNet [346] generalize CN and Katz heuristics

with neural functions to incorporate those pairwise information, thereby achieving efficiency and

promising performance.

Despite satisfying results in many tasks via utilizing structural information, [325, 326, 155] find

that vanilla GNNs are still sub-optimal suffering from disability on capturing important pairwise

patterns for Link Prediction, e.g., Common Neighborhood. Graph Neural Networks for Link

Prediction (GNN4LP) [325, 326, 317, 279, 283, 36, 346] are then proposed which incorporates

different inductive bias to capturing more pairwise information. SEAL [325], Neo-GNN [317] and

6

NCNC [283] involve more neighbor-overlapping knowledge. BUDDY [36], and NBFNet [346]

exploit the higher order structural information, e.g., number of paths between nodes. Nonetheless,

advanced GNN4LP models with better effectiveness usually have more complicated model designs,

leading to the efficiency issue. [310, 309, 291, 345, 140] are then proposed to improve efficiency via

approximating methods, e.g., hashing algorithm, A* algorithm, and random walk approximations.

2.3 Graph Foundation Models

Graph Foundation Models (GFMs) have recently emerged as a promising approach for learning

transferable graph representations that can generalize to previously unseen graphs [189]. A key

challenge in this field is the wide variation in graph structures, where both connectivity and feature

patterns differ significantly. GNNs are not regarded as GFMs because they perform optimally only

on one single dataset. At present, there is no widely accepted universal approach to capture graph

representation, resulting in a broad range of GFM designs. Based on how well models transfer

across tasks and domains, GFMs can be divided into three main categories: task-specific GFMs,

domain-specific GFMs, and primitive GFMs. Below, we define each category and provide relevant

examples.

A task-specific/domain-specific GFM should be transferable across the specific task/domain and

thus adapt to diverse downstream datasets and domain-specific tasks. A notable example of a task-

specific GFM is ULTRA [72], achieving superior zero-shot knowledge graph completion performance

across datasets from various domains. A task-specific GFM shows great practical benefits, as it

can be trained on data-rich domains, e.g., Wikipedia knowledge graphs, and subsequently improve

effectiveness in resource-limited domains, e.g., geography knowledge graph. A domain-specific

GFM instance, DiG [336], learns universal representations across various chemical tasks by

leveraging domain-specific knowledge. The domain-specific GFM is highly efficient, as one model

can serve all tasks while also delivering improved effectiveness compared to single-task models.

A primitive GFM exhibits the capability to generalize towards a limited number of datasets and

tasks. A notable example is OFA [164], which is co-trained on data ranging from citation networks

and molecule graphs to knowledge graphs via a unified task formation on node, link, and graph level

7

tasks. The OFA model can achieve comparable or even better performance over the vanilla GNNs

on each task. Nonetheless, the OFA requires transforming all node features into text for co-training,

which may not be convenient for all types of data. This co-training paradigm may also limit its

generalization to unseen tasks and domains.

8

CHAPTER 3

SOURCE FREE GRAPH DOMAIN ADAPTATION

3.1

Introduction

Node classification [134] is a crucial task on graph-structural data such as transaction network

[329], citation networks [262, 256, 193], and so on. Recently, Graph Neural Networks [134, 188]

have greatly advanced the performance of node classification. However, most existing studies only

concentrate on how to classify well on one given graph of a specific domain, while ignoring its

performance degradation when applying it to graphs from other domains due to the domain gap. For

example, regarding two real-world citation networks with papers as nodes and edges representing

their citations, papers published between 2000 - 2010 and papers published between 2010 - 2020

may have significant differences from the following two aspects: (1) Feature distribution shifts

as the advanced research topics and high-frequency keywords change over time. (2) Discrepancy

between graph structures: due to the great success of Deep Learning, papers on machine learning

and neuro-science have been more frequently cited in recent years. More concretely, we select two

datasets: ACM-D, and ACM-S, two subgraphs from ACM dataset [306] to study their graph structure

discrepancy. Their degree distributions are shown in Fig. 3.1. It is obvious to see that the node

degree distributions on different graphs are different. The above problems, i.e., feature distribution

shift and graph structure discrepancy, lead to unsatisfactory performance when transferring the GNN

model across graphs from different domains to handle the same task. The naive way to achieve good

Figure 3.1 Discrepancy between degree distributions on ACM-D and ACM-S datasets.

9

00.20.40.612345678910Node proportionDegreeACM-DACM-Sresults on the target graph from a different domain is to label the graph manually and retrain a new

model from scratch, which is expensive and time-consuming. To solve this problem, Unsupervised

Domain Adaptation (UDA), a transfer learning technique leveraging knowledge learned from a

sufficiently labeled source domain to enhance the performance on an unlabeled target domain, has

raised increasing attention recently.

UDA has shown great success on image data [267, 74, 243, 186] and text data [114, 55]. Recently,

Unsupervised Graph Domain Adaptation (UGDA) has been proposed as a new application of UDA

on graph data. It utilizes important properties of graphs, especially the structural information

indicating the correlation between nodes. Generally speaking, most existing UGDA methods

[307, 248, 330, 289] utilize a joint learning framework: (1) A feature encoder is trained to align the

feature distributions between the source domain and target domain to mitigate the domain gap. (2)

A classifier is trained on encoded features with cross-entropy loss, supervised by source labels. The

model can achieve satisfying performance with strong discriminative ability on the aligned feature

distribution.

However, a crucial requirement for these joint learning methods is access permission to the

source data, which might be problematic for both accessibility and privacy issues [274]. In the

real-world scenario, access to the source domain is not always available (e.g., domain adaptation

between two different platforms). The usage of sensitive attributes on graphs may lead to potential

data leakage and other severe privacy issues. Therefore, we propose a new scenario, Source Free

Unsupervised Graph Domain Adaptation (SFUGDA), in which only the unlabeled target data

and the GNN model trained from source data are available for adaptation. 1

The key challenges in this scenario are two-fold: (1) How the model can adapt well to the shifted

target data distribution without accessing the source graph for aligning the feature distributions.

(2) How to enhance the discriminative ability of the source model without accessing source labels

for supervision. In this paper, we propose SOGA, a model agnostic SOurce free domain Graph

Adaptation algorithm, which enables the GNN model trained on the labeled source graph to perform

1reuse [191] from the ACM Digital Library

10

well on the unlabeled target graph.

SOGA addresses these challenges by the following two components: (1) Structure Consistency

(SC) optimization objective: inspired by the unsupervised graph embedding methods [267, 231],

which learn node representations by preserving various graph properties using well-designed

objective functions, we propose SC objective to tune the source model to reflect the target graph

structure in the model output representation space. It can adapt the source model to the shifted

target data distribution. (2) Information Maximization (IM) optimization objective is proposed

to enhance the discriminative ability of the source model by maximizing the mutual information

between the target graph and its corresponding output. We theoretically prove that IM can improve

the confidence of prediction and raise the lower bound of the AUC metric.

Moreover, we usually can neither determine the source model architecture nor its training

procedure in practice, other than no accessibility to the source data. Our algorithm is also model

agnostic which be easily adopted to arbitrary GNN models. With this property, our SOGA can

easily satisfy the above practical requirements.

In summary, the main contributions of our work are as follows:

• We first articulate a new scenario called SFUGDA when we have no access to the source

graph and its labels. To the best of our knowledge, this is the first work in SFUGDA.

• We propose a model agnostic unsupervised algorithm called SOGA to tackle challenges in

SFUGDA. It can both adapt the source model to the shifted target distribution and enhance its

discriminative ability with a theoretical guarantee.

• Extensive experiments are conducted on real-world datasets. Our SOGA outperforms all the

baselines on four cross-domain tasks. Moreover, experimental results verify the model agnostic

property as SOGA can be applied with different representative GNN models successfully.

11

3.2 Related Work

3.2.1 Comparison with topics on Domain Adaptation

Unsupervised Graph Domain Adaptation (UGDA) aims to transfer the knowledge learned on a

labeled graph from the source domain to an unlabeled graph from the target domain tackling the same

task. Most existing UGDA methods aim to mitigate the domain gap by aligning the source feature

distribution and the target one. According to different alignment approaches, they can be roughly

divided into two categories. (1) Distance-based methods like [248, 307] incorporate maximum mean

discrepancy (MMD) [25] as a domain distance loss to match the distribution statistical moments

at different order. (2) Domain adversarial methods [330, 289] follow the guidance of adversarial

training, which confuses generated features across the source domain and the target one to mitigate

the domain discrepancy. DANE [330] adds an adversarial regularizer inspired by LSGAN [195],

while UDAGCN [289] uses Gradient Reversal Layer [74] and domain adversarial loss to extract

cross-domain node embedding.

However, all the above UGDA methods heavily rely on access to the source data, which leads to

failure in the SFUGDA scenario where source data is not available anymore.

In computer vision, Source Free Unsupervised Domain Adaptation is a new research task with

practical value. Most existing studies focus on different strategies to generate pseudo labels on

images inspired by [243]. [154] utilizes the Deep Cluster algorithm to assign cleaner pseudo

labels with a global view and an Information Maximization algorithm to minimize the prediction

uncertainty PrDA [130] uses a set-to-set distance to filter confident pseudo labels. [151] focuses on

how to adapt the feature distribution on the target domain by generating similar feature samples

from a GAN-based model. [305] encourages label consistency on local affinity neighborhoods

based on the key observation that the target data can still form clear data clusters. However, the

above methods designed for image data are not suitable for graph-structural data. Since graph node

samples are naturally structured by dependencies (i.e., edges) between nodes, strategies focusing on

i.i.d. data like images, cannot be well adapted. For example, even when feature distribution stays

the same, the graph may still suffer from domain gaps for various structure patterns. Thus, methods

12

for SFUGDA should handle structural dependencies well. Despite the graph structure leading to

new challenges, various properties of the graph structure, such as structure proximity, could help the

adaptation if modeled properly.

Moreover, our proposed SFUGDA method is more practical than methods for images as most

of them need specific-designed source model architecture. They utilize different components like

BatchNorm or WeightNorm to implicitly memorize the knowledge from source data. However, it

seems not feasible in practice to retrain a specific source model for adaptation. Contrastively, our

SOGA can combine with any GNN model with no need for a specific design.

3.2.2 Comparison between SFUGDA with other topics on graphs

Graph self-supervised learning [271, 344, 324, 104, 105, 226] is a new scenario on the graph

which also utilizes the two-stage procedure similar with SFUGDA. Typically, those methods will

first have an unsupervised learning procedure by creating graph-specific pretext tasks and training

on the unlabeled data. This procedure aims to learn a good representation that could benefit different

downstream tasks. Then a supervised fine-tuning procedure is employed with the labeled data for

the specific downstream tasks. However, those methods are not applicable in the SFUGDA scenario

as there is no label information in the target domain for the supervised fine-tuning.

More recently, self-supervised learning is also utilized as an unsupervised fine-tuning strategy [39,

210, 275]. They typically update the original model by minimizing a self-supervised loss on the

target distribution. However, most are specifically designed for image data and may not be suitable

for graph-structured data. [120] is the first to utilize self-supervised learning for finetuning in

the graph domain. It focuses on learning an adaptive graph transformation from a data-centric

perspective rather than learning a well-performed model.

Graph Federated learning is a distributed machine learning approach for privacy which has raised

great interest in graph [145, 294]. They aggregate a server-side model from multiple decentralized

edge devices without data leakage, offering a privacy-preserving mechanism. However, it requires

each local data with labeled information. It fails to address the SFUGDA scenario where label

information is only available in one single source domain.

13

3.3 Preliminary

Definition 1 (Node Classification). Node classification is a task to learn a conditional probability

Q (y|G; Θ) to distinguish the category of each unlabeled node on a single graph G = (V, E, X, y),

where Θ is the model parameters. V = {𝑣1, · · · 𝑣𝑛} is the node set with 𝑛 nodes and E is the edge set.

X ∈ R𝑛×𝑑 is the node feature matrix, y ∈ R𝑛 is the partially observed node label set of which each

element satisfies 𝑦𝑖 ∈ {1, 2, · · · 𝑘, −1}. 𝑦𝑖 = −1 indicates 𝑖-th node is unlabeled, 𝑑 is the feature

dimension, and 𝑘 is the number of categories.

Node classification differs from typical classification tasks since the latter usually assumes that

different samples are i.i.d (independent identical distribution), whereas samples in the former case

are correlated through edges. In deep learning, we use Graph Neural Networks (GNNs) [84] to model

Q (y|G; Θ), the conditional probability to distinguish the category of all nodes, for capturing such

relationships. Based on the assumption of localization [57], the predicted conditional distribution

is usually decomposed as follows, Q (y|G; Θ) = (cid:206)𝑣𝑖 ∈V 𝑞(𝑦𝑖 |𝑥𝑖, Ni; Θ), where 𝑞(𝑦𝑖 |𝑥𝑖, Ni; Θ) is the
conditional probability to distinguish the category of one single node 𝑣𝑖. Notice that, G includes

the feature 𝑥𝑖 and the neighbor information Ni for each node 𝑣𝑖. Following the GNN model, Cross

Entropy is usually adopted as the loss function:

E𝑣𝑖∼𝑝(𝑣)

𝑘
∑︁

𝑦=1

−








𝑝(𝑦|𝑥𝑖, Ni) log 𝑞(𝑦|𝑥𝑖, Ni; Θ)

,








(3.1)

where 𝑝(𝑣) is the prior distribution and 𝑝(𝑦|𝑥, N ) is the oracle conditional distribution. As

information in the node 𝑣𝑖 contains its own feature 𝑥𝑖 and neighborhood information Ni, we will

simplify 𝑝(𝑥𝑖, Ni) to 𝑝(𝑣𝑖) for brevity.

According to Def. 1, a typical node classification task is defined on a single graph with partial

supervision. To better leverage the knowledge from a labeled graph (namely source graph) to tackle

the same node classification task on another unlabeled graph (namely target graph), Unsupervised

Graph Domain Adaptation (UGDA) [330] is proposed. We then give a clear definition of UGDA

which has already been well recognized in [247, 330, 289, 117, 247].

14

Figure 3.2 The detailed procedure of the SOGA algorithm. In the SFUGDA scenario, the source
training procedure and labeled source graph in the shadow box are not accessible. Our algorithm
only includes the upper dashed box describing the adaptation procedure. SOGA utilizes the output
of the model on the unlabeled target graph to optimize two objectives: Information Maximization
and Structure Consistency to adapt the model on the target domain.

Definition 2 (Unsupervised Graph Domain Adaptation). aims to learn a node classification model

Q (y|G; Θ) based on a source graph Gs = (Vs, Es, Xs, ys), and the model performs well on a target

graph Gt = (Vt, Et, Xt, yt) where node features (i.e., Xs and Xt) and labels (i.e., ys and yt) express

the same meanings yt is fully unknown which indicates an unsupervised problem in the target

graph. The oracle conditional distributions for the source and target graph are defined as 𝑝𝑠 (𝑦|𝑥, N )

and 𝑝𝑡 (𝑦|𝑥, N ), respectively.

A general assumption in Unsupervised Graph Domain Adaptation is that prediction tasks are

almost the same, i.e., 𝑝𝑠 (𝑦|𝑥, N ) and 𝑝𝑡 (𝑦|𝑥, N ) are similar [330, 289]. Thus, according to Eq. (3.1),

15

??????Inaccessible source training procedureGNN Modellabeled source graphShadow stands for no access permissionAdaptation ProcedureInformation MaximizationInputProvideunlabeled target graphStructure Consistencythe main challenges for UGDA are the misalignment between prior distributions 𝑝𝑠 (𝑣) of the source

graph and 𝑝𝑡 (𝑣) of the target graph. Consequently, the majority of current approaches attempt to

align two distributions as part of their methodologies, which predominantly depends on access

to the source data. Nonetheless, in practical situations, obtaining the training data of the source

model is frequently challenging, primarily due to privacy concerns or data collection expenses.

This gives rise to a novel problem named Source-Free Unsupervised Graph Domain Adaptation

(SFUGDA), which necessitates the absence of source data access in addition to the requirements of

UGDA.

3.4 Problem Statement & Methodology

3.4.1 Problem Statement & Overview

Problem Statement: Source Free Unsupervised Graph Domain Adaptation aims to learn a

well-performed node classification model Qt(y|G; Θ𝑡) on the target graph Gt, while the accessible

information only contains two parts: (1) the well-trained source model Qs(y|G; Θ𝑠) (i.e., well-

performed in the source graph but not guaranteed to be well-performed in the target graph); (2) The

unlabeled target graph Gt = (V𝑡, E𝑡, X𝑡).

Overview of Framework: As the framework outlines shown in Fig. 3.2, the well-performed source

model is provided by the first procedure, the inaccessible training procedure on the source graph.

As we cannot interfere with the source training procedure, the source model architecture could be

an arbitrary GNN as we could not determine. In our experiments, the well-trained source model is

the model with the best performance on the validation set of the source graph. Parameters of the

source model Θ𝑠 with primary discriminative ability are utilized as the initialization of our model in

the latter adaptation procedure.

The adaptation procedure is the key to solving this problem. In this procedure, we need to further

adapt Q𝑠 (y|G; Θ𝑠) on the unlabeled target graph G𝑡 without the information of GNN architecture

and the prior distribution of the source data. This leads to three main challenges: (1) We are required

to adapt the source model to the target distribution with no access to features in the source domain;

(2) The adaption learning in the target domain is an entirely unsupervised learning procedure because

16

of no access to the label in both source and target domains. With only optimizing the unsupervised

loss in the target domain, it could be easy to lose the initial discriminatory power of the source

model; (3) Algorithm design that depends on the model structure will no longer be feasible because

of no access to the source training procedure. In our work, we design unsupervised optimization

objectives to solve the above challenges and achieve better performance on the target graph.

3.4.2 Overall Objectives

To adapt the given source model, we mainly design two optimization objectives. One is to

leverage the information stored in the given model, namely the Information Maximization (IM)

optimization objective, and the other is to utilize the target graph structure, namely the Structure

Consistency (SC) optimization objective, to enhance the discriminative ability of the model on the

target graph. The overall objective is defined as follows:

max L = L𝐼 𝑀 + L𝑆𝐶

(3.2)

Note that, both optimization objectives are designed on the model output space R𝑘 , where 𝑘 is the

number of classes. The objective can be easily applied to any GNN model. Details on the two

objectives are presented in the following sections.

3.4.3

Information Maximization Optimization Objective

We define the IM objective as the mutual information between inputs and outputs of the model

enhancing the discriminative ability:

L𝐼 𝑀 = MI( ˆyt, Vt) = − H( ˆyt|Vt) + H( ˆyt),

(3.3)

where ˆyt is the prediction on target domain and Vt is the information of input nodes containing node

feature Xt and information from node neighbor Nt. MI(·, ·) is the mutual information, and H(·) and

H(·|·) are entropy and conditional entropy, respectively. The objective can be divided into two parts,

one is to minimize the conditional entropy and the other is to maximize the entropy of the marginal

distribution of ˆyt. We will introduce the implementation and the idea behind such a design for these

two parts, respectively.

17

Conditional Entropy The conditional entropy can be implemented by the following equation:

𝐻 ( ˆyt|Vt) = E𝑥𝑖∼𝑝𝑡 (𝑥)

|C|
∑︁

𝑦=1

−








𝑞(𝑦|𝑥𝑖, Ni; Θ) log 𝑞(𝑦|𝑥𝑖, Ni; Θ)








(3.4)

which can be easily optimized by sampling nodes from the prior distribution 𝑝𝑡 (𝑥) on the target

graph.

Intuitively, the goal of this objective is to enhance the certainty of predictions made on the target

graph, which will lead to a substantial improvement in the lower bound of the model’s effectiveness.

Theoretically, we present two key lemmas. The first lemma demonstrates the manner in which

the objective bolsters the confidence of the model predictions. Meanwhile, the second lemma

indicates that the lower bound of the Area Under Curve (AUC) will increase when the objective is

applied.

Lemma 1. When the source model is optimized by the objective Eq.

(3.4) with a gradient

descent optimizer and the capacity of the source model is sufficiently large, for each node 𝑣𝑖 on

the target graph, the predicted conditional distribution 𝑞(𝑦|𝑥𝑖, N𝑖; Θ) will converge to a vector q,

where the value of 𝜂 elements will be 1

𝜂 , and the other elements will be 0. 𝜂 is determined by
the number of categories with the maximum probability value predicted by the original source

model 𝑞(𝑦|𝑥𝑖, N𝑖; Θ𝑠). Similarly, the non-zero positions of q are the indices of categories with the

maximum probability value.

In most cases, 𝜂 equals one, and hence the prediction q will be a one-hot encoding vector. For

further verifying the effectiveness of the objective, we theoretically analyze its effect on the Area

Under Curve (AUC) metric on a binary classification problem:

Lemma 2. When the original source model is trained for a binary classification problem with the

discriminative ability of 𝑟 𝑝 and 𝑟𝑛 accuracy for positive samples and negative samples on the target

graph respectively, the lower bound of AUC can be raised from 𝑟 𝑝 × 𝑟𝑛 to 1

2 (𝑟 𝑝 + 𝑟𝑛) by using the

conditional entropy objective Eq. (3.4).

18

Raising the lower bound of AUC from 𝑟 𝑝 × 𝑟𝑛 to 1

2 (𝑟 𝑝 + 𝑟𝑛) is significant, for instance, if

𝑟 𝑝 = 𝑟𝑛 = 0.7, then the absolute improvement will be 0.21.

3.4.3.1 Entropy of Marginal Distribution

The entropy of marginal distribution ˆyt can be calculated as:

𝐻 ( ˆyt) = −

∑︁

𝑦

𝑞(𝑦) log 𝑞(𝑦),

𝑞(𝑦) =E𝑣𝑖∼𝑝𝑡 (𝑣) [𝑞(𝑦|𝑥𝑖, Ni; Θ)]

(3.5)

This objective is designed to avoid the unsupervised objective easily stuck in a bad solution where

predictions concentrate on the same category. Particularly, if we have additional knowledge about the

prior distribution of labels 𝑝𝑡 (𝑦) on the target graph, a KL-divergence objective can be a replacement

for the Eq. (3.5):

𝐾 𝐿( 𝑝𝑡 (𝑦)||𝑞(𝑦)) =

𝑝𝑡 (𝑦) log

∑︁

𝑦

𝑝𝑡 (𝑦)
𝑞(𝑦)

.

(3.6)

To summarize, we adopt Eq. (3.5) by default to balance different categories, and we can use

Eq.(3.6) to approximate the real distribution with additional information about the prior label

distribution.

3.4.4 Structure Consistency Optimization Objective

To adapt the source model to the shifted target domain without source data, leveraging the

structural information of the target graph becomes the key solution. Thus, we design a Structure

Consistency (SC) objective based on two hypotheses, i.e., (1) the probability of sharing the same

label for local neighbors is relatively high; (2) the probability of sharing the same label for the nodes

with the same structural role is relatively high. These two hypotheses are commonly utilized in lots

of Graph Embedding works [221, 78, 231] where several structure-preserving losses based on the

above hypotheses are designed for learning node representations. To be specific, the SC objective is

19

designed as follows:

∑︁

∑︁

L𝑆𝐶 =

𝜆1J

𝑖, 𝑗
𝐿 + 𝜆2J

𝑖, 𝑗
𝐶 ,

𝑣 𝑗 ∈Vt

𝑣𝑖 ∈Vt
(cid:68)

𝑙

𝑖, 𝑗

𝐿 = 𝑝 (𝑖, 𝑗)
J
𝐶 = 𝑝 (𝑖, 𝑗)
J

𝑖, 𝑗

𝑐

log 𝜎(

ˆy(i)
t

, ˆy(j)
t

log 𝜎(

(cid:68)

ˆy(i)
t

, ˆy(j)
t

(cid:69)

(cid:69)

) + (1 − 𝑝 (𝑖, 𝑗)

𝑙

) log 𝜎(−

) + (1 − 𝑝 (𝑖, 𝑗)

𝑐

) log 𝜎(−

(cid:68)

(cid:68)

ˆy(i)
t

, ˆy(j)
t

ˆy(i)
t

, ˆy(j)
t

(cid:69)

(cid:69)

),

),

(3.7)

where ˆ𝑦 (𝑖)
𝑡

is the predicted label vector for 𝑖-th node, 𝜎(𝑥) = 1/(1 + 𝑒−𝑥) is Sigmoid function, ⟨·, ·⟩

is the inner product, 𝜆1 and 𝜆2 are hyperparameters to control the importance of two sub-objectives.
Notice that both 𝜆1 and 𝜆2 are set to the default value 1 in the experiments. 𝑝 (𝑖, 𝑗)

∈ {0, 1} are

, 𝑝 (𝑖, 𝑗)
𝑠

𝑙

defined by the local neighbor similarity and structural role similarity. Specifically, if (𝑣𝑖, 𝑣 𝑗 ) ∈ Et,
then 𝑝 (𝑖, 𝑗)

= 0. Similarly, if (𝑣𝑖, 𝑣 𝑗 ) ∈ St then 𝑝 (𝑖, 𝑗)

= 1, otherwise 𝑝 (𝑖, 𝑗)

= 1, otherwise 𝑝 (𝑖, 𝑗)

= 0,

𝑐

𝑐

𝑙

𝑙

where St is a set containing the top 𝜅 structurally similar node pairs. We follow struc2vec [231] to

define the structural similarity that can be roughly understood as calculating the similarity of the

sorted degree sequences around two given nodes. In order to reduce the number of hyperparameters,

𝜅 is set as the same size of the edge set |Et| by default in all of our experiments while it can be

adjusted as needed.

Intuitively speaking, the first and the second sub-objectives correspond to hypotheses (1) and

𝑖, 𝑗
(2), respectively.J
𝐿

𝑖, 𝑗
and J
𝐶

are cross-entropy loss defined in the node pair level. The objective

𝑖, 𝑗
J
𝐿

enlarges the prediction similarity between the nodes with connection and distinguishing nodes

𝑖, 𝑗
without connection. Similarly, the objective J
𝐶

enlarges the similarity between the nodes with

similar structural roles and distinguishing nodes with different structural roles. Finally, we use the

negative sampling technique [221] to avoid calculating the objective function for each node pair for

acceleration. Combining the objectives Eq. (3.2), Eq. (3.4), and Eq. (3.5), we obtain the overall

differentiable objective of the model parameters Θ. We adopt the adaptive moment estimation

method (i.e., Adam) [131] to optimize the overall objective.

3.5 Experiments

We conduct experiments on real-world datasets to study our proposed algorithm SOGA. We

design a series of experiments to answer the following research questions:

20

Table 3.1 Average Macro-F1 and Macro-AUC scores on target graph in four unsupervised graph
domain adaptation tasks on baseline methods and GCN-SOGA. GCN-SOGA-prior is the GCN-
SOGA with additional label distribution knowledge. Notice that, For UGDA baselines including
DANE and UDAGCN, we give them additional access to source data.

Group1

Group2

Methods

DBLP→ACM

ACM→DBLP

DeepWalk
Node2vec
GCN
GraphSAGE
GAT
GRACE
DGI
TENT
GTrans
SHOT
NRC
DANE
UDAGCN
GCN-SOGA
GCN-SOGA-prior

Macro-F1
Macro-AUC
0.135 ± 0.012
0.593 ± 0.010
0.128 ± 0.023
0.567 ± 0.011
0.583 ± 0.002
0.887 ± 0.004
0.418 ± 0.057
0.763 ± 0.054
0.227 ± 0.004
0.831 ± 0.004
0.604 ± 0.014
0.908 ± 0.007
0.592 ± 0.010
0.894 ± 0.012
0.617 ± 0.007
0.912 ± 0.008
0.610 ± 0.003
0.913 ± 0.006
0.556 ± 0.004
0.858 ± 0.007
0.561 ± 0.009
0.858 ± 0.009
0.614 ± 0.017
0.906 ± 0.023
0.930 ± 0.006
0.626 ± 0.070
0.636 ± 0.003 0.931 ± 0.004
0.943 ± 0.008
0.650 ± 0.007

Macro-F1
0.112 ± 0.012
0.080 ± 0.018
0.668 ± 0.015
0.752 ± 0.010
0.745 ± 0.036
0.604 ± 0.014
0.621 ± 0.005
0.913 ± 0.011
0.911 ± 0.007
0.673 ± 0.079
0.644 ± 0.010
0.584 ± 0.008
0.696 ± 0.009
0.928 ± 0.018
0.935 ± 0.011

Macro-AUC
0.613 ± 0.008
0.533 ± 0.004
0.937 ± 0.003
0.934 ± 0.003
0.929 ± 0.011
0.806 ± 0.004
0.872 ± 0.010
0.957 ± 0.009
0.948 ± 0.004
0.938 ± 0.008
0.897 ± 0.007
0.937 ± 0.003
0.953 ± 0.008
0.988 ± 0.002
0.990 ± 0.001

ACM-D→ACM-S

Macro-F1

Macro-AUC
0.183 ± 0.012 0.549 ± 0.006
0.134 ± 0.012 0.537 ± 0.005
0.685 ± 0.005 0.856 ± 0.008
0.407 ± 0.042 0.835 ± 0.013
0.681 ± 0.006 0.854 ± 0.005
0.662 ± 0.003 0.876 ± 0.003
0.610 ± 0.003 0.842 ± 0.002
0.702 ± 0.015 0.893 ± 0.004
0.723 ± 0.021 0.864 ± 0.003
0.658 ± 0.011 0.864 ± 0.002
0.629 ± 0.007 0.823 ± 0.010
0.722 ± 0.004 0.888 ± 0.002
0.665 ± 0.010 0.881 ± 0.004
0.733 ± 0.005
0.907 ± 0.005
0.737 ± 0.004 0.908 ± 0.005

ACM-S→ACM-D

Macro-F1

Macro-AUC
0.237 ± 0.012 0.573 ± 0.007
0.219 ± 0.014 0.649 ± 0.003
0.796 ± 0.030 0.924 ± 0.002
0.743 ± 0.015 0.909 ± 0.003
0.804 ± 0.007 0.928 ± 0.002
0.792 ± 0.007 0.907 ± 0.007
0.808 ± 0.006 0.919 ± 0.007
0.813 ± 0.007 0.922 ± 0.004
0.753 ± 0.004 0.876 ± 0.018
0.827 ± 0.016 0.916 ± 0.018
0.817 ± 0.003 0.921 ± 0.005
0.821 ± 0.004 0.923 ± 0.001
0.822 ± 0.018 0.928 ± 0.002
0.842 ± 0.008 0.951 ± 0.002
0.843 ± 0.003 0.953 ± 0.005

• RQ1: How does the GCN-SOGA compare with other state-of-the-art node classification

methods? (GCN-SOGA represents SOGA applying on the default source domain model:

GCN.)

• RQ2: How effective can SOGA be integrated with different GNN models?

• RQ3: How do different components in SOGA contribute to its effectiveness?

• RQ4: How do different choices of hyperparameters 𝜆1 and 𝜆2 affect the performance of

SOGA?

• RQ5: Can GCN-SOGA learn more distinguishable node representations from visualization

compared with other baselines?

3.5.1 Experiment Settings

Datasets. We use two groups of real-world graph datasets for our experiments. DBLPv8 and

ACMv9 are the first group of citation networks collected by [289] from arnetMiner [262]. Their

domain gap mainly comes from different origins (DBLP, ACM respectively) and different publication

time periods, i.e. DBLPv8 (after 2010), ACMv9 (between years 2000 and 2010). ACM-D (Dense)

21

Table 3.2 Statistics of the experimental datasets.

Datasets
DBLPv8
ACMv9
ACM-D
ACM-S

# Nodes
5578
7410
1500
1500

# Edges
7341
11135
4960
759

# Features # Labels

7537
7537
300
300

6
6
4
4

and ACM-S (Sparse) are the second group of citation networks from the ACM dataset collected by

[306]. The detailed statistics of these datasets are illustrated in Tab. 3.2.

Baselines. We select some state-of-the-art methods as baselines to verify the effectiveness

of our proposed algorithm. They are (1) Graph embedding methods including DeepWalk, and

Node2vec [221, 78]. (2) Graph Neural Network (GNN) methods including GCN, GraphSAGE, and

GAT [134, 84, 269]. (3) SFUDA methods on the image including SHOT [154] and NRC [305] (4)

self-supervised learning methods including DGI [271], GRACE [344], TenT [275], and Gtrans [120].

DGI and GRACE are two graph self-supervised learning baselines that focus on learning good

representation for the downstream task. However, they require labeled data on the target domain

for finetuning, which is not available in our scenario. To make a fair comparison, we employ the

proposed self-supervised learning algorithm for unsupervised fine-tuning, which shows a similar

train manner with our proposed SOGA. TenT and Gtrans are the selected self-supervised learning

algorithms for unsupervised finetuning from image and graph domains, respectively. (5) UGDA

methods including DANE and UDAGCN [330, 289]. Notice that we give them UGDA methods

additional access to the source data. For all baseline methods with GNN, we utilize a two-layer

GCN in which the hidden dimensions are 256, and 128 respectively as the default encoder. All

baseline methods are included with a careful hyperparameter check..

Reproducibility Settings. To ensure the validity of experiments, each source dataset is randomly

split into training and validation sets with a ratio 4:1. All the experimental results are averaged over

5 runs with different random seeds [1, 3, 5, 7, 9].

Considering the hyperparameter search setting, we do not have any hyperparameter search in

SOGA. We set the hyperparameters 𝜆1 and 𝜆2 to the default value: of 1 in all experiments except for

the hyperparameter sensitivity analysis. For baseline methods, we apply large-scale hyperparameter

22

Table 3.3 The performance comparison on the Macro-F1 score of different GNN models with or
without applying SOGA.

Methods

GCN
GCN-SOGA
GraphSAGE
GraphSAGE-SOGA
GAT
GAT-SOGA

Group1

Group2

DBLPv8→ACMv9 ACMv9→DBLPv8 ACM-D→ACM-S ACM-S→ACM-D

0.583 ± 0.002
0.636 ± 0.003
0.418 ± 0.057
0.594 ± 0.086
0.227 ± 0.004
0.592 ± 0.086

0.668 ± 0.015
0.928 ± 0.018
0.752 ± 0.010
0.947 ± 0.002
0.745 ± 0.036
0.946 ± 0.001

0.685 ± 0.005
0.736 ± 0.007
0.407 ± 0.042
0.734 ± 0.006
0.681 ± 0.006
0.736 ± 0.006

0.796 ± 0.030
0.838 ± 0.008
0.743 ± 0.015
0.820 ± 0.020
0.804 ± 0.007
0.824 ± 0.027

grid search on all other baselines including GNN, SFUDA, self-supervised learning, and UGDA

methods to ensure those baseline methods reach their best performance. Our reproducing settings

are different from some baseline methods reported in their papers like five different random seeds

and the validation partition. After checking with their authors, we consider that this may induce a

different experimental result from the one in the original paper. We also release our code and the

experimental details in the repository here

Evaluation Methods. We conduct the stability evaluation to verify the stability of algorithms.

The main reason is that models trained with different epochs may have large performance differences.

Generally speaking, there should be an additional validation set used to select the best training epoch.

However, the validation set is not available on the unlabeled target domain. Therefore, it is of great

importance to evaluate the stability of model performance on the target domain. With good stability,

it is easy for models to achieve satisfying performance for stability prevents significant performance

fluctuations after convergence. Contrastively, the performance of the unstable method will drop

quickly after reaching the peak or fluctuate continuously. Concretely speaking, we (1) plot the line

chart describing the Macro-F1 score on the target domain in each training epoch. (2) calculate the

mean and standard deviation of Macro-F1 scores on the target domain across the epochs. To avoid

the initial fluctuation before convergence, we choose the epochs after the first 𝑛 ones for calculation

(set to 20 by default). A flat curve with little fluctuation indicates good stability, corresponding to

results with large expectations and small standard deviations.

23

3.5.2 Overall Results

The experimental results of all baseline methods, GCN-SOGA (applying SOGA on GCN), and

GCN-SOGA-prior on Macro-F1 score and Macro-AUC score are illustrated on Tab. 3.1.

Figure 3.3 The comparison of learning curves of different UGDA methods on the DBLPv8 →
ACMv9 task. The 𝑥 axis denotes the training epoch, whereas the 𝑦 axis denotes the Macro-F1 score
on the target graph.

GCN-SOGA-prior is a variant of GCN-SOGA which we give SOGA algorithm with additional

knowledge about the prior distribution of labels 𝑝𝑡 (𝑦). Therefore, we replace the entropy of marginal

distribution in eq. (3.5) by default with the KL-divergence objective in eq. (3.5). Notice that UGDA

methods other than our proposed GCN-SOGA require additional information, i.e., access permission

to the source graph in the adaptation procedure. Therefore, these methods are not feasible in the

SFUGDA scenario. When reproducing these methods, we give them additional access to the

source data.

Experimental results of GCN-SOGA on four cross-domain tasks show consistent improvements

in Macro-F1 score and Macro-AUC score, with a maximum gain of 2.1% and 4%, respectively.

Additionally, GCN-SOGA-prior can have greater gain than GCN-SOGA on the first group. It

indicates that SOGA can work better with prior label distribution awareness. The reason for the

performance GCN-SOGA-prior is similar to GCN-SOGA is that both ACM-D and ACM-S datasets

have an almost uniform label distribution 𝑝(𝑦). The additional prior knowledge happens to be close

to the default assumption. Therefore, it is no wonder that the performance of GCN-SOGA-prior is

24

050100150200250300Epoch0.00.20.40.6F1 scoreGCN-SOGADANEUDAGCNsimilar to the one of GCN-SOGA. From the perspective of different cross-domain tasks, we can find

the performance on the DBLPv8 → ACMv9 and ACM-D → ACM-S is much lower than the other

two tasks which indicates its difficulties. We will mainly focus on these difficult tasks and conduct

further experiments on them in later sections.

Figure 3.4 The comparison of learning curves of GCN-SOGA and its variants on the DBLPv8 →
ACMv9 task. The 𝑥 axis denotes the training epoch, whereas the 𝑦 axis denotes the Macro-F1 score
in the target graph.

From the perspective of different baseline methods, we can see that the graph embedding methods

perform poorly, probably due to the lack of preserving cross-graph similarity. Though the relative

position between nodes is preserved by structural proximity, similar nodes may have entirely different

absolute positions in different graphs. GNNs perform better for the message passing mechanism

like graph convolution and can preserve the similarity of nodes if their local sub-graphs are similar

as proven by [62]. Self-supervised learning methods DGI and GRACE do not perform well since

they are designed to learn a good representation but not the discriminative ability on the specific

downstream task. TENT also does not work so well since it does not take the complex relationship

on the graph into consideration. GTrans shows unsatisfying performance on ACM-D→ACM-S

task since it transforms the target graph structure into a more sparse one. However, this sparsity

assumption does not always come true in reality. SFUDA methods on the image domain including

SHOT and NRC do not show satisfying performance since they do not have the specific design

on the graph. Two UGDA methods, UDAGCN and DANE, perform best among all baselines for

25

020406080100Epoch0.5250.5500.5750.6000.625F1 scoreGCN-SOGAGCN-SOGA-IMOnlyGCN-SOGA-SCOnlythey implicitly mitigate the distribution gap. To give a more careful comparison between UGDA

baselines and our proposed GCN-SOGA, we conduct stability evaluation in detail. From Fig. 3.3,

we can see that GCN-SOGA in red illustrates stability with a high Macro-F1 score, while DANE

in blue reveals a slower convergence. The performance of UDAGCN in green illustrates a violent

fluctuation. The fluctuation majorly contributes to the conflict in optimizing the domain alignment

loss and cross-entropy loss. Such fluctuation will cause great difficulty in deciding which epoch to

stop the adaptation procedure. Thus, the result further indicates the strength of our GCN-SOGA.

Moreover, such fluctuation indicates that the alignment loss may have a strong conflict with the main

cross-entropy loss during the optimization. It could be the evidence of why those UGDA methods

with additional access to the source data perform worse than our SOGA.

Table 3.4 Statistical results of stability evaluation on GCN-SOGA and its variants. The reported
results are the expectation and standard derivation of Macro-F1 scores on the training procedure
after 20 epochs.

Methods

GCN
GCN-SOGA
GCN-SOGA-IMOnly
GCN-SOGA-SCOnly

Group1

Group2

DBLPv8→ACMv9 ACMv9→DBLPv8 ACM-D→ACM-S ACM-S→ACM-D
0.7961 ± 0.0000
0.8244 ± 0.0056
0.8263 ± 0.0060
0.3182 ± 0.0013

0.6683 ± 0.0000
0.9382 ± 0.0002
0.9406 ± 0.3640
0.6491 ± 0.0653

0.6857 ± 0.0000
0.7323 ± 0.0017
0.7227 ± 0.0170
0.7160 ± 0.0026

0.5832 ± 0.0000
0.6151 ± 0.0005
0.5823 ± 0.0007
0.5576 ± 0.0004

(a) The performance on the DBLPv8 → ACMv9
task

(b) The performance on the ACM-D → ACM-S
task

Figure 3.5 Performances on the target domain with different choices of 𝜆1 and 𝜆2. The orange solid
line and the corresponding shadow indicate the mean value and the standard deviation of the results,
respectively, when 𝜆2 > 𝜆1, which means more attention on the neighborhood. The blue ones are
similar except 𝜆1 > 𝜆2.

26

20406080100Epoch0.560.580.600.620.64F1 score40600.62950.6300Structure roleNeighborhood20406080100Epoch0.680.700.720.74F1 score40600.7300.731Structure roleNeighborhood3.5.3 Effectiveness of SOGA on different GNN models

To demonstrate the efficacy and the model agnostic property of our proposed algorithm: SOGA,

we evaluate SOGA with different representative GNN models. Specifically, we combine SOGA with

GCN, GraphSAGE and GAT, named GCN-SOGA, SAGE-SOGA, and GAT-SOGA, respectively.

The results are shown in Tab. 3.3. One observation is that SOGA can bring consistent improvement

on different GNN models, which verifies that SOGA is model-agnostic. Meanwhile, the poor

performance of GAT and GraphSAGE on some tasks (e.g., that in ACM-D → ACM-S) with the

careful hyperparameter grid search, indicates the potential overfitting problem of these expressive

models. Nonetheless, on the poor performance like GraphSAGE on the ACM-D → ACM-S task,

the original Macro-F1 performance is merely 0.407 while performance with SOGA is 0.734, almost

the same with the best result: 0.736. It indicates that SOGA can help to achieve a comparable good

performance regardless of the poor origin GNN model.

(a) GCN

(b) DANE

(c) UDAGCN

(d) GCN-SOGA-
IMOnly

(e) GCN-SOGA

Figure 3.6 Target domain network representation visualization of the source GCN and UGDA
methods using t-SNE on the task DBLPv8 → ACMv9. Our proposed GCN-SOGA is marked with
the red box. GCN-SOGA-IMOnly is a variant of GCN-SOGA mentioned in 3.5.4 without SC
optimization objective.

3.5.4 Ablation Study

We conduct ablation experiments to investigate the contribution of each component. The study

can be divided into two parts including the necessity of the well-trained source model, and the roles

of two optimization objectives. All experiments in this section utilize the stable evaluation for a

more detailed and fair comparison.

First, we verify the necessity of the indispensable component: the well-trained source model.

Experiments show that only applying two unsupervised objectives with a randomly initialized model

27

leads to failure where the highest Macro-F1 score among all tasks is no more than 0.20, similar to

random guess. Therefore, it is of great significance to utilize the primary discriminative ability of

the well-trained source model.

Then we further explore the different roles of two unsupervised optimization objectives by

stability evaluation. We propose the variants of our proposed SOGA called SOGA-IMOnly and

SOGA-SCOnly corresponding to the algorithm trained with only IM or SC objective, respectively.

The curves on DBLPv8 → ACMv9 are shown in Fig. 3.4. We can find the following observations:

(1) For our GCN-SOGA in red, shows significant performance gain and strong stability with a flat

curve after the first few epochs. (2) For GCN-SOGA-IMOnly in blue, there has been an evident

drop after reaching the peak around 10 epochs. This unstable phenomenon reveals the difficulty

and uncertainty of achieving a good result. (3) For GCN-SOGA-SCOnly in green, the curve is

smooth after reaching the peak with only a little gain. Overall speaking, IM and SC objectives

show a complementary effect. IM enhances the discriminative ability to achieve better results while

SC takes charge of maintaining stability to maintain good performance consistently in the training

procedure.

The detailed statistical results of stability evaluation on all tasks are illustrated in Tab. 3.4. We

can find that on easy tasks like ACMv9 → DBLPv8, GCN-SOGA-IMOnly can achieve similar

results with GCN-SOGA. However, on more difficult tasks, which are DBLPv8 → ACMv9 and

ACM-D → ACM-S, GCN-SOGA could achieve more significant and stable improvement than

GCN-SOGA-IMOnly. It further indicates the necessity of SC especially on more difficult tasks.

Moreover, we notice that GCN-SOGA-SCONLY may perform even worse than the original GCN

trained on the source domain. The key reason is that we could not utilize any supervised signal in

the SFUGDA scenario. Therefore, both SC and IM objectives are unsupervised objectives. It could

be possible that we lose the original discriminative ability with unsupervised finetuning.

3.5.5 Hyperparameter sensitivity analysis

Though we have achieved impressive results with the default hyperparameter setting as 𝜆1 =

𝜆2 = 1, it is still noteworthy to examine how the different choices of 𝜆1 and 𝜆2 affect the performance

28

of SOGA. Specifically, we focus on the Macro-F1 score performance of GCN-SOGA as well as its

stability in the training procedure with different hyperparameter settings. Revolving around this

goal, we conduct neighborhood evaluation and structure role evaluation which spare larger weight

to 𝜆1 and 𝜆2, respectively.

For neighborhood evaluation, we run experiments 10 times with different choices where 𝜆1 > 𝜆2.

Then we plot the line chart describing the Macro-F1 score on the target domain after the first 10

training epochs, skipping initial fluctuations for brevity. Structure role evaluation is similar except

𝜆2 > 𝜆1. The experimental results on DBLPv8 → ACMv9 and ACM-D → ACM-S are illustrated

in Fig. 3.5. The solid line is the average result of 10 experiments while the shadow one represents

the corresponding standard deviation. We can see that curves stay at a high level and the shadow

area is narrow, which indicates good performance with stability in the training procedure. We can

conclude that our performance is robust to different choices of 𝜆1 and 𝜆2.

3.5.6 Visualization

We visualize node representations of the target domain generated by the baseline methods: GCN,

DANE, UDAGCN, our proposed GCN-SOGA and its variant GCN-SOGA-IMOnly (without the

Structure Consistency optimization objective). The aim is to further prove the performance of

SOGA even without explicit domain adaptation component to mitigate the domain discrepancy.

One thing we want to point out is that the key to achieving good classification performance on the

target domain is good class separability (the learned target node representations with the same label

are close and the target node representations with different labels are far). Mitigating the domain

discrepancy with an explicit domain adaptation component, which has been adopted by many UGDA

methods, is just one of the effective approaches to enhance class separability on the target domain.

However, it may not be necessary. The method like SOGA can also achieve good separation by both

fully exploring the potential of the source model and utilizing the target graph structure.

For simplicity, we choose the most difficult task, DBLPv8 → ACMv9, to visualize the node

representation. The node representation is the hidden representation closest to the final linear

full-connected layer, which is of the same dimension size in all methods: 128. The dimensionality

29

reduction method for visualization is the T-distributed Stochastic Neighbor Embedding (t-SNE) [268].

The visualization results of different methods are shown in Fig. 3.6. The color of each node represents

its label.

We can observe that the well-trained source model GCN shows low-class separability on the

target graph, where clusters have many overlaps. UGDA methods, DANE and UDAGCN, are

somehow better with better clustering. However, the boundaries are still difficult to find. For our

proposed GCN-SOGA marked in the red box, the large boundary can be seen though there are some

nodes clustered mistakenly due to the limitation of the totally unsupervised adaptation procedure.

We further compare the GCN-SOGA and its variant GCN-SOGA-IMOnly without SC optimization

object to see how the structural information benefits the class separation. This phenomenon further

reveals that with the graph structure to learn structure proximity, the source model can better adapt

to the target data distribution.

3.6 Conclusion

In this work, we articulate a new scenario called Source Free Unsupervised Graph Domain

Adaptation with no access to the source graph because of practical reasons like privacy policies.

Existing methods cannot work well as it is impossible for feature alignment. Facing challenges in

SFUGDA, we propose our algorithm SOGA, which could be applied to arbitrary GNNs by adapting

to the target domain distribution and enhancing the discriminative ability of the source model.

Extensive experiments indicate its effectiveness.

30

CHAPTER 4

DEMYSTIFYING STRUCTURAL DISPARITY IN GRAPH NEURAL NETWORKS: CAN
ONE SIZE FIT ALL?

4.1

Introduction

How to infer the missing node feature can be an new essential for understanding a graph

with additional node features. It is also an essential network problem with extensive usage in

many real-world applications, including the recommender system to infer the user portrait, citation

networks to predict the missing paper topic, and biological networks to identify the protein roles.

This problem can be formalized as the node classification or regression task, where the goal is to

predict the missing feature utilizing the inputs with graph structure and existing features. When

GNNs are applied to node classification, it is generally believed to work well under the homophily

assumption, where neighboring nodes are likely to share the same label and similar features. An

example of the homophilic pattern is depicted in the upper part of Figure 4.1, where node features

and node labels are denoted by colors (i.e., blue and red) and numbers (i.e., 0 and 1), respectively.

We can observe that all connected nodes exhibit homophilic patterns and share the same label 0. The

message passing can then be viewed as a denoising process towards a smooth node representation

with its neighborhood. Recently, several studies have demonstrated that GNNs can also perform

well on certain heterophilic graphs [14, 184, 176]. In heterophilic graphs, connected nodes tend

to have different labels, which we refer to as heterophilic patterns. The example in the lower part

of Figure 4.1 shows the heterophilic patterns. Based on this example, we intuitively illustrate

how GNNs can work on such heterophilic patterns (lower right): after averaging features over all

neighboring nodes, nodes with label 0 completely switch from their initial blue color to red, and

vice versa; despite this feature alteration, the two classes remain easily distinguishable since nodes

with the same label (number) share the same color (features).

However, existing studies on the effectiveness of GNNs [13, 14, 184, 176] only focus on either

homophilic or heterophilic patterns solely and overlook the fact that real-world graphs typically

exhibit a mixture of homophilic and heterophilic patterns. Recent studies [152, 156] reveal that

31

Figure 4.1 Examples of homophilic and heterophilic patterns. Colors/numbers indicate node
features/labels.

many heterophilic graphs, e.g., Squirrel and Chameleon [239], contain over 20% homophilic nodes.

Similarly, our preliminary study depicted in Figure 4.2 demonstrates that heterophilic nodes are

consistently present in many homophilic graphs, e.g., PubMed [246] and Ogbn-arxiv [101]. Hence,

real-world homophilic graphs predominantly consist of homophilic nodes as the majority structural

pattern and heterophilic nodes in the minority one, while heterophilic graphs exhibit an opposite

phenomenon with heterophilic nodes in the majority and homophilic ones in the minority.

To provide insights aligning the real-world scenario with structural disparity, we revisit the toy

example in Figure 4.1, considering both homophilic and heterophilic patterns together. Specifically,

for nodes labeled 0, both homophilic and heterophilic node features appear in blue before aggregation.

However, after aggregation, homophilic and heterophilic nodes in label 0 exhibit different features,

appearing blue and red, respectively. Such differences may lead to performance disparity between

nodes in majority and minority patterns. For instance, in a homophilic graph with the majority

pattern being homophilic, GNNs are more likely to learn the association between blue features and

class 0 on account of more supervised signals in majority. Consequently, nodes in the majority

structural pattern can perform well, while nodes in the minority structural pattern may exhibit poor

performance, indicating an over-reliance on the majority structural pattern. Observations indicate

that GCN [132], a vanilla GNN, often underperforms MLP-based models on nodes with the minority

32

AggregationAggregationHomophilyHeterophily00000000010101010101pattern while outperforming them on the majority nodes.

To develop a model that simultaneously captures both homophily and heterophily patterns, it

is essential to move beyond the fixed message passing operators with pre-defined neighborhood

treatment. To address this issue, I adapt transformers [255] with their self-attention mechanisms,

offering a powerful alternative due to their ability to consider more neighboring nodes and

dynamically weigh the importance of each node. This adaptability gives the possibility to capture

diverse network patterns together. Moreover, the large parameter scale of transformers further

enhances their ability to model complex network patterns with extensive network data. To meet the

transformer’s sequential input requirement, I convert the original node neighborhood into sequences

via a random walk algorithm. This algorithm traverses connected nodes in a random manner,

thereby revealing structural proximity through the sequence of visited nodes. To learn how to select

neighbor nodes adaptively, I propose a masked node modeling pre-text task, aiming to recover the

masked neighbor nodes in the random-walk sequence. The large-scale pre-trained transformer,

trained over 100 million instances, can automatically select informative neighbor nodes with high

attention scores while ignoring noisy neighbor nodes. Consequently, the pre-trained model can

achieve satisfying few-shot performance across networks, comparable to a fully supervised GNN. In

contrast, a transformer with supervised training cannot generalize well and tends to over-fit shortcut

solutions.

4.2 Prelimaries

Semi-Supervised Node classification (SSNC). Let 𝐺 = (𝑉, 𝐸) be an undirected graph, where

𝑉 = {𝑣1, · · · , 𝑣𝑛} is the set of 𝑛 nodes and 𝐸 ⊆ 𝑉 × 𝑉 is the edge set. Nodes are associated with

node features X ∈ R𝑛×𝑑, where 𝑑 is the feature dimension. The number of class is denoted as 𝐾.
The adjacency matrix A ∈ {0, 1}𝑛×𝑛 represents graph connectivity where A[𝑖, 𝑗] = 1 indicates an

edge between nodes 𝑖 and 𝑗. D is a degree matrix and D[𝑖, 𝑖] = 𝑑𝑖 with 𝑑𝑖 denoting degree of node

𝑣𝑖. Given a small set of labeled nodes, 𝑉tr ⊆ 𝑉, SSNC task is to predict on unlabeled nodes 𝑉 \ 𝑉tr.

Node homophily ratio is a common metric to quantify homophilic and heterophilic patterns. It is

calculated as the proportion of a node’s neighbors sharing the same label as the node [220, 341, 152].

33

It is formally defined as ℎ𝑖 = |{𝑢∈N (𝑣𝑖):𝑦𝑢=𝑦𝑣 }|

𝑑𝑖

, where N (𝑣𝑖) denotes the neighbor node set of 𝑣𝑖 and

𝑑𝑖 = |N (𝑣𝑖)| is the cardinality of this set. Following [152, 63, 220], node 𝑖 is considered to be

homophilic when more neighbor nodes share the same label as the center node with ℎ𝑖 > 0.5. We

define the graph homophily ratio ℎ as the average of node homophily ratios ℎ =
this ratio can be easily extended to higher-order cases ℎ(𝑘)

by considering 𝑘-order neighbors N𝑘 (𝑣𝑖).

𝑖

(cid:205)𝑖 ∈𝑉 ℎ𝑖
|𝑉 |

. Moreover,

Node subgroup refers to a subset of nodes in the graph sharing similar properties, typically

homophilic and heterophilic patterns measured with node homophily ratio. Training nodes are

denoted as 𝑉tr. Test nodes 𝑉te can be categorized into 𝑀 node subgroups, 𝑉te = (cid:208)𝑀

𝑚=1 𝑉𝑚, where

nodes in the same subgroup 𝑉𝑚 share similar structural pattern.

4.3 Effectiveness of GNN on nodes with different structural properties

In this section, we explore the effectiveness of GNNs on different node subgroups exhibiting

distinct structural patterns, specifically, homophilic and heterophilic patterns. It is different from

previous studies [184, 13, 14, 178, 176] that primarily conduct analysis on the whole graph and

demonstrate effectiveness with an overall performance gain. These studies, while useful, do not

provide insights into the effectiveness of GNNs on different node subgroups, and may even obscure

scenarios where GNNs fail on specific subgroups despite an overall performance gain. To accurately

gauge the effectiveness of GNNs, we take a closer examination on node subgroups with distinct

structural patterns. The following experiments are conducted on two common homophilic graphs,

Ogbn-arxiv [101] and Pubmed [246], and two heterophilic graphs, Chameleon and Squirrel [239].

These datasets are chosen since GNNs can achieve better overall performance than MLP.

(a) PubMed (ℎ=0.79)

(b) Arxiv (ℎ=0.63)

(c) Cham.(ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.2 Node homophily ratio distributions. All graphs exhibit a mixture of homophilic and
heterophilic nodes despite various graph homophily ratio ℎ.

34

0.00.20.40.60.81.0hnode012345Density0.00.20.40.60.81.0hnode0123Density0.00.20.40.60.81.0hnode012Density0.00.20.40.60.81.0hnode0123DensityExistence of structural pattern disparity within a graph is to recognize real-world graphs

exhibiting different node subgroups with diverse structural patterns, before investigating the GNN

effectiveness on them. We demonstrate node homophily ratio distributions on the aforementioned

datasets in Figure 4.2. We can have the following observations. Obs.1: All four graphs exhibit

a mixture of both homophilic and heterophilic patterns, rather than a uniform structural patterns.

Obs.2: In homophilic graphs, the majority of nodes exhibit a homophilic pattern with ℎ𝑖>0.5, while

in heterophilic graphs, the majority of nodes exhibit the heterophilic pattern with ℎ𝑖≤0.5. We define

nodes in majority structural pattern as majority nodes, e.g., homophilic nodes in a homophilic graph.

(a) PubMed (ℎ=0.79)

(b) Arxiv(ℎ=0.63)

(c) Cham.(ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.3 Performance comparison between GCN and MLP-based models. Each bar represents the
accuracy gap on a specific node subgroup exhibiting a homophily ratio within the range specified on
the x-axis. MLP-based models often outperform GCN on heterophilic nodes in homophilic graphs
and homophilic nodes in heterophilic graphs with a positive value.

Examining GCN performance on different structural patterns. To examine the effectiveness

of GNNs on different structural patterns, we compare the performance of GCN [132] a vanilla

GNN, with two MLP-based models, vanilla MLP and Graphless Neural Network (GLNN) [328],

on testing nodes with different homophily ratios. It is evident that the vanilla MLP could have a

large performance gap compared to GCN (i.e., 20% in accuracy) [184, 328, 132]. Consequently,

an under-trained vanilla MLP comparing with a well-trained GNN leads to an unfair comparison

without rigorous conclusion. Therefore, we also include an advanced MLP model GLNN. It is

trained in an advanced manner via distilling GNN predictions and exhibits performance on par with

GNNs. Notably, only GCN has the ability to leverage structural information during the inference

phase while both vanilla MLP and GLNN models solely rely on node features as input. This

comparison ensures a fair study on the effectiveness of GNNs in capturing different structural

35

0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range-0.2-0.10.00.10.2Difference in acc  model - GCNMLPGLNN0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.50.40.30.20.10.00.1Difference in acc  model - GCNMLPGLNN0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.30.20.10.00.1Difference in acc  model - GCNMLPGLNN0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.10.00.10.20.3Difference in acc  model - GCNMLPGLNNpatterns with mitigating the effects of node features. Experimental results on four datasets are

presented in Figure 4.3. In the figure, y-axis corresponds to the accuracy differences between GCN

and MLP-based models where positive indicates MLP models can outperform GCN; while x-axis

represents different node subgroups with nodes in the subgroup satisfying homophily ratios in the

given range, e.g., [0.0-0.2]. Based on experimental results, the following key observations can be

made: Obs.1: In homophilic graphs, both GLNN and MLP demonstrate superior performance on

the heterophilic nodes with homophily ratios in [0-0.4] while GCN outperforms them on homophilic

nodes. Obs.2: In heterophilic graphs, MLP models often outperform on homophilic nodes yet

underperform on heterophilic nodes. Notably, vanilla MLP performance on Chameleon is worse

than that of GCN across different subgroups. This can be attributed to the training difficulties

encountered on Chameleon, where an unexpected similarity in node features from different classes

is observed [175]. Our observations indicate that despite the effectiveness of GCN suggested

by [184, 176, 13], GCN exhibits limitations with performance disparity across homophilic and

heterophilic graphs. It motivates investigation why GCN benefits majority nodes, e.g., homophilic

nodes in homophilic graphs, while struggling with minority nodes.

Organization. In light of the above observations, we endeavor to understand the underlying

causes of this phenomenon in the following sections by answering the following research questions.

Section 4.3.1 focuses on how aggregation, the fundamental mechanism in GNNs, affects nodes with

distinct structural patterns differently. Building on these observations, Section 4.3.2 recognizes

the key factors driving performance disparities on different structural patterns with a non-i.i.d.

PAC-Bayes bound. Section 4.3.3 empirically corroborates the validity of our theoretical analysis

with real-world datasets.

4.3.1 How does aggregation affect nodes with structural disparity differently

In this subsection, we examine how aggregation reveals different effects on nodes with structural

disparity, serving as a precondition for performance disparity. Specifically, we focus on the

discrepancy between nodes from the same class but with different structural patterns.

For a controlled study on graphs, we adopt the contextual stochastic block model (CSBM) with

36

two classes. It is widely used for graph analysis, including generalization [184, 13, 276, 14, 277, 71],

clustering [70], fairness [116, 115], and GNN architecture design [285, 52, 292]. Typically, nodes in

CSBM model are generated into two disjoint sets C1 and C2 corresponding to two classes, 𝑐1 and 𝑐2,

respectively. Each node with 𝑐𝑖 is associated with features 𝑥 ∈ R𝑑 sampling from 𝑁 (𝜇𝑖, 𝐼), where 𝜇𝑖

is the feature mean of class 𝑐𝑖 with 𝑖 ∈ {1, 2}. The distance between feature means in different classes

𝜌 = |𝜇1 − 𝜇2|, indicating the classification difficulty on node features. Edges are then generated

based on intra-class probability 𝑝 and inter-class probability 𝑞. For instance, nodes with class 𝑐1

have probabilities 𝑝 and 𝑞 of connecting with another node in class 𝑐1 and 𝑐2, respectively. The

CSBM model, denoted as CSBM(𝜇1, 𝜇2, 𝑝, 𝑞), presumes that all nodes follow either homophilic

with 𝑝 > 𝑞 or heterophilic patterns 𝑝 < 𝑞 exclusively. However, this assumption conflicts with

real-world scenarios, where graphs often exhibit both patterns simultaneously, as shown in Figure 4.2.

To mirror such scenarios, we propose a variant of CSBM, referred to as CSBM-Structure (CSBM-S),

allowing for the simultaneous description of homophilic and heterophilic nodes.

Definition 1 (CSBM-S(𝜇1, 𝜇2, ( 𝑝 (1), 𝑞 (1)), ( 𝑝 (2), 𝑞 (2)), Pr(homo))). The generated nodes consist

of two disjoint sets C1 and C2. Each node feature 𝑥 is sampled from 𝑁 (𝜇𝑖, 𝐼) with 𝑖 ∈ {1, 2}.
Each set C𝑖 consists of two subgroups: C (1)
inter-class edge probability 𝑝 (1) > 𝑞 (1) and C (2)
Pr(homo) denotes the probability that the node is in homophilic pattern. C ( 𝑗)
class 𝑖 and subgroup 𝑗 with ( 𝑝 ( 𝑗), 𝑞 ( 𝑗)). We assume nodes follow the same degree distribution with

for nodes in heterophilic pattern with 𝑝 (2) < 𝑞 (2).

for nodes in homophilic pattern with intra-class and

denotes node in

𝑖

𝑖

𝑖

𝑝 (1) + 𝑞 (1) = 𝑝 (2) + 𝑞 (2).

Based on the neighborhood distributions, the mean aggregated features F = D−1AX obtained

follow Gaussian distributions on both homophilic and heterophilic subgroups.

f ( 𝑗)
𝑖 ∼ 𝑁

(cid:18) 𝑝 ( 𝑗) 𝝁1 + 𝑞 ( 𝑗) 𝝁2
𝑝 ( 𝑗) + 𝑞 ( 𝑗)

,

(cid:19)

I
√
𝑑𝑖

, for 𝑖 ∈ C ( 𝑗)
1

; f ( 𝑗)

𝑖 ∼ 𝑁

(cid:18) 𝑞 ( 𝑗) 𝝁1 + 𝑝 ( 𝑗) 𝝁2
𝑝 ( 𝑗) + 𝑞 ( 𝑗)

,

(cid:19)

I
√
𝑑𝑖

, for 𝑖 ∈ C ( 𝑗)
2

(4.1)

Where C ( 𝑗)

𝑖

is the node subgroups with structural pattern with ( 𝑝 ( 𝑗), 𝑞 ( 𝑗)) in label 𝑖. Our initial

examination of different effects on aggregation focuses on the aggregated feature distance between

37

homophilic and heterophilic node subgroups within class 𝑐1.

Proposition 1. The aggregated feature mean distance between homophilic and heterophilic node

subgroups within class 𝑐1 is

(cid:12)
(cid:12)
(cid:12)

𝑝 (1) 𝜇1+𝑞 (1) 𝜇2

𝑝 (1) +𝑞 (1) −

𝑝 (2) 𝜇1+𝑞 (2) 𝜇2
𝑝 (2) +𝑞 (2)

(cid:12)
(cid:12)
(cid:12)

> 0, indicating the aggregated feature

of homophilic and heterophilic subgroups are from different feature distributions, with a mean

distance larger than 0 distance before aggregation, since original node features draw from the same

distribution, regardless of different structural patterns.

Notably, the distance between original features is regardless of the structural pattern. This

proposition suggests that aggregation results in a distance gap between different patterns within the

same class.

In addition to node feature differences with the same class, we further examine the discrepancy

between nodes 𝑢 and 𝑣 with the same aggregated feature f𝑢 = f𝑣 but different structural patterns. We

examine the discrepancy with the probability difference of nodes 𝑢 and 𝑣 in class 𝑐1, denoted as

|P1(𝑦𝑢 = 𝑐1|f𝑢) − P2(𝑦𝑣 = 𝑐1|f𝑣)|. P1 and P2 are the conditional probability of 𝑦 = 𝑐1 given the

feature f on structural patterns ( 𝑝 (1), 𝑞 (1)) and ( 𝑝 (2), 𝑞 (2)), respectively.

Lemma 1. With assumptions (1) A balance class distribution with P(𝑌 = 1) = P(𝑌 = 0) and (2)

aggregated feature distribution shares the same variance 𝜎. When nodes 𝑢 and 𝑣 have the same

aggregated features f𝑢 = f𝑣 but different structural patterns, ( 𝑝 (1), 𝑞 (1)) and ( 𝑝 (2), 𝑞 (2)), we have:

|P1(𝑦𝑢 = 𝑐1|f𝑢) − P2(𝑦𝑣 = 𝑐1|f𝑣)| ≤

√

𝜌2

2𝜋𝜎

|ℎ𝑢 − ℎ𝑣 |

(4.2)

Notably, above assumptions are not strictly necessary but employed for elegant expression.

Lemma 1 implies that nodes with a small homophily ratio difference |ℎ1 − ℎ2| are likely to share the

same class, and vice versa.

4.3.2 Why does Performance Disparity Happen? Subgroup Generalization Bound for GNNs

In this subsection, we conduct a rigorous analysis elucidating primary causes for performance

disparity across different node subgroups with distinct structural patterns. We identify two key factors

for satisfying test performance: (1) test node 𝑢 should have a close feature distance min𝑣∈𝑉tr |f𝑢 − f𝑣 |

38

to training nodes 𝑉tr, indicating that test nodes can be greatly influenced by training nodes. (2) With

identifying the closest training node 𝑣, nodes 𝑢 and 𝑣 should be more likely to share the same class,

where (cid:205)𝑐𝑖 ∈C |P(𝑦𝑢 = 𝑐𝑖 |f𝑢) − P(𝑦𝑣 = 𝑐𝑖 |f𝑣)| is required to be small. The second factor, focusing
on whether two close nodes are in the same class, is dependent on the homophily ratio difference

|ℎ𝑢 − ℎ𝑣 |, as shown in Lemma 1. Notably, since training nodes are randomly sampled, their structural

patterns are likely to be the majority one. Therefore, training nodes will show a smaller homophily

ratio difference with majority test nodes sharing the same majority pattern than minority test nodes,

resulting in the performance disparity in distinct structural patterns.

To rigorously examine the role of aggregated feature distance and homophily ratio difference

in performance disparity, we derive a non-i.i.d. PAC-Bayesian GNN generalization bound, based

on the Subgroup Generalization bound of Deterministic Classifier [181]. We begin by stating key

assumptions on graph data and GNN model to clearly delineate the scope of our theoretical analysis.

Definition 2 (Generalized CSBM-S model). Each node subgroup 𝑉𝑚 follows the CSBM distribution

𝑉𝑚 ∼ CSBM(𝜇1, 𝜇2, 𝑝 (𝑖), 𝑞 (𝑖)), where different subgroups share the same class mean but different

intra-class and inter-class probabilities 𝑝 (𝑖) and 𝑞 (𝑖). Moreover, node subgroups also share the

same degree distribution as 𝑝 (𝑖) + 𝑞 (𝑖) = 𝑝 ( 𝑗) + 𝑞 ( 𝑗).

Instead of CSBM-S model with one homophilic and heterophilic pattern, we take the generalized

CSBM-S model assumption, allowing more structural patterns with different levels of homophily.

Assumption 1 (GNN model). We focus on SGC [288] with the following components: (1) a one-hop

mean aggregation function 𝑔 with 𝑔(𝑋, 𝐺) denoting the output. (2) MLP feature transformation

𝑓 (𝑔𝑖 (𝑋, 𝐺); 𝑊1, 𝑊2, · · · , 𝑊𝐿), where 𝑓 is a ReLU-activated 𝐿-layer MLP with 𝑊1, · · · , 𝑊𝐿 as

parameters for each layer. The largest width of all the hidden layers is denoted as 𝑏.

Notably, despite analyzing simple GNN architecture theoretically, similar with [181, 184, 77],

our theory analysis could be easily extended to the higher-order case with empirical success across

different GNN architectures shown in Section 4.3.3.

39

Our main theorem is based on the PAC-Bayes analysis which typically aims to bound the

generalization gap between the expected margin loss L0

𝑚 on test subgroup 𝑉𝑚 for a margin 0 and the

𝛾
tr on train subgroup 𝑉tr for a margin 𝛾. Those losses are generally utilized
empirical margin loss (cid:98)L
in PAC-Bayes analysis[198, 197, 53, 65]. The formulation is shown as follows:

Theorem 1 (Subgroup Generalization Bound for GNNs). Let ˜ℎ be any classifier in the classifier
family H with parameters { (cid:101)𝑊𝑙 }𝐿
training nodes 𝑁tr = |𝑉tr|, there exist 0 < 𝛼 < 1
𝑦tr := {𝑦𝑖}𝑖∈𝑉tr, we have:

𝑙=1. for any 0 < 𝑚 ≤ 𝑀, 𝛾 ≥ 0, and large enough number of the
4 with probability at least 1 − 𝛿 over the sample of

𝑚 ( ˜ℎ) ≤ (cid:98)L
L0

𝛾

tr( ˜ℎ) + 𝑂

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

(cid:124)

𝐾 𝜌

√
2𝜋𝜎
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(𝜖𝑚 + |ℎtr − ℎ𝑚 | · 𝜌)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
(a)

+

𝑏 (cid:205)𝐿
𝑙=1 | (cid:101)𝑊𝑙 |2
𝐹
(𝜖𝑚)2/𝐿
(𝛾/8)2/𝐿 𝑁 𝛼
tr
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:124)
(b)

(cid:125)

+R

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(4.3)

(cid:125)

The bound is related to three terms: (a) describes both large homophily ratio difference |ℎtr − ℎ𝑚 |

and large aggregated feature distance 𝜖 = max 𝑗 ∈𝑏𝑉𝑚 min𝑖∈𝑉tr |𝑔𝑖 (𝑋, 𝐺) − 𝑔 𝑗 (𝑋, 𝐺)|2 between test

node subgroup 𝑉𝑚 and training nodes 𝑉tr lead to large generalization error. 𝜌 = |𝜇1 − 𝜇2| denotes

the original feature separability, independent of structure. 𝐾 is the number of classes. (b) further

strengthens the effect of nodes with the aggregated feature distance 𝜖, leads to a large generalization

error. (c) R is a term independent with aggregated feature distance and homophily ratio difference,

depicted as

+ 1
𝑁 2𝛼
tr
norm. R vanishes as training size 𝑁0 grows.

ln 𝐿𝐶 (2𝐵𝑚)1/𝐿
𝛾1/𝐿 𝛿

1
𝑁 1−2𝛼
tr

, where 𝐵𝑚 = max𝑖∈𝑉tr∪𝑉𝑚 |𝑔𝑖 (𝑋, 𝐺)|2 is the maximum feature

Our theory suggests that both homophily ratio difference and aggregated feature distance to

training nodes are key factors contributing to the performance disparity. Typically, nodes with large

homophily ratio difference and aggregated feature distance to training nodes lead to performance

degradation.

4.3.3 Performance Disparity Across Node Subgroups on Real-World Datasets

To empirically examine the effects of theoretical analysis, we compare the performance on

different node subgroups divided with both homophily ratio difference and aggregated feature

40

(a) PubMed (ℎ=0.79)

(b) Arxiv (ℎ=0.63)

(c) Cham. (ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.4 Test accuracy disparity across node subgroups by aggregated-feature distance and
homophily ratio difference to training nodes. Each figure corresponds to a dataset, and each bar
cluster corresponds to a GNN model. A clear performance decrease tendency can be found from
subgroups 1 to 5 with increasing differences to training nodes.

(a) PubMed (ℎ=0.79)

(b) Arxiv (ℎ=0.63)

(c) Cham. (ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.5 Test accuracy disparity across node subgroups by aggregated-feature distance to train
nodes. Each figure corresponds to a dataset, and each bar cluster corresponds to a GNN model. A
clear performance decrease tendency can be found from subgroups 1 to 5 with increasing differences
to training nodes.

(a) PubMed (ℎ=0.79)

(b) Arxiv (ℎ=0.63)

(c) Cham. (ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.6 Test accuracy disparity across node subgroups by homophily ratio difference to train
nodes. Each figure corresponds to a dataset, and each bar cluster corresponds to a GNN model. A
clear performance decrease tendency can be found from subgroups 1 to 5 with increasing differences
to training nodes.

distance to training nodes with popular GNN models including GCN [132], SGC [288], GAT [270],

GCNII [42], and GPRGNN [50]. Typically, test nodes are partitioned into subgroups based on their

disparity scores to the training set in terms of both 2-hop homophily ratio ℎ(2)

𝑖

and 2-hop aggregated

features F(2) obtained by F(2) = ( ˜D−1 ˜A)2X, where ˜A = A + I and ˜D = D + I. For a test node 𝑖, we
measure the node disparity by (1) selecting the closest training node 𝑣 = arg min𝑣∈𝑉0 ||F(2)
𝑢 − F(2)
𝑣
(2) then calculating the disparity score 𝑠𝑢 = ||F(2)

|, where the first and

𝑢 − F(2)
𝑣

||2 + |ℎ(2)

𝑢 − ℎ(2)
𝑣

||

the second terms correspond to the aggregated-feature distance and homophily ratio differences,

41

1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.81.0ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.6ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.6ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.81.0ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.60.8ACC1234512345123451234512345GCNSGCGATGCNIIGPRGNN0.00.20.40.6ACCrespectively. We then sort test nodes in terms of the disparity score and divide them into 5

equal-binned subgroups accordingly. Performance on different node subgroups is presented in

Figure 4.4 with the following observations. Obs.1: We note a clear test accuracy degradation with

respect to the increasing differences in aggregated features and homophily ratios. Furthermore, we

investigate on the individual effect of aggregated feature distance and homophily ratio difference in

Figure 4.5 and 4.6, respectively. An overall trend of performance decline with increasing disparity

score is evident though some exceptions are present. Obs.2: When only considering the aggregated

feature distance, there is no clear trend among groups 1, 2, and 3 on GCN, SGC, and GAT on

heterophilic datasets. Obs.3: When only considering the homophily ratio difference, there is no

clear trend among groups 1, 2, and 3 across four datasets. These observations underscore the

importance of both aggregated-feature distance and homophily ratio differences in shaping GNN

performance disparity. Combining these factors together provides a more comprehensive and

accurate understanding of the reason for GNN performance disparity.

Summary. In this section, we study GNN performance disparity on nodes with distinct structural

patterns and uncover its underlying causes. We primarily investigate the impact of aggregation, the

key component in GNNs, on nodes with different structural patterns in Sections 4.3.1. We observe

that aggregation effects vary across nodes with different structural patterns, notably enhancing

the discriminative ability on majority nodes. These observed performance disparities inspire

us to identify crucial factors contributing to GNN performance disparities across nodes with a

non-i.i.d PAC-Bayes bound in Section 4.3.2. The theoretical analysis indicates that test nodes with

larger aggregated feature distances and homophily ratio differences with training nodes experience

performance degradation. We substantiate our findings on real-world datasets in Section 4.3.3.

4.4

Implications of graph structural disparity

In this section, we illustrate the significance of our findings on structural disparity via (1)

elucidating the effectiveness of existing deeper GNNs (2) unveiling an over-looked aspect of

distribution shift on graph out-of-distribution (OOD) problem, and introducing a new OOD scenario

accordingly.

42

4.4.1 Elucidate the effectiveness of Deeper GNNs

Deeper GNNs [50, 42, 138, 3, 168, 147, 300] enable each node to capture more complex

higher-order graph structure than vanilla GCN, via reducing the over-smoothing problem [216, 242,

38, 168, 150] Nonetheless, which structural patterns deeper GNNs can exceed and the reason for its

effectiveness remain unclear. To investigate this problem, we compare vanilla GCN with different

deeper GNNs, including GPRGNN[50], APPNP[138], and GCNII[42], on node subgroups with

varying homophily ratios, adhering the same setting with Figure 4.3. Experimental results are

shown in Figure 4.7. We can observe that deeper GNNs primarily surpass GCN on minority node

subgroups with slight performance trade-offs on the majority node subgroups. We conclude that the

effectiveness of deeper GNNs majorly contributes to improved discriminative ability on minority

nodes.

(a) PubMed (ℎ=0.79)

(b) Ogbn-arxiv (ℎ=0.63) (c) Chameleon (ℎ=0.22)

(d) Squirrel (ℎ=0.25)

Figure 4.7 Performance comparison between GCN and deeper GNNs. Each bar represents the
accuracy gap on a specific node subgroup exhibiting homophily ratio within range specified on
x-axis.

4.4.2 A new graph out-of-distribution scenario

The graph out-of-distribution (OOD) problem refers to the underperformance of GNN due to

distribution shifts on graphs. Many Graph OOD scenarios [290, 343, 330, 166, 101, 81, 181, 190],

e.g., biased training labels, time shift, and popularity shift, have been extensively studied. These

OOD scenarios can be typically categorized into covariate shift with Ptrain(X) ≠ Ptest(X) and

concept shift [227, 205, 286] with Ptrain(Y|X) ≠ Ptest(Y|X). Ptrain(·) and Ptest(·) denote train

and test distributions, respectively. Existing graph concept shift scenarios [290, 81] introduce

different environment variables 𝑒 resulting in P(Y|X, 𝑒train) ≠ P(Y|X, 𝑒test) leading to spurious

43

0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range-0.2-0.10.00.10.2Difference in acc  model - GCNGCNIIGPRGNNAPPNP0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.040.020.000.020.04Difference in acc  model - GCNGCNIIGPRGNNAPPNP0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.20.10.00.10.2Difference in acc  model - GCNGCNIIGPRGNNAPPNP0.0-0.20.2-0.40.4-0.60.6-0.80.8-1.0Homophily Ratio Range0.20.10.00.10.20.3Difference in acc  model - GCNGCNIIGPRGNNAPPNPcorrelations. To address existing concept shifts, algorithms [290, 327] have been developed to

capture the environment-invariant relationship P(Y|X). Nonetheless, existing concept shift settings

overlook the scenario where there is not a unique environment-invariant relationship P(Y|X). For

instance, P(Y|Xhomo) and P(Y|Xhete) can be different, indicated in Section 4.3.1. Xhomo and Xhete

correspond to features of nodes in homophilic and heterophilic patterns. Notably, homophilic

and heterophilic patterns are crucial task knowledge that cannot be recognized as the irrelevant

environmental variable. Consequently, we find that homophily ratio difference between train and test

sets could be an important factor leading to an overlook concept shift, namely, graph structural shift.

Notably, structural patterns cannot be considered as environment variables given their integral role

in node classification task. The practical implications of this concept shift are substantiated by the

following scenarios: (1) graph structural shift frequently occurs in most graphs, with a performance

degradation in minority nodes, as depicted in Figure 4.3. (2) graph structural shift hides secretly in

existing graph OOD scenarios. For instance, the FaceBook-100 dataset [290] reveals a substantial

homophily ratio difference between train and test sets, averaging 0.36. This discrepancy could be

the primary cause of OOD performance deterioration since the exist OOD algorithms [290, 121]

that neglect such a concept shift can only attain a minimal average performance gain of 0.12%. (3)

graph structural shift is a recurrent phenomenon in numerous real-world applications where new

nodes in graphs may exhibit distinct structural patterns. For example, in a recommendation system,

existing users with rich data can receive well-personalized recommendations in the exploitation

stage (homophily), while new users with less data may receive diverse recommendations during the

exploration stage (heterophily).

Given the prevalence and importance of the graph structural shift, we propose a new graph OOD

scenario emphasizing this concept shift. Specifically, we introduce a new data split on existing

datasets, namely Cora, CiteSeer, PubMed, Ogbn-Arxiv, Chameleon, and Squirrel, where majority

nodes are selected for train and validation, and minority ones for test. This data split strategy

highlights the homophily ratio difference and the corresponding concept shift. To better illustrate

the challenges posed by our new scenario, we conduct experiments on models including GCN,

44

Table 4.1 Performance (Accuracy) on the proposed OOD split.

Pubmed
89.18±0.15
GCN(i.i.d)
51.04±0.16
GCN
68.38±0.43
MLP
67.51±0.25
GLNN
GCNII
67.76±0.36
GPRGNN 57.24±0.18
57.91±0.10
SRGNN
65.37±1.35
EERM
67.59±0.91
EERM(II)

Ogbn-Arxiv
72.99±0.14
34.94±0.07
33.17±0.37
35.89±0.14
36.81±0.14
34.95±0.43
40.37±1.65
34.23±0.46
40.28±0.84

Squirrel
58.09±0.71
32.13±4.93
24.57±0.77
31.51±0.70
37.15±1.39
42.43±7.71
37.62±1.74
40.93±0.57
44.31±0.40

Chameleon
75.09±0.79
43.35±3.47
34.78±4.97
47.01±1.09
41.25±2.03
35.27±7.67
42.09±0.43
45.84±1.05
48.59±0.78

MLP, GLNN, GPRGNN, and GCNII. We also include graph OOD algorithms, SRGNN [343] and

EERM [290], with GCN encoders. EERM(II) is a variant of EERM with a GCNII encoder For a fair

comparison, we show GCN performance on an i.i.d. random split, GCN(i.i.d.), sharing the same

node sizes for train, validation, and test. Results are shown in Table 4.1. Following observations

can be made: Obs.1: The performance degradation can be found by comparing OOD setting with

i.i.d. one across four datasets, confirming OOD issue existence. Obs.2: MLP-based models and

deeper GNNs generally outperform vanilla GCN, demonstrating the superiority on minority nodes.

Obs.3: Graph OOD algorithms with GCN encoders struggle to yield good performance across

datasets, indicating a unique challenge over other Graph OOD scenarios. This primarily stems

from the difficulty in learning both accurate relationships on homophilic and heterophilic nodes

with distinct P(Y|X). Nonetheless, it can be alleviated by selecting a deeper GNN encoder, as the

homophily ratio difference may vanish in higher-order structure information, with reduced concept

shift. Obs.4: EERM(II), EERM with GCNII, outperforms the one with GCN. Observations suggest

that GNN architecture plays an indispensable role in addressing graph OOD issues, highlighting the

new direction.

4.5 Conclusion

In conclusion, this work provides crucial insights into GNN performance meeting structural

disparity, common in real-world scenarios. We recognize that aggregation exhibits different effects

on nodes with structural disparity, leading to better performance on majority nodes than those in

45

minority. The understanding also serves as a stepping stone for multiple graph applications.

Our exploration majorly focuses on common datasets with clear majority structural patterns

while real-world scenarios, offering more complicated datasets, posing new challenges Additional

experiments are conducted on Actor, IGB-tiny, Twitch-gamer, and Amazon-ratings.

46

CHAPTER 5

REVISITING LINK PREDICTION: A DATA PERSPECTIVE

5.1

Introduction

Graphs are essential data structures that use links to describe relationships between objects. Link

prediction, which aims to find missing links within a graph, is a fundamental task in the graph domain.

Link prediction methods aim to estimate proximity between node pairs, often under the assumption

that similar nodes are inclined to establish connections. Originally, heuristic methods [339, 126]

were proposed to predict link existence by employing handcrafted proximity features to extract

important data factors, e.g., local structural proximity and feature proximity. For example, Common

Neighbors(CN) algorithm [339] assumes that node pairs with more overlapping between one-hop

neighbor nodes are more likely to be connected. To mitigate the necessity for handcrafted features,

Deep Neural Networks are utilized to automatically extract high-quality proximity features. In

particular, Graph Neural Networks (GNNs) [135, 133, 84] become increasingly popular owing

to their excellence in modeling graph data. Nonetheless, vanilla GNNs fall short in capturing

pairwise structural information [326, 155], e.g., neighborhood-overlapping features, achieving

modest performance in link prediction. To address these shortcomings, Graph Neural Networks

for Link Prediction(GNN4LP)[325, 279, 36] are proposed to incorporate different inductive biases

revolving on pairwise structural information.

New designs on GNN4LP models strive to improve vanilla GNN to capture diverse pairwise

data patterns, e.g., local structural patterns [317, 283], the number of paths [346], and structural

position [325]. These models have found wide applicability across a myriad of real-world graph

problems from multiple domains, e.g., paper recommendation, drug interaction prediction, and

protein analysis [141, 102]. A recent benchmark [149] evaluates the performance of GNN4LP

models on datasets from diverse domains, and finds performance disparity as there is no universally

best-performing GNN4LP model, observing that even vanilla GCN can achieve best performance

on certain datasets. [4, 34] reveal similar phenomena across heuristic algorithms. We conjecture

the main reasons for such phenomena are that (i) From a model perspective, different models often

47

Figure 5.1 Distribution disparity of Common Neighbors across datasets.

have preferred data patterns due to their distinct capabilities and inductive biases. (ii) From a data

perspective, graphs from different domains could originate from distinct underlying mechanisms of

link formation. Figure 5.1 illustrates this disparity in the number of CNs on multiple benchmark

datasets. Notably, edges in the ogbl-ppa and ogbl-ddi datasets tend to have many CNs. Considering

both model and data perspectives, performance disparity becomes evident where certain models

perform well when their preferred data patterns align with particular data mechanisms on particular

datasets, but others do not. This suggests that both model and data perspectives are significant to the

success of link prediction. While mainstream research focuses on designing better models [325, 326],

we opt to investigate a data-centric perspective on the development of link prediction. Such a

perspective can provide essential guidance on model design and benchmark dataset selection for

comprehensive evaluation.

To analyze link prediction from a data-centric perspective, we must first understand the underlying

data factors across different datasets. To achieve these goals, our study proceeds as follows: (i)

Drawing inspiration from well-established literature [107, 201] in network analysis, we pinpoint

three key data factors for link prediction: local structural proximity, global structural proximity, and

feature proximity. Comprehensive empirical analyses confirm the importance of these three factors.

(ii) In line with empirical analysis, we present a latent space model for link prediction, providing

theoretical guarantee on the effectiveness of the empirically identified data factors. (iii) We conduct

48

[0-1)[1-3)[3-10)[10-25)[25-inf)0.20.40.60.81.0PropotionPubmedogbl-collabogbl-ppaogbl-ddian in-depth analysis of relationships among data factors on the latent space model. Our analysis

reveals the presence of incompatibility between feature proximity and local structural proximity.

This suggests that the occurrence of both high feature similarity and high local structural similarity

within a single edge rarely happens. Such incompatibility sheds light on an overlooked vulnerability

in GNN4LP models: they typically fall short in predicting links that primarily arise from feature

proximity. (iv) Building upon the systematic understandings, we provide guidance for model design

and benchmark dataset selection, with opportunities for link prediction.

5.2 Main Analysis

In this section, we conduct analyses to uncover the key data factors for link prediction and

the underlying relationships among those data factors. Since underlying data factors contributing

to link formation are difficult to directly examine from datasets, we employ heuristic algorithms

as a lens to reflect their relevance. Heuristic algorithms calculate similarity scores derived from

different data factors to examine the probability of whether two nodes should be connected. They are

well-suited for this analysis as they are simple and interpretable, rooted in principles from network

analysis [211, 127]. Leveraging proper-selected heuristic algorithms and well-established literature

in network analysis, we endeavor to elucidate the underlying data factors for link prediction.

Organization. Revolving on the data perspective for link prediction, the following subsections

are organized as follows. Section 5.2.1 focuses on identifying and empirically validating the key data

factors for link prediction using corresponding heuristics. In line with the empirical significance

of those factors, Section 5.2.2 introduces a theoretical model for link prediction, associating data

factors with node distances within a latent space. Links are more likely to be established between

nodes with a small latent distance. Section 5.2.3 unveils the relationship among data factors building

upon the theoretical model. We then clearly identify an incompatibility between local structural

proximity and feature proximity factors. Specifically, incompatibility indicates it is unlikely that

the occurrence of both large feature proximity and large local structural proximity within a single

edge. Section 5.2.4 highlights an overlooked limitation of GNN4LP models stemming from this

incompatibility.

49

Preliminaries & Experimental Setup. G = (V, E) is an undirected graph where V and E are

the set of 𝑁 nodes and |E | edges, respectively. Nodes can be associated with features X ∈ R𝑛×𝑑,

where 𝑑 is the feature dimension. We conduct analysis on cora, citeseer, pubmed, ogbl-collab,

ogbl-ppa, and ogbl-ddi datasets [102, 199] with the same model setting as recent benchmark [149].

5.2.1 Underlying data factors on link prediction

Motivated by well-established understandings in network analysis [56, 281, 144] and heuristic

designs [5, 126], we conjecture that there are three key data factors for link prediction.

(1) Local structural proximity(LSP)[212] corresponds to the similarity of immediate

neighborhoods between two nodes. The rationale behind LSP is rooted in the principle of triadic

closure [107], which posits that two nodes with more common neighbors have a higher probability

of being connected. Heuristic algorithms derived from the LSP perspective include CN, RA, and

AA [5], which quantify overlap between neighborhood node sets. We mainly focus on common

neighbors (CN) in the following discussion. The CN score for nodes 𝑖 and 𝑗 is calculated as

|Γ(𝑖) ∩ Γ( 𝑗)|, where Γ(·) denotes the neighborhood set.

(2) Global structural proximity(GSP)[126, 113] goes beyond immediate neighbor-

hoods between two nodes by considering their global connectivity. The rationale behind GSP is that

two nodes with more paths between them have a higher probability of being connected. Heuristic

algorithms derived from GSP include SimRank, Katz, and PPR [26], to extract the ensemble of

paths information. We particularly focus on the Katz heuristic in the following discussion. The

Katz score for nodes 𝑖 and 𝑗 is calculated as (cid:205)∞

indicating the importance of the higher-order information.

𝑙=1 𝜆𝑙 |paths⟨𝑙⟩ (𝑖, 𝑗)|, where 𝜆 < 1 is a damping factor,
|paths⟨𝑙⟩ (𝑖, 𝑗)| counts the number of

length-𝑙 paths between 𝑖 and 𝑗.

To understand the importance of those data factors, we aim to answer the following questions:

(i) Does each data factor indeed play a key role in link prediction? (ii) Does each factor provide

unique information instead of overlapping information?

We further investigate the relationship between heuristics from the same and different data

factors. This includes whether heuristics from the same data factor provide similar information for

50

(a) ogbl-ddi

(b) ogbl-collab

Figure 5.2 Overlapping ratio between top-ranked edges on different heuristic algorithms. Diagonals
are the comparison between two heuristics within the same factor, while others compare
heuristics from different factors. MRR is selected as the metric.

link prediction and whether those from different data factors could offer unique perspectives. To this

end, we examine disparities in predictions among different heuristics. Similar predictions imply that

they provide similar information, while divergent predictions indicate that each factor could provide

unique information. Predictions for node pairs can be arranged in descending order according to

the predicted likelihood of them being connected. We primarily focus on top-ranked node pairs

since they are likely to be predicted as links. Thus, they can largely determine the efficacy of the

corresponding algorithm. If two algorithms produce similar predictions, high-likelihood edges

should have a high overlap. Else, their overlap should be low.

Experimental results are shown in Figure 5.2. It focuses on the overlapping ratio between the

top ranking (25%) node pairs of two different heuristics either from the same data factor or different

data factors. We make the following observations: (i) Comparing two heuristics from the same

factor, i.e., the diagonal cells, we observe that high-likelihood edges for one heuristic are top-ranked

in the other. This indicates heuristics from the same data factor capture similar information. (ii)

Comparing two heuristics derived from different factors, We can observe that the overlapping of

top-ranked edges is much low. These observations suggest that (i) selecting one representative

heuristic for one data factor could be sufficient as heuristics from the same factors share similar

predictions, and (ii) different factors are unique as there is little overlap in predictions.

51

GSPLSPGSPLSP0.674.1e-054.2e-050.52GSPLSPGSPLSP0.543.3e-072.5e-070.655.2.2 Theoretical model for Link Prediction based on underlying data factors

In this subsection, we rigorously propose a theoretical model for link prediction based on the

important data factors empirically analyzed above. We first introduce a latent space model and then

theoretically demonstrate that the model reflects the effectiveness of LSP, GSP, and FP factors.

The latent space model [96] has been widely utilized in many domains, e.g., sociology and

statistics. It is typically utilized to describe proximity in the latent space, where nodes with a close in

the latent space are likely to share particular characteristics. In our paper, we propose a latent space

model for link prediction, describing a graph with 𝑁 nodes incorporating both feature and structure

proximity, where each node is associated with a location in a 𝐷-dimensional latent space. Intuitions

on modeling feature and structure perspectives are shown as follows. Structural perspective is of

primarily significance in link prediction. In line with this, the latent space model connects the link

prediction problem with the latent node pairwise distance 𝑑, where 𝑑 is strongly correlated with

structural proximity. A small 𝑑𝑖 𝑗 indicates two nodes 𝑖 and 𝑗 sharing similar structural characteristics,

with a high probability of being connected. Considering feature and structural perspectives together,

we develop an undirected graph model inspired by [245]. Detailed formulation is as follows:

1
1 + 𝑒𝛼(𝑑𝑖 𝑗 −max{𝑟𝑖,𝑟 𝑗 })

𝑃(𝑖 ∼ 𝑗 |𝑑𝑖 𝑗 ) =





0

𝑑𝑖 𝑗 > max{𝑟𝑖, 𝑟 𝑗 }

𝑑𝑖 𝑗 ≤ max{𝑟𝑖, 𝑟 𝑗 }

(5.1)

where 𝑃(𝑖 ∼ 𝑗 |𝑑𝑖 𝑗 ) depicts the probability of forming an undirected link between 𝑖 and 𝑗 (𝑖 ∼ 𝑗),

predicated on both the features and structure. The latent distance 𝑑𝑖 𝑗 indicates the structural

likelihood of link formation between 𝑖 and 𝑗. The feature proximity parameter 𝛽𝑖 𝑗 ∈ [0, 1]

additionally introduces the influence from the feature perspective. Moreover, the model has two

parameters 𝛼 and 𝑟. 𝛼 > 0 controls the sharpness of the function. To ease the analysis, we

set 𝛼 = +∞. 𝑟𝑖 is a connecting threshold parameter corresponding to node 𝑖. With 𝛼 = +∞,
1+𝑒 𝛼(𝑑𝑖 𝑗 −max{𝑟𝑖 ,𝑟 𝑗 } ) = 0 if 𝑑𝑖 𝑗 > max{𝑟𝑖, 𝑟 𝑗 }, otherwise it equals to 1. Therefore, a large 𝑟𝑖 indicates
node 𝑖 is more likely to form edges, leading to a potentially larger degree. Nodes in the graph can be

1

associated with different 𝑟 values, allowing us to model graphs with various degree distributions.

Such flexibility enables our theoretical model to be applicable to more real-world graphs. We

52

identify how the model can reveal different important data factors in link prediction. Therefore,

we (i) derive heuristic scores revolving around each factor in the latent space and (ii) provide a

theoretical foundation suggesting that each score can offer a suitable bound for the probability of

link formation. Theoretical results underscore the effectiveness of each factor.

Effectiveness of Local Structural Proximity (LSP). We first derive the common neighbor (CN)

score on the latent space model. Notably, since we focus on the local structural proximity, the effect of

the features is ignored. We therefore set the FP parameter 𝛽𝑖 𝑗 = 0, for ease of analysis. Considering

two nodes 𝑖 and 𝑗, a common neighbor node 𝑘 can be described as a node connected to both

nodes 𝑖 and 𝑗. In the latent space, it should satisfy both 𝑑𝑖𝑘 < max {𝑟𝑖, 𝑟𝑘 } and 𝑑𝑘 𝑗 < max {𝑟𝑘 , 𝑟 𝑗 },

which lies in the intersection between two balls, 𝑉 (max {𝑟𝑖, 𝑟𝑘 }) and 𝑉 (max {𝑟𝑘 , 𝑟 𝑗 }). Notably,

𝑉 (𝑟) = 𝑉 (1)𝑟 𝐷 is the volume of a radius 𝑟, where 𝑉 (1) is the volume of a unit radius hypersphere.

Therefore, the expected number of common neighbor nodes is proportional to the volume of the

intersection between two balls. With the volume in the latent space, we then derive how CN provides

a meaningful bound on the structural distance 𝑑𝑖 𝑗 .

Proposition 1 (latent space distance bound with CNs). For any 𝛿 > 0, with probability at least

1 − 2𝛿, we have 𝑑𝑖 𝑗 ≤ 2

𝑟 𝑚𝑎𝑥
𝑖 𝑗 −

(cid:16) 𝜂𝑖 𝑗 /𝑁−𝜖
𝑉 (1)

(cid:17) 2/𝐷

, where 𝜂𝑖 𝑗 is the number of common neighbors

√︂

between nodes 𝑖 and 𝑗, 𝑟 𝑚𝑎𝑥

𝑖 𝑗 = 𝑚𝑎𝑥{𝑟𝑖, 𝑟 𝑗 }, and 𝑉 (1) is the volume of a unit radius hypersphere in
𝐷 dimensional space. 𝜖 is a term independent of 𝜂𝑖 𝑗 . It vanishes as the number of nodes 𝑁 grows.

Proposition 1 indicates that a large number of common neighbors 𝜂𝑖 𝑗 results in a smaller latent

distance 𝑑𝑖 𝑗 , leading to a high probability for an edge connection. We then extend the above analysis

on local structure to global structure with more complicated structural patterns.

Effectiveness of Global Structural Proximity (GSP). We first derive the number of paths

between node 𝑖 and 𝑗 on the latent space. Notably, most heuristics on the GSP factor can be

viewed as a weighted number of paths. The key idea is to view each common neighbor node as a

path with a length ℓ = 2, serving as the basic element for paths with a length ℓ > 2. We denote

that the nodes 𝑖, 𝑗 are linked through path of length ℓ, i.e., 𝑖 = 𝑘0 ∼ 𝑘1 ∼ . . . ∼ 𝑘ℓ−1 ∼ 𝑘ℓ = 𝑗.

As we assume each node is only associated with its neighborhood, the probability that the

53

path 𝑃(𝑘0 ∼ 𝑘1 ∼ . . . ∼ 𝑘ℓ−1 ∼ 𝑘ℓ) exists can be easily bounded by a decomposition of
𝑃(𝑘0 ∼ 𝑘1 ∼ 𝑘2) · 𝑃(𝑘1 ∼ 𝑘2 ∼ 𝑘3) · · · 𝑃(𝑘ℓ−2 ∼ 𝑘ℓ−1 ∼ 𝑘ℓ) = (cid:206)ℓ−1

𝑙=1 𝑃(𝑘ℓ−1, 𝑘ℓ, 𝑘ℓ+1). Notably,
each element is the common neighbor probability discussed in Proposition 1, equivalent to the path

with ℓ = 2. We then calculate the volume of the number of paths and derive how it bound the latent

distance 𝑑𝑖 𝑗 .

Proposition 2 (latent space distance bound with the number of paths). For any 𝛿 > 0, with probability
at least 1 − 2𝛿, we have 𝑑𝑖 𝑗 ≤ (cid:205)𝑀−2

, where 𝜂ℓ (𝑖, 𝑗) is the

2
𝐷 (ℓ −1)

√︂

(cid:17)

(cid:16) 𝜂ℓ (𝑖, 𝑗)−𝑏(𝑁,𝛿)
𝑐(𝑁,𝛿,ℓ)

𝑛=0 𝑟𝑛 + 2

𝑟 max
𝑀 −

number of paths of length ℓ between 𝑖 and 𝑗 in 𝐷 dimensional Euclidean space. 𝑀 ∈ {1, · · · , ℓ − 1}

is the set of intermediate nodes.

Proposition 2 indicates that a large number of paths 𝜂ℓ (𝑖, 𝑗) results in a smaller latent distance

𝑑𝑖 𝑗 , leading to a high probability for an edge connection. It demonstrates the effectiveness of GSP

factor.

5.2.3

Intrinsic relationship among underlying data factors

In this subsection, we conduct a rigorous analysis elucidating the intrinsic relationship among

different factors, upon the theoretical model. Our analyses are two-fold: (i) the relationship between

structural factors, i.e., LSP and GSP; and (ii) the relationship between factors focusing on feature

and structure, i.e., FP and LSP, FP and GSP.

The relationship between local and global structural proximity. To consider both local and

global structural factors, we treat the CN algorithm as the number of paths 𝜂ℓ (𝑖, 𝑗) with length ℓ = 2.

Therefore, analysis between local and global structural factors can be regarded as the influence of

𝜂(𝑖, 𝑗) on different lengths ℓ. The key for the proof is to identify the effect of ℓ by bounding other

terms related with ℓ in Proposition 2, i.e., 𝜂ℓ (𝑖, 𝑗) and 𝑐(𝑁, 𝛿, ℓ). We also ignore the feature effect

to ease the structural analysis.

Lemma 2 (latent space distance bound with local and global structural proximity). For any 𝛿 > 0,

with probability at least 1 − 2𝛿, we have 𝑑𝑖 𝑗 ≤ (cid:205)𝑀−2
(cid:205)𝑀−2

𝑛=0 𝑟𝑛, 𝑟 𝑚𝑎𝑥

𝑀 serve as independent variables that do not change with ℓ.

𝑛=0 𝑟𝑛 + 2

𝑟 𝑚𝑎𝑥
𝑀 −

√︄

(cid:18)√︃ 𝑁 ln(1/𝛿)
2

− 1

2
𝐷 (ℓ −1)

(cid:19)

, where

54

Figure 5.3 Performance comparison between GNN4LP models and SAGE on the ogbl-collab
dataset. Bars represent the performance gap on node pairs dominated by feature and structural
proximity, respectively. Figures correspond to compare FP with GSP and LSP, respectively.

Given the same number of paths 𝜂ℓ with different lengths ℓ, a small ℓ provides a much tighter

bound with close distance in the latent space. The bound becomes exponentially loose with the

increase of ℓ as the hop ℓ in

(cid:18)√︃ 𝑁 ln(1/𝛿)
2

− 1

2
𝐷 (ℓ −1)

(cid:19)

acts as an exponential coefficient. This indicates

that (i) When both LSP and GSP are sufficient, LSP can provide a tighter bound, indicating a

more important role. (ii) When LSP is deficient, e.g., the graph is sparse with not many common

neighborhoods, GSP can be more significant.

5.2.4 An overlooked vulnerability in GNN4LP models inspired from data factors

In this subsection, we delve into how the incompatibility between structural proximity and feature

proximity affects the effectiveness of GNN4LP models. These models are inherently designed

to learn pairwise structural representation, encompassing both feature and structural proximity.

Despite their strong capability, the incompatibility between structural and feature factors leads to

potentially conflicting training signals. For example, while structural proximity patterns may imply

a likely link between two nodes, feature proximity patterns might suggest the opposite. Therefore,

it seems challenging for a single model to benefit both node pairs with feature proximity factor

and those with the structural ones. Despite most research primarily emphasizing the capability of

GNN4LP models on structural proximity, the influence of incompatibility remains under-explored.

55

featstruct0.10.00.10.2Difference in hit  model - SAGElocalfeatstruct0.10.00.1globalBUDDYNeoGNNNCNCTo validate our statement, we conduct experiments to compare the performance of vanilla GNNs,

e.g., SAGE and GCN, with the advanced GNN4LP models including Buddy, NeoGNN, and NCNC.

The fundamental difference between GNN4LP models and vanilla GNNs is that vanilla GNNs

only learn single-node structural representation with limitations in capturing pairwise structural

factors while GNN4LP models go beyond. Such comparison sheds light on examining how the key

capacity of GNN4LP, i.e., capturing pairwise structural factor, behaves along the incompatibility.

Comparisons are conducted on node pairs dominated by different factors, represented as node

pairs E𝑠 \ E 𝑓 and E 𝑓 \ E𝑠 with only structural proximity and only feature proximity accurately

predicted, respectively. E𝑠 and E 𝑓 denote node pairs accurately predicted with structural proximity

and feature proximity, respectively. Experimental results are presented in Figure 5.3, where the

x-axis indicates node pairs dominated by different underlying factors. The y-axis indicates the

performance differences between GNN4LP models and vanilla GraphSAGE. A notable trend is that

GNN4LP models generally outperform vanilla GNNs on edges governed by LSP and GSP while

falling short in those on feature proximity. This underlines the potential vulnerability of GNN4LP

models, especially when addressing edges primarily influenced by feature proximity. This underlines

the overlooked vulnerability of GNN4LP models on node pairs dominated by the FP factor due to

the incompatibility between feature and structural proximity.

5.3 Guidance for practitioners on Link Prediction

In this section, we provide guidance for the new model design and how to select benchmark

datasets for comprehensive evaluation, based on the above understandings from a data perspective.

5.3.1 Guidance for the model design

In Section 5.2, we highlight the incompatibility between structural and feature proximity

factors in influencing GNN4LP models. When both structural and feature factors come into

play simultaneously, there is a potential for them to provide conflicting supervision to the model.

Such understanding suggests that the model design should learn the feature proximity factors and

pairwise structural ones independently before integrating their outputs, in order to mitigate such

incompatibility. In particular, we apply such a strategy to the SEAL [325], a representative GNN4LP

56

(a) Original coupling SEAL model architec-
ture

(b) Proposed decoupled SEAL model archi-
tecture

Figure 5.4 The original SEAL and the proposed decoupled SEAL architectures. Xfeat and Xdrnl are
the original node feature and the structural embedding via Double-Radius Node Labeling.

(a) Performance on original and decoupled
SEAL.

(b) Comparison between SEAL models and
SAGE.

Figure 5.5 Effectiveness of proposed decoupled SEAL.

model. Different from the vanilla GNNs only utilizing original node features X 𝑓 𝑒𝑎𝑡 as feature input,

it additionally employs local structural features X𝑑𝑟𝑛𝑙 by double-radius node labeling (DRNL) based

on their structural roles. X 𝑓 𝑒𝑎𝑡 and X𝑑𝑟𝑛𝑙 are concatenated and then forwarded to one single GNN,

as depicted in Figure 5.5a. Therefore, the GNN must wrestle with the incompatibility between

FP and structural factors. Guided by the above understanding, we propose the decoupled SEAL,

which separates the original node features X 𝑓 𝑒𝑎𝑡 and local structural features X𝑑𝑟𝑛𝑙 into different

GNNs. Each dedicated GNN could learn either feature patterns or pairwise structural patterns

separately. The decoupled model architecture is depicted in Figure 5.4b. Experimental results

comparing the original SEAL and our proposed decoupled SEAL are illustrated in Figure 5.5a.

Notably, our decoupled SEAL consistently outperforms, with gains reaching up to 1.03% on the large

ogbl-collab dataset. Furthermore, Figure 5.5b shows comparisons with GraphSAGE, following

the same setting with Figure 5.3. The decoupled SEAL demonstrates a reduced performance drop

on node pairs dominated by the FP factor with a larger gain on those by structural factors.

57

GNNReadoutScore𝐗𝐟𝐞𝐚𝐭𝐗𝐝𝐫𝐧𝐥ScoreGNN1Readout1GNN2Readout2𝜶1-𝜶𝐗𝐟𝐞𝐚𝐭𝐗𝐝𝐫𝐧𝐥CoraCiteseerPubmedcollab40506070Hitorigindecouplefeatstruct0.10.00.1Difference in hit  model - SAGEPubmedfeatstruct0.00.1collaborigindecoupleTable 5.1 The Hit@10 performance on the newly selected datasets.

CN

Katz

FH

MLP

SAGE

BUDDY

power
photo

12.88
18.34

29.85 NA
13.78
7.07

5.03 ± 0.88
12.37 ± 4.13

6.99 ± 1.16
18.61 ± 5.97

19.88 ± 1.37
18.09 ± 2.52

5.3.2 Guidance for Benchmark Dataset selection

With the recognized data factors and their relationships, we enumerate all potential combinations

among different data factors, illuminating the complete dataset landscape. It allows us to categorize

prevalent datasets and pinpoint missing scenarios not covered by those datasets. Consequently,

we introduce new datasets addressing those identified gaps and offer guidance for practitioners on

more comprehensive benchmark dataset selection. In particular, we recognize datasets into four

categories considering two main aspects: (i) From the feature perspective, we verify whether FP

dominates, indicated with decent performance on FH. (ii) From the structural perspective, we verify

whether GSP dominates, indicated by whether a GSP heuristic can provide additional improvement

over LSP (if not, then LSP dominates). Section 5.2.3 demonstrates that such scenario happens

when LSP is inadequate. Therefore, there are four categories including category 1: both LSP

and FP factors dominate. Category 2: Only LSP factor dominates. Category 3: both GSP and

FP factors dominate. Category 4: Only GSP factor dominates. The prevalent datasets like cora,

citeseer , and pubmed are in category 3 with both GSP and FP factors dominating, while datasets

like ogbl-ddi and ogbl-ppa primarily are in category 2, focusing on the LSP factor. We can then

clearly identify that two significant dataset categories, 1 and 4, are not covered on existing datasets.

To broaden for a more comprehensive evaluation beyond existing benchmark datasets, we

introduce more datasets to cover these categories . This includes the unfeatured power dataset in

category 4 and the photo dataset in category 1. The categorizations of these datasets are confirmed

through experimental results illustrated in Table 5.1. We observe: (i) For the power dataset with

only GSP matters, the Katz significantly outperforms other algorithms, even the GNN4LP model,

BUDDY. (ii) Deep models do not show superior performance on both datasets, indicating that

success focusing on existing datasets cannot extend to the new ones, suggesting potential room

58

for improvement. We can then provide the following guidance for benchmarking dataset selection

for practitioners: (i) selecting algorithms that perform best on the datasets belonging to the same

category as the proposed one. (ii) selecting datasets from their own domain rather than datasets from

other domains. To help with that, we collect most of the existing datasets for link prediction covering

most domains including biology, transportation, web, academia, and social science, assisting in a

more comprehensive evaluation aligning with real-world scenarios. Details on all datasets are in the

repository.

5.4 Conclusion

In this work, I explore how to infer the missing edge from a data perspective, elucidating two

pivotal factors: LSP and GSP. Theoretical analyses uncover the underlying incompatibility. Inspired

by incompatibility, our paper shows a positive broader impact as we identify the overlooked biased

prediction in GNN4LP models and show the potential solution to address this issue. Thereby, I

propose a GFM model named UniAug to leverage the increasing scale of graph data from different

domains. I collect thousands of graphs from various domains and pre-train a self-conditioned

discrete diffusion model on them. In the downstream stage, I augment the graphs by preserving

the original node features and generating synthetic structures. UniAug can achieve satisfying

performance on the link prediction task while achieving consistent performance gain as the data

argumentation for node and graph level tasks.

59

CHAPTER 6

GRAPH FOUNDATION MODEL ARE ALREADY HERE

6.1

Introduction

Foundation models [23], which are pre-trained on massive data and can be adapted to tackle a

wide range of downstream tasks, have achieved inimitable success in various domains, e.g., computer

vision (CV) [228] and natural language processing (NLP) [29, 264]. Typically, foundation models

can effectively utilize both the prior knowledge obtained from the pre-training stage and the data

from downstream tasks to achieve better performance [86] and even deliver promising efficacy with

few-shot task demonstrations [60, 194].

Meanwhile, graphs are vital and distinctive data structures that encapsulate non-Euclidean

and intricate object relationships. Since various graphs embody unique relations, most graph

learning approaches are tailored to train from scratch for a single task on a particular graph.

This approach necessitates separate data collection and deployment for each individual graph and

task. Consequently, an intriguing question emerges: Is it possible to devise a Graph Foundation

Model (GFM) that benefits from large-scale training with better generalization across different

domains and tasks?

Despite advanced foundation models in other domains, the development of GFMs remains in

the infant stage. Recent research has demonstrated initial successes of GFMs in specialized areas,

such as knowledge graphs [72, 73] and molecules [18]. Notably, most of these models are built on

principles specific to their domains. For instance, ULTRA for knowledge graph completion [72]

draws inspiration from double equivariance for inductive link prediction [75]. However, there is still

a lack of general guidance on how to build GFMs that can effectively cater to a broad spectrum of

graph-based applications.

The key difficulty in designing GFMs lies in finding the invariance across diverse graph data,

ranging from social networks to molecular graphs with countless structural patterns, into the same

representation space to achieve positive transfer. The answer from the CV and NLP domains is a

shared vocabulary. In the NLP foundation models, the text is first broken down into smaller units

60

based on the vocabulary, which can be words, phrases, or symbols. In the CV foundation models, the

image is mapped to a series of discrete image tokens [316, 11] based on the vision token vocabulary.

The vocabulary defines the basic units in the particular domain, transferable across different tasks

and datasets. Therefore, the key challenges in achieving the GFMs narrow down to how we can find

the graph vocabulary, the basic transferable units underlying graphs to encode the invariance on

graphs.

However, finding a suitable graph vocabulary that works across diverse graphs is challenging,

which is the primary focus of this paper.

Our contributions:

In this paper, we present a vocabulary perspective to clearly state

the position of the GFM. A comprehensive review of the graph transferability principles and

corresponding actionable steps is illustrated in Section 6.2, serving us the principle for future

vocabulary construction and the GFM design. In Section 6.3, we discuss the potential for building the

GFM following neural scaling laws from several perspectives (1) building and training vocabulary

from scratch, and (2) leveraging existing LLM. Finally, we introduce more insights and open

questions to inspire constructive discussions on the GFMs in Section 6.4.

6.2 Graph Transferability Principles with Actionable Steps

In the last section, we investigate the key to building an effective GFM, which lies in constructing

a suitable graph vocabulary to keep the essential invariance across datasets and tasks. Despite

existing successes, more graph transferability principles, identifying different invariances, can serve

as guidance for constructing new suitable graph vocabulary for future GFMs. We present a few

actionable next steps inspired by these principles, highlighting their potential benefits.

The following discussions are organized as follows: We first provide a general introduction

to the graph transferability principles in Section 6.2.1. Detailed task-specific principles on node

classification, link prediction, and graph classification tasks can be found in Section 6.2.2, 6.2.3,

and 6.2.4, respectively. We finally discuss the principles for task transferability in Section 6.2.5.

Notably, the following discussions majorly concentrate on the transferability of the graph structure.

61

6.2.1 An overview on Graph transferability principles

In this subsection, we introduce principles that enable transferability on graphs, focusing on

three key aspects: network analysis, expressiveness, and stability.

Network analysis provides a conventional understanding of the network system by identifying

fundamental graph patterns, e.g., network motif [202] and establishing the key principles, e.g.,

triadic closure principle [107] and homophily principle, which are generally valid across different

domains. Those principles have been generally utilized to guide the design of advanced GNNs. For

example, the state-of-the-art GNN for link prediction [283] is a Neural Common Neighbor, inspired

by the triadic closure principle. Despite its effectiveness, network analysis heavily relies on expert

knowledge without a provable guarantee.

Expressiveness provides a theoretical background as to which functions graph neural architectures

can model in general, e.g., a well-known connection that graph-level performance of GNNs is bounded

by Weisfeiler-Leman tests [299, 208, 207]. The most-expressive structural representation [257] is

the key concept describing that the representation of two node sets should be invariant if and only

if the node sets are symmetric with a permutation equivalence. Such most-expressive structural

representation serves as an important principle to design a suitable graph vocabulary that perfectly

distinguishes all non-isomorphic structural patterns in multi-ary prediction tasks.

Stability [241] assesses the representation sensitivity to graph perturbations. It aims to maintain

a bounded gap in predictions for pairs under minor perturbations, rather than the expressiveness

only distinguishing between isomorphic and non-isomorphic cases. The stability imposes a stricter

constraint leading to better generalization. It can be an analogy to the constraint on the graph

vocabulary where similar structure patterns should have similar representation.

6.2.2 Transferability Principles in Node Classification

Network analysis. Homophily [127], which describes the phenomenon of linked nodes often

sharing similar features (“birds of a feather flock together”), is a longstanding principle in social

science. It serves as the principle guidance for methods ranging from conventional page-rank [51]

and label propagation [37] to recent advanced GNNs. Existing GNN architectures, often crafted

62

based on the homophily principle, demonstrate strong performance on diverse homophilous graphs

across various domains. This adherence to homophily not only enhances model effectiveness

but also facilitates model transferability among homophilous graph datasets. Notably, successful

transfers among such graphs are evidenced in [311].

While homophily predominates in network analysis, it is not a universal rule. In many real-world

scenarios, “opposites attract", resulting in networks characterized by heterophily—where nodes are

more likely to link with dissimilar nodes. GNNs built with the homophily principle often struggle

with heterophilious networks, except in cases of “good heterophily” [183, 177], where GNNs can

identify and leverage consistent patterns in connections between dissimilar nodes. However, most

heterophilious networks are complex and varied, posing challenges for GNNs due to their irregular

and intricate interaction patterns [178, 282, 187]. Consequently, GNNs’ transferability, more assured

in homophilous graphs, is facing significant challenges in heterophilous ones.

Stability. [318] theoretically establishes the relationship between transferability and network

stability, demonstrating that graph filters with enhanced spectral smoothness and a smaller maximum

frequency response exhibit improved transferability in terms of node features and structure,

respectively. In particular, spectral smoothness, characterized by the Lipschitz constant of the

graph filter function of the corresponding GNN indicates stability against edge perturbations. The

maximum frequency response, reflecting the highest spectral frequency after applying a graph

filter (essentially the largest eigenvalue of the Laplacian matrix), describes stability against feature

perturbations.

Actionable steps inspired by principles. [187] illustrates the network analysis principle that a

single GNN can perform well on either homophily patterns or heterophily patterns, but not both.

This principle provides actionable insight for GFM design, suggesting that the graph vocabulary

for homophily patterns and heterophily patterns should be modeled separately. Consequently, the

model backbone for GFMs in node classification should not rely on a single GNN, which only excels

on either homophilic graphs or heterophilic graphs. A better architecture design choice could be (1)

an adaptive GNN with different aggregation filters for homophilic and heterophilic graphs or (2) a

63

Figure 6.1 In this graph, nodes 𝑣1 and 𝑣4 are isomorphic; links (𝑣1, 𝑣2) and (𝑣2, 𝑣4) are not
isomorphic. However, vanilla GNN with the same node representations 𝑣1 and 𝑣4 gives the same
prediction to links (𝑣1, 𝑣2) and (𝑣2, 𝑣4).

graph transformer without a fixed aggregation process [255, 253].

[318] designs a spectral regularization term inspired by the network stability to address the

out-of-distribution problem. Adapting spectral regularization for GFMs could be a potential next

step.

6.2.3 Transferability Principles in Link Prediction

Network Analysis. Important network analysis principles [192] fall into three primary concepts

including: (1) local structural proximity corresponding to the triadic closure principle [107], where

friends of friends become friends themselves. It inspires well-known conventional methods including

CN, RA, AA [5]. (2) global structural proximity corresponding to the decay factor principle, where

two nodes with more short paths between them have a higher probability of being connected. It

inspires well-known conventional methods, e.g. Simrank and Katz [126, 113]. (3) feature proximity

corresponding to the homophily principle [211] where shared beliefs and thoughts can be found in

connected individuals.

These principles guide the evolution of link prediction algorithms, from basic heuristics to

sophisticated GNNs [35, 148]. GNNs, inspired by these principles, perform well across diverse

graphs in multiple domains. Moreover, [337] provides empirical evidence supporting the beneficial

transferability of these guiding principles.

Expressiveness. A vanilla GNN, equipped with only single-node permutation equivalence,

cannot achieve transferability for the link prediction task due to its lack of expressiveness. An

64

	𝒗𝟐	𝒗𝟏	𝒗𝟑	𝒗𝟒example of such failure is shown in Figure 6.1 with a featureless graph. 𝑣1 and 𝑣4 are represented

identically by the vanilla GNN, as they possess identical neighborhood structures.

Therefore, the similarity between 𝑣1 and 𝑣2 will be the same as the one between 𝑣4 and 𝑣2,

leading to identical representations and predictions for both links (𝑣1, 𝑣2) and (𝑣2, 𝑣4) However,

according to the global structural proximity, (𝑣1, 𝑣2), with a shorter distance of 1, should be more

likely to be connected. The vanilla GNN, which computes the representation of 𝑣1’ solely from its

neighborhood, overlooks the structural dependence with 𝑣2. As a result, this potentially leads to

negative transfer, where the GNN might erroneously predict both or neither link to exist, whereas it

is more likely that only (𝑣1, 𝑣2) has a link.

To consider all possible dependencies between node pairs, we aim for the most expressive

structural representation for the link prediction. This representation should be invariant if and only

if links are symmetric. [325] achieves such structural representation by incorporating node labeling

features that depend on both the source and the target nodes in a link. [326] further highlights the

key aspects of node labeling design, including: (1) target-nodes-distinguishing, where the source

and target nodes have distinct labels compared to other nodes; and (2) permutation equivariance.

Node labeling methods that meet these criteria, such as double radius node labeling (DRNL) and

zero-one (ZO) labeling, can produce the most expressive structural representations. Many other

GNNs [312, 313, 278] can achieve similar expressiveness, serving as the potential backbone for

GFMs on the link prediction task. The expressiveness representation can find the complete set

of distinct relations to differentiate all nonisomorphic node pairs, thereby mitigating the risk of

negative transfer in standard GNNs. [109] extends the relational Weisfeiler-Leman framework [15]

to link prediction and incorporate the concept of labeling tricks to multirelational graphs.

Stability. For those equally expressive structural representations, there may still be a gap in terms

of their stability. For example, empirical evidence [326] shows that GNNs with DRNL labeling

outperform those with ZO labeling. From the perspective of stability, it is crucial to maintain a

bounded gap in predictions for pairs under minor perturbations. [278] provides a theoretical analysis

identifying key properties of stable positional encoding (GNNs should be rotation and permutation

65

equivariant to positional encodings) that enhance generalization. The stable positional encoding

may be directly applied towards better GFMs.

Actionable step inspired by principles.

[192] illustrates the network analysis principle

concerning the incompatibility between structural proximity and feature proximity. Node pairs with

high feature proximity are likely to be with low local structural proximity and vice versa. This

incompatibility leads to over-emphasis on node pairs with high structural proximity while neglecting

those with high feature proximity. This principle provides actionable insight for the GFM design,

suggesting that the graph vocabulary for feature proximity patterns and structural proximity patterns

should be modeled separately. Consequently, the model backbone for GFMs in link prediction

should separately encode the pairwise structural proximity and the feature proximity.

A GNN following the expressiveness principles could include all the important structural

information relevant to the link prediction [326]. [59] utilizes in-context learning to effectively

transfer expressive GNN representations to new, unseen graphs. Satisfying performance can be

found across graphs from biology, transport, web, and social domains. An actionable next step

could be to better utilize expressive representations for downstream graphs from specific domains.

6.2.4 Transferability Principles in Graph Classification

Network Analysis. Network motifs, typically composed of small and recurrent sub-graphs,

are often considered the building blocks of a graph [203, 20]. A proper selection of the motif set

can cover most essential knowledge on the specific datasets. Graph kernels [273] are proposed to

quantify motif counts or other pre-defined graph structural features and then utilize the extracted

features to build a classifier such as SVM. Despite the essential motif sets from different domains

being generally different, there could exist a uniform set of motifs shared across different domains.

In such cases, the positive transfer can be found on the uniform sets, where [17] shows the positive

transfer across neuronal connectivity networks, food webs, and electronic circuits. Therefore, we

conjecture that the network motif could be the base unit for the vocabulary (a set of invariant

elements) for the graph classification as it is both explainable and potentially shared across graphs.

Expressiveness. [322] proposes a unified framework to understand the ability of different GNNs

66

to detect and count graph substructures (motif). More expressive GNN which could detect more

diverse motifs and construct a richer graph vocabulary. In analogy with the uniform motif sets, we

conjecture that it is more possible for expressive GNNs to find the uniform motif sets and achieve

better transferability.

Stability. [110] proposes a provably stable position encoding that surpasses the expressive sign

and invariant encoding [142] and modeling [158], enabling minimal changes to positional encodings

on the minor modifications to the Laplacian. The key innovation is to apply a weighted sum of

eigenvectors instead of treating each eigen-subspace independently. Satisfying performance can be

observed on the out-of-distribution molecular graph prediction. Such stable positional encoding

may be directly applied towards better GFMs.

Actionable step inspired by principles. Inspired by the network analysis with graph kernels,

one concrete next step towards GFMs could revolve around how to identify frequent network

motif [95, 232] which should be transferable across all graphs. Expressive GNNs with better

network motif model capability could be a suitable architecture for GFMs.

Figure 6.2 Unifying different task formulations: (a) Link view: Given the target node, node
classification is converted to the link prediction between the target node and corresponding label
nodes. (b) Subgraph view: Node classification (orange node) is converted to the (green) ego-
graph classification. Link prediction (orange nodes) is converted to the (green) induced-subgraph
classification.

6.2.5 Transferability Principles across Tasks

A unified task formulation is generally employed to facilitate transferability across various

tasks. The unified task formulation enables (1) enlarging the dataset size via converting datasets for

67

Input nodesLabel nodes??(a)(b)????different downstream tasks as one and (2) utilizing one pre-training model to serve different tasks.

The significance of aligning task formulations is evident in the following example: [118] shows

that using link prediction directly as a pretext task leads to negative transfer for node classification.

However, by reformulating node classification into a link prediction problem [258, 108], where a

node’s class membership is treated as the link likelihood between the node and label nodes, positive

transfer is achieved. [172, 259] further propose a sub-graph view to adapt the node classification as

an ego graph classification, and link prediction as a binary classification on the induced sub-graph

of the target node pair. Figure 6.2 provides illustrative examples for these two unified views. More

recently, [164] unifies node-level, link-level, and graph-level tasks via (1) adding a virtual prompt

node and (2) connecting the virtual nodes to nodes of interests, i.e., the center node for node

classification, source and target nodes for link prediction, and all nodes for graph classification.

A unified formulation provides the possibility for co-training all tasks together while it remains

unknown whether this can be done without negative transfer. Moreover, the unified task formulation

may not be necessary to achieve transfer across tasks.

It is generally utilized for supervised

co-training and prompt-based prediction as discussed above. A GFM can be (1) pre-trained with

self-supervised tasks and (2) adapted to downstream tasks via fine-tuning without requiring specific

task formulations. The success is due to the transferability principles across different tasks. However,

there remains limited study in this direction. We list a few existing principles as follows. (1) Node

classification and link prediction tasks share the feature homophily as an important principle. (2)

[165] indicates that the global structural proximity principle on the link prediction can improve the

node classification performance on the non-homophilous graph. (3) The triadic closure in the link

prediction is a particular network motif utilized in the graph classification. There are more shared

motifs [92, 61, 4, 143] on both graph classification and link prediction tasks. We emphasize the

importance of cross-task transferability principles as an important future direction.

6.3 Neural Scaling Law on GFMs

The success of the foundation model can be attributed to the validity of the neural scaling

law [124] which shows performance enhancement with increasing model scale and data scale. In

68

this section, we first discuss when the neural scaling law happens in Section 6.3.1 from a graph

vocabulary perspective. We then discuss techniques towards successful data scaling and model

scaling in Section 6.3.2 and 6.3.3, respectively. We finally discuss the potential of leveraging

large-scale LM on the graph domain in Section 6.3.4.

6.3.1 When Neural Scaling Law Happens

In section 6.2, we discuss the underlying transferable principles guiding future vocabulary

construction. This principle guidance has led to the successful scaling behavior in the material

science domain [252, 323, 16] with the help of the geometric prior. Nevertheless, we are still cautious

about whether the existing success can be extended to the graph domain. The key concern is whether

graphs can strictly follow those principles. Uncertainty can be found on the human-defined graph

construction criteria [28]. For instance, the construction knowledge relying on expert knowledge

may lead to uncertainty in edges [308]. [45, 153] observe that mislabeled samples exist widely in all

datasets, where the popular CiteSeer dataset has more than 15% wrongly labeled data. Despite the

above uncertainty, different graph constructions with manual design can follow opposite principles.

For instance, Ogbn-arxiv [103] and Arxiv-year [157] are two node classification datasets with

identical graph information. The only difference lies in the label where Ogbn-arxiv employs paper

categories and Arxiv-year uses publication years as labels, resulting in conflicting homophily and

heterophily properties [187]. Therefore, when uncertainties and opposite graph constructions exist,

the scaling behavior may not happen as the data do not obey the graph transferable principles.

6.3.2 Data Scaling

Data scaling refers to the phenomenon that the performance consistently improves with the

increasing data scale.

[40, 108] initially validate that GNNs trained in both supervised and

self-supervised manners follow data scaling law on molecular property predictions, and node

classification on text-attributed graphs. [31] further exhibits that the similarity between pre-training

data and downstream task data serves as a prerequisite for the data scaling on graphs. Specifically,

[31] provides concrete guidance on how to select the pre-training data via the graphon signal analysis

and the essential network property, i.e. network entropy, respectively. Notably, all principles

69

mentioned in Section 6.2 can be applied to facilitate positive transfer with data scaling phenomena.

[263] indicates the possibility to achieve scaling behavior across different domains.

A limitation in current research on data scaling is graph data insufficiency, in contrast to the

readily available trillion-level real-world data in CV and NLP domains. The key reasons are two-fold:

(1) constructing graphs requires expert intervention e.g., defining relationships (2) intellectual

property issues.

Synthetic graph generation can be utilized to alleviate the data insufficiency issue, enabling

more comprehensive training. Traditional graph generative models [7, 233, 6, 146] are capable

of generating graphs satisfying some certain statistical properties, which still plays an important

role on node-level and link-level tasks. Deep generative models on graph [122, 180, 123, 272, 161]

have shown great success in generating high-quality synthetic graphs which help graph-level tasks

by providing a more comprehensive description of the graph distributions space. With successful

evidence of pre-training on synthetic data from other domains [204, 266], we anticipate the potential

benefits of high-quality synthetic graphs.

6.3.3 Model Scaling

Model scaling refers to the phenomenon that the performance consistently improves with the

increasing model scale. Previous research in NLP indicates that apart from data, the backbone

model constitutes a fundamental for scaling [124]. [167] primarily validates the neural scaling law

on various graph tasks and model architectures under the supervised setting.

However, [129] demonstrates that the GAT [269] with a larger number of parameters under-

performs on the graph regression tasks compared to the smaller-sized counterparts. As a comparison,

geometric GNNs scale well to predict atomic potentials in material science [252, 323, 16]. Observa-

tions indicate that geometric GNNs with a good geometric-prior vocabulary design can help achieve

model scaling over the vanilla GNN.

Graph transformer is another popular choice for the model architecture, where geometric-

prior graph vocabulary design is explicitly modeled through either a GNN encoder or positional

encoding [209]. [196, 174] show that graph transformers show positive scaling capabilities for

70

molecular data under a supervised setting. More recently, [334] demonstrates vanilla transformer’s

effectiveness in protein and molecular property prediction. Particularly, it views the graph as a

sequence of tokens forming an Eulerian path [66], which ensures the lossless serialization, and

then adopts next-token prediction to pre-train transformers. After fine-tuning, it achieves promising

results on the protein association prediction and molecular property prediction and shows that vanilla

transformers also follow the model scaling law [124]. Nonetheless, the effectiveness of transformers

on other tasks remains unclear.

6.3.4 Leveraging Large-scale LMs for Graphs

LLMs with successful scaling behavior have achieved tremendous success in the NLP domain.

Surprisingly, well-trained LLMs can be applied to other domains with satisfying performance such

as time series forecast [79], recommendation [119], and search engine [347]. Larger-scale LLMs

can even capture key symmetries of crystal structures, suggesting that LLMs may possess a strong

simplicity bias [218] across domains by implementing Bayesian model averaging algorithm [331].

A recent line of research on GFMs focuses on leveraging the strong capabilities of LLMs on

graph tasks [46, 48, 47]. Our discussions can be roughly categorized on LLM applications (i)

conventional graph tasks (such as node, edge, and graph classification), and (ii) language-driven

tasks like Graph Question Answer (GQA).

LLMs on conventional graph tasks. One natural way to utilize LLMs is as textual feature

encoders [45]. Despite the original node features not being text, [164] manually converts them

into knowledge-enhanced text descriptions and then encodes features into textual embedding. This

LLM embedding approach offers the following benefits. (i) High-quality features help to achieve

satisfactory performance with vanilla GCN [45]. (ii) LLMs encode diverse original features in

an aligned feature space, enabling training and inference across graphs from different domains

without the feature heterogeneity problem. Notably, when LLMs are utilized as feature encoders for

textural understanding, the scaling law does not happen [19], meaning that a larger model does not

necessarily lead to better performance.

Another approach is to utilize LLMs as predictors which first fine-tune LLMs and then generate

71

predictions in a natural language form. [45, 90] treats node classification as text classification on the

target node feature, illustrating promising results in the zero-shot setting. However, simply flattening

graph structures into prompts does not yield additional improvement, leaving a large performance

gap compared to well-trained GNNs [45]. To better encode the graph structure knowledge, methods

such as GNNs [261], graph transformers [33], and non-parametric aggregations [43] are utilized as

structure encoders. The encoded structural embeddings are then linearly mapped into text space as

prompt tokens. LLMs generate predictions based on a concatenation of the prompt token and the

textual instruction. Instead of additional graph modeling, [333] employs a novel tree-based prompt

design that transforms the graph into sequence while retaining important structural semantics. This

approach indicates the potential for LLMs to understand particular graph structures. Overall, proper

LLM fine-tuning can achieve satisfying graph performance while efficiency may be a potential issue.

LLMs on language-driven graph tasks. Instead of adapting LLMs for conventional graph

tasks, LLMs can also be applied to language-driven tasks they are originally skilled in, for example,

Graph Question Answer (GQA). [69, 280] apply LLMs on various GQA tasks, e.g., cycle check,

and maximum flow, by describing graph structure with natural language. More recently, [222]

incorporates an external GNN tokenizer to encode graph information, achieving satisfying out-of-

domain generalization to unseen graph tasks. Interestingly, [222] illustrates that equivariance is

not necessary when equipped with LLMs. [91] proposes new real-world challenging GQA tasks

and a corresponding LLM-based conversational framework. This framework integrates GNNs

and retrieval-augmented generation (RAG) to improve graph understanding and mitigate issues

like hallucination, demonstrating effectiveness across multiple domains. Until now, most GQA

challenges have focused on abstract graphs without concrete descriptions for each node, creating to

obstacle to o leveraging the extensive internal knowledge in LLMs. We call for more real-world

GQA challenges enabling better leverage of LLM capabilities.

Overall, a consensus can be found that LLMs can serve as a better textual feature encoder with

superior performance. Nonetheless, it remains unclear whether LLMs should play a key role in

building GFMs with concerns about the LLM’s capability to understand essential graph structures.

72

[244, 64] theoretically observe that LLMs are required to tackle problems sequentially greedily [200],

leading to a shortcut solution rather than a formal analysis on the graph structure.

6.4

Insights & Open Questions

In this section, we explore key insights gained from recent advancements in GFMs and highlight

open questions that remain to be addressed in this evolving field.

6.4.1 Potential Redundancy on Pretext Task and Architecture Design

There are mainly two approaches to achieving transferability: (1) designing GNNs with specific

geometric properties for transfer, e.g., ULTRA [72], and (2) creating pretext tasks to automatically

learn these properties. [118] suggests an overlap between these approaches, indicating that pretext

tasks targeting local structural information might be unnecessary, given that GNNs often inherently

encode this information. Investigating the strengths and limitations of these techniques, along with

providing practical guidance for their selection, could be a valuable research direction. A hypothesis

might be that model design methods are more suitable for data that strictly adheres to geometric

priors, while pretext task designs are more effective in the opposite scenario.

6.4.2 The Feasibility of GFMs

Graphs can be defined in different ways based on different criteria like similarity or influence

between node pairs [28]. We can then categorize graphs based on the observability of the criteria.

The observable graph is unambiguously known, e.g., whether one paper cites another paper in a

citation graph. Text and images can also be viewed as a specific case of observable graphs. In

contrast, the observable ones are manually conducted with ambiguous descriptions of the relationship,

e.g., whether one gene regulates the expression of another in a gene-regulate graph. These graphs

may not naturally exist in the world, leading to uncertainty with a lack of invariant principle. It

remains unknown whether GFMs can learn shared knowledge while avoiding manually introduced

noisy patterns.

There are concerns about the benefit of training a GFM on graphs that are neither from the

same domain nor share the same downstream task. On the one hand, it seems that training on

them simultaneously shows no positive transfer benefit while increasing the risk of the negative

73

transfer. On the other hand, there may be potential undiscovered transferable patterns that could

lead to success. Therefore, we pose an open question of whether there exists a universal structural

representation space that can benefit all the graph tasks.

6.4.3 Broader Usage of GFM

In this paper, we majorly focus on building GFMs for conventional graph-focused tasks. Notably,

graph formulation provides the universal representation ability, which has a broader usage in other

domains, e.g., scene graphs for Computer Vision (CV) [319, 338], bipartite graphs for linear

programming [49], and physical graphs for understanding physical mechanisms [250]. To emphasize

more broader usage of GFMs, we illustrate the potential advantaged usage of GFMs over existing

foundation models in reasoning, computer vision, and code intelligence domains domains. Details

can be found as follows.

Reasoning. [111] proposes a task-specific GFM, focusing on neural algorithmic reasoning tasks.

A strong reasoning capability can be found with effectiveness across sorting, searching, and dynamic

programming tasks. We argue that this GFM following the theory of algorithmic alignment [296]

may achieve better reasoning capability than the LLM merely relying on the textual inputs via

retrieving concepts co-occur frequently in training data [225].

Computer Vision. Scene graph is a data structure representing objects, their attributes, and the

relationships between them within an image, facilitating CV tasks such as image understanding and

visual reasoning. However, current research remains a naive scene graph modeling with vanilla

GNNs with more emphasis on image modeling. We argue that the GFM on the scene graph may

help to preserve global and local scene-object relationships [320], avoiding the potential conflict

or redundancy between multiple objectives which frequently appears on the recent popular Sora

model [27].

Code Intelligence. Graphs, e.g., code property graph [169], control flow graph, and program

dependency graph, play an important role in code-relevant tasks, e.g., vulnerability detection [169],

fault localization [230], and code search [160]. Compared to sequence-based modeling with LLMs,

graphs can provide a complementary perspective on the overlooked essential code attributes such

74

as syntax, control flow, and data dependencies. However, the graph modeling remains naive with

unknown transferability across different program languages.

Overall, GFMs demonstrate unique value compared to foundation models in other domains.

However, they are limited to applications involving graph structure data. An exciting future topic is

how to adaptively combine GFMs with other foundation models across different modalities toward a

powerful Artificial General Intelligence (AGI).

6.5 Conclusion

From the transferability principles of graphs, we review existing GFMs and ground their

effectiveness from a vocabulary view to find a set of basic transferrable units across graphs and

tasks. Our key perspectives can be summarized as follows: (1) Constructing a universal GFM

is challenging, but domain/task-specific GFMs are approachable with the usual availability of a

specific vocabulary. (2) One challenge is developing GFMs following the neural scaling law, which

requires more data collection, suitable architecture design, and properly leveraging LLMs. This

paper summarizes the current position of GFMs and challenges toward the next step, which may be

a blueprint for GFMs to inspire relevant research.

75

CHAPTER 7

CROSS-DOMAIN GRAPH DATA SCALING: A SHOWCASE WITH DIFFUSION MODELS

7.1

Introduction

The effectiveness of existing foundation models [229, 264, 137] heavily relies on the availability

of substantial amounts of data, where the relationship manifests as a scaling behavior between

model performance and data scale [125]. Consistent performance gain has been observed with the

increasing scale of pre-training data in both Natural Language Processing [125, 97] and Computer

Vision [2, 321] domains. This data scaling phenomenon facilitates the development of general

models endowed with extensive knowledge and effective data pattern recognition capabilities. In

downstream applications, these models are capable of adaptively achieving performance gains across

different tasks.

In the context of graphs, the availability of large-scale graph databases [236] enables possible

data scaling across datasets and domains. Existing works have demonstrated graph data scaling

following two limited settings: in-domain pre-training [293, 173] and task-specific selection for

pre-training data [32]. During the pre-training process, each graph in the pre-training pool must be

validated as in-domain or relevant to the downstream dataset. Given a specific domain or task, the

crucial discriminative data patterns are likely confined to a fixed set [193], leaving other potential

patterns in diverse graph data distribution as noisy input. In terms of structure, graphs from different

domains are particularly composed of varied patterns [203], making it hard to transfer across

domains [182]. For example, considering the building blocks of the graphs, the motifs shared by

the World Wide Web hyperlinks only partially align with those shared by genetic networks [203].

Therefore, closely aligning the characteristics of the pre-training graphs and the downstream data

both in feature and structure is essential for facilitating positive transfer [32]. As a consequence,

the necessity of such meticulous data filtering restricts these methods from scaling up graphs

effectively, as they can only utilize a small part of the available data. Given the limitation of the

graph pre-training methods, a pertinent question emerges: How can we effectively leverage the

increasing scale of graph data across domains?

76

Rather than focusing solely on data patterns specific to particular domains, we aim to develop a

model that has a comprehensive understanding of data patterns inherent across various types of

graphs. In line with the principles of data scaling, we hypothesize that incorporating a broader

range of training datasets can help the model build an effective and universal graph pattern library,

avoiding an overemphasis on major data patterns specific to any single dataset [188]. To construct

such a general-purpose model, we propose to utilize a diffusion model operating only on the

structure as the backbone, for the following key reasons. (1) Unlike features, graph structures

follow a uniform construction principle, namely, the connections between nodes. This allows for

positive transfer across domains when the upstream and downstream data exhibit similar topological

patterns [32]. In particular, while the graph representations of neurons and forward electronic

circuits are derived from distinct domains, they still share common motifs [203]. (2) Current

supervised and self-supervised methods tend to capture only specific patterns of graph data, with

models designed for particular inductive biases [188, 193, 298]. For instance, graph convolutional

networks (GCNs) excel in node-level representation learning by emphasizing homophily, whereas

graph-level representation learning benefits from expressive GNNs capable of distinguishing complex

graph structures. (3) We opt for a structure-only model due to the heterogeneous feature spaces

across graphs, which often include missing features or mismatched semantics [189]. For instance,

node features yield completely different interpretations in citation networks, where they represent

keywords of documents, compared to molecular networks, where they denote properties of atoms.

To this end, we pre-train a structure-only diffusion model on thousands of graphs, which serves as

the upstream component of our framework.

In the downstream stage, we employ the pre-trained diffusion model as a Universal graph structure

Augmentor (UniAug) to enhance the dataset, where diffusion guidance [94, 58, 80] is employed

to align the generated structure with the downstream requirements. Specifically, we generate

synthetic structures with various guidance objectives, and the resulting graphs consist of generated

structures and original node features. This data augmentation paradigm strategically circumvents

feature heterogeneity and fully utilizes downstream inductive biases by applying carefully designed

77

downstream models to the augmented graphs in a plug-and-play manner. Empirically, we apply

UniAug to graphs from diverse domains and consistently observe performance improvement in

node classification, link prediction, and graph property prediction. To the best of our knowledge,

this study represents the first demonstration of a data-scaling graph structure augmentor on graphs

across domains.

7.2 Preliminary and Related Work

Learning from unlabeled graphs Graph self-supervised learning (SSL) methods provide

examples of pre-training and fine-tuning paradigm [104, 99, 128, 314, 301]. However, these methods

benefit from limited data scaling due to feature heterogeneity, structural pattern differences across

domains, and varying downstream inductive biases. It is worth mentioning that DCT [162] presents

a pre-training and then data augmentation pipeline on molecules. Despite its impressive performance

improvement on graph-level tasks, DCT is bounded with molecules and thus the use cases are

limited.

Graph data augmentation There have been many published works exploring graph data

augmentation (GDA) since the introduction of graph neural networks (GNNs), with a focus on

node-level [219, 170, 10], link-level [335, 213], and graph-level [85, 159, 179, 163, 139]. These

GDA methods have been generally designed for specific tasks or particular aspects of graph data.

In addition, they are often tailored for a single dataset and struggle to transfer to unseen patterns,

which limits their generalizability to a broader class of applications.

Diffusion models on graphs Diffusion models [93, 254, 234] are latent variable models that

learn data distribution by gradually adding noise into the data and then recovering the clean input.

Existing diffusion models on graphs can be classified into two main categories depending on the type

of noise injected, i.e. Gaussian or discrete. Previous works employed Gaussian diffusion models

both on general graphs [215, 123] and molecules [249, 302]. However, adding Gaussian noise into

the adjacency matrix will destroy the sparsity of the graph, which hinders the scalability of the

diffusion models [82]. Recent works adapted discrete diffusion models to graphs with categorical

transition kernels [272, 44, 41]. We denote the adjacency matrix of a graph as A0 ∈ {0, 1}𝑛×𝑛 with

78

Figure 7.1 The pipeline of UniAug. We pre-train a diffusion model across domains and perform
structure augmentation on the downstream graphs. The augmented graphs consist of generated
structures and original node features and are then processed by a downstream GNN.

𝑛 nodes. We write the forward process to corrupt the adjacency matrix into a sequence of latent

variables as Bernoulli distribution
A𝑡 | A𝑡−1(cid:17)

𝑞

(cid:16)

= Bernoulli (cid:16)

A𝑡; 𝛼𝑡A𝑡−1 + (cid:0)1 − 𝛼𝑡 (cid:1) 𝜋

(cid:17)

,

(cid:16)

A𝑡−1 | A𝑡, A0(cid:17)

𝑞

=

𝑞 (cid:0)A𝑡 | A𝑡−1(cid:1) 𝑞 (cid:0)A𝑡−1 | A0(cid:1)
𝑞 (cid:0)A𝑡 | A0(cid:1)

,

(7.1)

where 𝜋 is the converging non-zero probability, 𝛼𝑡 is the noise scale, and ¯𝛼𝑡 = (cid:206)𝑡

𝑖=1 𝛼𝑖. Under
predict-A0 parameterization, the reverse process denoise the adjacency matrix with a Markov chain

𝑝𝜃 (cid:0)A𝑡−1 | A𝑡 (cid:1) ∝ (cid:205)

(cid:101)A0

(cid:16)

𝑞

A𝑡−1 | A𝑡, (cid:101)A0

(cid:17) ˜𝑝𝜃

(cid:16)

(cid:101)A0 | A𝑡 (cid:17)

,

(7.2)

where ˜𝑝𝜃 ((cid:101)A0 | A𝑡) represents the denoising network that predicts the original adjacency matrix
from the noisy adjacency matrix. The parameters are estimated by optimizing the variational lower

bound on the negative log-likelihood [9]

𝑇
∑︁

𝐿vb =

E𝑞(A𝑡 |A0)

𝑡=2
− E𝑞(A1|A0)

(cid:104)

𝐷KL

(cid:16)

𝑞

(cid:16)

A𝑡−1 | A𝑡, A0(cid:17)

∥ 𝑝𝜃

(cid:16)

A𝑡−1 | A𝑡 (cid:17) (cid:17)(cid:105)

(7.3)

(cid:104)log 𝑝𝜃

(cid:16)

A0 | A1(cid:17)(cid:105)

+ E𝑞(A0)

(cid:104)

𝐷KL

(cid:16)

𝑞

(cid:16)

A𝑡 | A0(cid:17)

∥ 𝑝 (cid:0)A𝑡 (cid:1)(cid:17) (cid:105)

.

7.2.1 Method

In this section, our goal is to build UniAug to understand the diverse structure patterns of

graphs and perform data augmentation with a range of objectives. As illustrated in Fig.7.1, UniAug

79

DiffusionModelPre-trainingStructure AugmentationDiffusionModelGuidanceHead+…DownstreamGNNAugmented GraphsPrediction:Active :FrozenFigure 7.2 Normalized structural properties.

consists of two main components: a pre-trained diffusion model and the downstream adaptation

through structure augmentation. We first collect thousands of graphs from varied domains with

diverse patterns. To construct a general model free of downstream inductive biases, we train a

self-conditioned discrete diffusion model on graph structures. In the downstream stage, we train

an MLP guidance head on top of the diffusion model with objectives across different levels of

granularity. We then augment the downstream dataset by generating synthetic structures through

guided generation, where the augmented graph is composed of generated structures and original

node features. Subsequently, we apply the augmented data to train a task-specific model for

performing downstream tasks. Below, we elaborate on the data collection process, the architecture

of the discrete diffusion model, and the guidance objectives employed.

7.2.2 Pre-training data collection

In light of the data scaling spirit, we expect our pre-training data to contain diverse data patterns

with sufficient volume. As graphs from different domains exhibit different patterns [203], we wish

to build a collection of graphs from numerous domains to enable a universal graph pattern library

with pre-training. Within the publicly available graph databases, Network Repository [236] provides

a comprehensive collection of graphs with varied scales from different domains, such as biological

networks, chemical networks, social networks, and many more. Among the thousands of graphs

in the Network Repository, some of them contain irregular patterns, including multiple levels of

edges, extremely high density, et cetera. To ensure the quality of the graphs, we analyze the graph

80

20246Scale-free exponent2024Network entropyNetwork RepositoryGitHub Starproperties following Xu et al. [295] and filter out the outliers. In addition, we observe that the

coverage of graphs in the Network Repository is incomplete according to the network entropy and

scale-free exponent, as we observe a relatively scattered space in the middle of Fig. 7.2. To fill in

the gap, we include a 1000 graphs subset of the GitHub Star dataset [240]. The selected graphs are

utilized to train a discrete diffusion model.

7.2.3 Pre-training through diffusion model

Diffusion models have demonstrated the ability to facilitate transferability from a data augmenta-

tion perspective on the images [265, 315, 88]. Unlike the traditional hand-crafted data augmentation

methods, diffusion models are capable of producing more diverse patterns with high quality [265].

With the aid of diffusion guidance [94, 58], these methods can achieve domain customization tailored

to specific semantic spaces [315, 88]. Despite the success of data augmentation through diffusion

models on images, the non-Euclidean nature of graph structures poses challenges for data-centric

learning on graphs. In addition, the fact that most graphs in the Network Repository are unlabeled

exacerbates the challenges, as the absence of labeled data results in substantially lower generation

quality for diffusion models [58, 12].

To address the aforementioned challenges, we propose to construct a self-conditioned discrete

diffusion model on graph structures. Unlike Gaussian-based diffusion models, discrete diffusion

models [98, 9, 30, 272] operate with discrete transition kernels between latent variables, as shown

in Section 7.2. The key reason we opt for the discrete diffusion models lies in the sparse nature of

graphs, where adding Gaussian noise into the adjacency matrix will result in a dense graph [82]. On

the contrary, discrete diffusion models effectively preserve the sparse structure of graphs during the

diffusion process, thus maintaining the efficiency of the models on graphs.

To accommodate for unlabeled graphs, we adopt a self-supervised labeling strategy as an

auxiliary conditioning procedure [76, 100]. By leveraging the self-labeling technique, we are able

to upscale the diffusion model to data with more diverse patterns [76]. The self-labeling technique

requires two components, i.e., a feature extractor and a self-supervised annotator.

Feature extractor. We extract graph-level features by calculating graph properties, including

81

the number of nodes, density, network entropy, average degree, degree variance, and scale-free

exponent following Xu et al. [295]. The first two represent the scale of the graph corresponding to

nodes and edges, and the rest indicate the amount of information contained within a graph [295].

We compute the properties of one graph and concatenate them to get a graph-level representation.

Self-supervised annotator. To assign labels to graphs in a self-supervised manner, we employ

clustering algorithms on the graph-level representations. The number of clusters is determined

jointly by the silhouette score [238] and the separation of the graphs. The candidates of the number

of clusters are chosen to ensure different clusters are well separated. Among the candidates, we

select the final number of clusters by maximizing the mean Silhouette Coefficient of all samples.

Next we detail the parameterization of the denoising model ˜𝑝𝜃 ((cid:101)A0 | A𝑡) with the self-assigned
graph-level labels k. The denoising model recovers the edges of the original adjacency matrix

by predicting the connectivity of the upper triangle, which can be formulated as a link prediction

problem [325, 144]. Following the link prediction setup, the denoising model is composed of a

graph transformer (GT) [251] and an MLP link predictor. Denote the hidden dimension as 𝑑, we

treat the node degrees as node features and utilize a linear mapping 𝑓𝑑 : R ↦→ R𝑑 to match the

dimension. Similarly, we utilize another linear mapping 𝑓𝑡 : R ↦→ R𝑑 for timestep 𝑡 and learnable

embeddings 𝑓𝑘 : {0, . . . , 𝐾 } ↦→ R𝑑 for labels k, where 𝐾 is the number of clusters. The outputs are

summed together and then fed into the GT. Mathematically, we have

h𝑡 = GT (cid:0) 𝑓𝑑 (cid:0)degree (cid:0)A𝑡 (cid:1)(cid:1) + 𝑓𝑡 (𝑡) + 𝑓𝑘 (k), A𝑡 (cid:1) ,

˜𝑝𝜃 ((cid:101)A0

𝑖 𝑗

| A𝑡; 𝑡, k) ≔ ˜𝑝𝜃 ((cid:101)A0

𝑖 𝑗

| h𝑡) = MLP (cid:16)

[h𝑡

𝑖, h𝑡
𝑗 ]

(cid:17)

.

(7.4)

With the above denoising network, our diffusion model is trained on the collected graphs by

optimizing the variational lower bound in (7.3). After the pre-training process, we perform adaptive

downstream enhancement through graph structure augmentation.

7.2.4 Downstream adaptation through data augmentation

The downstream phase of UniAug is to augment the graph topology through guided generation.

This guidance process serves to provide downstream semantics for the diffusion model, thus bridging

the gap between the pre-training distribution and the downstream datasets. Among the techniques

82

for diffusion guidance, gradient-based methods [58, 80] offer versatile approaches by incorporating

external conditions that are not present during training. For the discrete diffusion process, we

opt for the gradient-based NOS method [80] due to its flexibility and efficiency. Specifically,

we build an MLP regression head 𝑔𝜃 : R𝑑 ↦→ R𝑟 that takes the hidden representations h𝑡 as

the input and outputs the guidance objective of dimension 𝑟. Denote 𝜏 as the temperature, 𝛾

as the step-size, 𝜆 as the regularization strength, and 𝜀 drawn from N (0, 𝐼), we sample from
˜𝑝′((cid:101)A0 | h𝑡) ∝ ˜𝑝𝜃 ((cid:101)A0 | h𝑡) exp (cid:0)𝑔𝜃 (cid:0)h𝑡 (cid:1) (cid:1) via Langevin dynamics
(cid:104)

𝜆KL (cid:16) ˜𝑝′ (cid:16)

(cid:101)A0 | h𝑡,′(cid:17)

∥ ˜𝑝′ (cid:16)

(cid:101)A0 | h𝑡 (cid:17)(cid:17)

− 𝑔𝜃 (cid:0)h𝑡,′(cid:1) (cid:105)

h𝑡,′ ← h𝑡,′ − 𝛾∇h𝑡 ,′

+ √︁2𝛾𝜏𝜀.

(7.5)

One key question to answer is how to choose the proper guidance objectives. Our goal is to find

numerical characteristics that can best describe the structural properties of a graph. This includes

supervision signal and self-supervised information on the level of node, edge, and graph.

Node level. Node labels provide the supervision signal for node classification tasks. Beyond

node labels, node degrees are a fundamental factor in the evolutionary process of a graph [171].

From the perspective of network analysis, centrality measures indicate the importance of nodes

from various viewpoints [24]. Empirically, we observe that utilizing different node-level heuristics

as guidance targets tends to yield similar outcomes. Therefore, we focus on node labels and node

degrees.

Edge level. Edge-level heuristics can be broadly classified into two categories: local structural

heuristics, such as Common Neighbor and Adamic Adar [5], and global structural heuristics, such as

Katz [126] and SimRank [113]. Similar to node-level heuristics, empirical observations suggest that

different edge-level heuristics tend to yield comparable guidance effects. In this work, we focus on

the Common Neighbors (CN) heuristic due to its efficiency. Another edge-level guidance objective

is to recover the adjacency matrix from the node representations in a link prediction way, similar to

how we parameterize the denoising network. We anticipate that such link prediction objective helps

to align the generated graph with the downstream data on the granularity of edges.

Graph level. Graph labels offer the supervision signal for graph classes or regression targets. In

addition, we incorporate graph-level properties [295] as quantitative measures to bridge the gap

83

Table 7.1 Comparison between GDA methods, pre-training methods, and UniAug. By cross-domain
transfer, we emphasize the ability of the method to train on vastly different domains and benefit all
of them.

GDA methods

Pre-training methods

GraphAug CFLP Half-Hop

FLAG AttrMask D-SLA GraphMAE

Effective on graph-level task
Effective on edge-level task
Effective on node-level task
In-domain transfer
Cross-domain transfer

✓
–
–
–
–

–
✓
–
–
–

–
–
✓
–
–

✓
✓
✓
–
–

✓
–
–
✓
–

✓
✓
–
✓
–

✓
–
✓
✓
–

UniAug

✓
✓
✓
✓
✓

between the pre-training distribution and the downstream dataset. We empirically observe that graph

label guidance offers significantly higher performance boosts compared to properties on graph-level

tasks. Therefore, we focus on graph labels in our experiments.

We note that all the above objectives are natural choices inspired by heuristics and downstream

tasks. There exist many other self-supervised objectives to be explored, such as community-level

spectral change [260] and motif occurrence prediction [235]. We leave the study of these objectives

as one future work. With the diffusion guidance, we assemble the augmented graphs with generated

structures and original node features. The augmented graphs are then fed into downstream-specific

GNNs.

7.2.5 Comparison to existing methods

The data augmentation paradigm of UniAug allows us to disentangle the upstream and downstream.

We construct a diffusion model as the upstream component to comprehend the structural patterns

of graphs across various domains. In addition, we leverage downstream inductive biases with

downstream-specific models in a plug-and-play manner. This allows UniAug to facilitate cross-

domain transfer, offering a unified method that benefits graphs across different domains for various

downstream tasks. On the contrary, existing GDA methods are typically designed for specific tasks

and hard to transfer to unseen patterns. In the meantime, existing pre-training methods fail to

transfer across domains due to heterogeneity in features and structures. This comparison highlights

the success of UniAug as a data-scaling graph structure augmentor across domains. We summarize

the comparison between methods in Table 7.1.

84

7.3 Results

In this section, we conduct experiments to validate the effectiveness of UniAug. We first pre-train

our discrete diffusion model on thousands of graphs collected from diverse domains. For each

downstream task, we train an MLP guidance head with corresponding objectives on top of the

diffusion model. We then perform structure augmentation using UniAug and subsequently train a

task-specific GNN on augmented data for prediction. Through the experiments, we aim to answer

the following research questions:

• RQ1: Can UniAug benefit graphs from various domains across different downstream tasks?

• RQ2: What is the scaling behavior of UniAug corresponding to data scale and amount of

compute?

• RQ3: Which components of UniAug are effective in preventing negative transfer?

7.3.1 Main results

To get a comprehensive understanding of UniAug, we evaluate it on 25 downstream datasets

from 7 domains for graph property prediction, link prediction, and node classification.

Baselines. We evaluate our model against three main groups of baselines. (1) Task-specific

GNNs: For graph property prediction, we use GIN [298]; for link prediction, we use GCN [136]

and NCN [284]; and for node classification, we use GCN [136]. (2) Graph pre-training methods:

These include AttrMask, CtxtPred, EdgePred, and InfoMax [104], JOAO [314], D-SLA [128], and

GraphMAE [99]. For each of these methods, we pre-train it on the same pre-training set as UniAug.

While most of the pre-training graphs lack node features, we calculate the node degrees as the

input. Each method consists of three pre-trained variants with different backbone GNNs, including

GIN, GCN, and GAT. We note that all these methods require the downstream graphs to have the

same node feature space as the pre-training data. Therefore, in the fine-tuning stage, we replace

the node features of the downstream datasets with node degrees, evaluate all three variants, and

report the highest performance for each method in each task. (3) Graph data augmentation (GDA)

methods: For graph property prediction, we include S-Mixup [159], GraphAug [179], FLAG [139],

85

Table 7.2 Mean and standard deviation of accuracy (%) with 10-fold cross-validation on graph
classification. The best result is bold. The last column is the average rank of each method.

DD

Enzymes

Proteins

NCI1

IMDB-B

IMDB-M

Reddit-B

Reddit-12K

Collab

A.R.

GIN

75.81 ± 6.11

66.00 ± 7.52

73.32 ± 4.03

78.30 ± 3.20

71.10 ± 2.90

49.07 ± 2.81

90.85 ± 1.30

48.63 ± 1.62

74.54 ± 2.41

AttrMask
CtxtPred
EdgePred
InfoMax
JOAO
D-SLA
GraphMAE

S-Mixup
GraphAug
FLAG

72.93 ± 3.09
75.14 ± 2.67
75.64 ± 2.77
75.23 ± 3.43
75.98 ± 2.86
74.66 ± 3.30
76.07 ± 3.25

73.12 ± 3.27
75.21 ± 2.63
76.87 ± 7.21

23.66 ± 6.09
21.67 ± 3.87
22.00 ± 3.32
22.50 ± 6.76
22.17 ± 3.67
22.67 ± 4.21
23.00 ± 3.64

66.85 ± 7.04
68.14 ± 7.92
68.35 ± 7.45

73.10 ± 3.90
72.21 ± 4.60
71.22 ± 3.53
71.30 ± 5.18
71.57 ± 5.31
71.97 ± 4.17
70.45 ± 4.19

74.61 ± 5.08
74.21 ± 3.70
74.31 ± 4.21

77.67 ± 2.53
78.99 ± 1.29
77.82 ± 2.95
76.94 ± 1.48
76.87 ± 2.27
77.95 ± 2.11
79.08 ± 2.72

78.91 ± 1.61
79.53 ± 3.21
79.03 ± 3.75

71.20 ± 2.40
70.70 ± 1.55
70.20 ± 2.23
71.60 ± 2.06
71.02 ± 1.81
71.92 ± 2.75
71.50 ± 2.01

69.61 ± 4.43
74.00 ± 3.41
68.83 ± 4.67

48.00 ± 3.14
48.20 ± 2.23
47.80 ± 2.42
46.70 ± 2.46
48.85 ± 2.06
47.28 ± 1.88
47.93 ± 3.03

48.33 ± 5.36
48.11 ± 1.85
47.21 ± 3.45

87.50 ± 3.31
90.35 ± 2.31
90.80 ± 1.69
89.15 ± 2.84
90.17 ± 2.13
89.77 ± 1.87
86.10 ± 3.63

88.65 ± 3.12
90.50 ± 3.17
89.11 ± 2.40

48.00 ± 1.60
47.62 ± 2.50
48.35 ± 1.44
48.98 ± 1.83
49.01 ± 1.90
48.50 ± 1.33
47.67 ± 1.16

48.30 ± 2.50
49.00 ± 1.99
47.48 ± 3.01

75.64 ± 1.52
75.60 ± 1.49
74.64 ± 2.24
75.44 ± 1.12
74.77 ± 1.71
75.99 ± 2.08
74.84 ± 1.36

75.89 ± 3.26
76.02 ± 2.67
75.32 ± 3.13

UniAug

78.13 ± 2.61

71.50 ± 5.85

75.47 ± 2.50

80.54 ± 1.77

73.50 ± 2.48

50.13 ± 2.05

92.28 ± 1.59

49.48 ± 0.71

77.00 ± 2.02

5.56

8.00
7.67
8.56
8.00
7.11
7.00
7.67

6.67
3.67
7.00

1.11

GREA [163], and DCT [162]; for link prediction, we include CFLP [335]; and for node classification

on heterophilic graphs, we include Half-Hop [10]. The GDA methods are implemented based on

chosen task-specific GNNs.

Graph property prediction. We employ graph label guidance for UniAug throughout the

graph-level tasks by training a 2-layer MLP as the guidance head on the graph labels in the training

set. In the augmentation stage, we generate multiple graphs per training sample, and the generated

graphs are then fed into the baseline GIN. We present the results of graph classification in Table 7.3.

Three key observations emerge from the analysis: (1) Existing pre-training methods show negative

transfer compared to GIN. Some special cases are the Enzymes and molecule regression datasets,

where all pre-training methods fail to yield satisfactory results. In these datasets, the features are

one of the driving components for graph property prediction, while the pre-training methods fail

to encode such information due to incompatibility with the feature dimension. This reveals one

critical drawback of the pre-training methods: their inability to handle feature heterogeneity. (2)

GDA methods yield inconsistent results across different datasets. While these methods enhance

performance in some datasets, they cause performance declines in others. This variability is

directly reflected in the average rank, where some of them even fall behind the GIN. (3) Unlike

the pre-training methods and GDA methods, UniAug shows consistent performance improvements

against GIN with a large margin. In the molecule regression tasks, UniAug effectively compensates

for the absence of bond features and achieves performance comparable to DCT, which is a data

augmentation method pre-trained on in-domain molecule graphs. These findings affirm that the

86

Table 7.3 Mean and standard deviation of accuracy (%) with 10-fold cross-validation on graph
classification. The best result is bold. The last column is the average rank of each method. *Results
are taken from DCT [162].

ogbg-Lipo

ogbg-ESOL

ogbg-FreeSolv A.R.

GINE*
GIN

0.545 ± 0.019 0.766 ± 0.016
0.543 ± 0.021 0.729 ± 0.018

1.639 ± 0.146
1.613 ± 0.155

JOAO

0.859 ± 0.007 1.458 ± 0.040

3.292 ± 0.117

FLAG*
GREA*
DCT*

UniAug

0.528 ± 0.012 0.755 ± 0.039
0.586 ± 0.036 0.805 ± 0.135
0.516 ± 0.071 0.717 ± 0.020

1.565 ± 0.098
1.829 ± 0.368
1.339 ± 0.075

0.528 ± 0.006 0.677 ± 0.026

1.448 ± 0.049

5.00
3.67

7.00

3.00
6.00
1.33

1.67

pre-training and structure augmentation paradigm of UniAug effectively benefits the downstream

datasets at the graph level.

Link prediction. We choose three guidance objectives for UniAug, including node degree, CN,

and link prediction objective, as described in Section 7.2.4. For each objective, we train an MLP to

provide guidance information. We then augment the graph structure by generating a synthetic graph

and preserving the original training edges, ensuring that the augmented graph does not remove any

existing edges. The augmented graph is then fed into a GCN for link prediction. We summarize the

results in Table 7.4, which show similar patterns to those observed in graph property prediction: (1)

Existing pre-training methods provide negative transfer, especially on datasets with node features.

(2) GDA method CFLP leads to performance drops on the datasets without features and also suffers

from high computation complexity during preprocessing. (3) UniAug enhances performance across

all tested datasets. In addition, we employ UniAug to NCN [284], one of the state-of-the-art methods

for link prediction. The results demonstrate consistent performance boosts from UniAug when we

apply NCN as the backbone. The structure augmentation paradigm of UniAug allows plug-and-play

applications to any downstream-specific models, showcasing its adaptability and effectiveness.

Node classification. To demonstrate the effectiveness of UniAug in node-level tasks, we transform

the node classification into subgraph classification. Specifically, we extract the aggregation tree

of each node, i.e., 2-hop subgraph for a 2-layer GCN, and label the subgraph with the center

87

Table 7.4 Mean and standard deviation across 10 runs on link prediction. Results are scaled ×100.
The last two methods are based on NCN, while the rest are GCN-based. The best result is bold for
two backbones, respectively. The last column is the average rank of each GCN-based method.

Cora
MRR

Citeseer
MRR

Pubmed
MRR

Power
Hits@10

Yeast
Hits@10

Erdos
Hits@10

Flickr
Hits@10

GCN

30.26 ± 4.80

50.57 ± 7.91

16.38 ± 1.30

30.61 ± 4.07

24.71 ± 4.92

35.71 ± 2.65

8.10 ± 2.58

AttrMask
CtxtPred
EdgePred
InfoMax
JOAO
D-SLA
GraphMAE

CFLP

13.43 ± 1.93
15.68 ± 2.91
15.31 ± 3.54
16.35 ± 2.57
17.21 ± 3.66
15.55 ± 3.12
15.94 ± 1.73

33.62 ± 6.44

UniAug-GCN 35.36 ± 7.88
NCN
31.72 ± 4.48
UniAug-NCN 35.92 ± 7.85

20.23 ± 1.29
22.31 ± 1.31
22.91 ± 1.87
22.90 ± 1.30
23.10 ± 1.41
23.05 ± 1.54
20.35 ± 1.52

55.20 ± 4.16
54.66 ± 4.55

58.03 ± 3.45
61.69 ± 3.21

16.39 ± 3.62
13.10 ± 3.70
17.85 ± 4.45
15.91 ± 2.71
15.33 ± 3.70
16.10 ± 3.96
13.80 ± 1.36

29.92 ± 2.61
29.30 ± 3.55
29.54 ± 3.78
29.29 ± 4.72
28.98 ± 4.01
29.37 ± 2.88
27.69 ± 1.99

25.10 ± 4.77
22.96 ± 4.28
25.78 ± 4.51
26.33 ± 4.12
26.47 ± 4.65
26.15 ± 3.32
26.51 ± 2.92

30.85 ± 3.13
34.82 ± 2.55
34.65 ± 3.84
35.82 ± 4.12
33.77 ± 3.05
36.02 ± 4.58
35.63 ± 3.61

8.77 ± 1.65
3.61 ± 1.01
6.86 ± 3.24
3.23 ± 0.38
6.01 ± 1.57
6.70 ± 2.03
8.41 ± 2.44

17.01 ± 2.75

16.02 ± 8.31

24.23 ± 5.23

28.74 ± 2.38

OOM

17.28 ± 1.89
38.26 ± 2.56
40.30 ± 2.53

34.36 ± 1.68
27.36 ± 5.00
30.20 ± 1.46

27.52 ± 4.80
39.85 ± 5.07
42.11 ± 5.74

39.67 ± 4.51
36.81 ± 3.29
39.26 ± 2.84

9.46 ± 1.18
8.33 ± 0.92
8.85 ± 0.90

A.R.

4.14

6.43
7.86
5.43
6.00
6.00
5.29
6.14

6.43

1.29

–
–

Table 7.5 Mean and standard deviation of accuracy (%) across 10 splits on node classification of
heterophilic graphs. The best result is bold. The last column is the average rank of each method.

GCN

AttrMask
CtxtPred
EdgePred
InfoMax
JOAO
D-SLA
GraphMAE

Cornell

Wisconsin

Texas

Actor

Chameleon*

Squirrel*

59.41 ± 6.03

51.68 ± 4.34

63.78 ± 4.80

30.58 ± 1.29

40.94 ± 3.91

39.11 ± 1.74

44.86 ± 5.43
40.81 ± 7.78
42.70 ± 5.51
39.19 ± 12.75
40.13 ± 8.60
41.05 ± 6.88
47.05 ± 4.37

53.73 ± 4.31
36.67 ± 17.23
48.04 ± 6.63
39.80 ± 16.38
44.70 ± 7.45
42.13 ± 9.58
57.06 ± 4.59

76.47 ± 2.61
69.02 ± 4.96
79.61 ± 5.56

60.54 ± 5.82
58.92 ± 4.32
59.37 ± 5.11
58.87 ± 4.06
57.06 ± 3.43
59.93 ± 4.29
63.70 ± 5.51

72.35 ± 4.27
73.51 ± 5.06
77.03 ± 4.27

25.31 ± 1.03
23.97 ± 2.63
22.99 ± 6.22
23.30 ± 4.37
24.17 ± 5.02
23.74 ± 4.06
24.69 ± 0.68

33.95 ± 0.68
33.11 ± 1.57
34.97 ± 0.55

35.81 ± 2.88
24.36 ± 4.13
21.02 ± 5.06
22.59 ± 4.91
25.81 ± 3.79
26.49 ± 4.27
37.18 ± 3.08

38.59 ± 2.89
43.84 ± 3.39
41.94 ± 2.77

30.63 ± 1.68
26.26 ± 7.50
27.94 ± 8.41
27.52 ± 9.09
31.72 ± 7.03
28.50 ± 6.90
31.94 ± 1.65

37.34 ± 2.18
41.90 ± 1.90
38.79 ± 2.61

A.R.

3.83

5.83
9.50
8.83
10.17
8.33
8.00
5.00

3.00
2.00
1.50

Half-Hop
UniAug
UniAug + Half-Hop

62.46 ± 7.58
68.11 ± 6.72
72.43 ± 5.81

*Chameleon and Squirrel are filtered to remove duplicated nodes [224].

Table 7.6 Results of node classification on homophily graphs. Results are scaled ×100.

Cora

Citeseer

Pubmed

ACC ↑

SD ↓

GCN
UniAug

GCN
UniAug

81.75 ± 0.73
81.78 ± 0.60

70.71 ± 0.76
71.17 ± 0.58

79.53 ± 0.25
79.54 ± 0.35

24.51 ± 1.06
23.45 ± 0.90

22.57 ± 0.80
19.90 ± 0.81

27.02 ± 0.56
26.50 ± 0.55

88

Figure 7.3 Effects of pre-training amount of compute on graph classification (left) and link prediction
(right), where one PF-days = 1015 × 24 × 3600 = 8.64 × 1019 floating point operations.

node. We then adopt a strategy similar to graph classification and train a 2-layer classifier as a

guidance head. Inspired by the success of structure augmentation on heterophilic graphs [22, 10],

we evaluate UniAug on 6 heterophilic datasets. We observe phenomena similar to those seen in

graph property prediction and link prediction in Table 7.5. One thing to mention is the combination

of UniAug and Half-Hop. Half-Hop offers performance improvements in four out of six datasets via

data augmentation, and combining it with UniAug yields even higher results. This highlights the

flexibility of UniAug and opens up possibilities for further exploration of its use cases. Given the

impressive results of UniAug on heterophilic graphs, we anticipate it will also help to balance the

performance disparities among nodes with different homophily ratios on homophilic graphs [188].

We split the nodes into five groups according to their homophily ratios and calculate the standard

deviation (SD) across groups. As shown in Table 7.6, UniAug matches the performance of vanilla

GCN and also reduces the performance discrepancies corresponding to SD.

7.3.2 Scaling behavior of UniAug

In light of the neural scaling law [125, 97, 2, 321, 167], we expect UniAug to benefit from an

increased volume of data and more compute budget. In this subsection, we investigate the scaling

behavior of UniAug in terms of data scale and amount of compute for pre-training.

Data scale During the data collection process, we prepare three versions of the training data with

increasing magnitude. We first sample 10 graphs per category from the Network Repository [236]

89

510152025Compute (103 PF-days)6869707172737475AccuracyEnzymesIMDB-BProteins510152025Compute (103 PF-days)25303540455055Metrics ValuesYeastPowerCoraErdosCiteseerFigure 7.4 Effects of pre-training data scale on graph classification (left) and link prediction (right).
The groups SMA, FUL, and EXT represent SMALL, FULL, and EXTRA data collection.

to build a SMALL collection. Next, we gather all the graphs from the Network Repository and

filter out large-scale graphs and outliers for a FULL collection. In addition, we add a 1000 graphs

subset of the GitHub Star dataset from TUDataset [206] to enlarge the coverage of diverse patterns

and form an EXTRA collection. We pre-train three versions of UniAug respectively on the three

collections and evaluate them on graph classification and link prediction. As shown in Fig. 7.4, we

observe a clear trend of increase in performance as we enlarge the scale of pre-training data. This

paves the way to scale up UniAug to even more pre-training graphs.

Amount of compute We sought to understand how effectively our diffusion model can learn data

patterns as we continue to train it. To this end, we checkpointed UniAug every 2,000 epochs (5 × 10−3

PF-days) while training on the EXTRA collection, and then applied it to graph classification and link

prediction tasks. The results are illustrated in Fig. 7.3. We observe that downstream performance

generally improves with prolonged training, while the trend slows down for some datasets when we

reach 8,000 epochs. We take the checkpoint at the 10,000th epoch for evaluations. Given the scaling

behavior observed, we anticipate UniAug to become even more effective with additional resources.

7.3.3 Preventing negative transfer

In the previous parts of the experiments, we showcase the positive transfer of UniAug across

different tasks. We now investigate which aspects of the design prevent negative transfer. UniAug

consists of two main components: a pre-trained diffusion model and the structure augmentation

90

SMAFULEXTSMAFULEXTSMAFULEXTEnzymesIMDB-BProteins6065707580AccuracySMAFULEXTSMAFULEXTSMAFULEXTSMAFULEXTYeastPowerCoraErdos202530354045Metrics ValuesTable 7.7 Demonstration of negative transfer on graph classification (up) and link prediction (down).

Enzymes

Proteins

IMDB-B

IMDB-M

GIN

66.00 ± 7.52

73.32 ± 4.03

71.10 ± 2.90

49.07 ± 2.81

71.50 ± 5.85
UniAug
w/o self-cond
71.11 ± 7.50
62.17 ± 3.93
w/o guidance
w/ cross-guide 51.50 ± 7.64

75.47 ± 2.50
73.31 ± 4.63
71.15 ± 4.56
72.46 ± 4.35

73.50 ± 2.48
71.50 ± 2.27
53.80 ± 3.29
71.10 ± 2.38

50.13 ± 2.05
49.00 ± 2.74
35.33 ± 3.17
49.20 ± 2.59

through guided generation. In the pre-training process, we inject self-supervised graph labels into the

diffusion model and we wonder about the performance of its unconditioned counterpart. Regarding

the augmentation process, we examine the impact of diffusion guidance by exploring outcomes when

the guidance is either removed or applied using another dataset from a different domain (cross-guide).

We summarize the results in Table 7.7 for graph classification and link prediction. All modifications

investigated lead to performance declines in both tasks. We observe that removing guidance results

in significant negative transfers for graph classification, while the effects of self-conditioning are

more pronounced for link prediction. We conclude that both the self-conditioning strategy and

diffusion guidance are crucial in preventing negative transfer, underscoring their importance in the

design of UniAug.

7.4 Conclusion

In this work, we propose a graph structure augmentation pipeline UniAug to leverage the

increasing scale of graph data. We collect thousands of graphs from various domains and pre-train a

self-conditioned discrete diffusion model on them. In the downstream stage, we augment the graphs

by preserving the original node features and generating synthetic structures. We apply UniAug to

node-, link-, and graph-level tasks and achieve consistent performance gain. We have successfully

developed a showcase that benefits from cross-domain graph data scaling using diffusion models.

One limitation of the current analysis is the absence of an investigation into the effects of model

parameters due to limited resources. Given the scaling behavior of UniAug in terms of data scale

and amount of compute, we anticipate that a large-scale model will provide significant performance

improvements. One future direction is to investigate the adaptation of fast sampling methods to the

91

discrete diffusion models on graphs. This will lead to lower time complexity and enable broader

application scenarios.

92

CHAPTER 8

CONCLUSION

This dissertation advances graph machine learning by transitioning from Graph Neural Networks

(GNNs) to Graph Foundation Models (GFMs), aiming for generalization across diverse graphs.

By identifying essential network patterns and addressing GNN limitations under domain shifts,

homophily-heterophily trade-offs, and link prediction challenges, this work lays the foundation

for scalable, transferable graph learning. The proposed Graph Vocabulary Hypothesis and GFM

blueprint pave the way for robust, adaptable models, marking a shift from task-specific GNNs

to universally applicable GFMs. Ultimately, this research bridges network science insights with

modern graph learning, setting the stage for future advancements in graph Machine learning.

93

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Rediet Abebe, Nicole Immorlica, Jon Kleinberg, Brendan Lucier, and Ali Shirali. On the
effect of triadic closure on network segregation. In Proceedings of the 23rd ACM Conference
on Economics and Computation, pages 249–284, 2022.

Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the
limits of large scale pre-training. In International Conference on Learning Representations,
2022.

Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman,
Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph
convolutional architectures via sparsified neighborhood mixing. In international conference
on machine learning, pages 21–29. PMLR, 2019.

Ghadeer AbuOda, Gianmarco De Francisci Morales, and Ashraf Aboulnaga. Link prediction
via higher-order motif features. In Machine Learning and Knowledge Discovery in Databases:
European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019,
Proceedings, Part I, pages 412–429. Springer, 2020.

Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks,
25(3):211–230, 2003.

Edo M Airoldi, David Blei, Stephen Fienberg, and Eric Xing. Mixed membership stochastic
blockmodels. Advances in neural information processing systems, 21, 2008.

Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews
of modern physics, 74(1):47, 2002.

Aili Asikainen, Gerardo Iñiguez, Javier Ureña-Carrión, Kimmo Kaski, and Mikko Kivelä.
Cumulative effects of triadic closure and homophily in social networks. Science Advances,
6(19):eaax7310, 2020.

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg.
Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information
Processing Systems, 34:17981–17993, 2021.

[10] Mehdi Azabou, Venkataramana Ganesh, Shantanu Thakoor, Chi-Heng Lin, Lakshmi Sathidevi,
Ran Liu, Michal Valko, Petar Veličković, and Eva L Dyer. Half-hop: A graph upsampling
approach for slowing down message passing. In International Conference on Machine
Learning, pages 1341–1360. PMLR, 2023.

[11] Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell,
Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large
vision models. arXiv preprint arXiv:2312.00785, 2023.

94

[12] Fan Bao, Chongxuan Li, Jiacheng Sun, and Jun Zhu. Why are conditional generative models

better than unconditional ones? arXiv preprint arXiv:2212.00362, 2022.

[13] Aseem Baranwal, Kimon Fountoulakis, and Aukosh Jagannath. Graph convolution for semi-
supervised classification: Improved linear separability and out-of-distribution generalization.
arXiv preprint arXiv:2102.06966, 2021.

[14] Aseem Baranwal, Kimon Fountoulakis, and Aukosh Jagannath. Effects of graph convo-
lutions in multi-layer networks. In The Eleventh International Conference on Learning
Representations, 2023.

[15] Pablo Barcelo, Mikhail Galkin, Christopher Morris, and Miguel Romero Orth. Weisfeiler
and leman go relational. In Learning on Graphs Conference, pages 46–1. PMLR, 2022.

[16]

Ilyes Batatia, Philipp Benner, Yuan Chiang, Alin M. Elena, Dávid P. Kovács, Janosh Riebesell,
Xavier R. Advincula, Mark Asta, William J. Baldwin, Noam Bernstein, Arghya Bhowmik,
Samuel M. Blau, Vlad Cărare, James P. Darby, Sandip De, Flaviano Della Pia, Volker L.
Deringer, Rokas Elijošius, Zakariya El-Machachi, Edvin Fako, Andrea C. Ferrari, Annalena
Genreith-Schriever, Janine George, Rhys E. A. Goodall, Clare P. Grey, Shuang Han, Will
Handley, Hendrik H. Heenen, Kersti Hermansson, Christian Holm, Jad Jaafar, Stephan
Hofmann, Konstantin S. Jakob, Hyunwook Jung, Venkat Kapil, Aaron D. Kaplan, Nima
Karimitari, Namu Kroupa, Jolla Kullgren, Matthew C. Kuner, Domantas Kuryla, Guoda
Liepuoniute, Johannes T. Margraf, Ioan-Bogdan Magdău, Angelos Michaelides, J. Harry
Moore, Aakash A. Naik, Samuel P. Niblett, Sam Walton Norwood, Niamh O’Neill, Christoph
Ortner, Kristin A. Persson, Karsten Reuter, Andrew S. Rosen, Lars L. Schaaf, Christoph
Schran, Eric Sivonxay, Tamás K. Stenczel, Viktor Svahn, Christopher Sutton, Cas van der
Oord, Eszter Varga-Umbrich, Tejs Vegge, Martin Vondrák, Yangshuai Wang, William C.
Witt, Fabian Zills, and Gábor Csányi. A foundation model for atomistic materials chemistry,
2023.

[17] Federico Battiston, Giulia Cencetti, Iacopo Iacopini, Vito Latora, Maxime Lucas, Alice
Patania, Jean-Gabriel Young, and Giovanni Petri. Networks beyond pairwise interactions:
Structure and dynamics. Physics Reports, 874:1–92, 2020.

[18] Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Gabriela Moisescu-Pareja, Olek-
sandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller,
Jama Hussein Mohamud, et al. Towards foundational models for molecular learning on
large-scale multi-task datasets. arXiv preprint arXiv:2310.04292, 2023.

[19] Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas
Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text
encoders. arXiv preprint arXiv:2404.05961, 2024.

[20] Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex

networks. Science, 353(6295):163–166, 2016.

95

[21] Sannat Singh Bhasin, Vaibhav Holani, and Divij Sanjanwala. What do graph convolutional

neural networks learn? arXiv preprint arXiv:2207.01839, 2022.

[22] Wendong Bi, Lun Du, Qiang Fu, Yanlin Wang, Shi Han, and Dongmei Zhang. Make
heterophily graphs better fit gnn: A graph rewiring approach. arXiv preprint arXiv:2209.08264,
2022.

[23] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von
Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the
opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

[24] Stephen P Borgatti. Centrality and network flow. Social networks, 27(1):55–71, 2005.

[25] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard
Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum
mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.

[26] Sergey Brin and Lawrence Page. Reprint of: The anatomy of a large-scale hypertextual web

search engine. Computer networks, 56(18):3825–3833, 2012.

[27] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr,
Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh.
Video generation models as world simulators. 2024.

[28]

Ivan Brugere, Brian Gallagher, and Tanya Y Berger-Wolf. Network structure inference,
a survey: Motivations, methods, and applications. ACM Computing Surveys (CSUR),
51(2):1–39, 2018.

[29] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

[30] Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis,
and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances
in Neural Information Processing Systems, 35:28266–28279, 2022.

[31] Yuxuan Cao, Jiarong Xu, Carl Yang, Jiaan Wang, Yunchao Zhang, Chunping Wang, Lei Chen,
and Yang Yang. When to pre-train graph neural networks? an answer from data generation
perspective! arXiv preprint arXiv:2303.16458, 2023.

[32] Yuxuan Cao, Jiarong Xu, Carl Yang, Jiaan Wang, Yunchao Zhang, Chunping Wang, Lei Chen,
and Yang Yang. When to pre-train graph neural networks? an answer from data generation
perspective! arXiv preprint arXiv:2303.16458, 2023.

[33] Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaohai Hu, Xuanwen Huang, and Yang

96

Yang. Graphllm: Boosting graph reasoning ability of large language model. arXiv preprint
arXiv:2310.05845, 2023.

[34] Deepayan Chakrabarti. Avoiding biases due to similarity assumptions in node embeddings.
In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, pages 56–65, 2022.

[35] Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas
Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks
for link prediction with subgraph sketching. arXiv preprint arXiv:2209.15486, 2022.

[36] Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas
Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks
for link prediction with subgraph sketching. In ICLR, 2023.

[37] Nitesh V Chawla and Grigoris Karakoulas. Learning from labeled and unlabeled data: An
empirical study across techniques and domains. Journal of Artificial Intelligence Research,
23:331–366, 2005.

[38] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the
over-smoothing problem for graph neural networks from the topological view. In Proceedings
of the AAAI conference on artificial intelligence, volume 34, pages 3438–3445, 2020.

[39] Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. Contrastive test-time
adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 295–305, 2022.

[40] Dingshuo Chen, Yanqiao Zhu, Jieyu Zhang, Yuanqi Du, Zhixun Li, Qiang Liu, Shu Wu, and
Liang Wang. Uncovering neural scaling laws in molecular representation learning. In Thirty-
seventh Conference on Neural Information Processing Systems Datasets and Benchmarks
Track, 2023.

[41]

Jialin Chen, Shirley Wu, Abhijit Gupta, and Zhitao Ying. D4explainer: In-distribution
explanations of graph neural network via discrete denoising diffusion. In Thirty-seventh
Conference on Neural Information Processing Systems, 2023.

[42] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep
graph convolutional networks. In International conference on machine learning, pages
1725–1735. PMLR, 2020.

[43] Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. Llaga: Large

language and graph assistant. arXiv preprint arXiv:2402.08170, 2024.

[44] Xiaohui Chen, Jiaxing He, Xu Han, and Liping Liu. Efficient and degree-guided graph
generation via discrete diffusion modeling. In International Conference on Machine Learning,

97

pages 4585–4610. PMLR, 2023.

[45] Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Haifang Wen, Xiaochi Wei, Shuaiqiang Wang,
Dawei Yin, Wenqi Fan, Hui Liu, and Jiliang Tang. Exploring the potential of large language
models (llms) in learning on graphs. ArXiv, abs/2307.03393, 2023.

[46] Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang,
Dawei Yin, Wenqi Fan, Hui Liu, et al. Exploring the potential of large language models
(llms) in learning on graphs. ACM SIGKDD Explorations Newsletter, 25(2):42–61, 2024.

[47] Zhikai Chen, Haitao Mao, Jingzhe Liu, Yu Song, Bingheng Li, Wei Jin, Bahare Fatemi, Anton
Tsitsulin, Bryan Perozzi, Hui Liu, et al. Text-space graph foundation models: Comprehensive
benchmarks and new insights. arXiv preprint arXiv:2406.10727, 2024.

[48] Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, and
Jiliang Tang. Label-free node classification on graphs with large language models (llms).
arXiv preprint arXiv:2310.04668, 2023.

[49] Ziang Chen, Jialin Liu, Xinshang Wang, and Wotao Yin. On representing linear programs by
graph neural networks. In The Eleventh International Conference on Learning Representations,
2022.

[50] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized
pagerank graph neural network. In International Conference on Learning Representations.

[51] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized
pagerank graph neural network. In International Conference on Learning Representations,
2021.

[52] Yoonhyuk Choi, Jiho Choi, Taewook Ko, and Chong-Kwon Kim. Is signed message essential

for graph neural networks? arXiv preprint arXiv:2301.08918, 2023.

[53] Eugenio Clerico, George Deligiannidis, and Arnaud Doucet. Wide stochastic networks:
In International Conference on Algorithmic

Gaussian limit and pac-bayesian training.
Learning Theory, pages 447–470. PMLR, 2023.

[54] Sergio Currarini, Matthew O Jackson, and Paolo Pin. An economic model of friendship:

Homophily, minorities, and segregation. Econometrica, 77(4):1003–1045, 2009.

[55] Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Co-clustering based classification
for out-of-domain documents. In Pavel Berkhin, Rich Caruana, and Xindong Wu, editors,
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Jose, California, USA, August 12-15, 2007, pages 210–219. ACM,
2007.

98

[56] Nur Nasuha Daud, Siti Hafizah Ab Hamid, Muntadher Saadoon, Firdaus Sahran, and
Nor Badrul Anuar. Applications of link prediction in social networks: A review. Journal of
Network and Computer Applications, 166:102716, 2020.

[57] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
networks on graphs with fast localized spectral filtering. Advances in neural information
processing systems, 29, 2016.

[58] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.

Advances in neural information processing systems, 34:8780–8794, 2021.

[59] Kaiwen Dong, Haitao Mao, Zhichun Guo, and Nitesh V Chawla. Universal link predictor by

in-context learning. arXiv preprint arXiv:2402.07738, 2024.

[60] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing
Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234,
2022.

[61] Yuxiao Dong, Reid A Johnson, Jian Xu, and Nitesh V Chawla. Structural diversity and
homophily: A study across more than one hundred big networks. In Proceedings of the 23rd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
807–816, 2017.

[62] Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node
embeddings via diffusion wavelets. In Association for Computing Machinery Special Interest
Group on Knowledge Discovery and Data Mining, pages 1320–1329. ACM, 2018.

[63] Lun Du, Xiaozhou Shi, Qiang Fu, Xiaojun Ma, Hengyu Liu, Shi Han, and Dongmei Zhang.
Gbk-gnn: Gated bi-kernel graph neural networks for modeling both homophily and heterophily.
In Proceedings of the ACM Web Conference 2022, pages 1550–1558, 2022.

[64] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jian, Bill Yuchen Lin,
Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. Faith and fate: Limits
of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.

[65] Gintare Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, Gabriel Arpino, and Daniel Roy.
On the role of data in pac-bayes bounds. In International Conference on Artificial Intelligence
and Statistics, pages 604–612. PMLR, 2021.

[66]

Jack Edmonds and Ellis L. Johnson. Matching, euler tours and the chinese postman.
Mathematical Programming, 5:88–124, 1973.

[67] Anna Evtushenko and Jon Kleinberg. The paradox of second-order homophily in networks.

Scientific Reports, 11(1):13360, 2021.

99

[68] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural
networks for social recommendation. In The world wide web conference, pages 417–426,
2019.

[69] Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Talk like a graph: Encoding graphs

for large language models. arXiv preprint arXiv:2310.04560, 2023.

[70] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics

reports, 659:1–44, 2016.

[71] Kimon Fountoulakis, Amit Levi, Shenghao Yang, Aseem Baranwal, and Aukosh Jagannath.

Graph attention retrospective. arXiv preprint arXiv:2202.13060, 2022.

[72] Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, and Zhaocheng Zhu. Towards
foundation models for knowledge graph reasoning. arXiv preprint arXiv:2310.04562, 2023.

[73] Mikhail Galkin, Jincheng Zhou, Bruno Ribeiro, Jian Tang, and Zhaocheng Zhu. Zero-shot
logical query reasoning on any knowledge graph. arXiv preprint arXiv:2404.07198, 2024.

[74] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.
In International Conference on Machine Learning, pages 1180–1189. PMLR, JMLR.org,
2015.

[75]

Jianfei Gao, Yangze Zhou, and Bruno Ribeiro. Double permutation equivariance for
knowledge graph completion. arXiv preprint arXiv:2302.01313, 2023.

[76] Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, and Philip
Torr. Large-scale unsupervised semantic segmentation. IEEE transactions on pattern analysis
and machine intelligence, 2022.

[77] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational
limits of graph neural networks. In International Conference on Machine Learning, pages
3419–3430. PMLR, 2020.

[78] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 855–864, 2016.

[79] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. Large language models are
zero-shot time series forecasters. Advances in Neural Information Processing Systems, 36,
2024.

[80] Nate Gruver, Samuel Stanton, Nathan Frey, Tim GJ Rudner, Isidro Hotzel, Julien Lafrance-
Vanasse, Arvind Rajpal, Kyunghyun Cho, and Andrew G Wilson. Protein design with guided
discrete diffusion. Advances in Neural Information Processing Systems, 36, 2024.

100

[81] Shurui Gui, Xiner Li, Limei Wang, and Shuiwang Ji. Good: A graph out-of-distribution
benchmark. In Thirty-sixth Conference on Neural Information Processing Systems Datasets
and Benchmarks Track.

[82] Kilian Konstantin Haefeli, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer.
Diffusion models for graphs benefit from discrete state spaces. In NeurIPS 2022 Workshop:
New Frontiers in Graph Learning, 2022.

[83]

Jonathan Halcrow, Alexandru Mosoi, Sam Ruth, and Bryan Perozzi. Grale: Designing
networks for graph learning.
In Proceedings of the 26th ACM SIGKDD international
conference on knowledge discovery & data mining, pages 2523–2532, 2020.

[84] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large

graphs. Advances in neural information processing systems, 30, 2017.

[85] Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. G-mixup: Graph data augmentation
for graph classification. In International Conference on Machine Learning, pages 8230–8248.
PMLR, 2022.

[86] Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan
Yao, Ao Zhang, Liang Zhang, et al. Pre-trained models: Past, present and future. AI Open,
2:225–250, 2021.

[87] Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral
filters via bernstein approximation. Advances in Neural Information Processing Systems,
34:14239–14251, 2021.

[88] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and
XIAOJUAN QI. Is synthetic data from generative models ready for image recognition? In
The Eleventh International Conference on Learning Representations, 2023.

[89] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn:
Simplifying and powering graph convolution network for recommendation. In Proceedings of
the 43rd International ACM SIGIR conference on research and development in Information
Retrieval, pages 639–648, 2020.

[90] Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan
Hooi. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph
representation learning, 2023.

[91] Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier
Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph
understanding and question answering. arXiv preprint arXiv:2402.07630, 2024.

[92]

Justus Isaiah Hibshman, Daniel Gonzalez, Satyaki Sikdar, and Tim Weninger. Joint subgraph-

101

to-subgraph transitions: Generalizing triadic closure for powerful and interpretable graph
modeling. In Proceedings of the 14th ACM International Conference on Web Search and
Data Mining, pages 815–823, 2021.

[93]

[94]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.
Advances in neural information processing systems, 33:6840–6851, 2020.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.
arXiv:2207.12598, 2022.

arXiv preprint

[95] Tomaž Hočevar and Janez Demšar. A combinatorial approach to graphlet counting. Bioinfor-

matics, 30(4):559–565, 2014.

[96] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social
network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.

[97]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

[98] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax
flows and multinomial diffusion: Learning categorical distributions. Advances in Neural
Information Processing Systems, 34:12454–12465, 2021.

[99] Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie
Tang. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022.

[100] Vincent Tao Hu, David W Zhang, Yuki M Asano, Gertjan J Burghouts, and Cees GM Snoek.
Self-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18413–18422, 2023.

[101] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele
Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.
Advances in neural information processing systems, 33:22118–22133, 2020.

[102] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele
Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.
Advances in neural information processing systems, 33:22118–22133, 2020.

[103] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele
Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.
Advances in neural information processing systems, 33:22118–22133, 2020.

[104] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure

102

Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265,
2019.

[105] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. Gpt-gnn:
Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 1857–1867. ACM,
2020.

[106] Hong Huang, Yuxiao Dong, Jie Tang, Hongxia Yang, Nitesh V Chawla, and Xiaoming Fu.
Will triadic closure strengthen ties in social networks? ACM Transactions on Knowledge
Discovery from Data (TKDD), 12(3):1–25, 2018.

[107] Hong Huang, Jie Tang, Lu Liu, JarDer Luo, and Xiaoming Fu. Triadic closure pattern analysis
and prediction in social networks. IEEE Transactions on Knowledge and Data Engineering,
27(12):3374–3389, 2015.

[108] Qian Huang, Hongyu Ren, Peng Chen, Gregor Kržmanc, Daniel Zeng, Percy Liang,
and Jure Leskovec. Prodigy: Enabling in-context learning over graphs. arXiv preprint
arXiv:2305.12600, 2023.

[109] Xingyue Huang, Miguel Romero Orth, Ismail Ilkan Ceylan, and Pablo Barcelo. A theory of
link prediction via relational weisfeiler-leman. arXiv preprint arXiv:2302.02209, 2023.

[110] Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, and
Pan Li. On the stability of expressive positional encodings for graph neural networks. arXiv
preprint arXiv:2310.02579, 2023.

[111] Borja Ibarz, Vitaly Kurin, George Papamakarios, Kyriacos Nikiforou, Mehdi Bennani, Róbert
Csordás, Andrew Joseph Dudzik, Matko Bošnjak, Alex Vitvitskyi, Yulia Rubanova, et al. A
generalist neural algorithmic learner. In Learning on graphs conference, pages 2–1. PMLR,
2022.

[112] Paul Jaccard. Distribution comparée de la flore alpine dans quelques régions des alpes

occidentales et orientales. Bulletin de la Murithienne, (31):81–92, 1902.

[113] Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 538–543, 2002.

[114] Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in nlp. In
Association of Computational Linguistics, pages 264–271. The Association for Computational
Linguistics, 2007.

[115] Zhimeng Jiang, Xiaotian Han, Chao Fan, Zirui Liu, Xiao Huang, Na Zou, Ali Mostafavi, and

Xia Hu. Topology matters in fair graph learning: a theoretical pilot study.

103

[116] Zhimeng Jiang, Xiaotian Han, Chao Fan, Zirui Liu, Na Zou, Ali Mostafavi, and Xia Hu. Fmp:
Toward fair graph message passing against topology bias. arXiv preprint arXiv:2202.04187,
2022.

[117] Lichen Jin, Yizhou Zhang, Guojie Song, and Yilun Jin. Active domain transfer on network
embedding. In Proceedings of The Web Conference 2020, pages 2683–2689. ACM / IW3C2,
2020.

[118] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang.
Self-supervised learning on graphs: Deep insights and new direction. arXiv preprint
arXiv:2006.10141, 2020.

[119] Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han,
Hanqing Lu, Zhengyang Wang, Ruirui Li, et al. Amazon-m2: A multilingual multi-locale
shopping session dataset for recommendation and text generation. Advances in Neural
Information Processing Systems, 36:8006–8026, 2023.

[120] Wei Jin, Tong Zhao, Jiayuan Ding, Yozen Liu, Jiliang Tang, and Neil Shah. Empowering graph
representation learning with test-time graph transformation. arXiv preprint arXiv:2210.03561,
2022.

[121] Wei Jin, Tong Zhao, Jiayuan Ding, Yozen Liu, Jiliang Tang, and Neil Shah. Empowering graph
representation learning with test-time graph transformation. In The Eleventh International
Conference on Learning Representations, 2023.

[122] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Hierarchical generation of molecular
graphs using structural motifs. In International conference on machine learning, pages
4839–4848. PMLR, 2020.

[123] Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. Score-based generative modeling of graphs
via the system of stochastic differential equations. In International Conference on Machine
Learning, pages 10362–10383. PMLR, 2022.

[124] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon
Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020.

[125] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon
Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020.

[126] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43,

1953.

[127] Kazi Zainab Khanam, Gautam Srivastava, and Vijay Mago. The homophily principle in

104

social network analysis. arXiv preprint arXiv:2008.10383, 2020.

[128] Dongki Kim, Jinheon Baek, and Sung Ju Hwang. Graph self-supervised learning with accurate
discrepancy learning. Advances in Neural Information Processing Systems, 35:14085–14098,
2022.

[129] Jinwoo Kim, Tien Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and
Seunghoon Hong. Pure transformers are powerful graph learners. arXiv, abs/2207.02505,
2022.

[130] Youngeun Kim, Donghyeon Cho, Priyadarshini Panda, and Sungeun Hong. Progressive
domain adaptation from a source pre-trained model. arXiv preprint arXiv:2007.01524, 2020.

[131] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

[132] Thomas Kipf and Max Welling. Semi-supervised classification with graph convolutional

networks. ArXiv, abs/1609.02907, 2016.

[133] Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Workshop on

Bayesian Deep Learning, 2016.

[134] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional

networks. International Conference on Learning Representations, abs/1609.02907, 2017.

[135] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In International Conference on Learning Representations (ICLR), 2017.

[136] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional

networks. In International Conference on Learning Representations, 2017.

[137] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson,
Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
4015–4026, 2023.

[138] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate:
Graph neural networks meet personalized pagerank. In International Conference on Learning
Representations, 2018.

[139] Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin
Taylor, and Tom Goldstein. Robust optimization as data augmentation for large-scale graphs.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 60–69, 2022.

105

[140] Lecheng Kong, Yixin Chen, and Muhan Zhang. Geodesic graph neural network for efficient
graph representation learning. Advances in Neural Information Processing Systems, 35:5896–
5909, 2022.

[141] István A Kovács, Katja Luck, Kerstin Spirohn, Yang Wang, Carl Pollis, Sadie Schlabach,
Wenting Bian, Dae-Kyum Kim, Nishka Kishore, Tong Hao, et al. Network-based prediction
of protein interactions. Nature communications, 10(1):1240, 2019.

[142] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou.
Rethinking graph transformers with spectral attention. Advances in Neural Information
Processing Systems, 34:21618–21629, 2021.

[143] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels.

Applied Network Science, 5(1):1–42, 2020.

[144] Ajay Kumar, Shashank Sheshar Singh, Kuldeep Singh, and Bhaskar Biswas. Link prediction
techniques, applications, and performance: A survey. Physica A: Statistical Mechanics and
its Applications, 553:124289, 2020.

[145] Anusha Lalitha, Osman Cihan Kilinc, Tara Javidi, and Farinaz Koushanfar. Peer-to-peer
federated learning on graphs. arXiv preprint arXiv:1901.11173, abs/1901.11173, 2019.

[146] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin
Ghahramani. Kronecker graphs: an approach to modeling networks. Journal of Machine
Learning Research, 11(2), 2010.

[147] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as
deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision,
pages 9267–9276, 2019.

[148] Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang,
and Dawei Yin. Evaluating graph neural networks for link prediction: Current pitfalls and
new benchmarking. arXiv preprint arXiv:2306.10453, 2023.

[149] Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang,
and Dawei Yin. Evaluating graph neural networks for link prediction: Current pitfalls and
new benchmarking. arXiv preprint arXiv:2306.10453, 2023.

[150] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional
networks for semi-supervised learning. In Proceedings of the AAAI conference on artificial
intelligence, volume 32, 2018.

[151] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation:
Unsupervised domain adaptation without source data. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9641–9650. Computer Vision Foundation / IEEE,

106

2020.

[152] Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Wei
Qian. Finding global homophily in graph neural networks when meeting heterophily. In
International Conference on Machine Learning, 2022.

[153] Yuwen Li, Miao Xiong, and Bryan Hooi. Graphcleaner: Detecting mislabelled samples in

popular graph learning benchmarks. arXiv preprint arXiv:2306.00015, 2023.

[154] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation. In International Conference on
Machine Learning, pages 6028–6039. PMLR, PMLR, 2020.

[155] Shuming Liang, Yu Ding, Zhidong Li, Bin Liang, Yang Wang, Fang Chen, et al. Can gnns

learn heuristic information for link prediction? 2022.

[156] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and
Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong
simple methods. Advances in Neural Information Processing Systems, 34:20887–20902,
2021.

[157] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and
Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong
simple methods. Advances in Neural Information Processing Systems, 34:20887–20902,
2021.

[158] Derek Lim, Joshua David Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron,
and Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation
learning. In The Eleventh International Conference on Learning Representations, 2022.

[159] Hongyi Ling, Zhimeng Jiang, Meng Liu, Shuiwang Ji, and Na Zou. Graph mixup with soft
alignments. In International Conference on Machine Learning, pages 21335–21349. PMLR,
2023.

[160] Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X Liu,
Chunming Wu, and Shouling Ji. Deep graph matching and searching for semantic code
retrieval. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–21, 2021.

[161] Gang Liu, Eric Inae, Tong Zhao, Jiaxin Xu, Tengfei Luo, and Meng Jiang. Data-centric
learning from unlabeled graphs with diffusion model. arXiv preprint arXiv:2303.10108,
2023.

[162] Gang Liu, Eric Inae, Tong Zhao, Jiaxin Xu, Tengfei Luo, and Meng Jiang. Data-centric
learning from unlabeled graphs with diffusion model. Advances in neural information
processing systems, 36, 2024.

107

[163] Gang Liu, Tong Zhao, Jiaxin Xu, Tengfei Luo, and Meng Jiang. Graph rationalization with
environment-based augmentations. In Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, pages 1069–1078, 2022.

[164] Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan
Zhang. One for all: Towards training one graph model for all classification tasks. arXiv
preprint arXiv:2310.00149, 2023.

[165] Haoyu Liu, Ningyi Liao, and Siqiang Luo. Simga: A simple and effective heterophilous
graph neural network with efficient global aggregation. arXiv preprint arXiv:2305.09958,
2023.

[166] Hongrui Liu, Binbin Hu, Xiao Wang, Chuan Shi, Zhiqiang Zhang, and Jun Zhou. Confidence
may cheat: Self-training on graph neural networks under distribution shift. In Proceedings of
the ACM Web Conference 2022, pages 1248–1258, 2022.

[167] Jingzhe Liu, Haitao Mao, Zhikai Chen, Tong Zhao, Neil Shah, and Jiliang Tang. Neural

scaling laws on graphs. arXiv preprint arXiv:2402.02054, 2024.

[168] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In
Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery &
data mining, pages 338–348, 2020.

[169] Ruitong Liu, Yanbin Wang, Haitao Xu, Bin Liu, Jianguo Sun, Zhenhao Guo, and Wenrui Ma.
Source code vulnerability detection: Combining code language models and code property
graphs. arXiv preprint arXiv:2404.14719, 2024.

[170] Songtao Liu, Rex Ying, Hanze Dong, Lanqing Li, Tingyang Xu, Yu Rong, Peilin Zhao,
In
Junzhou Huang, and Dinghao Wu. Local augmentation for graph neural networks.
International conference on machine learning, pages 14054–14072. PMLR, 2022.

[171] Yang-Yu Liu, Jean-Jacques Slotine, and Albert-László Barabási. Controllability of complex

networks. nature, 473(7346):167–173, 2011.

[172] Zemin Liu, Xingtong Yu, Yuan Fang, and Xinming Zhang. Graphprompt: Unifying pre-
training and downstream tasks for graph neural networks. In Proceedings of the ACM Web
Conference 2023, pages 417–428, 2023.

[173] Zhiyuan Liu, Yaorui Shi, An Zhang, Enzhi Zhang, Kenji Kawaguchi, Xiang Wang, and
Tat-Seng Chua. Rethinking tokenizer and decoder in masked graph modeling for molecules.
Advances in Neural Information Processing Systems, 36, 2024.

[174] Shuqi Lu, Zhifeng Gao, Di He, Linfeng Zhang, and Guolin Ke. Highly accurate quantum
chemical property prediction with uni-mol+. arXiv preprint arXiv:2303.16982, 2023.

108

[175] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, and Doina Precup.
When do we need gnn for node classification? arXiv preprint arXiv:2210.16979, 2022.

[176] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen
Chang, and Doina Precup. Revisiting heterophily for graph neural networks. In Advances in
Neural Information Processing Systems.

[177] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen
Chang, and Doina Precup. Is heterophily a real nightmare for graph neural networks to do
node classification? arXiv preprint arXiv:2109.05641, 2021.

[178] Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie
Fu, Jure Leskovec, and Doina Precup. When do graph neural networks help with node
classification: Investigating the homophily principle on node distinguishability. arXiv preprint
arXiv:2304.14274, 2023.

[179] Youzhi Luo, Michael McThrow, Wing Yee Au, Tao Komikado, Kanji Uchino, Koji Maruhashi,
and Shuiwang Ji. Automated data augmentations for graph classification. arXiv preprint
arXiv:2202.13248, 2022.

[180] Youzhi Luo, Keqiang Yan, and Shuiwang Ji. Graphdf: A discrete flow model for molecular
graph generation. In International Conference on Machine Learning, pages 7192–7203.
PMLR, 2021.

[181] Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. Subgroup generalization and fairness of graph
neural networks. Advances in Neural Information Processing Systems, 34:1048–1061, 2021.

[182] Qian Ma, Haitao Mao, Jingzhe Liu, Zhehua Zhang, Chunlin Feng, Yu Song, Yihan Shao,
and Yao Ma. Do neural scaling laws exist on graph self-supervised learning? arXiv preprint
arXiv:2408.11243, 2024.

[183] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural

networks? arXiv preprint arXiv:2106.06134, 2021.

[184] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural

networks? ArXiv, abs/2106.06134, 2021.

[185] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view
In Proceedings of the 30th ACM
on graph neural networks as graph signal denoising.
International Conference on Information & Knowledge Management, pages 1202–1211,
2021.

[186] Haitao Mao, Xu Chen, Qiang Fu, Lun Du, Shi Han, and Dongmei Zhang. Neuron campaign
for initialization guided by information bottleneck theory. In Proceedings of the 30th ACM
International Conference on Information & Knowledge Management, pages 3328–3332,

109

2021.

[187] Haitao Mao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang
Tang. Demystifying structural disparity in graph neural networks: Can one size fit all? arXiv
preprint arXiv:2306.01323, 2023.

[188] Haitao Mao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang
Tang. Demystifying structural disparity in graph neural networks: Can one size fit all?
Advances in Neural Information Processing Systems, 36, 2024.

[189] Haitao Mao, Zhikai Chen, Wenzhuo Tang, Jianan Zhao, Yao Ma, Tong Zhao, Neil Shah,
Michael Galkin, and Jiliang Tang. Graph foundation models. arXiv preprint arXiv:2402.02216,
2024.

[190] Haitao Mao, Lun Du, Yujia Zheng, Qiang Fu, Zelin Li, Xu Chen, Shi Han, and Dongmei
Zhang. Source free unsupervised graph domain adaptation. arXiv preprint arXiv:2112.00955,
2021.

[191] Haitao Mao, Lun Du, Yujia Zheng, Qiang Fu, Zelin Li, Xu Chen, Shi Han, and Dongmei
Zhang. Source free graph unsupervised domain adaptation. In Proceedings of the 17th ACM
International Conference on Web Search and Data Mining, pages 520–528, 2024.

[192] Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao,
Neil Shah, and Jiliang Tang. Revisiting link prediction: A data perspective. arXiv preprint
arXiv:2310.00793, 2023.

[193] Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao,
Neil Shah, and Jiliang Tang. Revisiting link prediction: a data perspective. In The Twelfth
International Conference on Learning Representations, 2024.

[194] Haitao Mao, Guangliang Liu, Yao Ma, Rongrong Wang, and Jiliang Tang. A data generation
perspective to the mechanism of in-context learning. arXiv preprint arXiv:2402.02212, 2024.

[195] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.
Least squares generative adversarial networks. In IEEE International Conference on Computer
Vision, pages 2794–2802. IEEE Computer Society, 2017.

[196] Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam
Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, and Dominique Beaini. Gps++:
An optimised hybrid mpnn/transformer for molecular property prediction. arXiv preprint
arXiv:2212.02229, 2022.

[197] Andreas Maurer. A note on the pac bayesian theorem. arXiv preprint cs/0411099, 2004.

[198] David McAllester. Simplified pac-bayesian margin bounds.

In Learning Theory and

110

Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop,
COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 203–215.
Springer, 2003.

[199] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating
the construction of internet portals with machine learning. Information Retrieval, 3:127–163,
2000.

[200] R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths.
Embers of autoregression: Understanding large language models through the problem they
are trained to solve. arXiv preprint arXiv:2309.13638, 2023.

[201] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in

social networks. Annual review of sociology, 27(1):415–444, 2001.

[202] Filippo Menczer, Santo Fortunato, and Clayton A. Davis. A First Course in Network Science.

Cambridge University Press, 2020.

[203] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon.
Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827,
2002.

[204] Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Richard Chen, Leonid
Karlinsky, Kate Saenko, Venkatesh Saligrama, and Rogerio S Feris. Task2sim: Towards
effective pre-training and transfer from synthetic data. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 9194–9204, 2022.

[205] Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco
Herrera. A unifying view on dataset shift in classification. Pattern recognition, 45(1):521–530,
2012.

[206] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and
Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs.
arXiv preprint arXiv:2007.08663, 2020.

[207] Christopher Morris, Yaron Lipman, Haggai Maron, Bastian Rieck, Nils M. Kriege, Martin
Grohe, Matthias Fey, and Karsten Borgwardt. Weisfeiler and leman go machine learning:
The story so far. Journal of Machine Learning Research, 24(333):1–59, 2023.

[208] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,
Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph
neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33,
pages 4602–4609, 2019.

[209] Luis Müller, Mikhail Galkin, Christopher Morris, and Ladislav Rampášek. Attending to

111

graph transformers. arXiv preprint arXiv:2302.04181, 2023.

[210] Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas
Brox, and Jan Hendrik Metzen. Test-time adaptation to distribution shift by confidence
maximization and input transformation. arXiv preprint arXiv:2106.14999, 2021.

[211] Yohsuke Murase, Hang Hyun Jo, János Török, János Kertész, Kimmo Kaski, et al. Structural

transition in social networks. 2019.

[212] Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical

review E, 64(2):025102, 2001.

[213] Trung-Kien Nguyen and Yuan Fang. Diffusion-based negative sampling on graphs for link

prediction. arXiv preprint arXiv:2403.17259, 2024.

[214] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational
factorization models by including observable patterns. Advances in Neural Information
Processing Systems, 27, 2014.

[215] Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon.
Permutation invariant graph generation via score-based generative modeling. In International
Conference on Artificial Intelligence and Statistics, pages 4474–4484. PMLR, 2020.

[216] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for

node classification. arXiv preprint arXiv:1905.10947, 2019.

[217] Tore Opsahl. Triadic closure in two-mode networks: Redefining the global and local clustering

coefficients. Social networks, 35(2):159–167, 2013.

[218] Madhur Panwar, Kabir Ahuja, and Navin Goyal. In-context learning through the bayesian
prism. In The Twelfth International Conference on Learning Representations, 2023.

[219] Hyeonjin Park, Seunghun Lee, Sihyeon Kim, Jinyoung Park, Jisu Jeong, Kyung-Min Kim,
Jung-Woo Ha, and Hyunwoo J Kim. Metropolis-hastings data augmentation for graph neural
networks. Advances in Neural Information Processing Systems, 34:19010–19020, 2021.

[220] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-
gcn: Geometric graph convolutional networks. In International Conference on Learning
Representations.

[221] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 701–710, 2014.

[222] Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou,

112

and Jonathan Halcrow. Let your graph do the talking: Encoding structured data for llms.
arXiv preprint arXiv:2402.05862, 2024.

[223] Oleg Platonov, Denis Kuznedelev, Artem Babenko, and Liudmila Prokhorenkova. Character-
izing graph datasets for node classification: Beyond homophily-heterophily dichotomy. arXiv
preprint arXiv:2209.06177, 2022.

[224] Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila
Prokhorenkova. A critical look at the evaluation of GNNs under heterophily: Are we
really making progress? In The Eleventh International Conference on Learning Representa-
tions, 2023.

[225] Ben Prystawski and Noah D Goodman. Why think step-by-step? reasoning emerges from the

locality of experience. arXiv preprint arXiv:2304.03843, 2023.

[226] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan
Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In
Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery &
data mining, pages 1150–1160, 2020.

[227] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.

Dataset shift in machine learning. Mit Press, 2008.

[228] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In International conference on
machine learning, pages 8748–8763. PMLR, 2021.

[229] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In International conference on
machine learning, pages 8748–8763. PMLR, 2021.

[230] Md Nakhla Rafi, Dong Jae Kim, An Ran Chen, Tse-Hsun Chen, and Shaowei Wang.
Towards better graph neural neural network-based fault localization through enhanced code
representation. arXiv preprint arXiv:2404.04496, 2024.

[231] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning
node representations from structural identity. In Association for Computing Machinery
Special Interest Group on Knowledge Discovery and Data Mining, pages 385–394. ACM,
2017.

[232] Pedro Ribeiro, Pedro Paredes, Miguel EP Silva, David Aparicio, and Fernando Silva. A
survey on subgraph counting: concepts, algorithms, and applications to network motifs and
graphlets. ACM Computing Surveys (CSUR), 54(2):1–36, 2021.

113

[233] Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential
random graph (p*) models for social networks. Social networks, 29(2):173–191, 2007.

[234] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 10684–10695, 2022.

[235] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou
Huang. Self-supervised graph transformer on large-scale molecular data. Advances in neural
information processing systems, 33:12559–12571, 2020.

[236] Ryan Rossi and Nesreen Ahmed. The network data repository with interactive graph analytics
and visualization. In Proceedings of the AAAI conference on artificial intelligence, volume 29,
2015.

[237] Ryan A Rossi, Anup Rao, Sungchul Kim, Eunyee Koh, and Nesreen Ahmed. From closing
triangles to closing higher-order motifs. In Companion Proceedings of the Web Conference
2020, pages 42–43, 2020.

[238] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster

analysis. Journal of computational and applied mathematics, 20:53–65, 1987.

[239] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding.

Journal of Complex Networks, 9(2):cnab014, 2021.

[240] Benedek Rozemberczki, Oliver Kiss, and Rik Sarkar. Karate club: an api oriented open-source
python framework for unsupervised learning on graphs. In Proceedings of the 29th ACM
international conference on information & knowledge management, pages 3125–3132, 2020.

[241] Luana Ruiz, Luiz F. O. Chamon, and Alejandro Ribeiro. Transferability properties of graph

neural networks. IEEE Transactions on Signal Processing, 71:3474–3489, 2023.

[242] T Konstantin Rusch, Michael M Bronstein, and Siddhartha Mishra. A survey on oversmoothing

in graph neural networks. arXiv preprint arXiv:2303.10993, 2023.

[243] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for un-
supervised domain adaptation. In International Conference on Machine Learning, pages
2988–2997. PMLR, PMLR, 2017.

[244] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal

analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.

[245] Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W Moore. Theoretical justification
of popular link prediction heuristics. In IJCAI proceedings-international joint conference on
artificial intelligence, volume 22, page 2722, 2011.

114

[246] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina
Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.

[247] Xiao Shen, Quanyu Dai, Fu-lai Chung, Wei Lu, and Kup-Sze Choi. Adversarial deep network
embedding for cross-network node classification. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 2991–2999, 2020.

[248] Xiao Shen, Quanyu Dai, Sitong Mao, Fu-lai Chung, and Kup-Sze Choi. Network together:
Node classification via cross-network deep network embedding. IEEE Transactions on Neural
Networks and Learning Systems, 32(5):1935–1948, 2020.

[249] Chence Shi, Shitong Luo, Minkai Xu, and Jian Tang. Learning gradient fields for molecular
conformation generation. In International conference on machine learning, pages 9558–9568.
PMLR, 2021.

[250] Hongzhi Shi, Jingtao Ding, Yufan Cao, Li Liu, Yong Li, et al. Learning symbolic models for
graph-structured physical mechanism. In The Eleventh International Conference on Learning
Representations, 2022.

[251] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu Sun. Masked
label prediction: Unified message passing model for semi-supervised classification. arXiv
preprint arXiv:2009.03509, 2020.

[252] Nima Shoghi, Adeesh Kolluru, John R. Kitchin, Zachary W. Ulissi, C. Lawrence Zitnick, and
Brandon M. Wood. From molecules to materials: Pre-training large generalizable models for
atomic property prediction, 2023.

[253] Harry Shomer, Yao Ma, Haitao Mao, Juanhui Li, Bo Wu, and Jiliang Tang. Lpformer: an
adaptive graph transformer for link prediction. In Proceedings of the 30th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, pages 2686–2698, 2024.

[254] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon,
and Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2021.

[255] Yu Song, Haitao Mao, Jiachen Xiao, Jingzhe Liu, Zhikai Chen, Wei Jin, Carl Yang, Jiliang
Tang, and Hui Liu. A pure transformer pretraining framework on text-attributed graphs. arXiv
preprint arXiv:2406.13873, 2024.

[256] Yu Song and Donglin Wang. Learning on graphs with out-of-distribution nodes.

In
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, pages 1635–1645, 2022.

[257] Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional
node embeddings and structural graph representations. arXiv preprint arXiv:1910.00452,

115

2019.

[258] Mingchen Sun, Kaixiong Zhou, Xin He, Ying Wang, and Xin Wang. Gppt: Graph pre-training
and prompt tuning to generalize graph neural networks. In Proceedings of the 28th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 1717–1727,
New York, NY, USA, 2022. Association for Computing Machinery.

[259] Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task

prompting for graph neural networks. 2023.

[260] Shiyin Tan, Dongyuan Li, Renhe Jiang, Ying Zhang, and Manabu Okumura. Community-

invariant graph contrastive learning. arXiv preprint arXiv:2405.01350, 2024.

[261] Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao
Huang. Graphgpt: Graph instruction tuning for large language models. arXiv preprint
arXiv:2310.13023, 2023.

[262] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer: extraction
and mining of academic social networks. In Association for Computing Machinery Special
Interest Group on Knowledge Discovery and Data Mining, pages 990–998. ACM, 2008.

[263] Wenzhuo Tang, Haitao Mao, Danial Dervovic, Ivan Brugere, Saumitra Mishra, Yuying Xie,
and Jiliang Tang. Cross-domain graph data scaling: A showcase with diffusion models. arXiv
preprint arXiv:2406.01899, 2024.

[264] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[265] Brandon Trabucco, Kyle Doherty, Max A Gurinas, and Ruslan Salakhutdinov. Effective data
augmentation with diffusion models. In The Twelfth International Conference on Learning
Representations, 2024.

[266] Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry

without human demonstrations. Nature, 625(7995):476–482, 2024.

[267] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain

confusion: Maximizing for domain invariance. Computer Science, abs/1412.3474, 2014.

[268] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of

machine learning research, 9(2605):2579–2605, 2008.

[269] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and
Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.

116

[270] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and

Yoshua Bengio. Graph attention networks. stat, 1050:20, 2017.

[271] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and

R Devon Hjelm. Deep graph infomax. ICLR (Poster), 2(3):4, 2019.

[272] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal
Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh
International Conference on Learning Representations, 2023.

[273] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt.

Graph kernels. Journal of Machine Learning Research, 11:1201–1242, 2010.

[274] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A
Practical Guide, 1st Ed., Cham: Springer International Publishing, 10:3152676, 2017.

[275] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent:
Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.

[276] Haonan Wang, Jieyu Zhang, Qi Zhu, and Wei Huang. Augmentation-free graph contrastive

learning. arXiv preprint arXiv:2204.04874, 2022.

[277] Haonan Wang, Jieyu Zhang, Qi Zhu, and Wei Huang. Can single-pass contrastive learning
work for both homophilic and heterophilic graph? arXiv preprint arXiv:2211.10890, 2022.

[278] Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. Equivariant and stable positional
encoding for more powerful graph neural networks. In International Conference on Learning
Representations, 2021.

[279] Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. Equivariant and stable positional
encoding for more powerful graph neural networks. arXiv preprint arXiv:2203.00199, 2022.

[280] Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia
Tsvetkov. Can language models solve graph problems in natural language? In Thirty-seventh
Conference on Neural Information Processing Systems, 2023.

[281] Hui Wang and Zichun Le. Seven-layer model in complex networks link prediction: A survey.

Sensors, 20(22):6560, 2020.

[282] Junfu Wang, Yuanfang Guo, Liang Yang, and Yunhong Wang. Understanding heterophily for

graph neural networks. arXiv preprint arXiv:2401.09125, 2024.

[283] Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion

for link prediction. arXiv preprint arXiv:2302.00890, 2023.

117

[284] Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion
for link prediction. In The Twelfth International Conference on Learning Representations,
2024.

[285] Rongzhe Wei, Haoteng Yin, Junteng Jia, Austin R Benson, and Pan Li. Understanding
non-linearity in graph neural networks from the bayesian-inference perspective. In Advances
in Neural Information Processing Systems.

[286] Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden

contexts. Machine learning, 23:69–101, 1996.

[287] Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot,
Thomas Seidel, and Thierry Langer. A compact review of molecular property prediction with
graph neural networks. Drug Discovery Today: Technologies, 37:1–12, 2020.

[288] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger.
Simplifying graph convolutional networks. In International conference on machine learning,
pages 6861–6871. PMLR, 2019.

[289] Man Wu, Shirui Pan, Chuan Zhou, Xiaojun Chang, and Xingquan Zhu. Unsupervised domain
adaptive graph convolutional networks. In The Web Conference 2020, pages 1457–1467.
ACM / IW3C2, 2020.

[290] Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. Handling distribution shifts on

graphs: An invariance perspective. arXiv preprint arXiv:2202.02466, 2022.

[291] Wei Wu, Bin Li, Chuan Luo, and Wolfgang Nejdl. Hashing-accelerated graph neural networks

for link prediction. In Proceedings of the Web Conference 2021, pages 2910–2920, 2021.

[292] Xinyi Wu, Zhengdao Chen, William Wang, and Ali Jadbabaie. A non-asymptotic analysis of
oversmoothing in graph neural networks. arXiv preprint arXiv:2212.10701, 2022.

[293] Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li,
and Stan Z. Li. Mole-BERT: Rethinking pre-training graph neural networks for molecules.
In The Eleventh International Conference on Learning Representations, 2023.

[294] Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid

graphs. Advances in Neural Information Processing Systems, 34, 2021.

[295] Jiarong Xu, Renhong Huang, Xin Jiang, Yuxuan Cao, Carl Yang, Chunping Wang, and Yang
Yang. Better with less: A data-active perspective on pre-training graph neural networks.
Advances in Neural Information Processing Systems, 36:56946–56978, 2023.

[296] Keylu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie

Jegelka. What can neural networks reason about? ICLR 2020, 2020.

118

[297] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural

networks? In International Conference on Learning Representations.

[298] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural

networks? arXiv preprint arXiv:1810.00826, 2018.

[299] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural

networks? In International Conference on Learning Representations, 2019.

[300] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and
Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In
International conference on machine learning, pages 5453–5462. PMLR, 2018.

[301] Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. Self-supervised graph-
level representation learning with local and global structure. In International Conference on
Machine Learning, pages 11548–11558. PMLR, 2021.

[302] Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff:
In International

A geometric diffusion model for molecular conformation generation.
Conference on Learning Representations, 2022.

[303] Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. Two sides
of the same coin: Heterophily and oversmoothing in graph convolutional neural networks.
In 2022 IEEE International Conference on Data Mining (ICDM), pages 1287–1292. IEEE,
2022.

[304] Ruichao Yang, Xiting Wang, Yiqiao Jin, Chaozhuo Li, Jianxun Lian, and Xing Xie. Re-
inforcement subgraph reasoning for fake news detection. In Proceedings of the 28th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2253–2262, 2022.

[305] Shiqi Yang, Joost van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic
neighborhood structure for source-free domain adaptation. Advances in Neural Information
Processing Systems, 34:29393–29405, 2021.

[306] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. Unsupervised
domain adaptation without source data by casting a bait. Computer Vision and Pattern
Recognition, abs/2010.12427, 2020.

[307] Shuwen Yang, Guojie Song, Yilun Jin, and Lun Du. Domain adaptive classification
on heterogeneous information networks. In International Joint Conference on Artificial
Intelligence, pages 1410–1416. ijcai.org, 2020.

[308] Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun Chen. Generative knowledge graph

construction: A review. arXiv preprint arXiv:2210.12714, 2022.

119

[309] Haoteng Yin, Muhan Zhang, Jianguo Wang, and Pan Li. Surel+: Moving from walks to sets
for scalable subgraph-based graph representation learning. arXiv preprint arXiv:2303.03379,
2023.

[310] Haoteng Yin, Muhan Zhang, Yanbang Wang, Jianguo Wang, and Pan Li. Algorithm and
system co-design for efficient subgraph-based graph representation learning. arXiv preprint
arXiv:2202.13538, 2022.

[311] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure
Leskovec. Graph convolutional neural networks for web-scale recommender systems. In
Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &
data mining, pages 974–983, 2018.

[312] Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph
neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35,
pages 10737–10745, 2021.

[313] Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In

International conference on machine learning, pages 7134–7143. PMLR, 2019.

[314] Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. Graph contrastive learning
automated. In International Conference on Machine Learning, pages 12121–12132. PMLR,
2021.

[315] Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongxuan Li, and Jun Zhu. Diffusion
models and semi-supervised learners benefit mutually with few labels. Advances in Neural
Information Processing Systems, 36, 2024.

[316] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen,
Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats
diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.

[317] Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. Neo-gnns:
Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural
Information Processing Systems, 34:13683–13694, 2021.

[318] You Yuning, Chen Tianlong, Wang Zhangyang, and Shen Yang. Graph domain adaptation
via theory-grounded spectral regularization. In The Eleventh International Conference on
Learning Representations, 2023.

[319] Guangyao Zhai, Evin Pinar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir
Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes
with scene graphs. arXiv preprint arXiv:2305.16283, 2023.

[320] Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir

120

Navab, and Benjamin Busam. Commonscenes: Generating commonsense 3d indoor scenes
with scene graphs. Advances in Neural Information Processing Systems, 36, 2024.

[321] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision
transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 12104–12113, 2022.

[322] Bohang Zhang, Jingchu Gai, Yiheng Du, Qiwei Ye, Di He, and Liwei Wang. Beyond
In The Twelfth

weisfeiler-lehman: A quantitative framework for GNN expressiveness.
International Conference on Learning Representations, 2024.

[323] Duo Zhang, Xinzijian Liu, Xiangyu Zhang, Chengqian Zhang, Chun Cai, Hangrui Bi, Yiming
Du, Xuejian Qin, Jiameng Huang, Bowen Li, Yifan Shan, Jinzhe Zeng, Yuzhi Zhang, Siyuan
Liu, Yifan Li, Junhan Chang, Xinyan Wang, Shuo Zhou, Jianchuan Liu, Xiaoshan Luo,
Zhenyu Wang, Wanrun Jiang, Jing Wu, Yudi Yang, Jiyuan Yang, Manyi Yang, Fu-Qiang
Gong, Linshuang Zhang, Mengchao Shi, Fu-Zhi Dai, Darrin M. York, Shi Liu, Tong Zhu,
Zhicheng Zhong, Jian Lv, Jun Cheng, Weile Jia, Mohan Chen, Guolin Ke, Weinan E, Linfeng
Zhang, and Han Wang. Dpa-2: Towards a universal large atomic model for molecular and
material simulation. arXiv preprint arXiv:2312.15492, 2023.

[324] Hengrui Zhang, Qitian Wu, Junchi Yan, David Wipf, and Philip S Yu. From canonical
correlation analysis to self-supervised graph neural networks. Advances in Neural Information
Processing Systems, 34:76–89, 2021.

[325] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in

neural information processing systems, 31, 2018.

[326] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory
of using graph neural networks for multi-node representation learning. Advances in Neural
Information Processing Systems, 34:9061–9073, 2021.

[327] Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei
Zhang, and Fei Wu. Stable prediction on graphs with agnostic distribution shift. arXiv
preprint arXiv:2110.03865, 2021.

[328] Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. Graph-less neural networks:

Teaching old mlps new tricks via distillation. ArXiv, abs/2110.08727, 2021.

[329] Yanci Zhang, Yutong Lu, Haitao Mao, Jiawei Huang, Cien Zhang, Xinyi Li, and Rui Dai.

Company competition graph. arXiv preprint arXiv:2304.00323, 2023.

[330] Yizhou Zhang, Guojie Song, Lun Du, Shuwen Yang, and Yilun Jin. Dane: Domain
International Joint Conference on Artificial Intelligence,

adaptive network embedding.
abs/1906.00684:4362–4368, 2019.

121

[331] Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does
in-context learning learn? bayesian model averaging, parameterization, and generalization.
arXiv preprint arXiv:2305.19420, 2023.

[332] He Zhao, Lan Du, and Wray Buntine. Leveraging node attributes for incomplete relational

data. In International conference on machine learning, pages 4072–4081. PMLR, 2017.

[333] Jianan Zhao, Le Zhuo, Yikang Shen, Meng Qu, Kai Liu, Michael Bronstein, Zhaocheng Zhu,
and Jian Tang. Graphtext: Graph reasoning in text space. arXiv preprint arXiv:2310.01089,
2023.

[334] Qifang Zhao, Weidong Ren, Tianyu Li, Xiaoxiao Xu, and Hong Liu. Graphgpt: Graph
learning with generative pre-trained transformers. arXiv preprint arXiv:2401.00529, 2023.

[335] Tong Zhao, Gang Liu, Daheng Wang, Wenhao Yu, and Meng Jiang. Learning from
counterfactual links for link prediction. In International Conference on Machine Learning,
pages 26911–26926. PMLR, 2022.

[336] Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi
Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen,
Frank Noé, Haiguang Liu, and Tie-Yan Liu. Towards predicting equilibrium distributions for
molecular systems with deep learning. arXiv preprint arXiv:2306.05445, 2023.

[337] Wenqing Zheng, Edward W Huang, Nikhil Rao, Zhangyang Wang, and Karthik Subbian. You
only transfer what you share: Intersection-induced graph transfer learning for link prediction.
arXiv preprint arXiv:2302.14189, 2023.

[338] Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. Learning to generate scene
graph from natural language supervision. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 1823–1834, 2021.

[339] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information.

The European Physical Journal B, 71:623–630, 2009.

[340] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai
Koutra. Graph neural networks with heterophily. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 35, pages 11168–11176, 2021.

[341] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra.
Beyond homophily in graph neural networks: Current limitations and effective designs.
Advances in Neural Information Processing Systems, 33:7793–7804, 2020.

[342] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph
neural networks with an optimization framework. In Proceedings of the Web Conference
2021, pages 1215–1226, 2021.

122

[343] Qi Zhu, Natalia Ponomareva, Jiawei Han, and Bryan Perozzi. Shift-robust gnns: Overcoming
the limitations of localized graph training data. Advances in Neural Information Processing
Systems, 34:27965–27977, 2021.

[344] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph

contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.

[345] Zhaocheng Zhu, Xinyu Yuan, Louis-Pascal Xhonneux, Ming Zhang, Maxime Gazeau, and
Jian Tang. Learning to efficiently propagate for reasoning on knowledge graphs. arXiv
preprint arXiv:2206.04798, 2022.

[346] Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford
networks: A general graph neural network framework for link prediction. Advances in Neural
Information Processing Systems, 34:29476–29490, 2021.

[347] Lixin Zou, Haitao Mao, Xiaokai Chu, Jiliang Tang, Wenwen Ye, Shuaiqiang Wang, and
Dawei Yin. A large scale search dataset for unbiased learning to rank. Advances in Neural
Information Processing Systems, 35:1127–1139, 2022.

123