ENHANCING LINK PREDICTION ON GRAPHS: A MULTIFACETED APPROACH By Harry Shomer A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2025 ABSTRACT Graphs are a common way of representing real-world structured data. A graph is composed of nodes connected with one another via edges (i.e., “links”), where a link models how such nodes are related to one another. Due to the prevalence of graph-structured data, machine learning on graph data has become very popular. Link prediction, which attempts to predict unseen links in a graph, is a fundamental task on graphs. Link prediction has a multitude of real-world applications, including in recommender systems, knowledge graphs, and biology. In recent years, a flurry of methods have been introduced that make use of graph neural networks (GNNs) for this task. However, we find that multiple limitations impede our ability to create effective link prediction models that can perform in real- world settings. First, we find that the current method of evaluating link prediction models is both unrealistic and too easy, resulting in inflated model performance that doesn’t reflect real-world performance. Second, we observe that current methods are limited in their ability to model various patterns in link formation. This poses a challenge in real-world datasets where links can form for a number of reasons. Third, efficiency is an important concern in link prediction, as effective models are often expensive to run. Lastly, we find that link prediction models on knowledge graphs suffer from degree bias, where poor representations are learnt for lower-degree nodes, leading to subpar performance. In this thesis, we first uncover the root cause of these fundamental limitations. I will then introduce our attempts to combat these problems, through the design of new evaluation strategies and new model design. Through the introduction of these new approaches, we help promote better use of link prediction in more realistic scenarios that can occur in the wild. To my parents and family for their support. iii ACKNOWLEDGEMENTS During my PhD, I have been lucky to have been mentored and supported by many different people. First, I’d like to thank my advisor Dr. Jiliang Tang, for his advice, encouragement, and support throughout my five years at MSU. In many ways, I got incredibly lucky to have stumbled into having Dr. Tang as my advisor. Before joining Dr. Tang’s lab, I had no prior research experience. In fact, I knew absolutely nothing about Graph Machine Learning. Due to my lack of research experience, I had a lot of learning to do. Nonetheless, Dr. Tang was very patient with me, standing by me as I struggled to get my first paper accepted (which didn’t occur until the middle of my third year). He has imparted onto me an immeasurable amount of wisdom, not only in the area of research but in mentoring and resilience. I will be forever grateful for his help and support in helping shape me into the researcher I am today. I would like to express my gratitude to my PhD committee members: Dr. Tai-Quan "Winson" Peng, Dr. Kevin Liu, and Dr. Neil Shah. Their helpful questions, comments, and encouragement were greatly appreciated. I have been lucky to have had many amazing and supportive colleagues during my PhD. First, I am deeply thankful to the many colleagues and labmates in the Data Science & Engineering (DSE) lab at MSU. This includes: Dr. Yao Ma, Dr. Hamid Karimi, Dr. Jamell Dacon, Dr. Wenqi Fan, Dr. Tyler Derr, Yaxin Li, Dr. Haochen Liu, Dr. Hua Liu, Dr. Xiaorui Liu, Dr. Wei Jin, Dr. Wentao Wang, Dr. Xiaoyang Wang, Dr. Yiqi Wang, Dr. Zhiwei Wang, Dr. Han Xu, Dr. Xiangyu Zhao, Dr. Hongzhi Wen, Pengfei He, Dr. Haitao Mao, Dr. Juanhui Li, Jiayuan Ding, Haoyu Han, Yingqian Cui, Jie Ren, Kaiqi Yang, Hang Li, Yuxuan Wan, Zhikai Chen, Kai Guo, Wenzhuo Tang, Dr. Remy Liu, Yuping Lin, Jay Revolinsky, Yu Song, Jingzhe Liu, Xinnan Dai, Hanbing Wang, Shenglai Zeng, Li Ma, Shen Dong, Quang Truong, Bingheng Li, Bo Wang, Yucheng Chu. In particular, I’d like to emphasize Dr. Yao Ma and Dr. Wei Jin, who both provided valuable encouragement and advice as mentors during my first couple of years at MSU. I’d also like to extend my thanks to my other collaborators, including: Dr. Geri Skenderi, Dr. Bo Wu, Dr. Jiejun Xu, Dr. Neil Shah, Dr. Yu Wang, Dr. Hui Liu, Jiawei Xu, Dr. Yasemin Copur-Gencturk, Dr. Zhigang Hua, Dr. Tong iv Zhao, Dr. Amin Javari, Dr. Charu C. Aggarawl, and Dr. Shaylynn Crum-Dacon. I would also like to give a special thank you to the members of the Shiu Lab at MSU. I spent a lot of time at the Shiu Lab and am incredibly appreciative for being considered a “honorary labmate”. In particular I’d to thank Dr. Shinhan Shiu and Dr. Melissa Lehti-Shiu for their support and for inviting me to their house for many lab parties. I’d also like to thank the various members of the lab, including: Kenia Segura Abá, Brianna Brown, Thilanka Ranaweera, Edmond Anderson, Dr. Huan Chen, and Dr. Jyothi Kumar. I am beyond grateful to consider all of you my friends. Last, but not least, I’d like give my upmost thanks to my family and friends for their unconditional love and support. v TABLE OF CONTENTS CHAPTER 1 1.1 Motivation . . 1.2 Contributions . . 1.3 Outline . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 CHAPTER 2 EVALUATING GRAPH NEURAL NETWORKS FOR LINK PREDICTION: CURRENT PITFALLS AND NEW BENCHMARKING . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Introduction . 5 8 2.2 Preliminaries 2.3 Fair Comparison Under the Existing Setting . . . . . . . . . . . . . . . . . . . 10 2.4 New Evaluation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 Introduction . LPFORMER: AN ADAPTIVE GRAPH TRANSFORMER FOR LINK PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . 3.1 3.2 Background . . . 23 3.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 Introduction . TOWARD DEGREE BIAS IN EMBEDDING-BASED KNOWLEDGE GRAPH COMPLETION . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . 4.1 . . 43 4.2 Related Work . . . . 44 4.3 Preliminary Study . 4.4 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Regularizing Effect of KG-Mixup . . . . . . . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 DISTANCE-BASED PROPAGATION FOR EFFICIENT KNOWLEDGE GRAPH REASONING . . . . . . . . . . . . . . . . . . 62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 Introduction . 5.2 Preliminary . . 63 5.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . 5.4 Experiment 5.5 Related Work . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . 79 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 vi APPENDIX A EVALUATING GRAPH NEURAL NETWORKS FOR LINK PREDICTION: CURRENT PITFALLS AND NEW BENCHMARKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 APPENDIX B LPFORMER: AN ADAPTIVE GRAPH TRANSFORMER FOR LINK PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 APPENDIX C TOWARD DEGREE BIAS IN EMBEDDING-BASED KNOWLEDGE GRAPH COMPLETION . . . . . . . . . . . . . . . . 122 APPENDIX D DISTANCE-BASED PROPAGATION FOR EFFICIENT KNOWLEDGE GRAPH REASONING . . . . . . . . . . . . . . . . . 127 vii CHAPTER 1 INTRODUCTION 1.1 Motivation Graphs are ubiquitous data structures, being used to represent data across many domains. Recent work in Graph ML has shown tremendous promise in modeling graphs [55] in many real- world applications such as in the sciences and finance. One common downstream task on graphs is link prediction (LP), which attempts to predict whether two nodes in a graph are connected. LP has a wide span of real-world applications ranging from recommender systems [46], biological networks [57], and knowledge graph completion [100]. However, several key challenges prevent wider adoption of such models: (1) Bias & Efficiency: Graph ML models are known to often suffer from bias [103] and poor efficiency [20]. Fixing these problems, requires understanding their root causes at both the model and data level. (2) Performance on Diverse Data: To be viable for widespread use, graph-based models must be performant on data from a diverse set of domains [72]. This is necessary to promote widespread adoption of those models. (3) Real-World Evaluation: To correctly judge the effectiveness of different models, model evaluation must correspond to real-world model usage. However, this is not always the case [62], resulting in a mismatch between offline evaluation and actual usage. All these problems come together to present a set of unique challenges to using LP in realistic settings. To help solve the varied and multifaceted set of challenges faced by LP, we propose to use data- centric approach. By data-centric, we mean that the problems are in fact related to the data itself, and can be solved through an approach that emphasizes the data. In particular, (1) By analyzing the data, we can glean valuable insights into the core problems that effect performance, efficiency, and bias. (2) These insights can then be leveraged to motivate the design of more scalabale and effective models. (3) Furthermore, by understanding the data, we can determine if our evaluation is aligned with actual use of these models. Through this, we can then introduce newer evaluation procedures or benchmark datasets that better reflect realistic model performance. Next, I describe how this framework is applied to enrich our ability to perform link prediction 1 on graphs. 1.2 Contributions My research applies the aforementioned framework to enhancing the task of link prediction (LP). Specifically, my work concerns advancing LP in two main areas. The first is LP on uni- relational graphs from various domains (this is typically just referred to as “link prediction” in literature, a convention we adhere to). The second is LP on knowledge graphs (KGs), which is generally referred to as KG completion (KGC). KGs are a type of multi-relational graph that encodes facts as links. As such, LP on KGs is analogous to predicting new facts. For each task, we use the knowledge gained from understanding the data to promote more effective models and evaluation procedures. Our contributions are as follows: 1. We observe that there are several issues that plague the existing evaluation of models for LP. First, there is a lack of unified settings used across papers, resulting in different evaluation metrics, data splits, and training procedures being used. This complicates any one-to-one comparison between models. Second, we find that the previous evaluation setting is unrealistic and doesn’t correspond to real-world evaluation of LP. To address these issues, we first conduct a benchmarking of all prominent models and datasets under the same settings, to better compare performance. We then propose a new evaluation setting, HeaRT, which more closely aligns with how LP is conducted. We find that the previous evaluation procedure greatly overestimates the performance of LP methods. 2. Work has shown that multiple types of factors need to be used to adequately perform LP [72]. However, most existing methods are either unable to model all such factors. Furthermore, those that can are prohibitively expensive. In pursuit of an expressive and scalable model, we propose LPFormer, which efficiently and adaptively stresses different LP factors for each link being predicted (i.e., the “target link”). We achieve this by considering the relationship between both nodes in the target link in the context of the graph. The relationship is extracted via a learnable attention mechanism between the target link and the nodes in the graph. Using attention allows for the factors to vary by target link, thereby stressing different underlying factors for different 2 target links. We demonstrate the ability of LPFormer on both the original and HeaRT evaluations settings. 3. A common concern when learning on graphs is degree bias. This causes graph algorithms to learn poorer representations for nodes of lower degree, thereby degrading their performance on downstream tasks. This is harmful as the final learnt model will be biased against nodes of lower degree. Motivated by this problem we asked – how does degree bias effect KG completion? We observed that unlike other graph tasks, it is not the node degree that best correlates with performance on KGs but rather the number of links where the relation and the node being predicted co-occur together. To obviate this special type of degree bias, we propose a model- agnostic data augmentation technique, KG-Mixup, which creates additional synthetic links for those samples of lower degree and can empirically improve performance. 4. Recently, GNN-based models that propagate path information, “path-based GNNs”, have gained prominence for KG reasoning. But, to achieve satisfying performance, they require a large number of layers, rendering them inefficient in real-world applications. However, we observe that such models tend to propagate messages that either have redundant path information or contain no information (i.e., “empty”). From these observations, we propose a new model TAGNet, that works by reducing the number of redundant or empty messages propagated via a set of constraints on the messages passed. We observe that TAGNet can propagate up to 90% less messages than baseline models, while achieving similar or even better performance than other models. 1.3 Outline In Chapter 2, we observe that the current evaluation setting for LP is unrealistic. We then propose a new method – HeaRT, which more closely resembles LP in real-world application. In Chapter 3, we introduce a new method for LP, which uses a Transformer architecture to model the various underlying link-formation mechanisms. In Chapter 4, we study the problem of degree bias in KG completion. We show that it manifests itself differently from other graph tasks. We then propose a simple data augmentation strategy, KG-Mixup, to help alleviate the bias. In Chapter 5, 3 we aim to improve the efficiency of path-based GNNs for KG completion. We show that existing methods tend to propagate many unnecessary messages, a propose a new method TAGNet to remove them. Lastly, we conclude the dissertation in Chapter 6 with a discussion of future research directions. 4 CHAPTER 2 EVALUATING GRAPH NEURAL NETWORKS FOR LINK PREDICTION: CURRENT PITFALLS AND NEW BENCHMARKING 2.1 Introduction The task of link prediction is to determine the existence of an edge between two unconnected nodes in a graph. Existing link prediction algorithms attempt to estimate the proximity of different pairs of nodes in the graph, where node pairs with a higher proximity are more likely to interact [69]. Link prediction is applied in many different domains including social networks [27], biological networks [57], and recommender systems [46]. Graph neural networks (GNNs) [125] have gained prominence in recent years with many new frameworks being proposed for a variety of different tasks. Corresponding to the rise in popularity of GNNs, there has been a number of studies that attempt to critically examine the effectiveness of different GNNs on various tasks. This can be seen for the task of node classification [101], graph classification [34], knowledge graph completion (KGC) [95, 3, 105], and others [32]. However, despite a number of new GNN-based methods being proposed [139, 20, 136, 120] for link prediction, there is currently no work that attempts to carefully examine recent advances in link prediction methods. Upon examination, we find that there are several pitfalls in regard to model evaluation that impede our ability to properly evaluate current methods. This includes: 1. Lower than Actual Performance. We observe that the current performance of multiple models is underreported. For some methods, such as standard GNNs, this is due to poor hyperparameter tuning. Once properly tuned, they can even achieve the best overall performance on some metrics (see SAGE [42] in Table 2.1). Furthermore, for other methods like Neo-GNN [136] we can achieve around an 8.5 point increase in Hits@50 on ogbl-collab relative to the originally reported performance. This results in Neo-GNN achieving the best overall performance on ogbl-collab in our study (see Table 2.2). Such problems obscure the true performance of different models, making it difficult to draw reliable conclusions from the current results. 2. Lack of Unified Settings. For Cora, Citeseer, and Pubmed datasets [131], there exists no 5 unified data split and evaluation metrics used for each individually. For the data split, some works [115, 153] use a single fixed train/valid/test split with percentages 85/5/10%. More recent works [20, 120] use 10 random splits of size 70/10/20%. In terms of the evaluation metrics, some studies [20, 120] use ranking-based metrics such as MRR or Hits@K while others [54, 153] report the area under the curve (AUC). This is despite multiple studies that argue that AUC is a poor metric for evaluating link prediction [129, 47]. Additionally, for both the planetoid (i.e., Cora, Citeseer and Pubmed) and ogbl-collab datasets, some methods incorporate the validation edges during testing [20, 43], while others [136, 120] don’t. This lack of a unified setting makes it difficult to draw a comparison and hampers our ability to determine which methods perform best on these datasets. 3. Unrealistic Evaluation Setting. During the evaluation, we are given a set of true samples (i.e., positive samples) and a set of false samples (i.e., negative samples). We are tasked with learning a classifier 𝑓 that assigns a higher probability to the positive samples than the negatives. The current evaluation setting uses the same set of randomly selected negative samples for each positive sample. We identify two potential problems with the current evaluation procedure. (1) It is not aligned with real-world settings. In a real-world scenario, we typically care about predicting links for a specific node. For example, in friend recommendations, we aim to recommend friends for a specific user 𝑢. To evaluate such models for 𝑢, we strive to rank node pairs including 𝑢. However, this does not hold in the current setting as 𝑢 is not included in most of the negative samples. (2) The current evaluation setting makes the task too easy. As such, it may not reflect the model performance in real-world applications. This is because the nodes in a randomly selected negative “node pair” are likely to be unrelated to each other. As shown in Figure 2.1, almost all negative samples in the test data have no common neighbors, a typically strong heuristic, making them trivial to classify them. To account for these issues, we propose to first conduct a fair and reproducible evaluation among current link prediction methods under the existing evaluation setting. We then design a new evaluation strategy that is more aligned with a real-world setting and detail our results. Our key 6 (a) ogbl-collab (b) ogbl-ppa (c) ogbl-citation2 Figure 2.1 Common neighbor distribution for the positive and negative test samples for the ogbl-collab, ogbl-ppa, and ogbl-citation2 datasets under the existing evaluation setting. contributions are summarized below: 1. Reproducible and Fair Comparison. We conduct a fair comparison of different models across multiple common datasets. To ensure a fair comparison, we tune all models on the same set of hyperparameters. We further evaluate different models using multiple types of evaluation metrics. For the Planetoid datasets [131], we further use a unified data split to facilitate a point of comparison between models. To the best of our knowledge, there are no recent efforts to comprehensively benchmark link prediction methods (several exist for KGC [105, 3, 95]). Furthermore, we open-source the implementation in our analysis to enable others in their analyses. 2. New Evaluation Setting. We recognize that the current negative sampling strategy used in evaluation is unrealistic and easy. To counter these issues, we first use a more realistic setting of tailoring the negatives to each positive sample. This is achieved by restricting them to be corruptions of the positive sample (i.e., containing one of its two nodes as defined in Eq. (2.3)). Given the prohibitive cost of utilizing all possible corruptions, we opt instead to only rank against 𝐾 negatives for each positive sample. In order to choose the most relevant and difficult corruptions, we propose a Heuristic Related Sampling Technique (HeaRT), which selects them based on a combination of multiple heuristics. This creates a more challenging task than the previous evaluation strategy and allows us to better assess the capabilities of current methods. 7 The rest of the paper is structured as follows. In Section 2.2 we introduce the models, datasets, and settings used for conducting a fair comparison between methods. In Section 2.3 we show the results of the fair comparison under the existing evaluation setting and discuss our main observations. Lastly, in Section 2.4 we introduce our new evaluation setting. We then detail and discuss the performance of different methods using our new setting. 2.2 Preliminaries 2.2.1 Task Formulation In this section we formally define the task of link prediction. Let G = {V, E} be a graph where V and E are the set of nodes and edges in the graph, respectively. Furthermore, let 𝑋 ∈ R |𝑉 |×𝑑 be a set of 𝑑-dimensional features for each node. Link prediction aims to predict the likelihood of a link existing between two nodes given the structural and feature information. For a pair of nodes 𝑢 and 𝑣, the probability of a link existing, 𝑝(𝑢, 𝑣), is therefore given by: 𝑝(𝑢, 𝑣) = 𝑝(𝑢, 𝑣 | G, 𝑋). (2.1) Traditionally, 𝑝(𝑢, 𝑣) was estimated via non-learnable heuristic methods [74, 65]. More recently, methods that use learnable parameters have gained popularity [139, 20]. These methods attempt to estimate 𝑝(𝑢, 𝑣) via a learnable function 𝑓 such that: 𝑝(𝑢, 𝑣) = 𝑓 (𝑢, 𝑣 | G, 𝑋, Θ), (2.2) where Θ represents a set of learnable parameters. A common choice of 𝑓 are graph neural networks [55]. In the next subsection we detail the various link prediction methods used in this study. 2.2.2 Link Prediction Methods In this section we given an overview of the different methods used in this study. Conventional methods [74, 65] often exploit hand-craft graph structural properties (i.e., heuristics) between node pairs. GNNs attempt to learn the structural information to facilitate link prediction [141, 120, 20]. Given the strong performance of pairwise-based heuristics [136, 120], some recent works use both GNNs and pairwise information, demonstrating strong performance. 8 For our study, we consider both traditional and state-of-the-art GNN-based models. They can be roughly organized into four categories. 1) Heuristic methods: Common Neighbor (CN) [81], Adamic Adar (AA) [2], Resource Allocation (RA) [151], Shortest Path [65], and Katz [49]. These methods define a score to indicate the link existence based on the graph structure. Among them, CN, AA, and RA are based on the common neighbors, while Shortest Path and Katz are based on the path information. 2) Embedding methods: Matrix factorization (MF) [74], Multilayer Perceptron (MLP) and Node2Vec [39]. These methods are trained to learn low-dimensional node embeddings that are used to predict the likelihood of node pairs existing. 3) GNN methods: GCN [132], GAT [115], SAGE [42], and GAE [54]. These methods attempt to integrate the multi-hop graph structure based on the message passing paradigm. 4) GNN + Pairwise Information methods: Standard GNN methods, while powerful, are not able to capture link-specific information [141]. As such, works have been proposed that augment GNN methods by including additional information to better capture the relation between the nodes in the link we are predicting. SEAL [141], BUDDY [20], and NBFNet [153] use the subgraph features. Neo-GNN [136], NCN [120], and NCNC [120] are based on common neighbor information. Lastly, PEG [117] uses the positional encoding derived from the graph structure. 2.2.3 Datasets and Experimental Settings In this section we summarize the datasets and evaluation and training settings. We note that the settings depend on the specific dataset. More details are given in Appendix A.3. Datasets. We limit our experiments to homogeneous graphs, which are the most commonly used datasets for link prediction. This includes the small-scale datasets, i.e., Cora, Citeseer, Pubmed [131], and large-scale datasets in the OGB benchmark [43], i.e., ogbl-collab, ogbl-ddi, ogbl- ppa, and ogbl-citation2. We summarize the statistics and split ratio of each dataset in Appendix A.3. Metrics. For evaluation, we use both the area under the curve (AUC) and ranking-based metrics, i.e., mean reciprocal rank (MRR) and Hits@K. For Cora, Citeseer, and Pubmed we adopt 𝐾 ∈ {1, 3, 10, 100}. We note that 𝐾 = 100 is reported in some recent works [20, 120]). However due to the small number of negatives used during evaluation (e.g., ≈ 500 for Cora and Citeseer) 9 𝐾 = 100 is likely not informative. For the OGB datasets, we adopt 𝐾 ∈ {20, 50, 100} to keep consistent with the original study [43]. Please see Appendix A.2.1 for the formal definitions of the various evaluation metrics. Hyperparameter Ranges. We conduct a grid hyperparameter search across a comprehensive range of values. For Cora, Citeseer, and Pubmed this includes: learning rate (0.01, 0.001), dropout (0.1, 0.3, 0.5), weight decay (1e-4, 1e-7, 0), number of model layers (1, 2, 3), number of prediction layers (1, 2, 3), and the embedding size (128, 256). Due to the large size of the OGB datasets, it’s infeasible to tune over such a large range. Therefore, following the most commonly used settings among published hyperparameters, we fix the weight decay to 0, the number of model and prediction layers to be 3, and the embedding size to be 256. The best hyperparameters are chosen based on the validation performance. We note that several exceptions exist to these ranges when they result in significant performance degradations (see Appendix A.3 for more details). We further follow the existing setting and only sample one negative sample per positive sample during training. Existing Evaluation Settings. In the evaluation stage, the same set of randomly sampled negatives are used for all positive samples. We note that one exception is ogbl-citation2, where they randomly sample 1000 negative samples per positive sample. For Cora, Citeseer, and Pubmed the number of negative samples is equal to the number of positive samples. For the OGB datasets, we use the existing fixed set of randomly chosen negatives found in [43]. Furthermore, for ogbl-collab we follow the existing protocol [43] and include the validation edges in the training graph during testing. This setting is adopted on ogbl-collab under both the existing and new evaluation setting. 2.3 Fair Comparison Under the Existing Setting In this section, we conduct a fair comparison among link prediction methods. This comparison is spurred by the multiple pitfalls noted in Section 5.1, which include lower-than-actual model performance, multiple data splits, and inconsistent evaluation metrics. These pitfalls hinder our ability to fairly compare different methods. To rectify this, we conduct a fair comparison adhering to the settings listed in section 2.2.3. 10 Table 2.1 Results on Cora, Citeseer, and Pubmed(%) under the existing evaluation setting. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Heuristic Embedding GNN GNN+Pairwise Info Models CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG Cora Citeseer Pubmed MRR AUC MRR AUC MRR AUC 20.99 31.87 30.79 12.45 27.4 37.29 ± 8.82 14.29 ± 5.79 31.21 ± 7.90 32.50 ± 6.87 31.86 ± 6.08 37.83 ± 7.75 29.98 ± 3.21 26.69 ± 5.89 26.40 ± 4.40 22.65 ± 2.60 32.93 ± 3.80 29.01 ± 3.83 37.69 ± 3.97 22.76 ± 1.84 70.85 70.96 70.96 81.08 81.17 90.97 ± 0.64 80.29 ± 2.26 95.32 ± 0.37 95.01 ± 0.32 93.90 ± 0.32 95.63 ± 0.27 95.08 ± 0.33 90.59 ± 0.75 95.06 ± 0.36 93.73 ± 0.36 96.76 ± 0.18 96.90 ± 0.28 92.85 ± 0.17 94.46 ± 0.34 28.34 29.37 27.61 31.82 38.16 44.33 ± 8.99 24.80 ± 4.71 43.53 ± 7.26 50.01 ± 6.04 48.69 ± 7.53 47.84 ± 6.39 63.33 ± 3.14 39.36 ± 4.99 59.48 ± 8.96 53.97 ± 5.88 54.97 ± 6.03 64.03 ± 3.67 38.17 ± 3.06 56.12 ± 6.62 67.49 67.49 67.48 75.5 75.37 94.46 ± 0.59 75.92 ± 3.25 94.45 ± 0.32 95.89 ± 0.26 96.25 ± 0.20 97.39 ± 0.15 97.06 ± 0.22 88.52 ± 1.40 96.72 ± 0.26 94.89 ± 0.60 97.04 ± 0.26 97.65 ± 0.30 91.06 ± 0.15 96.15 ± 0.41 14.02 16.66 15.63 7.15 21.44 34.61 ± 2.48 19.29 ± 6.29 16.52 ± 4.14 19.94 ± 4.24 18.63 ± 7.75 22.74 ± 5.47 16.67 ± 0.19 38.06 ± 5.18 23.98 ± 5.11 31.45 ± 3.17 35.65 ± 4.60 25.70 ± 4.48 44.73 ± 2.12 21.05 ± 2.85 63.9 63.9 63.9 74.64 74.86 93.14 ± 0.18 93.06 ± 0.43 98.34 ± 0.10 98.69 ± 0.06 98.20 ± 0.07 98.87 ± 0.04 97.47 ± 0.08 97.77 ± 0.40 98.2 ± 0.05 98.71 ± 0.05 98.98 ± 0.04 99.14 ± 0.03 98.34 ± 0.02 96.97 ± 0.39 The results are split into two tables. The results for Cora, Citeseer, and Pubmed are shown in Table 2.1 and OGB in Table 2.2. For simplicity, we only present the AUC and MRR for Cora, Citeseer, and Pubmed. For OGB datasets, we include AUC and the original ranking metric reported in [43] to allow a convenient comparison (Hits@20 for ogbl-ddi, Hits@50 for ogbl-collab, Hits@100 for ogbl-ppa, and MRR for ogbl-citation2). We use “>24h" to denote methods that require more than 24 hours for either training one epoch or evaluation. OOM indicates that the algorithm requires over 50Gb of GPU memory. Since ogbl-ddi has no node features, we mark the MLP results with a “N/A". Additional results in terms of other metrics are presented in Appendix A.6. We have several noteworthy observations concerning the methods, the datasets, the evaluation settings, and the overall results. We highlight the main observations below. Observation 1: Better than Reported Performance. We find that for some models we are able to achieve superior performance compared to what is reported by recent studies. For instance, in our study Neo-GNN [136] achieves the best overall test performance on ogbl-collab with a Hits@50 of 66.13. In contrast, the reported performance in [136] is only 57.52, which would rank seventh under our current setting. This is because the original study [136] does not follow the standard setting 11 Table 2.2 Results on OGB datasets (%) under the existing evaluation setting. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Heuristic Embedding GNN GNN+Pairwise Info Models CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG ogbl-collab ogbl-ddi ogbl-ppa Hits@50 AUC Hits@20 AUC Hits@100 AUC 61.37 64.17 63.81 46.49 64.33 49.06 ± 1.04 41.81 ± 1.67 35.81 ± 1.08 54.96 ± 3.18 55.00 ± 3.28 59.44 ± 1.37 OOM 63.37 ± 0.69 64.59 ± 0.46 66.13 ± 0.61 63.86 ± 0.51 65.97 ± 1.03 OOM 49.02 ± 2.99 17.73 82.78 18.61 82.78 6.23 82.78 0 96.51 17.73 90.54 34.69 ± 2.90 96.24 ± 0.15 23.50 ± 5.35 83.75 ± 1.77 95.91 ± 0.08 N/A 49.90 ± 7.23 97.89 ± 0.06 31.88 ± 8.83 97.11 ± 0.09 49.84 ± 15.56 98.08 ± 0.03 7.09 ± 6.02 OOM 25.25 ± 3.90 95.65 ± 0.29 29.60 ± 4.75 96.52 ± 0.40 20.95 ± 6.03 98.23 ± 0.05 97.83 ± 0.04 76.52 ± 10.47 98.20 ± 0.05 70.23 ± 12.11 OOM 94.45 ± 0.89 >24h 30.28 ± 4.92 95.2 95.43 96.51 59.07 95.2 99.78 ± 0.04 99.46 ± 0.10 N/A 99.86 ± 0.03 99.63 ± 0.21 99.96 ± 0.00 75.34 ±15.96 97.97 ± 0.19 99.81 ± 0.02 98.06 ± 2.00 99.97 ± 0.00 99.97 ± 0.01 >24h 99.45 ± 0.04 97.22 27.65 97.23 32.45 49.33 97.24 99.13 0 97.22 27.65 99.77 ± 0.00 26.24 ± 0.96 99.46 ± 0.10 28.4 ± 4.62 90.23 ± 0.00 0.45 ± 0.04 99.84 ± 0.03 29.57 ± 2.90 OOM OOM 99.82 ± 0.00 41.02 ± 1.94 OOM OOM 99.79 ± 0.02 48.80 ± 5.61 99.56 ± 0.02 47.33 ± 1.96 97.30 ± 0.14 48.45 ± 1.01 62.63 ± 1.15 99.95 ± 0.01 62.61 ± 0.76 99.97 ± 0.01 OOM OOM OOM OOM ogbl-citation2 MRR 74.3 75.96 76.04 >24h 74.3 45.04 ± 0.10 50.57 ± 12.14 38.07 ± 0.09 84.85 ± 0.07 OOM 83.06 ± 0.09 OOM 86.93 ± 0.43 87.86 ± 0.18 83.54 ± 0.32 89.27 ± 0.05 89.82 ± 0.43 OOM OOM of including validation edges in the graph during testing. This setting, as noted in Section 2.2.3, is used by all other methods on ogbl-collab. However it was omitted by [136], resulting in lower reported performance. Furthermore, on ogbl-citation2 [43], our results for the heuristic methods are typically around 75% MRR. This significantly outperforms previously reported results, which report an MRR of around 50% [141, 20]. The disparity arises as previous studies treat the ogbl- citation2 as a directed graph when applying heuristic methods. However, for GNN-based methods, ogbl-citation2 is typically converted to a undirected graph. We remedy this by also converting ogbl-citation2 to an undirected graph when computing the heuristics, leading to a large increase in performance. Furthermore, with proper tuning, conventional baselines like GCN [55] and GAE [54] generally exhibit enhanced performance relative to what was originally reported across all datasets. For example, we find that GAE can achieve the second best MRR on Citeseer and GCN the third best Hits@20 on ogbl-ddi. A comparison of the reported results and ours are shown in Table 2.3. We note that we report AUC for Cora, Citeseer, Pubmed as it was used in the original study. These observations suggest that the performance of various methods are better than what was reported in their initial publications. However, many studies [20, 120, 141] only report the original performance 12 Table 2.3 Comparison of ours and the reported results for GCN and GAE. GCN Reported Ours ogbl-collab Hits@50 47.14 ± 1.45 54.96 ± 3.18 ogbl-ppa Hits@100 18.67 ± 1.32 29.57 ± 2.90 ogbl-ddi Hits@20 37.07 ± 5.07 49.90 ± 7.23 ogbl-citation2 MRR 84.74 ± 0.21 84.85 ± 0.07 GAE Reported Ours Cora AUC 91.00 ± 0.01 95.08 ± 0.33 Citeseer AUC 89.5 ± 0.05 97.06 ± 0.22 Pubmed AUC 96.4 ± 0.00 97.47 ± 0.08 for comparison, which has the potential to lead to inaccurate conclusions. Observation 2: Divergence from Reported Results on ogbl-ddi. We observe that our results in Table 2.2 for ogbl-ddi differ from the reported results. Outside of GCN, which reports better performance, most other GNN-based methods report a lower-than-reported performance. For example, for BUDDY we only achieve a Hits@20 of 29.60 vs. the reported 78.51 (see Appendix A.4 for a comprehensive comparison among methods). We find that the reason for this difference depends on the method. BUDDY [20] reported 1 using 6 negatives per positive sample during training, leading to an increase in performance. Neo-GNN [136] first pretrains the GNN under the link prediction task, and then uses the pretrained model as the initialization for Neo-GNN.2 For a fair comparison among methods, we only use 1 negative per positive sample in training and we don’t apply the pretraining. For other methods, we find that a weak relationship between the validation and test performance complicates the tuning process, making it difficult to find the optimal hyperparameters. Please see Appendix A.5 for a more in-depth study and discussion. Observation 3: High Model Standard Deviation. The results in Tables 2.1 and 2.2 present the mean performance and standard deviation when training over 10 seeds. Generally, we find that for multiple datasets the standard deviation of the ranking metrics is often high for most models. For example, the standard deviation for MRR can be as high as 8.82, 8.96, or 7.75 for Cora, Citeseer, and Pubmed, respectively. Furthermore, on ogbl-ddi the standard deviation of Hits@20 reaches as high as 10.47 and 15.56. A high variance indicates unstable model performance. This makes it difficult to compare results between methods as the true performance lies in a larger range. This further complicates replicating model performance, as even large differences with the reported results may still fall within variance (see observation 2). Later in Section 2.4.3 we find that our 1https://github.com/melifluos/subgraph-sketching 2https://github.com/seongjunyun/Neo-GNNs 13 new evaluation can reduce the model variance for all datasets (see Table 2.6). This suggests that the high variance is related to the current evaluation procedure. Observation 4: Inconsistency of AUC vs. Ranking-Based Metrics. The AUC score is widely adopted to evaluate recent advanced link prediction methods [54, 153]. However, from our results in Tables 2.1 and 2.2 we observe that there exists a disparity between AUC and ranking-based metrics. In some cases, the AUC score can be high when the ranking metric is very low or even 0. For example, the Shortest Path heuristic records a Hits@K of 0 on ogbl-ppa. However, the AUC score is > 99%. Furthermore, even though RA records the third and fifth best performance on ogbl-ppa and ogbl-collab, respectively, it has a lower AUC score than Shortest Path on both. Previous works [47, 129] argued that AUC is not a proper metric for link prediction. This is due to the inapplicability of AUC for highly imbalanced problems [28, 99]. 2.4 New Evaluation Setting In this section, we introduce a new setting for evaluating link prediction methods. We first discuss the unrealistic nature of the current evaluation setting in Section 2.4.1. We then present our new evaluation setting in Section 2.4.2, which aims to align better with real-world scenarios. Lastly, in Section 2.4.3, we present and discuss the results based on our new evaluation setting. 2.4.1 Issues with the Existing Evaluation Setting The existing evaluation procedure for link prediction is to rank a positive sample against a set of 𝐾 randomly selected negative samples. The same set of 𝐾 negatives are used for all positive samples (with the exception of ogbl-citation2 which uses 1000 per positive sample). We demonstrate that there are multiple issues with this setting, making it difficult to properly evaluate the effectiveness of current models. Issue 1: Non-Personalized Negative Samples. The existing evaluation setting uses the same set of negative samples for all positive samples (outside of ogbl-citation2). This strategy, referred to as global negative sampling [123], is not a commonly sought objective. Rather, we are often more interested in predicting links that will occur for a specific node. Take, for example, a social network that connects users who are friends. In this scenario, we may be interested in recommending new 14 (a) Negative sample genera- tion for one positive sample. (b) Process of determining negative samples that contain a node 𝑎. Figure 2.2 Pipeline for generating the hard negative samples for a positive sample (a, b). friends to a user 𝑢. This requires learning a classifier 𝑓 that assigns a probability to a link existing. When evaluating this task, we want to rank links where 𝑢 connects to an existing friend above those where they don’t. For example, if 𝑢 is friends with 𝑎 but not 𝑏, we hope that 𝑓 (𝑢, 𝑎) > 𝑓 (𝑢, 𝑏). However, the existing evaluation setting doesn’t explicitly test for this. Rather it compares a true sample (𝑢, 𝑎) with a potentially unrelated negative sample, e.g., (𝑐, 𝑑). This is not aligned with the real-world usage of link prediction on such graphs. Issue 2: Easy Negative Samples. The existing evaluation setting randomly selects negative samples to use. However given the large size of most graphs (see Table A.1 in Appendix A.3), randomly sampled negatives are likely to choose two nodes that bear no relationship to each other. Such node pairs are trivial to classify. We demonstrate this by plotting the distribution of common neighbors (CN), a strong heuristic, for all positive and negative test samples in Figure 2.1. Almost all the negative samples contain no CNs, making them easy to classify. We further show that the same problem afflicts even the smaller datasets in Figure A.1 in Appendix A.1. These observations suggest that a more realistic evaluation strategy is desired. At the core of this challenge is which negative samples to use during evaluation. We discuss our design for solving this in the next subsection. 15 2.4.2 Heuristic Related Sampling Technique (HeaRT) In this subsection, we introduce new strategy for evaluating link prediction methods. To address the concerns outlined in Section 2.4.1, we design a new method for sampling negatives during evaluation. Our strategy, HeaRT, solves these challenges by: (a) personalizing the negatives to each sample and (b) using heuristics to select hard negative samples. This allows for the negative samples to be directly related to each positive sample while also being non-trivial. We further discuss how to ensure that the negative samples are both personalized and non-trivial for a specific positive sample. From our discussion in Section 2.4.1, we are motivated in personalizing the negatives to each positive sample. Since the positive samples in the current datasets are node pairs, we seek to personalize the negatives to both nodes in the positive sample. Extending our example in Section 2.4.1, this is analogous to restricting the negatives to contain one of the two users from the original friendship pair. As such, for a positive sample (𝑢, 𝑎), the negative samples will belong to the set: 𝑆(𝑢, 𝑎) = {(𝑢′, 𝑎) | 𝑢′ ∈ V} ∪ {(𝑢, 𝑎′) | 𝑎′ ∈ V}, (2.3) where V is the set of nodes. This is similar to the setting used for knowledge graph completion (KGC) [13] which uses all such samples for evaluation. However, one drawback of evaluating each positive sample against the entire set of possible corruptions is the high computational cost. To mitigate this issue we consider only utilizing a small subset of 𝑆(𝑢, 𝑎) during evaluation. The key challenge is how to generate a subset of 𝑆(𝑢, 𝑎). If we randomly sample from 𝑆(𝑢, 𝑎), we risk only utilizing easy negative samples. This is one of the issues of the existing evaluation setting (see Issue 2 in Section 2.4.1), whereby randomly selecting negatives, they unknowingly produce negative samples that are too easy. We address this by selecting the negative samples via a combination of multiple heuristics. Since heuristics typically correlate well with performance, we ensure that the negative samples will be non-trivial to classify. This is similar to the concept of candidate generation [41, 33], which only ranks a subset of candidates that are most likely to be true. 16 Table 2.4 Results on Cora, Citeseer, and Pubmed (%) under HeaRT. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Heuristic Embedding GNN GNN+Pairwise Info Models CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG Cora Citeseer Pubmed MRR Hits@10 MRR Hits@10 MRR Hits@10 9.78 11.91 11.81 5.04 11.41 14.47 ± 0.60 6.20 ± 1.42 13.52 ± 0.65 16.61 ± 0.30 13.84 ± 0.68 14.74 ± 0.69 18.32 ± 0.41 10.67 ± 3.46 13.71 ± 0.59 13.95 ± 0.39 14.66 ± 0.95 14.98 ± 1.00 13.56 ± 0.58 15.73 ± 0.39 20.11 24.10 24.48 15.37 22.77 32.77 ± 1.29 15.26 ± 3.39 31.01 ± 1.71 36.26 ± 1.14 32.89 ± 1.27 34.65 ± 1.47 37.95 ± 1.24 24.27 ± 6.74 30.40 ± 1.18 31.27 ± 0.72 35.14 ± 1.04 36.70 ± 1.57 31.12 ± 0.75 36.03 ± 0.75 8.42 10.82 10.84 5.83 11.19 21.17 ± 1.01 7.80 ± 0.79 22.62 ± 0.55 21.09 ± 0.88 19.58 ± 0.84 21.09 ± 1.15 25.25 ± 0.82 13.16 ± 1.66 22.84 ± 0.36 17.34 ± 0.84 28.65 ± 1.21 24.10 ± 0.65 14.29 ± 0.80 21.01 ± 0.77 18.68 22.20 22.86 16.26 24.84 45.82 ± 2.01 16.72 ± 1.99 48.02 ± 1.79 47.23 ± 1.88 45.30 ± 1.3 48.75 ± 1.85 49.65 ± 1.48 27.37 ± 3.20 48.35 ± 1.18 41.74 ± 1.18 53.41 ± 1.46 53.72 ± 0.97 31.39 ± 1.34 45.56 ± 1.38 2.28 2.63 2.47 0.86 3.01 3.94 ± 0.24 4.46 ± 0.32 6.41 ± 0.25 7.13 ± 0.27 4.95 ± 0.14 9.40 ± 0.70 5.27 ± 0.25 5.88 ± 0.53 7.56 ± 0.18 7.74 ± 0.30 5.84 ± 0.22 8.58 ± 0.59 >24h 4.4 ± 0.41 4.78 5.51 4.9 0.38 5.98 8.51 ± 0.77 9.42 ± 0.87 15.04 ± 0.67 15.22 ± 0.57 9.99 ± 0.64 20.54 ± 1.40 10.50 ± 0.46 12.47 ± 1.23 16.78 ± 0.53 17.88 ± 0.71 13.22 ± 0.56 18.81 ± 1.16 >24h 8.70 ± 1.26 An overview of the generation process is given in Figure 2.2. For each positive sample, we generate 𝐾 negative samples. To allow personalization to both nodes in the positive sample equally, we sample 𝐾/2 negatives with each node. For the heuristics, we consider RA [151], PPR [15], and feature similarity. A more detailed discussion on the negative sample generation is given in Appendix A.7. It’s important to note that our work centers specifically on negative sampling during the evaluation stage (validation and test). This is distinct from prior work that concerns the negatives sampled used during the training phase [130, 91]. As such, the training process remains unaffected under both the existing and HeaRT setting. 2.4.3 Results and Discussion In this subsection we present our results when utilizing HeaRT. We follow the parameter ranges introduced in Section 2.2.3. For all datasets we use 𝐾 = 500 negative samples per positive sample during evaluation. Furthermore for ogbl-ppa we only use a small subset of the validation and test positive samples (100K each) for evaluation. This is because the large size of the validation and test sets (see Table A.1 in Appendix A.3) makes HeaRT prohibitively expensive. 17 Table 2.5 Results on OGB datasets (%) under HeaRT. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Models CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2 MRR Hits@20 MRR Hits@20 MRR Hits@20 MRR Hits@20 4.20 5.07 6.29 2.66 6.31 4.68 ± 0.08 4.89 ± 0.25 5.37 ± 0.14 6.09 ± 0.38 4.18 ± 0.33 5.53 ± 0.5 OOM 6.43 ± 0.32 5.67 ± 0.36 5.23 ± 0.9 5.09 ± 0.38 4.73 ± 0.86 OOM 4.83 ± 0.21 16.46 19.59 24.29 15.98 24.34 16.84 ± 0.17 18.86 ± 0.40 16.15 ± 0.27 22.48 ± 0.81 18.30 ± 1.42 21.26 ± 1.32 OOM 21.57 ± 0.38 23.35 ± 0.73 21.03 ± 3.39 20.84 ± 1.31 20.49 ± 3.97 OOM 18.29 ± 1.06 6.71 6.97 8.70 0 6.71 11.14 ± 0.95 13.99 ± 0.47 N/A 13.46 ± 0.34 12.92 ± 0.39 12.60 ± 0.72 3.49 ± 1.73 9.99 ± 0.90 12.43 ± 0.50 10.86 ± 2.16 12.86 ± 0.78 >24h >24h 12.05 ± 1.14 38.69 39.75 44.01 0 38.69 63.63 ± 2.05 59.50 ± 1.68 N/A 64.76 ± 1.45 66.83 ± 2.23 67.19 ± 1.18 17.81 ± 9.80 49.74 ± 2.39 58.71 ± 1.63 51.94 ± 10.33 65.82 ± 2.66 >24h >24h 50.12 ± 6.55 17.11 17.83 17.79 >24h 14.10 14.67 ± 0.18 8.72 ± 2.60 16.32 ± 0.07 19.98 ± 0.35 OOM 41.73 43.12 43.34 >24h 35.55 42.68 ± 0.20 29.64 ± 7.30 43.15 ± 0.10 51.72 ± 0.46 OOM 25.70 26.85 28.34 0.54 25.70 18.33 ± 0.10 22.47 ± 1.53 0.98 ± 0.00 26.94 ± 0.48 OOM 27.27 ± 0.30 OOM 68.25 70.22 71.50 1.31 68.25 53.42 ± 0.11 70.71 ± 4.82 1.47 ± 0.00 68.38 ± 0.73 OOM 69.49 ± 0.43 OOM OOM 48.62 ± 1.93 29.71 ± 0.71 76.77 ± 0.94 20.60 ± 1.28 47.81 ± 0.37 19.17 ± 0.20 71.50 ± 0.68 27.70 ± 0.33 21.68 ± 1.14 43.17 ± 0.53 16.12 ± 0.25 64.81 ± 2.26 35.06 ± 0.26 81.89 ± 0.31 23.35 ± 0.28 53.76 ± 0.20 51.69 ± 1.48 33.52 ± 0.26 82.24 ± 0.40 OOM OOM 19.61 ± 0.54 OOM OOM 22.05 ± 0.12 53.13 ± 0.15 OOM OOM OOM OOM OOM The results are shown in Table 2.4 (Cora, Citeseer, Pubmed) and Table 2.5 (OGB). For simplicity, we only include the MRR and Hits@10 for Cora, Citeseer, Pubmed, and the MRR and Hits@20 for OGB. Additional results for other metrics can be found in Appendix A.8. We note that most datasets, outside of ogbl-ppa, exhibit much lower performance than under the existing setting. This is though we typically use much fewer negative samples in the new setting, implying that the negative samples produced by HeaRT are much harder. We highlight the main observations below. Observation 1: Better Performance of Simple Models. We find that under HeaRT, “simple" baseline models (i.e., heuristic, embedding, and GNN methods) show a greater propensity to outperform their counterparts via ranking metrics than under the existing setting. Specifically, we focus on MRR in Table 2.1, 2.4, and 2.5, and the corresponding ranking-based metrics in Table 2.2. Under the existing setting, such methods only rank in the top three for any dataset a total of 5 times. However, under HeaRT this occurs 10 times. Furthermore, under the existing setting only 1 “simple" method ranks best overall while under HeaRT there are 3. This suggests that recent advanced methods may have benefited from the easier negative samples in the existing setting. Another interesting observation is that on ogbl-collab, heuristic methods achieve strong per- 18 Table 2.6 Mean model standard deviation for the existing setting and HeaRT. We use Hits@20 for ogbl-ddi, Hits@50 for ogbl-collab, Hits@100 for ogbl-ppa, and MRR otherwise. Dataset Existing HeaRT % Change Cora Citeseer Pubmed ogbl-collab ogbl-ppa ogbl-ddi ogbl-citation2 5.19 5.94 4.14 1.49 2.13 6.77 1.39 0.79 0.88 0.35 0.96 0.36 3.49 0.59 -85% -85% -92% -36% -83% -48% -58% formance and are able to outperform more complicated models. Specifically, we find that Katz is ranked the second and RA the third. Note that this underscores the significance of the common neighbors information (i.e., paths of length 2), as this information is incorporated in both RA and Katz. Of note is that ogbl-collab is a dynamic collaboration graph, which is different from other datasets. Because of this, the negative sampling strategy also differs slightly from the other datasets. Please see Appendix A.9 for more discussion. Observation 2: Lower Model Standard Deviation. We observed earlier that, under the existing evaluation setting, the model variance across seeds was high (see observation 3 in Section 2.3). This complicates model comparison as the model performance is unreliable. Interestingly, we find that HeaRT is able to dramatically reduce the variance for all datasets. We demonstrate this by first calculating the mean standard deviation across all models on each individual dataset. This was done for both evaluation settings with the results compared. As demonstrated in Table 2.6, the mean standard deviation decreases for all datasets. This is especially true for Cora, Citeseer, and Pubmed, which each decrease by over 85%. Such a large decrease in standard deviation is noteworthy as it allows for a more trustworthy and reliable comparison between methods. We posit that this observation is caused by a stronger alignment between the positive and negative samples under our new evaluation setting. Under the existing evaluation setting, the same set of negative samples is used for all positive samples. One consequence of this is that a single positive sample may bear little to no relationship to the negative samples (see Section 2.4.1 for more discussion). However, under our new evaluation setting, the negatives for a positive sample 19 are a subset of the corruptions of that sample. This allows for a more natural comparison via ranking-based metrics as the samples are more related and can be more easily compared. Observation 3: Lower Model Performance. We observe that the majority of datasets exhibit a significantly reduced performance in comparison to the existing setting. For example, under the existing setting, models typically achieve a MRR of around 30, 50, and 30 on Cora, Citeseer, and Pubmed (Table 2.1), respectively. However, under HeaRT the MRR for those datasets is typically around 20, 25, and 10 (Table 2.4). Furthermore for ogbl-citation2, the MRR of the best performing model falls from a shade under 90 on the existing setting to slightly over 20 on HeaRT. Lastly, we note that the performance on ogbl-ppa actually increases. This is because we only utilize a small subset of the total test set when evaluating on HeaRT, nullifying any comparison between the two settings. These outcomes are observed despite HeaRT using much fewer negative samples than the original setting. This suggests that the negative samples generated by HeaRT are substantially more challenging than those used in the existing setting. This underscores the need to develop more advanced methodologies that can tackle harder negatives samples like in HeaRT. 2.5 Conclusion In this work we have revealed several pitfalls found in recent works on link prediction. To over- come these pitfalls, we first establish a benchmarking that facilitates a fair and consistent evaluation across a diverse set of models and datasets. By doing so, we are able to make several illuminating observations about the performance and characteristics of various models. Furthermore, based on several limitations we observed in the existing evaluation procedure, we introduce a more practical setting called HeaRT (Heuristic Related Sampling Technique). HeaRT incorporates a more real- world evaluation setting, resulting in a better comparison among methods. By introducing a more rigorous and realistic assessment, HeaRT could guide the field towards more effective models, thereby advancing the state of the art in link prediction. 20 CHAPTER 3 LPFORMER: AN ADAPTIVE GRAPH TRANSFORMER FOR LINK PREDICTION 3.1 Introduction Link prediction (LP) attempts to predict unseen edges in a graph. It has been adopted in many applications including recommender systems [46], social networks [27], and drug discovery [1]. Traditionally, hand-crafted heuristics were used to identify new links in the graph [81, 151, 2]. Heuristics are often chosen based on factors that typically correlate well with the formation of new links. For example, a popular heuristic is common neighbors (CNs), which assume that the links are more likely to exist between node pairs with more shared neighbors. It has been found that these factors, which we refer to as “LP Factors”, often stem from the local and global structural information and feature proximity [72]. We give an example in Figure 4.1 that demonstrates different heuristic scores for multiple candidate links. Each heuristic score corresponds to one of the LP factors: CNs for local information, Katz for global, and Feat-Sim for feature proximity. We can observe that the pair (source, 5) has the highest CN and Katz score of the candidate links, indicating an abundance of local and global structural information between the pair. On the other hand, the feature similarity for (source, 5) is the lowest among the candidate links. This indicates that different LP factors and heuristics have distinct assumptions about why links are formed. More recently, message passing neural networks (MPNNs) [37], which are able to learn effective node representations via message passing, have been widely adopted for LP tasks. They predict the existence of a link by combining the node representations of both nodes in the link. However, such a node-centric view is unable to incorporate the pairwise information between the nodes in the link. Because of this, conventional MPNNs have been demonstrated to be poor link predictors due to their limited capability to learn effective and expressive link representations [142, 102]. To address this issue, recent efforts [140, 154] have attempted to move beyond the node-centric view of traditional MPNNs by equipping them with pairwise information specific to the link being predicted (i.e. the “target link”) [140, 154]. This is done by customizing the message passing process to each 21 Figure 3.1 Example of multiple heuristic scores for the candidate links (source, 5), (source, 6), and (source, 7). Each heuristic corresponds to a different LP factor – local (CNs), global (Katz), and feature proximity (Feat-Sim). target link. However, a concern with this approach is that it can be prohibitively expensive [20], as message passing needs to be run for each individual target link. This is as opposed to traditional MPNNs which only run message passing once for all target links. To overcome these inefficiencies, recent methods [136, 20, 121] have instead explored ways to inject pairwise information into the model, without individualizing the message passing to each target link. This is done by decoupling the message passing and link-specific pairwise information. By doing so, the message passing only needs to be done once for all target links. To include the pairwise information, these methods, which we refer to as “Decoupled Pairwise MPNNs” (DP- MPNNs), instead learn a “pairwise encoding” to encode the pairwise relationship of the target link. The choice of pairwise encoding is often based on heuristics that correspond to common LP factors (e.g., common neighbors). DP-MPNNs have gained attention as they can achieve promising performance while being much more efficient than methods that customize the message passing mechanism. However, DP-MPNNs are often limited in the choice of pairwise encoding, using a one- size-fits-all solution for all target links. This has two limitations. (1) The pairwise encoding may fail to consider some integral LP factors. For example, NCNC [121] only considers the 1- hop neighborhood when computing the pairwise encoding, thereby ignoring the global structural information. This suggests the need for a pairwise encoding that considers multiple types of LP factors. (2) The pairwise encoding uses the same LP factors for all target links. This assumes that all target links need the same factors. However, it may not necessarily be true. Recently, [72] have 22 shown that different LP factors are necessary to classify different target links. It is evident that even for the same dataset, multiple LP factors are needed to properly predict all target links. This further applies to different datasets, where certain factors are more prominent than others. As such, it faces tremendous challenges when considering multiple types of LP factors. While one factor may effectively model some target links, it will fail for other target links where those patterns aren’t present. It is therefore desired to consider different LP factors for different target links. These observations motivate us to ask – can we design an efficient method that can adaptively determine which LP factors to incorporate for each individual target link? Essentially, it requires a pairwise encoding that (a) models multiple LP factors, (b) can be tailored to fit each individual target link, and (c) is efficient to calculate. By doing so, we can flexibly adapt the pairwise information based on the existing needs of each target link. To achieve this, we propose LPFormer – Link Prediction TransFormer. LPFormer is a type of graph Transformer [78] designed specifically for link prediction. Given a target link (𝑎, 𝑏), LPFormer models the pairwise encoding via an attention module that learns how 𝑎 and 𝑏 relate in the context of various LP factors. This allows for a more customizable set of pairwise encodings that are specific to each target link. Extensive experiments validate that LPFormer can achieve SOTA on a variety of benchmark datasets. We further demonstrate that LPFormer is better at modeling several types of LP factors, highlighting its adaptability, while also maintaining efficiency on denser graphs. 3.2 Background 3.2.1 Related Work Link prediction (LP) aims to model how links are formed in a graph. The process by which links are formed, i.e., link formation, is often governed by a set of underlying factors [8, 65]. We refer to these as “LP factors”. Two categories of methods are used for modeling these factors – heuristics and MPNNs. We describe each class of methods. We further include a discussion on existing graph transformers. Heuristics for Link Prediction. Heuristics methods [81, 151] attempt to explicitly model the LP factors via hand-crafted measures. Recently, [72] have shown that there are three main 23 factors that correlate with the existence of a link: (1) local structural information, (2) global structural information, and (3) feature proximity. Local structural information only considers the immediate neighborhood of the target link. Representative methods include Common Neighbors (CN) [81], Adamic Adar (AA) [2], and Resource Allocation (RA) [151]. They are predicated on the assumption that nodes that share a greater number neighbors exhibit a higher probability of forming connections. Global structural information further considers the global structure of the graph. Such methods include Katz [49] and Personalized PageRank (PPR) [15]. These methods posit that nodes interconnected by a higher number of paths are deemed to have larger similarity and, therefore, are more likely to form connections. Lastly, feature proximity assumes nodes with more similar features connect [79]. Previous work [83, 145] have shown that leveraging the node features are helpful in predicting links. Lastly, we note that [72] has recently shown that to properly predict a wide variety of links, it’s integral to incorporate all three of these factors. MPNNs for Link Prediction. Message Passing Neural Networks (MPNNs) [37] aim to learn node representations via the message passing mechanism. Traditional MPNNs have been used for LP including GCN [53], SAGE [42], and GAE [54]. However, they have been shown to be suboptimal for LP as they aren’t expressive enough to capture important pairwise patterns [141, 102]. SEAL [140] and NBFNet [154] try to address this by customizing the message passing process to each target link. This allows for the message passing to learn pairwise information specific to the target link. However, these methods have been shown to be unduly expensive as they require a separate round of message passing for each target link. As such, recent methods have been proposed to instead decouple the message passing and pairwise information [136, 20, 121], reducing the time needed to do message passing. Such methods include NCN/NCNC [121] which exploit the common neighbor information and BUDDY [20] and Neo-GNN [136] which consider the global structural information. Graph Transformers. Recent work has attempted to extend the original Transformer [114] architecture to graph-structured data. Graphormer [133] learns node representations by attending all nodes to each other. To properly model the structural information, they propose to use multiple 24 types of structural encodings (i.e., structural, centrality, and edge). SAN [58] further considers the use of the Laplacian positional encodings (LPEs) to enhance the learnt structural information. Al- ternatively, TokenGT [51] considers all nodes and edges as tokens in the sequence when performing attention. Due to the large complexity of these models, they are unable to scale to larger graphs. To address this, several graph transformers [22, 124] have been proposed for node classification that attempt to efficiently attend to the graph. However, while some work [23, 85] have formulated transformers for knowledge graph completion, to our knowledge, there are no graph transformers designed specifically for LP on uni-relational graphs. 3.2.2 Preliminaries We denote a graph as G = {V, E}, where V and E are the sets of nodes and edges in G, respectively. The adjacency matrix is represented as 𝐴 ∈ R|𝑉 |×|𝑉 |. The 𝑑-dimensional node features are represented by the matrix 𝑋 ∈ R|𝑉 |×𝑑. The set of neighbors for a node 𝑣 is given by N (𝑣). The set of overlapping neighbors between two nodes 𝑎 and 𝑏, i.e., the common neighbors (CNs), is expressed by N CN (𝑎,𝑏). We further denote the set of nodes that are 1-hop neighbors of only one of 𝑎 or 𝑏 as N 1 (𝑎,𝑏). Lastly, the personalized pagerank (PPR) score for a root node 𝑣 and an arbitrary node 𝑢 is given by ppr(𝑣, 𝑢). (𝑎,𝑏) and the nodes that are >1-hop from both nodes as N >1 3.3 The Proposed Framework In Section 5.1, we highlighted the importance of adaptively modeling multiple types of LP factors. However, current methods that use pairwise encodings, i.e., DP-MPNNs, struggle to appropriately achieve this goal. This is due to two issues: (1) They only attempt to model a subset of the potential LP factors (e.g., only local structural information), limiting their ability to model multiple factors. (2) They use a one-size-fits-all approach in regard to pairwise encoding, using the same combination of LP factors for each target link. These issues strongly limit the potential of such methods to properly model a variety of different target links. To overcome these problems, we propose LPFormer, a new transformer-based method that can adaptively customize the pairwise information for each target link by considering a variety of different LP factors in an efficient manner. 25 Figure 3.2 An overview of LPFormer. (1) Encode the nodes via a MPNN. (2) For a given target link, we determine which nodes to attend to ( ˆN (𝑎, 𝑏)) via the PPR-based thresholding technique in Eq. (3.10). (3) The pairwise encoding is computed by attending to each node, 𝑢 ∈ ˆN (𝑎, 𝑏) using the feature and relative positional encoding rpe(𝑎,𝑏,𝑢). (4) The pairwise encoding, node representations, and counts of different node types are concatenated and used to compute the final probability of the target link existing. 3.3.1 A General View of Pairwise Encodings Recent MPNNs for LP use a decoupled strategy to include the pairwise information [20, 121, 136]. These methods, DP-MPNNs, predict the existence of a link (𝑎, 𝑏) via both the node representations and a pairwise encoding 𝑠(𝑎, 𝑏). They follow the formulation below: 𝐻 = MPNN( 𝐴, 𝑋), (cid:16) (cid:16) 𝑝(𝑎, 𝑏) = 𝜎 MLP h𝑎 ⊙ h𝑏 ∥ 𝑠(𝑎, 𝑏) (cid:17)(cid:17) , (3.1) where ℎ𝑖 is the representation of node 𝑖 encoded by the MPNN. Various DP-MPNNs adopt different ways to model the pairwise encoding. For example, NCN [121] models the pairwise encoding 𝑠(𝑎, 𝑏) as the summation of the node representations of the CNs. The definitions of 𝑠(𝑎, 𝑏) for other prominent DP-MPNNs can be found in Appendix B.1. The pairwise encodings in these existing methods are typically manually selected or extracted from the graph, which limits the LP factors they can cover. For example, 𝑠(𝑎, 𝑏) in NCN and NCNC only capture the local structural information. BUDDY [20] ignores the node features when computing the pairwise encoding. 26 To flexibly model multiple types of LP factors, we propose a general formulation for pairwise encodings as follows, 𝑠(𝑎, 𝑏) = ∑︁ 𝑢∈V 𝑤(𝑎, 𝑏, 𝑢) ⊙ ℎ(𝑎, 𝑏, 𝑢), (3.2) where 𝑤(𝑎, 𝑏, 𝑢) measures the importance of node 𝑢 to (𝑎, 𝑏), and ℎ(𝑎, 𝑏, 𝑢) is the encoding of node 𝑢 relative to (𝑎, 𝑏). By considering which nodes should be considered for (𝑎, 𝑏) and how they are related to the node pair, Eq. (3.2) can model different LP factors by manually defining 𝑤(𝑎, 𝑏, 𝑢) and ℎ(𝑎, 𝑏, 𝑢). In particular, we demonstrate how the heuristic methods corresponding to different LP factors can fit into this framework. Common Neighbors (CNs) [81]: CNs considers the local structural information and is defined for a pair of nodes (𝑎, 𝑏) as N CN (𝑎,𝑏) = N (𝑎) ∩N (𝑏). Eq. (3.2) is equal to the CNs when ℎ(𝑎, 𝑏, 𝑢) = 1 and: 𝑤(𝑎, 𝑏, 𝑢) = 1, when 𝑢 ∈ N (𝑎) ∩ N (𝑏) 0, else       . (3.3) Katz Index [49]: The Katz index models the global structural information. It is defined as weighted summation of the number of paths of different lengths connecting 𝑎 and 𝑏 and a decay weight 𝛽 ∈ [0, 1], Katz(𝑎, 𝑏) = ∞ ∑︁ 𝛽𝑙 𝐴𝑙 𝑎,𝑏. 𝑙=1 This is equivalent to Eq. (3.2) where 𝑤(𝑎, 𝑏, 𝑢) = (cid:205)∞ 𝑙=1 𝛽𝑙 𝑒𝑇 𝑎 𝐴𝑙 and ℎ(𝑎, 𝑏, 𝑢) = 𝑒𝑇 𝑏 , when 𝑢 = 𝑏 0, else    ,    where 𝑒𝑖 ∈ B|V | is a one-hot vector for a node 𝑖. Feature Similarity: The feature similarity of the pair of nodes (𝑎, 𝑏) is expressed by dis(x𝑎, x𝑏) where x𝑎 are the node features of node 𝑎 and dis(·) is a distance function (e.g., euclidean distance). This can be rewritten as Eq. (3.2) by substituting 𝑤(𝑎, 𝑏, 𝑢) = dis(x𝑎, x𝑢) and ℎ(𝑎, 𝑏, 𝑢) = 𝑒𝑇 𝑏 . These examples demonstrate that the general formulation can indeed model many different LP factors including local and global structural information and feature proximity. We further show 27 in Appendix B.2 that Eq. (3.2) can model a variety of additional LP factors including RA [151], the pairwise encodings used in NCN/NCNC [121] and Neo-GNN [136]. However, fitting these methods into the formulation in Eq. (3.2) requires manually defining both 𝑤(𝑎, 𝑏, 𝑢) and ℎ(𝑎, 𝑏, 𝑢). This constrains the information represented by 𝑠(𝑎, 𝑏) based on the choice of design. Motivated by this, in the next section we introduce our method that does not rely on a handcrafting both 𝑤(𝑎, 𝑏, 𝑢) and ℎ(𝑎, 𝑏, 𝑢). 3.3.2 Modeling Pairwise Encodings via Attention In Section 3.3.1, we introduced a general formulation for pairwise encodings in Eq. (3.2), which is able to capture a variety of different LP factors. However, it requires manually defining both terms in the equation. This limits our ability to customize the pairwise information to each target link. As such, we further aim to move beyond a one-size-fits-all pairwise encoding, and enable the model to produce customized pairwise encoding for each target link. This allows the model to handle more realistic graphs that often contain multiple prominent LP factors for different target links as shown in [72]. In particular, we consider the following question: How can we model Eq (3.2) such that it can customize the used LP factors to each target link? We consider parameterizing both 𝑤(𝑎, 𝑏, 𝑢) and ℎ(𝑎, 𝑏, 𝑢). This allows us to learn how to personalize them to each target link. To achieve this, we leverage softmax attention [6]. This is due to its ability to dynamically learn the relevance of different nodes to the target link. As such, for multiple target links, it can emphasize the contributions of different nodes, thereby flexibly modeling different LP factors. We note that since the attention is between different sequences (i.e., a target link and nodes), it can be considered a form of cross attention [114]. To enhance the adaptability of the pairwise encoding for various links, it is essential to incor- porate various types of information. This allows the attention mechanism to discern and prioritize relevant information for each target link, facilitating the effective modeling of diverse LP factors. In particular, we consider two types of information. The first is the feature information. This includes the feature representation of both nodes in the target link and the node being attended 28 to. The node features are included due to their role in link formation and relationship to structural information [79]. Second, we consider the relative positional information. The relative posi- tional information reflects the relative position in the graph of a node 𝑢 to the target link (𝑎, 𝑏) in the local and global structural context. Due to the importance of local and global structural information [31, 45], it is vital to properly encode both. By including both the structural and feature information, we are able to cover the space of potential LP factors (see Section 3.2.1). We denote the feature representation of a node 𝑢 as h𝑢 and the relative positional encoding (RPE) as rpe(𝑎,𝑏,𝑢). The node importance 𝑤(𝑎, 𝑏, 𝑢) is modeled via attention as follows: ˜𝑤(𝑎, 𝑏, 𝑢) = 𝜙 (cid:16) 𝑤(𝑎, 𝑏, 𝑢) = (cid:17) , h𝑎, h𝑏, h𝑢, rpe(𝑎,𝑏,𝑢) exp( ˜𝑤(𝑎, 𝑏, 𝑢)) (cid:205)𝑣∈ ¯V (𝑎,𝑏) exp( ˜𝑤(𝑎, 𝑏, 𝑢)) , (3.4) where ¯V (𝑎, 𝑏) = V \ {𝑎, 𝑏}. The attention weight 𝑤(𝑎, 𝑏, 𝑢) can be considered as the impact of a node 𝑢 on (𝑎, 𝑏) relative to all nodes in G. This allows the model to emphasize different LP factors for each target link. The node encoding ℎ(𝑎, 𝑏, 𝑢) includes the features of node 𝑢 in conjunction with the RPE and is defined as: ℎ(𝑎, 𝑏, 𝑢) = W (cid:104) h𝑢 ∥ rpe(𝑎,𝑏,𝑣) (cid:105) . (3.5) By substituting Eq. (3.4) and Eq. (3.5) into Eq. (3.2) we can compute the pairwise information 𝑠(𝑎, 𝑏). We further define 𝜙(·) in Eq. (3.4) as the GATv2 [17] attention mechanism. The detailed formulation is given in Appendix B.4. The feature representations h𝑖 are computed via a MPNN. We use GCN [55] in this work. However, it is unclear how to properly encode the RPE of a node 𝑢 relative to (𝑎, 𝑏), rpe(𝑎,𝑏,𝑢). We aim to design the RPE to capture both the local and global structural relationship between the node and target link while also being efficient to calculate. In the next section, we discuss our solution for modeling rpe(𝑎,𝑏,𝑢). 3.3.3 PPR-Based Relative Positional Encodings In this section, we introduce our strategy for computing the RPE of a node 𝑢 relative to a target link (𝑎, 𝑏). Intuitively, we want the RPE to reflect the positional relationship between 𝑢 and 29 (𝑎, 𝑏) such that different types of information (i.e., local vs. global) are encoded differently. Using Figure 4.1 as an example, since node 3 is a CN of (source, 5) we expect it to have a much different relationship to the target link than node 6, which is a 2-hop neighbor of both nodes. An enticing option is to use the double radius node labeling (DRNL) trick introduced by [140]. However, [20] have shown it to be prohibitively expensive to calculate for larger graphs. Furthermore, existing RPEs are typically infeasible to calculate on larger graphs as they often rely on pairwise distances or the eigenvectors of the Laplacian [88]. As such, we seek an RPE that can both distinguish the relationship of different nodes to the target link while also being efficient to calculate. To motivate our RPE design, we draw inspiration from the following Proposition. Proposition 1. Consider a target link (𝑎, 𝑏) and a node 𝑢 ∈ V \ {𝑎, 𝑏}. The PPR [15] score of a root node 𝑖 and target node 𝑗 with teleportation probability 𝛼 is denoted by ppr(𝑖, 𝑗). Let 𝑟 𝑘 𝑎 (𝑢) be the probability of a walk of length 𝑘 beginning at node 𝑎 and terminating at 𝑢. We define 𝑎,𝑏 (𝑢) := 𝑟 𝑘 𝑟 𝑘 𝑏 (𝑢). We also define a weight 𝛾 𝑘 := 𝛼(1 − 𝛼) 𝑘 for all walks of length 𝑘. The PPR scores, 𝑝 𝑝𝑟 (𝑎, 𝑢) and 𝑝 𝑝𝑟 (𝑏, 𝑢), along with the random walk probabilities of disparate 𝑎 (𝑢) + 𝑟 𝑘 lengths, are interconnected through the following relationship. Γ(𝑎, 𝑏, 𝑢) = ppr(𝑎, 𝑢) + ppr(𝑏, 𝑢) = ∞ ∑︁ 𝑘=0 𝛾 𝑘𝑟 𝑘 𝑎,𝑏 (𝑢). (3.6) The detailed proof is given in Appendix B.3. From Proposition 1, we can make the following observations: (1) The PPR scores encode the weighted sum of the probabilities of different length random walks connecting two nodes. (2) Walks of shorter length are given higher importance, as evidenced by the dampening factor 𝛾 𝑘 = 𝛼(1 − 𝛼) 𝑘 which decays with the increase in 𝑘. These observations imply that – a larger value of Γ(𝑎, 𝑏, 𝑢) correlates with the existence of many shorter walks connecting node 𝑢 to the both nodes in the target link (𝑎, 𝑏). Therefore, the PPR scores can be used as an intuitive and useful method to understand the structural relationship between node 𝑢 and both nodes in the target link (𝑎, 𝑏). If both scores, ppr(𝑎, 𝑢) and ppr(𝑏, 𝑢), are high, there exists a high probability that many shorter walks connect 30 𝑢 to both nodes in the target link. This implies that node 𝑢 has a stronger impact on the nodes in the target link. On the other hand, if both PPR scores are low, there is likely very little relationship between 𝑢 and the target link. This allows for a convenient way of differentiating how a node structurally relates to the target link. Furthermore, we note that the PPR matrix can be efficiently pre-computed using the algorithm introduced by [5], allowing for easy computation and use. Following this idea, to calculate the RPE of a node 𝑢, we use the PPR scores of a node 𝑢 relative to both nodes in the target link (𝑎, 𝑏). Instead of considering the sum of PPR scores as in Proposition 1, we further parameterize Γ(·) via an MLP, rpe(𝑎,𝑏,𝑢) = MLP (ppr(𝑎, 𝑢), ppr(𝑏, 𝑢)) . (3.7) By introducing learnable parameters to Γ(·), it allows for the model learn the importance of individual PPR scores and how they interact with each other. To ensure that Eq. (3.7) is invariant to the order of the nodes in the target link, i.e., (𝑎, 𝑏) and (𝑏, 𝑢), we further set the RPE to be equal to the summation of the representations given by both (𝑎, 𝑏) and (𝑏, 𝑎): rpe(𝑎,𝑏,𝑢) = rpe(𝑎,𝑏,𝑢) + rpe(𝑏,𝑎,𝑢). (3.8) However, a concern with Eq. (3.8) is that it is not guaranteed to be able to distinguish certain types of nodes from each other. For example, it is necessary to clearly distinguish CNs from other nodes due to their important role in link formation [81]. To overcome this issue, we fit three separate MLPs for when 𝑢 is a: CN of (𝑎, 𝑏), a 1-hop neighbor of either 𝑎 and 𝑏, and a >1-hop neighbor of both 𝑎 and 𝑏. This ensures that we can properly distinguish between these three types of nodes. We verify the effectiveness of this design in Section 3.4.4. Lastly, we note that while other work [75, 63] has considered the use of random-walk based positional encodings, they are only designed for use on the node-level and are unable to be used for link-level tasks like LP. 3.3.4 Efficiently Attending to the Graph Context The proposed attention mechanism in Section 3.3.2 attends to all nodes in the graph, sans those in the link itself. This makes it difficult to scale to large graphs. Motivated by selective [73] and sparse [25] attention, we opt to attend to only a small portion of the nodes. 31 At a high level, we are interested in determining a subset of nodes ˆN (𝑎, 𝑏) ∈ V to attend to for the target link (𝑎, 𝑏). Our goal is to choose the set of nodes ˆN (𝑎, 𝑏) such that they are (a) few in number to improve scalability and (b) provide important contextual information to the pair (𝑎, 𝑏) to best learn the pairwise information. This can be achieved by only considering all nodes where the importance of the node 𝑢 to the target link (𝑎, 𝑏) is considered high. Formally, we can write this as the following where I (𝑎, 𝑏, 𝑢) is a function that denotes the importance of a node 𝑢 to the target link (𝑎, 𝑏): ˆN (𝑎, 𝑏) = {𝑢 ∈ V \ {𝑎, 𝑏} | I (𝑎, 𝑏, 𝑢) > 𝜂}. (3.9) The threshold 𝜂 allows us to distinguish those nodes that are sufficiently important to the target link. This allows for a simple and efficient way of determining the set ˆN (𝑎, 𝑏). However, what do we use to model the importance I (𝑎, 𝑏, 𝑢)? For ease of optimization and better efficiency, we avoid parameterizing the function I (𝑎, 𝑏, 𝑢). Instead, we want to choose a metric such that can properly serve as a proxy for the importance of a node 𝑢 to (𝑎, 𝑏) while also being concentrated in a small subset of nodes. Such a metric will allow Eq. (3.9) to choose a small but influential set of nodes to attend to. A measure that satisfies both criteria is Personalized Pagerank (PPR) [15]. In Section 3.3.3 we discussed that the PPR score can serve as a good tool to model the influence of a one node on another. Furthermore, existing work [38, 80, 5] shows that the PPR scores tend to be highly localized in a small subset of nodes. Therefore by making I (𝑎, 𝑏, 𝑢) contingent on the PPR scores of (𝑎, 𝑢) and (𝑏, 𝑢) we can extract a small but important set of nodes to attend to for the target link. Following this idea, for a target link (𝑎, 𝑏), we keep all nodes whose PPR score is above some threshold 𝜂 relative to both nodes in the target link. As such, we only keep a node 𝑢 if it is related in some capacity to at least one of the nodes in the target link. Similarly to Section 3.3.3, we treat CN, 1-Hop, and >1-Hop nodes differently by applying a different threshold for them. The filtered node set for each category of nodes is given by: ˆN 𝜋 (𝑎,𝑏) = {𝑢 ∈ N 𝜋 (𝑎,𝑏) | ppr(𝑎, 𝑢) > 𝜂𝜋, ppr(𝑏, 𝑢) > 𝜂𝜋}, (3.10) 32 Table 3.1 Dataset statistics. The split ratio is the % of samples for train/validation/test. Cora Citeseer Pubmed ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2 #Nodes #Edges Split Ratio 2,708 5,278 85/5/10 3,327 4,676 85/5/10 18,717 44,327 85/5/10 235,868 1,285,465 92/4/4 4,267 1,334,889 80/10/10 576,289 30,326,273 70/20/10 2,927,963 30,561,187 98/1/1 where ˆN 𝜋 (𝑎,𝑏) is the filtered node set for all nodes of the type 𝜋 ∈ {CN, 1−Hop, >1−Hop} and 𝜂𝜋 is the corresponding PPR threshold. We note that while other work [10, 134] has used PPR to filter the nodes on the node-level, no existing work has done so on the link-level. We corroborate this design by demonstrating that LPFormer can achieve SOTA performance in LP (Section 3.4.2) while achieving a faster runtime than the second-best method, NCNC [121], on denser graphs (Section 3.4.7). This is despite the fact that LPFormer can attend to a wider variety of nodes. We further show in Section 3.4.5 that the performance is stable with regards to the values of 𝜂 chosen, allowing us to easily choose a proper threshold on any dataset. 3.3.5 LPFormer We now define the overall framework – LPFormer. The overall procedure is given in Figure 3.2: (1) We first learn node representations from the input adjacency and node features via an MPNN. We note that this step is agnostic to the target link. (2) For a target link (𝑎, 𝑏) we extract the nodes to attend to, i.e. ˆN (𝑎, 𝑏). This is done via the PPR thresholding technique defined in Section 3.3.4. (3) We apply 𝐿 layers of attention, using the mechanism defined in Section 3.3.2. The output is the pairwise encoding 𝑠(𝑎, 𝑏). (4) We generate the prediction of the target link using three types of information: the element-wise product of the node representation, the pairwise encoding, and the number of CN, 1-Hop, and >1-Hop nodes identified by Eq. (3.10). The score function is given by: 𝑝(𝑎, 𝑏) = 𝜎 (cid:16) MLP (cid:16) h𝑎 ⊙ h𝑏 ∥ 𝑠(𝑎, 𝑏) ∥ | ˆN CN (𝑎,𝑏) | ∥ | ˆN 1 (𝑎,𝑏) | ∥ | ˆN >1 (𝑎,𝑏) | (cid:17)(cid:17) (3.11) We demonstrate in Section 3.4.4 that the inclusion of the node counts is helpful, as it provides complementary information to the pairwise encoding. 3.4 Experiments In this section, we conduct extensive experiments to validate the effectiveness of LPFormer. Specifically, we attempt to answer the following questions: (RQ1) Can LPFormer consistently 33 outperform baseline methods on a variety of different benchmark datasets? (RQ2) Is LPFormer able to model a variety of different LP factors? (RQ3) Can LPFormer be run efficiently on large dense graphs? We further conduct studies ablating each component of our model and analyzing the effect of the PPR-based threshold on performance. 3.4.1 Experimental Settings Datasets. We include Cora, Citeseer, and Pubmed [131] and ogbl-collab, ogbl-ppa, ogbl- ddi, and ogbl-citation2 [44]. Furthermore, for Cora, Citeseer, and Pubmed we experiment under a single fixed split (see Appendix B.5.1 for further discussion). The detailed statistics for each dataset are shown in Table A.1. Baseline Models. We compare LPFormer against a wide variety of baselines including: CN [81], AA [2], RA [151], GCN [55], SAGE [42], GAE [54], SEAL [140], NBFNet [154], Neo-GNN [136], BUDDY [20], and NCNC [121]. Results on Cora, Citeseer, and Pubmed are taken from [61]. Results for the heuristic methods are from [44]. All other results are either from their respective study or [20]. Hyperparameters: The learning rate is tuned from {1𝑒−3, 5𝑒−3}, the decay from {0.95, 0.975, 1}, and the dropout from [0, 0.7], and the weight decay from {0, 1𝑒−4, 1𝑒−7}. The size of the hidden dimension is set to 64 for ogbl-ppa and ogbl-citation2, 128 for Cora, Pubmed, and ogbl-collab, and 256 for Citeseer. Lastly, the PPR threshold is tuned from {1𝑒−2, 1𝑒−3, 1𝑒−4}. Evaluation Metrics. Each positive target link is evaluated against a set of given negative links. The rank of the positive link among the negatives is used to evaluate performance. The two types of metrics that are used to evaluate this ranking are Hits@K and MRR. For the OGB datasets we use the metric used in the original study. This includes Hits@50 for ogbl-collab, Hits@100 for ogbl-ppa and MRR for ogbl-citation2. For Cora, Citeseer, Pubmed we follow [61] and use MRR. Lastly, the same set of negative links is used for all positive links except on ogbl-citation2, where [44] provides a customized set of 1000 negatives for each individual positive link. 34 Table 3.2 Results on benchmark datasets. OOM is an out of memory error. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Citeseer Pubmed ogbl-collab ogbl-ppa ogbl-citation2 Mean Rank MRR MRR H@50 H@100 MRR Metric CN AA RA GCN SAGE GAE Cora MRR 20.99±0.00 31.87±0.00 30.79±0.00 32.50±6.87 37.83±7.75 29.98±3.21 28.34±0.00 29.37±0.00 27.61±0.00 50.01±6.04 47.84±6.39 63.33±3.14 14.02±0.00 16.66±0.00 15.63±0.00 19.94±4.24 22.74±5.47 16.67±0.19 SEAL NBFNet 26.69±5.89 37.69±3.97 39.36±4.99 38.17±3.06 38.06±5.18 44.73±2.12 Neo-GNN 22.65±2.60 BUDDY 26.40±4.40 NCN 32.93±3.80 NCNC 29.01±3.83 53.97±5.88 59.48±8.96 54.97±6.03 64.03±3.67 31.45±3.17 23.98±5.11 35.65±4.60 25.70±4.48 56.44±0.00 64.35±0.00 64.00±0.00 44.75±1.07 48.10±0.81 OOM 64.74±0.43 OOM 57.52±0.37 65.94±0.58 64.76±0.87 66.61±0.71 27.65±0.00 32.45±0.00 49.33±0.00 18.67±1.32 16.55±2.40 OOM 48.80±3.16 OOM 49.13±0.60 49.85±0.20 61.19±0.85 61.42±0.73 51.47±0.00 51.89±0.00 51.98±0.00 84.74±0.21 82.60±0.36 OOM 87.67±0.32 OOM 87.26±0.84 87.56±0.11 88.09±0.06 89.12±0.40 LPFormer 39.42±5.78 65.42±4.65 40.17±1.92 68.14±0.51 63.32±0.63 89.81±0.13 11.0 8.5 8.7 8.0 7.7 NA 6.2 NA 7.0 5.7 3.8 3.8 1.2 3.4.2 Main Results We present the results of LPFormer compared with baselines on multiple benchmark datasets. Note that we omit ogbl-ddi from the main results due to recent issues discovered by [61] (see Appendix B.5.2 for more details). The results are shown in Table 4.1. We observe that LPFormer can achieve SOTA performance on 5/6 datasets, significantly outperforming other baselines. Moreover, LPFormer is also the most consistent of all the methods, achieving strong performance on all datasets. This is as opposed to previous SOTA methods, NCNC and BUDDY, which tend to struggle on Cora and Pubmed. We attribute the consistency of LPFormer to the flexibility of our model, allowing it to customize the LP factors needed to each link and dataset. 3.4.3 Performance by LP Factor In this section, we measure the ability of LPFormer to capture a variety of different LP factors. To measure this, we identify all positive target links when there is only one dominant LP factor. For example, one group would contain all target links where the only dominant factor is the local structural information. We focus on links that correspond to one of the three groups identified in [72]: local structural information, global structural information, and feature proximity. 35 We identify these groups by using popular heuristics as proxies for each factor. For local structural information, we use CNs [81], for global structural information we use PPR [15] as it’s the most computationally efficient of all global methods, and for feature proximity, we use the cosine similarity of the features. Using these heuristics, we determine if only one factor is dominant by comparing the relative score of each heuristic. This is done by first computing the score for each factor 𝑖 for the target link (𝑎, 𝑏) – 𝑠𝑖 (𝑎, 𝑏). For each factor, we then compute the score corresponding to the 𝑝-th percentile among all links, ˆ𝑠𝑖. We choose a larger value of 𝑝 (i.e. 90%) such that a score ≥ ˆ𝑠𝑖 indicates that a significant amount of pairwise information exists for that factor. For a single target link, we then compare the score of each factor 𝑠𝑖 (𝑎, 𝑏) to ˆ𝑠𝑖. If 𝑠𝑖 (𝑎, 𝑏) ≥ ˆ𝑠𝑖 is true for only one factor, this implies that the score for only one factor is “high”. Therefore there is a notable amount of pairwise information existing for only one factor for the link (𝑎, 𝑏). This ensures that only one factor is strongly expressed. If this is true, we then assign the target link (𝑎, 𝑏) to factor 𝑖. Please see Appendix B.5.4 for a more detailed explanation. We demonstrate the results on Cora, Citeseer, and ogbl-collab in Figure 3.3. We observe that LPFormer typically performs best for each individual LP factor on all datasets. Furthermore, it is also the most consistently well-performing on each factor as compared to other methods. For example, on Cora the other methods struggle for links that correspond to the feature proximity factor. LPFormer, on the other hand, is able to significantly outperform them on those target links, performing around 33% better than the second best method. Lastly, we note that most methods tend to perform well on the links corresponding to the global factor, even if they don’t explicitly model such information. This is caused by a strong correlation that tends to exist between local and global structural information, often resulting in considerable overlap between both factors [72]. These results show that LPFormer can indeed adapt to multiple types of LP factors, as it can consistently perform well on samples belonging to a variety of different LP factors. Additional results are given in Appendix B.5.5. 36 (a) Cora (b) Citeseer (c) ogbl-collab Figure 3.3 Performance on links that contain one dominant LP factor. Results are on (a) Cora, (b) Citeseer, and (c) ogbl-collab. 3.4.4 Ablation Study We further include an ablation study to verify the effectiveness of the proposed components in LPFormer. In particular, we introduce 6 variants of LPFormer. (a) w/o Learnable Att: No attention is learned. As such, we set all attention weights to 1 and remove the RPE. (b) w/o Features in Att: We remove the node feature information from the attention mechanism. (c) w/o RPE in Att: We remove the RPE from the attention mechanism. (d) w/o PPR RPE: We replace the PPR-based RPE with a learnable embedding for each of CN, 1-Hop, and >1-Hop nodes. (e) w/o PPR RPE by Node Type: We don’t fit a separate function for each node type when determining the PPR RPE (see Section 3.3.3). Instead we use one for all nodes. (f) w/o Counts: We remove the counts of different nodes from the scoring function. The results are shown in Table 4.4. We include ogbl-collab, ogbl-ppa, and Citeseer. We observe that ablating a component always decreases the performance. However, the magnitude of the decrease is dataset-dependent. For example, on ogbl-collab, ablating the feature information in the attention marginally affects the performance. However, on ogbl-ppa and Citeseer, removing the feature information results in a large decrease in performance. On the other hand, while removing learnable attention results in a modest decrease on ogbl-ppa, for the other two datasets we see a large drop. This highlights the importance of each component of our framework, as they are each necessary for consistently strong performance across multiple datasets. 37 Table 3.3 Ablation Study on LPFormer. Method w/o Learnable Att w/o Features in Att w/o RPE in Att w/o PPR RPE w/o PPR RPE by Node Type w/o Counts LPFormer ogbl-collab 65.05±0.50 68.04±0.79 65.26±0.56 67.09±0.51 67.95±0.54 67.75±0.41 68.14±0.51 ogbl-ppa 62.77±1.03 56.98±1.55 61.20±0.69 61.91±1.22 62.92±1.06 44.37±1.89 63.32±0.63 Citeseer 56.23±1.75 53.40±9.30 56.70±3.79 51.96±15.2 57.40±5.71 54.39±5.30 65.42±4.65 Table 3.4 Effect of Varying the PPR Thresholds. Threshold ogbl-collab ogbl-citation2 1-Hop >1−Hop 68.24±0.25 67.73±0.65 67.60±0.31 68.24±0.25 67.08±0.65 68.14±0.51 1-Hop 89.81±0.13 89.49±0.18 89.49±0.16 >1−Hop 89.14±0.22 89.81±0.13 89.26±0.39 1e-4 1e-2 1 3.4.5 Effect of the PPR Thresholds We examine the effect of varying the PPR threshold for both 1-Hop and >1−Hop nodes as described in Eq. (3.10). The results for ogbl-collab and ogbl-citation2 are shown in Table 3.4. When varying the 1-Hop threshold, we fix the value of the >1−Hop threshold to 1e-2 for both datasets. When varying the >1−Hop threshold, we fix the value of the 1-Hop threshold to 1e-4 for both datasets. We can observe that modifying the threshold has little effect on the underlying performance of the model. For both datasets, a value of 1e-2 works well for the >1−Hop threshold and 1e-4 works well for the 1-Hop threshold. We typically find that setting both values to 1e-2 provides a good trade-off between performance and efficiency. 3.4.6 Performance on HeaRT Setting We further test the performance of our method on the HeaRT [61] evaluation setting, which considers a more realistic and difficult evaluation setting for link prediction. This is done by introducing a much harder and more realistic set of negative samples during evaluation. [61] observe that this results in a large decrease in performance on all datasets. Furthermore, compared to the 38 Cora 9.78 11.91 11.81 Models CN AA RA GCN SAGE GAE Table 3.5 Results (MRR) under HeaRT. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Citeseer Pubmed ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2 Mean Rank 8.42 10.82 10.84 2.28 2.63 2.47 4.20 5.07 6.29 6.71 6.97 8.70 25.70 26.85 28.34 17.11 17.83 17.79 16.61 ± 0.30 14.74 ± 0.69 18.32 ± 0.41 21.09 ± 0.88 21.09 ± 1.15 25.25 ± 0.82 7.13 ± 0.27 9.40 ± 0.70 5.27 ± 0.25 6.09 ± 0.38 5.53 ± 0.5 OOM 13.46 ± 0.34 12.60 ± 0.72 3.49 ± 1.73 26.94 ± 0.48 27.27 ± 0.30 OOM 19.98 ± 0.35 22.05 ± 0.12 OOM SEAL NBFNet 10.67 ± 3.46 13.56 ± 0.58 13.16 ± 1.66 14.29 ± 0.80 5.88 ± 0.53 >24h 6.43 ± 0.32 OOM 9.99 ± 0.90 >24h 29.71 ± 0.71 OOM 20.60 ± 1.28 OOM BUDDY Neo-GNN NCN NCNC 13.71 ± 0.59 13.95 ± 0.39 14.66 ± 0.95 14.98 ± 1.00 22.84 ± 0.36 17.34 ± 0.84 28.65 ± 1.21 24.10 ± 0.65 7.56 ± 0.18 7.74 ± 0.30 5.84 ± 0.22 8.58 ± 0.59 5.67 ± 0.36 5.23 ± 0.9 5.09 ± 0.38 4.73 ± 0.86 12.43 ± 0.50 10.86 ± 2.16 12.86 ± 0.78 >24h 27.70 ± 0.33 21.68 ± 1.14 35.06 ± 0.26 33.52 ± 0.26 19.17 ± 0.20 16.12 ± 0.25 23.35 ± 0.28 19.61 ± 0.54 LPFormer 16.80 ± 0.52 26.34 ± 0.67 9.99 ± 0.52 7.62 ± 0.26 13.20 ± 0.54 40.25 ± 0.24 24.70 ± 0.55 11.1 9.6 8.1 4.7 4.7 NA 6.4 NA 5.9 7.4 4.4 4.8 1.4 original evaluation setting, MPNNs designed specifically for link prediction are often outperformed by heuristics or other MPNNs. The full results can be found in Table 3.5. We observe that LPFormer performs considerably better than all other models. For instance, the mean rank of LPFormer is 3.1x better than the 2nd best-performing model, NCN. This indeed shows the advantage of LPFormer, as it can consistently achieve extraordinary performance across all datasets under the much more challenging HeaRT evaluation setting. This is as opposed to other LP-specific methods that often perform similarly to standard MPNN methods. 3.4.7 Runtime Analysis In this section, we compare the runtime of LPFormer against NCNC, which is the strongest performing baseline. The results are shown in Figure 3.4 on all four OGB datasets We further include the mean degree of each dataset in parentheses. We observe that LPFormer shines on denser datasets, taking significantly less time to train one epoch. This is despite that LPFormer can attend to nodes beyond the 1-hop radius of the target link. This underscores the importance of the PPR thresholding technique introduced in Section 3.3.4, as it allows for efficient attention to a wider variety of nodes. Lastly, we note that LPFormer struggles on the ogbl-citation2 dataset due to the large number of nodes in the dataset (i.e., 2,927,963), which requires the sparse PPR matrix to be quite large. For future work we plan on exploring pre-computing the necessary PPR scores as 39 an efficient pre-processing step, thereby removing the need to store the costly PPR matrix. Please see Appendix B.5.7 for more details. Figure 3.4 Comparison of training time of 1 epoch between LPFormer and NCNC. The mean degree is in parentheses. 3.5 Conclusion In this paper we introduce a new framework, LPFormer, that aims to integrate a wider variety of pairwise information for link prediction. LPFormer does this via a specially designed graph transformer, which adaptively considers how a node pair relate to each other in the context of the graph. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on a wide variety of benchmark datasets while retaining efficiency. We further demonstrate LPFormer’s supremacy at modeling multiple types of LP factors. For future work, we plan on exploring other methods of incorporating multiple LP factors with an emphasis on global structural information. We also plan to investigate the potential of alternative relative positional encodings. 40 CHAPTER 4 TOWARD DEGREE BIAS IN EMBEDDING-BASED KNOWLEDGE GRAPH COMPLETION 4.1 Introduction Knowledge graphs (KGs) are a specific type of graph where each edge represents a single fact. Each fact is represented as a triple (ℎ, 𝑟, 𝑡) that connects two entities ℎ and 𝑡 with a relation 𝑟. KGs have been widely used in many real-world applications such as recommendation [18], drug discovery [77], and natural language understanding [67]. However, the incomplete nature of KGs limits their applicability in those applications. To address this limitation, it is desired to perform a KG completion (KGC) task, i.e., predicting unseen edges in the graph thereby deducing new facts [93]. In recent years, the embedding-based methods [13, 29, 7, 100] that embed a KG into a low-dimensional space have achieved remarkable success on KGC tasks and enable downstream applications. However, a common issue in graph-related tasks is degree bias [107, 56], where nodes of lower degree tend to learn poorer representations and have less satisfactory downstream performance. Recent studies have validated this issue for various tasks on homogeneous graphs such as classi- fication [107, 146, 68] and link prediction [56]. However, KGs are naturally heterogeneous with multiple types of nodes and relations. Furthermore, the study of degree bias on KGs is rather limited. Therefore, in this work, we ask whether degree bias exists in KGs and how it affects the model performance in the context of KGC. To answer the aforementioned question, we perform preliminary studies to investigate how the degree affects the KGC performance. Take a triple (ℎ, 𝑟, 𝑡) as one example. The in-degree of entity 𝑡 is the number of triples where 𝑡 is the tail entity. Furthermore, we define the tail-relation degree as the number of triples where 𝑡 and the relation 𝑟 co-occur as the tail and relation (Eq. (4.2)). An example in Figure 4.1 is the tail-relation pair (Germany, Has Country). Since the pair only co-occurs as a relation and tail in one triple, their tail-relation degree is 1. Our preliminary studies (Section 4.3) suggest that when predicting the tail entity 𝑡, the in-degree of 𝑡 and especially the 41 Figure 4.1 Example of multiple facts in a KG. Since each country only co-occurs with the relation Has Country as the tail on one edge, they each only have a tail-relation degree of one with the relation Has Country. tail-relation degree of (𝑡, 𝑟) plays a vital role. That is, when predicting the tail for a triple (ℎ, 𝑟, 𝑡), the number of triples where the entity 𝑡 and relation 𝑟 co-occur as an entity-relation pair correlates significantly with performance during KGC. Going back to our example, since Germany and Has Country only co-occur as a relation and tail in one triple their tail-relation degree is low, thus making it difficult to predict Germany for the query (Europe, Has Country, ?). Given the existence of degree bias in KGC, we aim to alleviate the negative effect brought by degree bias. Specifically, we are tasked with improving the performance of triples with low tail-relation degrees while maintaining the performance of other triples with a higher tail-relation degree. Essentially, it is desired to promote the engagement of triples with low tail-relation degrees during training so as to learn better embeddings. To address this challenge, we propose a novel data augmentation framework. Our method works by augmenting entity-relation pairs that have low tail-relation degrees with synthetic triples. We generate the synthetic triples by extending the popular Mixup [137] strategy to KGs. Our contributions can be summarized as follows: • Through empirical study, we identify the degree bias problem in the context of KGC. To the best of our knowledge, no previous work has studied the problem of degree bias from the perspective of entity-relation pairs. • We propose a simple yet effective data augmentation method, KG-Mixup, to alleviate the degree bias problem in KG embeddings. 42 • Through empirical analysis, we show that our proposed method can be formulated as a form of regularization on the low tail-relation degree samples. • Extensive experiments have demonstrated that our proposed method can improve the performance of lower tail-relation degree triples on multiple benchmark datasets without compromising the performance on triples of higher degree. 4.2 Related Work KG Embedding: TransE [13] models the embeddings of a single triple as a translation in the embedding space. Multiple works model the triples as a tensor factorization problem, including [84, 127, 7, 4]. ConvE [29] learns the embeddings by modeling the interaction of a single triple via a convolutional neural network. Other methods like R-GCN [100] modify GCN [55] for relational graphs. Imbalanced/Long-Tail Learning: Imbalanced/Long-Tail Learning considers the problem of model learning when the class distribution is highly uneven. SMOTE [21], a classic technique, attempts to produce new synthetic samples for the minority class. Recent work has focused on tackling imbalance problems on deeper models. Works such as [90, 106, 66] address this problem by modifying the loss for different samples. Another branch of work tries to tackle this issue by utilizing ensemble modeling [119, 150, 122]. Degree Bias: [76] demonstrate the existence of popularity bias in popular KG datasets, which causes models to inflate the score of entities with a high degree. [11] show the existence of entity degree bias in biomedical KGs. [93] demonstrate that the performance is positively correlated with the number of source peers and negatively with the number of target peers. [56] analyze the degree bias of random walks. To alleviate this issue, they propose a debiasing method that utilizes random graphs. In addition, many studies have focused on allaying the effect of degree bias for the task of node classification including [107, 146, 68]. However, there is no work that focuses on how the intersection of entity and relation degree bias effects embedding-based KGC. Data Augmentation for Graphs There is a line of works studying data augmentation for homo- geneous graphs [149, 147]. Few of these works study the link prediction problem [148] but they 43 (a) In and Out Degree Analysis (b) Tail In-Degree Analysis (c) Controlling for Degrees Figure 4.2 MRR when predicting the tail for TuckER on FB15K-237 when varying the (a) in-degree and out-degree of the head and tail entity, (b) tail-relation and other-relation in-degree, and (c) other-relation in-degree for smaller sub-ranges of the tail-relation degree. do not address the issues in KGs. To augment KGs, [108] generate synthetic triples via adversarial learning; [118] use a GAN to create stronger negative samples; [59] use rule mining to identify new triples to augment the graph with. However, all these methods do not augment with for the purpose of degree bias in KGs and hence are not applicable to the problem this paper studies. 4.3 Preliminary Study In this section we empirically study the degree bias problem in KGC. We focus on two repre- sentative embedding based methods ConvE [29] and TuckER [7]. We first introduce some notations. We denote G = {V, R, E} as a KG with entities V, relations R, and edges E. Each edge represents two entities connected by a single relation. We refer to an edge as a triple and denote it as (ℎ, 𝑟, 𝑡) where ℎ is referred to as the head entity, 𝑡 the tail entity, and 𝑟 the relation. Each entity and relation is represented by an embedding. We represent the embedding for a single entity 𝑣 as x𝑣 ∈ R𝑛𝑣 and the embedding for a relation 𝑟 as x𝑟 ∈ R𝑛𝑟 , where 𝑛𝑣 and 𝑛𝑟 are the dimensions of the entity and relation embeddings, respectively. We further define the degree of an entity 𝑣 as 𝑑𝑣 and the in-degree (𝑑 (𝑖𝑛) ) and out-degree (𝑑 (𝑜𝑢𝑡) ) as the number of 𝑣 𝑣 triples where 𝑣 is the tail and head entity, respectively. Lastly, KGC attempts to predict new facts that are not found in the original KG. This involves predicting the tail entities that satisfy (ℎ, 𝑟, ?) and the head entities that satisfy (?, 𝑟, 𝑡). Follow- ing [30], we augment all triples (ℎ, 𝑟, 𝑡) with its inverse (𝑡, 𝑟 −1, ℎ). As such, predicting the head 44 entity of (ℎ, 𝑟, 𝑡) is analogous to predicting the tail entity for (𝑡, 𝑟 −1, ?). Under such a setting, KGC can be formulated as always predicting the tail entity. Therefore, in the remainder of this work, we only consider KGC as predicting the tail entities that satisfy (ℎ, 𝑟, ?). In the following subsections, we will explore the following questions: (1) Does degree bias exist in typical KG embedding models? and (2) Which factor in a triple is related to such bias? To answer these questions, we first study how the head and tail entity degree affect KGC performance in Section 4.3.1. Then, we investigate the impact of the frequency of entity-relation pairs co-occurring on KGC performance in Section 4.3.2. 4.3.1 Entity Degree Analysis We first examine the effect that the degree of both the head and tail entities have on KGC per- formance. We perform our analysis on the FB15K-237 dataset [111], a commonly used benchmark in KGC. Since a KG is a directed graph, we postulate that the direction of an entity’s edges matters. We therefore split the degree of each entity into its in-degree and out-degree. We measure the performance using the mean reciprocal rank (MRR). Note that the degree metrics are calculated using only the training set. Figure 4.2a displays the results of TuckER (see Section D.3.5 for more details) on FB15K-237 split by both entities and degree type. From Figure 4.2a we observe that when varying the tail entity degree value, the resulting change in test MRR is significantly larger than when varying the degree of head entities. Furthermore, the MRR increases drastically with the increase of tail in-degree (blue line) while there is a parabolic-like relationship when varying the tail out-degree (orange line). From these observations we can conclude: (1) the degree of the tail entity (i.e. the entity we are trying to predict) has a larger impact on test performance than the degree of the head entity; (2) the tail in-degree features a more distinguishing and apparent relationship with performance than the tail out-degree. Due to the page limitation, the results of ConvE are shown in Appendix C.4, where we have similar observations. These results suggest that KGC displays a degree bias in regards to the in-degree. Next, we will examine which factors of a triple majorly contribute to such degree bias. 45 4.3.2 Entity-Relation Degree Analysis In the previous subsection, we have demonstrated the relationship between the entity degree and KGC performance. However, it doesn’t account for the interaction of the entities and relation. Therefore, we further study how the presence of both the relation and entities in a triple together impact the KGC performance. We begin by defining the number of edges that contains the relation 𝑟 and an entity 𝑣 as the relation-specific degree: 𝑑𝑣,𝑟 = |{(ℎ, 𝑟, 𝑡) ∈ E | ℎ = 𝑣 ∨ 𝑡 = 𝑣}|. (4.1) Based on the results in Section 4.3.1, we posit that the main indicator of performance is the in- degree of the tail entity. We extend this idea to our definition of relation-specific degree by only counting co-occurrences of an entity and relation when the entity occurs as the tail. For simplicity we refer to this as the tail-relation degree and define it as: 𝑑 (𝑡𝑎𝑖𝑙) 𝑣,𝑟 = |{(ℎ, 𝑟, 𝑣) ∈ E}|. (4.2) The tail-relation degree can be understood as the number of edges that an entity 𝑣 shares with 𝑟, where 𝑣 occupies the position we are trying to predict (i.e. the tail entity). We further refer to the number of in-edges that 𝑣 doesn’t share with 𝑟 as “Other-Tail Relation” degree. This is calculated as the difference between the in-degree of entity 𝑣 and the tail-relation degree of 𝑣 and relation 𝑟, i.e. 𝑑 (𝑖𝑛) It is easy to verify that the in-degree of an entity 𝑣 is the summation of 𝑣 − 𝑑 (𝑡𝑎𝑖𝑙) 𝑣,𝑟 . the tail-relation degree and “Other-Tail Relation” degree. We use Figure 4.1 as an example of the tail-relation degree. The entity Sweden co-occurs with the relation Has Country on one edge. On that edge, Sweden is the tail entity. Therefore the tail-relation degree of the pair (Sweden, Has Country) is one. We note that a special case of the tail-relation degree is relation-level semantic evidence defined by [64]. Figure 4.2b displays the MRR when varying the value of the tail-relation and “Other-Tail Relation” degree of the tail entity. From the results, we note that while both degree metrics correlate with performance, the performance when the other-tail-relation degree in the range [0, 3) is quite high. Since both metrics are highly correlated, it is difficult to determine which metric 46 is more important for the downstream performance. Is the “Other-Tail Relation” the determining factor for performance or is it the tail-relation degree? We therefore check the performance when controlling for one another. Figure 4.2c displays the results when varying the “Other-Tail Relation” degree for specific sub-ranges of the tail-relation degree. From this figure, we see that the tail- relation degree exerts a much larger influence on the KGC performance as there is little variation between bars belonging to the same subset. Rather the tail-relation degree (i.e. the clusters of bars) has a much larger impact. Therefore, we conclude that for a single triple, the main factor of degree bias is the tail-relation degree of the tail entity. Remark. Our analysis differs from traditional research on degree bias. While previous works focus only on the degree of the node, we focus on a specific type of frequency among entity-relation pairs. This is vital as the frequencies of both the entities and relations are important in KGs. Though we only analyze KGs, findings from our analysis could be applicable to other types of heterogeneous graphs. 4.4 The Proposed Method Grounded in the observations in Section 4.3.2, one natural idea to alleviate the degree bias in KGC is to compensate the triples with low tail-relation degrees. Based on this intuition, we propose a new method for improving the KGC performance of triples with low tail-relation degrees. Our method, KG-Mixup, works by augmenting the low tail-relation degree triples during training with synthetic samples. This strategy has the effect of increasing the degree of an entity-relation pair with a low tail-relation degree by creating more shared edges between them. Therefore, KG-Mixup is very general and can further be used in conjunction with any KG embedding technique. 4.4.1 General Problem In Section 4.3.2 we showed that the tail-relation degree of the tail entity strongly correlates with higher performance in KGC. As such we seek to design a method that can increase the performance of such low-degree entity-relation pairs without sacrificing the performance of high-degree pairs. To solve this problem, we consider data augmentation. Specifically, we seek to create synthetic triples for those entity-relations pairs with a low tail-relation degree. In such a way we are creating 47 more training triples that contain those pairs, thereby “increasing” their degree. For each entity- relation pair with a tail-relation degree less than 𝜂, we add 𝑘 synthetic samples, which can be formulated as follows: ˜E𝑣,𝑟 = E𝑣,𝑟 ∪{( ˜ℎ, ˜𝑟, ˜𝑡)}𝑘 𝑖=1 𝑑 (𝑡𝑎𝑖𝑙) 𝑣,𝑟 < 𝜂, E𝑣,𝑟 else, (4.3)    where (ℎ, 𝑟, 𝑣) ∈ E𝑣,𝑟 are the original training triples with the relation 𝑟 and the tail entity 𝑣, ( ˜ℎ, ˜𝑟, ˜𝑡) is a synthetic sample, and ˜E𝑣,𝑟 is the new set of triples to use during training. Challenges. We note that creating the synthetic samples as shown in Eq. (4.3) is non-trivial and there are a number of challenges: 1. How do we produce the synthetic samples for KG triples that contain multiple types of embed- dings? 2. How do we promote diversity in the synthetic samples ( ˜ℎ, ˜𝑟, ˜𝑡)? We want them to contain sufficient information from the original entity and relation embeddings we are augmenting, while also being distinct from similar triples in E𝑣,𝑟. 3. How do we achieve such augmentation in a computationally efficient manner? These challenges motivate us to design a special data augmentation algorithm for knowledge graph completion and we detail its core techniques in the next subsection. 4.4.2 KG-Mixup We now present our solution for producing synthetic samples as described in Eq. (4.3). Inspired by the popular Mixup [137] strategy, we strive to augment the training set by mixing the represen- tations of triples. We draw inspiration from mixup as (1) it is an intuitive and widely used data augmentation method, (2) it is able to promote diversity in the synthetic samples via the randomly drawn value 𝜆, and (3) it is computationally efficient (see Section D.2.2). We now briefly describe the general mixup algorithm. We denote the representations of two samples as 𝑥1 and 𝑥2 and their labels 𝑦1 and 𝑦2. Mixup creates a new sample ˜𝑥 and label ˜𝑦 by combining both the representations and labels via a random value 𝜆 ∈ [0, 1] drawn from 48 Algorithm 4.1 KG-Mixup Training Procedure Require: 𝐺 = {𝑉, 𝑅, E} 𝑘, 𝜂 𝑋, 𝑊 ⊲ Training graph ⊲ # of samples to generate and degree threshold ⊲ Model embeddings and parameters if 𝑑 (𝑡 𝑎𝑖𝑙) < 𝜂 then 𝑡𝑖 ,𝑟𝑖 𝐶 = {(ℎ∗, 𝑟 ∗, 𝑡𝑖) ∈ E} 𝑆 = Rand-Sample(𝐶, 𝑘) Emix = {Mix(𝑒, 𝑠) ∀𝑠 ∈ 𝑆} for 𝑒 = (ℎ𝑖, 𝑟𝑖, 𝑡𝑖) ∈ E do 1: Randomly initialize 𝑋 and 𝑊 2: Pre-train to obtain initial 𝑋 3: Randomly re-initialize 𝑊 4: while not converged do 5: 6: 7: 8: 9: 10: 11: 12: 13: end for 14: 15: end while 16: return 𝑋 and 𝑊 Emix = {} else end if Update model parameters on {𝑒} ∪ Emix ⊲ Eq. (4.6) 𝜆 ∼ Beta(𝛼, 𝛼) such that: ˜𝑥 = 𝜆x1 + (1 − 𝜆)x2, ˜𝑦 = 𝜆y1 + (1 − 𝜆)y2. (4.4) (4.5) We adapt this strategy to our studied problem for a triple (ℎ, 𝑟, 𝑡) where the tail-relation degree is below a degree threshold, i.e. 𝑑 (𝑡𝑎𝑖𝑙) creating 𝑘 synthetic samples {( ˜ℎ, ˜𝑟, ˜𝑡)}𝑘 < 𝜂. For such a triple we aim to augment the training set by 𝑖=1. This is done by mixing the original triple with 𝑘 other 𝑡,𝑟 triples {(ℎ𝑖, 𝑟𝑖, 𝑡𝑖)}𝑘 𝑖=1. However, directly adopting mixup to KGC leads to some problems: (1) Since each sample doesn’t contain a label (Eq. 4.5) we are unable to perform label mixing. (2) While standard mixup randomly selects samples to mix with, we may want to utilize a different selection criteria to better enhance those samples with a low tail-relation degree. (3) Since each sample is composed of multiple components (entities and relations) it’s unclear how to mix two samples. We go over these challenges next. 49 4.4.2.1 Label Incorporation in KGC We first tackle how to incorporate the label information as shown in Eq. (4.5). Mixup was originally designed for classification problems, making the original label mixing straightforward. However, for KGC, we have no associated label for each triple. We therefore consider the entity we are predicting as the label. For a triple 𝑒1 = (ℎ1, 𝑟1, 𝑡1) where we are predicting 𝑡1, the label would be considered the entity 𝑡1. 4.4.2.2 Mixing Criteria Per the original definition of Mixup, we would then mix 𝑒1 with a triple belonging to the set {(ℎ2, 𝑟2, 𝑡2) ∈ E | 𝑡2 ≠ 𝑡1}. However, since our goal is to predict 𝑡1 we wish to avoid mixing it. Since we want to better predict 𝑡1, we need to preserve as much tail (i.e. label) information as possible. As such, we only consider mixing with other triples that share the same tail and belong to the set {(ℎ2, 𝑟2, 𝑡1) ∈ E | ℎ1 ≠ ℎ2, 𝑟1 ≠ 𝑟2}. Our design is similar to SMOTE [21], where only samples belonging to the same class are combined. We note that while it would be enticing to only consider mixing with triples containing the same entity-relation pairs, i.e. (ℎ2, 𝑟1, 𝑡1) ∈ E𝑡1,𝑟1, this would severely limit the number of possible candidate triples as the tail-relation degree can often be as low as one or two for some pairs. 4.4.2.3 How to Mix? We now discuss how to perform the mixing of two samples. Given a triple 𝑒1 = (ℎ1, 𝑟1, 𝑡1) of low tail-relation degree we mix it with another triple that shares the same tail (i.e. label) such that 𝑒2 = (ℎ2, 𝑟2, 𝑡1). Applying Eq. (4.4) to 𝑒1 and 𝑒2, a synthetic triple ˜𝑒 = ( ˜ℎ, ˜𝑟, ˜𝑡) is equal to: ˜𝑒 = 𝜆𝑒1 + (1 − 𝜆)𝑒2, ˜𝑒 = 𝜆(ℎ1, 𝑟1, 𝑡1) + (1 − 𝜆) (ℎ2, 𝑟2, 𝑡1), ˜𝑒 = 𝜆(xℎ1 , x𝑟1 , x𝑡1) + (1 − 𝜆) (xℎ2 , x𝑟2 , x𝑡1), (4.6) (4.7) (4.8) 50 where xℎ𝑖 and x𝑟 𝑗 represent the entity and relation embeddings, respectively. We apply the weighted sum to the head, relation, and tail, separately. Each entity and relation are therefore equal to: 𝑥 ˜ℎ = 𝜆xℎ1 + (1 − 𝜆)xℎ2 , 𝑥 ˜𝑟 = 𝜆x𝑟1 + (1 − 𝜆)x𝑟2 , 𝑥 ˜𝑡 = x𝑡1 . (4.9) (4.10) (4.11) We use Figure 4.1 to illustrate an example. Let 𝑒1 = (Europe, Has Country, Germany) be the triple we are augmenting. We mix it with another triple with the tail entity Germany. We consider the triple 𝑒2 = (Belgium, Borders, Germany). The mixed triple is represented as ˜𝑒 = (Europe + Belgium, Has Country + Borders, Germany). As 𝑒1 contains the continent that Germany belongs to and 𝑒2 has the country it borders, we can understand the synthetic triple ˜𝑒 as conveying the geographic location of Germany inside of Europe. This is helpful when predicting Germany in the original triple 𝑒1, since the synthetic sample imbues the representation of Germany with more specific geographic information. 4.4.3 KG-Mixup Algorithm for KGC We utilize the binary cross-entropy loss when training each model. The loss is optimized using the Adam optimizer [52]. We also include a hyperparameter 𝛽 for weighting the loss on the synthetic samples. The loss on a model with parameters 𝜃 is therefore: L (𝜃) = L𝐾𝐺 (𝜃) + 𝛽LMix(𝜃) , (4.12) where L𝐾𝐺 is the loss on the original KG triples and LMix is the loss on the synthetic samples. The full algorithm is displayed in Algorithm 4.1. We note that we first pre-train the model before training with KG-Mixup, to obtain the initial entity and relation representations. This is done as it allows us to begin training with stronger entity and relation representations, thereby improving the generated synthetic samples. 4.4.4 Algorithmic Complexity We denote the algorithmic complexity of a model 𝑓 (e.g. ConvE [29] or TuckER [7]) for a single sample 𝑒 as 𝑂 ( 𝑓 ). Assuming we generate 𝑁 negative samples per training sample, the 51 training complexity of 𝑓 over a single epoch is: 𝑂 (𝑁 · |E | · 𝑂 ( 𝑓 )) , (4.13) where |E | is the number of training samples. In KG-Mixup, in addition to scoring both the positive and negative samples, we also score the synthetic samples created for all samples with a tail-relation degree below a threshold 𝜂. We refer to that set of samples below the degree threshold as Ethresh. We create 𝑘 synthetic samples per 𝑒 ∈ Ethresh. As such, our algorithm scores an additional 𝑘 · |Ethresh| samples for a total of 𝑁 · |E | + 𝑘 · |Ethresh| samples per epoch. Typically the number of negative samples 𝑁 >> 𝑘. Both ConvE and TuckER use all possible negative samples per training sample while we find 𝑘 = 5 works well. Furthermore, by definition, Ethresh ⊆ E rendering |E | >> |Ethresh|. We can thus conclude that 𝑁 · |E | >> 𝑘 · |Ethresh|. We can therefore express the complexity of KG-Mixup as: ≈ 𝑂 (𝑁 · |E | · 𝑂 ( 𝑓 )) . (4.14) This highlights the efficiency of our algorithm as its complexity is approximately equivalent to the standard training procedure. 4.5 Regularizing Effect of KG-Mixup In this section, we examine the properties of KG-Mixup and show it can be formulated as a form of regularization on the entity and relation embeddings of low tail-relation degree samples following previous works [19, 138]. We denote the mixup loss with model parameters 𝜃 over samples 𝑆 as LMix(𝜃). The set 𝑆 contains those samples with a tail-relation degree below a threshold 𝜂 (see line 6 in Algorithm 4.1). The embeddings for each sample 𝑒𝑖 = (ℎ𝑖, 𝑟𝑖, 𝑡) ∈ 𝑆 is mixed with those of a random sample 𝑒 𝑗 = (ℎ 𝑗 , 𝑟 𝑗 , 𝑡) that shares the same tail. The embeddings are combined via a random value 𝜆 ∼ Beta(𝛼, 𝛼) as shown in Eq. (4.9), thereby producing the synthetic sample ˜𝑒 = ( ˜ℎ, ˜𝑟, 𝑡). The formulation for Lmix(𝜃) is therefore: LMix(𝜃) = 1 𝑘 |𝑆| |𝑆| ∑︁ 𝑘 ∑︁ 𝑖=1 𝑗=1 𝑙𝜃 ( ˜𝑒, ˜𝑦) , 52 (4.15) where 𝑘 synthetic samples are produced for each sample in 𝑆, and ˜𝑦 is the mixed binary label. Following Theorem 1 in [19] we can rewrite the loss as the expectation over the synthetic samples as, LMix(𝜃) = 1 |𝑆| |𝑆| ∑︁ 𝑖=1 E𝜆, 𝑗 𝑙𝜃 ( ˜𝑒, ˜𝑦) , (4.16) where 𝜆 ∼ D𝜆 and 𝑗 ∼ Uniform(E𝑡). The distribution D𝜆 = Beta[ 1 ,1] (𝛼, 𝛼) and the set E𝑡 contains all samples (ℎ 𝑗 , 𝑟 𝑗 , 𝑡) with tail 𝑡. Since the label 𝑦 for both samples 𝑖 and 𝑗 are always 1, rendering 2 ˜𝑦 = 1, we can simplify Eq. (4.16) arriving at: LMix(𝜃) = 1 |𝑆| |𝑆| ∑︁ 𝑖=1 E𝜆, 𝑗 𝑙𝜃 ( ˜𝑒) . (4.17) For the above loss function, we have the following theorem. Theorem 1. The mixup loss LMix(𝜃) defined in Eq. (4.17) can be rewritten as the following where the loss function 𝑙𝜃 is the binary cross-entropy loss, L (𝜃) is the loss on the original set of augmented samples 𝑆, and R1(𝜃) and R2(𝜃) are two regularization terms, LMix(𝜃) = L (𝜃) + R1(𝜃) + R2(𝜃). (4.18) The regularization terms are given by the following where each mixed sample ˜𝑒 is composed of a low tail-relation degree sample 𝑒𝑖 and another sample with the same tail entity 𝑒 𝑗 : R1(𝜃) = R2(𝜃) = 𝜏 |𝑆| 𝜏 |𝑆| |𝑆| ∑︁ 𝑘 ∑︁ 𝑖=1 |𝑆| ∑︁ 𝑗=1 𝑘 ∑︁ 𝑖=1 𝑗=1 (1 − 𝜎 ( 𝑓 (𝑒𝑖))) (1 − 𝜎 ( 𝑓 (𝑒𝑖))) 𝜕 𝑓 (𝑒𝑖)𝑇 𝜕𝑥 ˜ℎ Δℎ, 𝜕 𝑓 (𝑒𝑖)𝑇 𝜕𝑥 ˜𝑟 Δ𝑟, (4.19) (4.20) with 𝜏 = E𝜆∼D𝜆 (1 − 𝜆), Δℎ = (cid:16) 𝑥ℎ 𝑗 − 𝑥ℎ𝑖 (cid:17) , Δ𝑟 = (cid:16) 𝑥𝑟 𝑗 − 𝑥𝑟𝑖 (cid:17) , 𝜎 is the sigmoid function, and 𝑓 is the score function. We provide the detailed proof of Theorem 1 in Appendix C.6. Examining the terms in Eq (4.18), we can draw the following understandings on KG-Mixup: 53 1. The inclusion of L (𝜃) implies that the low tail-relation degree samples are scored an additional time when being mixed. This can be considered as a form of oversampling on the low tail-relation degree samples. 2. If the probability is very high, i.e. 𝜎( 𝑓 (𝑒𝑖)) ≈ 1, both R1 and R2 cancel out. This is intuitive as if the current parameters perform well for the original low-degree sample, there is no need to make any adjustments. 3. We can observe that R1 and R2 enforce some regularization on the derivatives as well as the difference between the embeddings Δℎ and Δ𝑟. This motivates us to further examine the difference between the embeddings. In Section 4.6.3, we find that our method does indeed produce more similar embeddings, indicating that our method exerts a smoothing effect among mixed samples. 4.6 Experiment In this section we conduct experiments to demonstrate the effectiveness of our approach on multiple benchmark datasets. We further compare the results of our framework to other methods commonly used to address bias. In particular we study if KG-Mixup can (a) improve overall KGC performance and (b) increase performance on low tail-relation degree triples without degrading performance on other triples. We further conduct studies examining the effect of the regularization terms, ascertaining the importance of each component in our framework, and the ability of KG- Mixup to improve model calibration. 4.6.1 Experimental Settings 4.6.1.1 Datasets We conduct experiments on three datasets including FB15K-237 [111], CoDEx-M [97], and NELL-995 [126]. We omit the commonly used dataset WN18RR [29] as a majority of entities have a degree less than or equal to 3, and as such does not exhibit any degree bias towards triples with a low tail-relation degree. The statistics of each dataset is shown in Table C.1. 54 4.6.1.2 Baselines We compare the results of our method, KG-Mixup, with multiple popular methods proposed for addressing imbalanced problems. Such methods can be used to mitigate bias caused by the initial imbalance. In our case, an imbalance in tail-relation degree causes algorithms to be biased against triples of low tail-relation degree. Specifically, we implement: (a) Over-Sampling triples below a degree threshold 𝜂. We over-sample 𝜂 − 𝑑 (𝑡𝑎𝑖𝑙) times, (b) Loss Re-Weighting [135], which assigns 𝑣,𝑟 a higher loss to triples with a low tail-relation degree, (c) Focal Loss [66], which assigns a higher weight to misclassified samples (e.g. low degree triples). 4.6.1.3 Evaluation Metrics To evaluate the model performance on the test set, we report the mean reciprocal rank (MRR) and the Hits@k for 𝑘 = 1, 10. Following [13], we report the performance using the filtered setting. 4.6.1.4 Implementation Details In this section, we detail the training procedure used to train our framework KG-Mixup. We conduct experiments on our framework using two different KG embedding models, ConvE [29] and TuckER [7]. Both methods are widely used to learn KG embeddings and serve as a strong indicator of our framework’s efficacy. We use stochastic weight averaging (SWA) [48] when training our model. SWA uses a weighted average of the parameters at different checkpoints during training for inference. Previous work [89] has shown that SWA in conjunction with data augmentation can increase performance. Lastly, the synthetic loss weighting parameter 𝛽 is determined via hyperparameter tuning on the validation set. 4.6.2 Main Results In this subsection we evaluate KG-Mixup on multiple benchmarks, comparing its test perfor- mance against the baseline methods. We first report the overall performance of each method on the three datasets. We then report the performance for various degree bins. The top results are bolded with the second best underlined. Note that the Standard method refers to training without any additional method to alleviate bias. Table 4.1 contains the overall results on each method and dataset. The performance is reported 55 Table 4.1 Knowledge Graph Completion (KGC) Comparison. Model Method FB15K-237 NELL-995 CoDEx-M MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 Standard 33.04 23.95 51.23 50.87 44.14 61.48 31.70 24.34 45.60 ConvE + Over-Sampling + Loss Re-weighting + Focal Loss + KG-Mixup (Ours) 30.45 32.32 32.08 34.33 21.85 23.32 23.29 25.00 47.81 50.19 50.09 53.11 48.63 50.89 50.43 51.08 40.99 43.83 44.00 43.52 60.78 62.17 60.70 63.22 27.13 28.38 27.99 31.71 20.17 21.12 20.93 23.49 40.11 42.68 41.48 47.49 Standard 35.19 26.06 53.47 52.11 45.51 62.26 31.67 24.46 45.73 TuckER + Over-Sampling + Loss Re-weighting + Focal Loss + KG-Mixup (Ours) 34.77 35.25 34.02 35.83 25.48 26.08 24.79 26.37 53.53 53.34 52.48 54.78 50.36 51.91 49.57 52.24 44.04 45.76 43.28 45.78 60.40 61.05 58.91 62.14 29.97 31.58 31.47 31.90 22.27 24.32 24.05 24.15 44.19 45.41 45.60 46.54 Table 4.2 MRR for tail-relation degree bins. The range for the zero, low, medium and high bins are [0, 1), [1, 10), [10, 50), and [50, ∞), respectively. Model Method FB15K-237 NELL-995 CoDEx-M Zero Low Medium High Zero Low Medium High Zero Low Medium High Standard 7.34 12.35 34.95 70.97 35.37 57.16 65.99 91.90 8.38 7.97 34.64 ConvE + Over-Sampling + Loss Re-weighting + Focal Loss + KG-Mixup (Ours) 8.37 5.03 7.52 10.90 12.45 9.89 11.89 13.92 33.01 30.56 33.96 35.74 68.75 63.34 68.75 70.72 36.67 36.16 34.72 35.38 57.33 57.96 58.00 59.56 56.09 63.69 65.60 65.41 79.57 89.52 90.89 90.64 8.09 8.79 6.78 9.74 7.52 7.09 6.80 8.96 Standard 10.41 14.65 38.49 71.39 37.02 58.21 69.17 90.55 9.99 8.29 TuckER + Over-Sampling + Loss Re-weighting + Focal Loss + KG-Mixup (Ours) 12.25 10.61 10.84 11.83 14.28 14.40 13.53 15.61 36.79 37.66 37.00 39.45 70.50 72.28 69.28 70.86 34.50 36.59 34.18 36.12 55.46 59.00 53.60 60.73 65.68 67.19 62.67 71.67 93.47 91.17 91.02 92.27 10.98 10.44 9.68 9.14 7.76 8.62 8.17 8.70 29.51 29.09 33.42 32.63 35.23 32.50 35.00 33.95 32.38 65.29 54.80 58.10 56.96 64.38 63.94 60.25 63.39 64.13 65.28 for both ConvE and TuckER. KG-Mixup achieves for the best MRR and Hits@10 on each dataset for ConvE. For TuckER, KG-Mixup further achieves the best MRR on each dataset and the top Hits@10 for two. Note that the other three baseline methods used for alleviating bias, on average, perform poorly. This may be due to their incompatibility with relational structured data where each sample contains multiple components. It suggests that we need dedicated efforts to handle the degree bias in KGC. We further report the MRR of each method for triples of different tail-relation degree. We split the triples into four degree bins of zero, low, medium and high degree. The range of each bin is [0, 1), [1, 10], [10, 50), and [50, ∞), respectively. KG-Mixup achieves a notable increase in 56 performance on low tail-relation degree triples for each dataset and embedding model. KG-Mixup increases the MRR on low degree triples by 9.8% and 5.3% for ConvE and TuckER, respectively, over the standard trained models on the three datasets. In addition to the strong increase in low degree performance, KG-Mixup is also able to retain its performance for high degree triples. The MRR on high tail-relation degree triples degrades, on average, only 1% on ConvE between our method and standard training and actually increases 1% for TuckER. Interestingly, the performance of KG-Mixup on the triples with zero tail-relation degree isn’t as strong as the low degree triples. We argue that such triples are more akin to the zero-shot learning setting and therefore different from the problem we are studying. Lastly, we further analyzed the improvement of KG-Mixup over standard training by comparing the difference in performance between the two groups via the paired t-test. We found that for the results in Table 4.1, 5/6 are statistically significant (p<0.05). Furthermore, for the performance on low tail-relation degree triples in Table 4.2, all results (6/6) are statistically significant. This gives further justification that our method can improve both overall and low tail-relation degree performance. 4.6.3 Regularization Analysis In this subsection we empirically investigate the regularization effects of KG-Mixup discussed in Section 4.5. In Section 4.5 we demonstrated that KG-Mixup can be formulated as a form of regularization. We further showed that one of the quantities minimized is the difference between the head and relation embeddings of the two samples being mixed, 𝑒𝑖 and 𝑒 𝑗 , such that (𝑥ℎ 𝑗 − 𝑥ℎ𝑖 ) and (𝑥𝑟 𝑗 − 𝑥𝑟𝑖 ). Here 𝑒𝑖 is the low tail-relation degree sample being augmented and 𝑒 𝑗 is another sample that shares the same tail. We deduce from this that for low tail-relation degree samples, KG-Mixup may cause their head and relation embeddings to be more similar to those of other samples that share same tail. Such a property forms a smoothing effect on the mixed samples, which facilitates a transfer of information to the embeddings of the low tail-relation degree sample. We investigate this by comparing the head and relation embeddings of all samples that are augmented with all the head and relation embeddings that also share the same tail entity. We 57 denote the set of all samples below some tail-relation degree threshold 𝜂 as Ethresh and all samples with tail entity 𝑡 as E𝑡. Furthermore, we refer to all head entities that are connected to a tail 𝑡 as H𝑡 = {ℎ 𝑗 | (ℎ 𝑗 , 𝑟 𝑗 , 𝑡) ∈ E𝑡 } and all such relations as R𝑡 = {𝑟 𝑗 | (ℎ 𝑗 , 𝑟 𝑗 , 𝑡) ∈ E𝑡 }. For each sample (ℎ𝑖, 𝑟𝑖, 𝑡) ∈ Ethresh we compute the mean euclidean distance between the (1) head embedding xℎ𝑖 and all xℎ 𝑗 ∈ H𝑡 and (2) the relation embedding x𝑟𝑖 and all x𝑟 𝑗 ∈ R𝑡. For a single sample 𝑒𝑖 the mean head and relation embedding distance are given by ℎdist(𝑒𝑖) and 𝑟dist(𝑒𝑖), respectively. Lastly, we take the mean of both the head and relation embeddings mean distances across all 𝑒 ∈ Ethresh, 𝐷rel = Mean (𝑟dist(𝑒𝑖) | 𝑒𝑖 ∈ Ethresh) , 𝐷head = Mean (ℎdist(𝑒𝑖) | 𝑒𝑖 ∈ Ethresh) . (4.21) (4.22) Both 𝐷head and 𝐷rel are shown in Table 4.3 for models fitted with and without KG-Mixup. We display the results for ConvE on FB15K-237. For both the mean head and relation distances, KG-Mixup produces smaller distances than the standardly-trained model. This aligns with our previous theoretical understanding of the regularization effect of the proposed method: for samples for which we augment during training, their head and relation embeddings are more similar to those embeddings belonging to other samples that share the same tail. This to some extent forms a smoothing effect, which is helpful for learning better representations for the low-degree triplets. Table 4.3 Mean Embedding Distances on FB15K-237. Embedding Type Head Entity Relation w/o KG-Mixup KG-Mixup 1.18 1.09 1.21 1.13 % Decrease -7.6% -6.6% 4.6.4 Ablation Study In this subsection we conduct an ablation study of our method on the FB15K-237 dataset using ConvE and TuckER. We ablate both the data augmentation strategy and the use of stochastic weight averaging (SWA) separately to ascertain their effect on performance. We report the overall test MRR and the low tail-relation degree MRR. The results of the study are shown in Table 4.4. 58 KG-Mixup achieves the best overall performance on both embedding models. Using only our data augmentation strategy leads to an increase in both the low degree and overall performance. On the other hand, while the SWA-only model leads to an increase in overall performance it degrades the low degree performance. We conclude from these observations that data augmentation component of KG-Mixup is vital for improving low degree performance while SWA helps better maintain or even improve performance on the non-low degree triples. Table 4.4 Ablation Study on FB15K-237. Method ConvE TuckER Low Overall Low Overall Standard + SWA + Augmentation KG-Mixup (Ours) 12.35 12.27 13.99 13.92 33.04 33.69 33.67 34.33 14.65 14.18 15.64 15.61 35.19 35.77 35.62 35.83 4.6.5 Parameter Study In this subsection we study how varying the number of generated synthetic samples 𝑘 and the degree threshold 𝜂 affect the performance of KG-Mixup. We consider the values 𝑘 ∈ {1, 5, 10, 25} and 𝜂 ∈ {2, 5, 15}. We report the MRR for both TuckER and ConvE on the CoDEx-M dataset. Figure 4.3a displays the performance when varying the degree threshold. Both methods peak at a value of 𝜂 = 5 and perform worst at 𝜂 = 15. Figure 4.3b reports the MRR when varying the number of synthetic samples generated. Both methods peak early with ConvE actually performing best at 𝑘 = 1. Furthermore, generating too many samples harms performance as evidenced by the sharp drop in MRR occurring after 𝑘 = 5. 4.6.6 Model Calibration In this subsection we demonstrate that KG-Mixup is effective at improving the calibration of KG embedding models. Model calibration [40] is concerned with how well calibrated a models prediction probabilities are with its accuracy. Previous work [87] have discussed the desirability of calibration to minimize bias between different groups in the data (e.g. samples of differing degree). Other work [116] has drawn the connection between out-of-distribution generalization and model 59 (a) Varying degree threshold (b) Varying # of samples Figure 4.3 MRR of TuckER and ConvE on CoDEx-M (a) when varying the degree threshold and (b) when varying the number of samples generated. calibration, which while not directly applicable to our problem is still desirable. Relevant to our problem, [110] has shown that Mixup is effective at calibrating deep models for the tasks of image and text classification. As such, we investigate if KG-Mixup is helpful at calibrating KG embedding models for KGC. Table 4.5 Expected Calibration Error (ECE). Lower is better. Model Method FB15K-237 NELL-995 CoDEx-M Low Overall Low Overall Low Overall ConvE TuckER Standard KG-Mixup Standard KG-Mixup 0.19 0.08 0.20 0.07 0.15 0.05 0.35 0.1 0.34 0.08 0.63 0.26 0.27 0.08 0.56 0.20 0.28 0.02 0.05 0.01 0.26 0.09 0.34 0.06 We compared the expected calibration error (see Appendix C.5 for more details) between models trained with KG-Mixup and those without on multiple datasets. We report the calibration in Table 4.5 for all samples and those with a low tail-relation degree. We find that in every instance KG-Mixup produces a better calibrated model for both ConvE and TuckER. These results suggest another reason for why KG-Mixup works; a well-calibrated model better minimizes the bias between different groups in the data [87]. This is integral for our problem where certain groups 60 of data (i.e. triples with low tail-relation degree) feature bias. 4.7 Conclusion We explore the problem of degree bias in KG embeddings. Through empirical analysis we find that when predicting the tail 𝑡 for a triple (ℎ, 𝑟, 𝑡), a strong indicator performance is the number of edges where 𝑟 and 𝑡 co-occur as the relation and tail, respectively. We refer to this as the tail-relation degree. We therefore propose a new method, KG-Mixup, that can be used in conjunction with any KG embedding technique to improve performance on triples with a low tail-relation degree. It works by augmenting lower degree entity-relation pairs with additional synthetic triples during training. To create synthetic samples we adapt the Mixup [137] strategy to KGs. Experiments validate its usefulness. For future work we plan on expanding our method to path-based techniques such as NBFNet [153]. 61 CHAPTER 5 DISTANCE-BASED PROPAGATION FOR EFFICIENT KNOWLEDGE GRAPH REASONING 5.1 Introduction Knowledge graphs (KGs) encode facts via edges in a graph. Because of this, one can view the task of predicting unknown edges (i.e. link prediction) as analogous to uncovering new facts. This task is referred to as knowledge graph completion (KGC) and has attracted a bevy of research over the past decade [14, 112, 100, 154]. Most work has focused on learning quality representations for all nodes (i.e. entities) and edge types (i.e. relations) in the graph to facilitate KGC. Recently, methods [154, 96, 143], have been introduced that move away from the embedding- based approach and focus instead on learning directly from path-based information. One recent GNN-based method, NBFNet [154], draws inspiration from the Bellman-Ford algorithm by com- puting path information through dynamic programming. By doing so, it learns pairwise embeddings between all node pairs in an inductive fashion. It achieves state-of-the-art performance in both the transductive and inductive KGC settings. In this work, we refer to such methods as path-based GNNs. However, a downside of path-based GNNs is their inefficiency. This limits their ability in large real-world graphs. Furthermore, it inhibits their ability to propagate deeply in the graph. Two recent methods have been proposed to address the inefficiency problem, i.e., A∗Net [152] and AdaProp [144], by only propagating to a subset of nodes every iteration. However, they still tend to propagate unnecessary and redundant messages. For path-based GNNs, only the source node is initialized with a non-zero message at the beginning of the propagation process. Such models often run a total of 𝑇 layers, where, in each layer, all nodes aggregate messages from their neighboring edges. We identify that this design is inefficient by making the following two observations. (1) Empty Messages: In the propagation process, a node only obtains non-empty messages when the number of propagation layers is ≥ the shortest path distance between the source and the node. This means that a large number of nodes far from the source node only aggregate “empty” messages in the early propagation 62 layers. Nonetheless, path-based GNN models such as NBFnet propagate these unnecessary “empty messages” in these early propagation layers. (2) Redundant Messages: To ensure path information from the source reach distant nodes, the number of layers 𝑇 needs to be sufficiently large. However, a large 𝑇 induces the propagation of redundant messages for those nodes that are close to the source node. Intuitively, short paths contain more significant information than long ones [50]. The “close” nodes typically aggregate enough information from shorter paths in the early propagation layers. Propagating messages for longer paths in later layers for “close” nodes does not provide significant information and needlessly adds to the complexity. More details on these two observations are provided in Section 5.3.1. To address these limitations and make the propagation process more efficient, we aim to develop an algorithm that limits the propagation of “empty” and “redundant” messages. In particular, we propose a new method TAGNet - TruncAted propaGation Network. TAGNet only aggregates paths in a fixed window for each source-target pair, which can be considered a form of path pruning. Our contributions can be summarized as follows: • We propose a new path-based GNN, TAGNet, which customizes the amount of path-pruning for each source-target node pair. • We demonstrate that the complexity of TAGNet is independent of the number of layers, allowing for efficient deep propagation. • Extensive experiments demonstrate that TAGNet reduces the number of aggregated messages by up to 90% while matching or even slightly outperforming NBFNet on multiple KG benchmarks. 5.2 Preliminary In this section, we first introduce the notation used throughout the paper. We then introduce the path formulation from [154], the generalized Bellman-Ford algorithm [9], and NBFNet [154]. 5.2.1 Notations We denote a KG as G = {V, R, E} with entities V, relations R, and edges E. An edge is denoted as a triple and is of the form (𝑠, 𝑞, 𝑜) where 𝑠 is the subject, 𝑞 the query relation, and 𝑜 the object. for an incomplete fact (𝑠, 𝑞, ?). In such a problem, we refer to the node entity 𝑠 as 63 (a) Example of propagating for three iterations with 𝛿=1. (b) Update status of nodes when 𝛿=1. Figure 5.1 Example of our algorithm when 𝛿 = 1. We note that an undirected blue edge indicates that both nodes aggregate each other. A directed edge indicates that only the head node aggregates the tail node. E.g., at iteration 2 node 2 aggregates node 1, however node 1 doesn’t aggregate node 2. the source node and any possible answer ? as the target node. Lastly, we denote the shortest path distance between nodes 𝑠 and 𝑜 as dist(𝑠, 𝑜). We assume an edge weight of 1 since KGs typically don’t contain edge weights. 5.2.2 Path Formulation [154] introduce a general path formulation for determining the existence of an edge (𝑠, 𝑞, 𝑜). They consider doing so by aggregating all paths between 𝑠 and 𝑜, conditional on the query 𝑞. We denote the maximum path length as 𝑇 (in their paper they set 𝑇 = ∞), 𝑃𝑡 𝑠,𝑜 represents all paths of length 𝑡 connecting nodes 𝑠 and 𝑜, and w𝑞 (𝑒𝑖) is the representation of an edge 𝑒𝑖 conditional on the relation 𝑞. The representation of an edge (𝑠, 𝑞, 𝑜) is given by h𝑞 (𝑠, 𝑜): h𝑞 (𝑠, 𝑜) = 𝑇 (cid:202) (cid:202) | 𝑝| (cid:204) 𝑡=1 𝑝∈𝑃𝑡 𝑠,𝑜 𝑖=1 w𝑞 (𝑒𝑖). (5.1) [154] show that this formulation can capture many existing graph algorithms including the Katz index [50], Personalized PageRank [15] and others. 5.2.3 Generalized Bellman-Ford Due to the exponential relationship between path length and the number of paths, calculating Eq. (5.1) for large 𝑇 is unfeasible. As such, [154] instead model Eq. (5.1) via the generalized Bellman-Ford algorithm [9] which recursively computes such path information in a more efficient 64 manner. It is formulated as: h(0) 𝑞 (𝑠, 𝑜) = 1𝑞 (𝑠 = 𝑜), h(𝑡) 𝑞 (𝑠, 𝑜) = (cid:18) (cid:202) (𝑥,𝑟,𝑜)∈E (𝑜) h(𝑡−1) 𝑞 (𝑠, 𝑥) ⊗ w𝑞 (𝑥, 𝑟, 𝑜) (cid:19) ⊕ h(0) 𝑞 (𝑠, 𝑜), (5.2) (5.3) where E (𝑜) represents all edges with 𝑜 as the object entity, i.e., (∗, ∗, 𝑜). [154] prove that 𝑇 iterations of the generalized Bellman-Ford is equal to Eq. (5.1) with a max path length of 𝑇. 5.2.4 NBFNet [154] extend Eq. (5.2) via the inclusion of learnable parameters. w𝑞 (𝑥, 𝑟, 𝑜) is replaced with a learnable embedding w𝑞 (𝑟) for each relation 𝑟. A linear transformation is further included in the aggregation. It is formulated as the following where for convenience we set h(𝑡) 𝑞 (𝑠, 𝑜) = h(𝑡) 𝑜 and w𝑞 (𝑥, 𝑟, 𝑜) = w𝑞 (𝑟): h(0) 𝑜 = INDICATOR(𝑢, 𝑣, 𝑞), h(𝑡) 𝑜 = AGG (cid:16)(cid:110) MSG(h(𝑡−1) 𝑥 , w𝑞 (𝑟)) | (𝑥, 𝑟, 𝑜) ∈ E (𝑜) (cid:111) ∪ {h(0) 𝑜 } (cid:17) . (5.4) The representation of the source node h(0) 𝑠 is initialized to a learnt embedding, q𝑟, corresponding to the query relation 𝑟. For all other nodes (𝑜 ≠ 𝑠), they learn a separate initial embedding. However in practice they simply initialize the other nodes to the 0 vector. For the AGG function they consider the sum, max, min and PNA operations. For the MSG function they consider the TransE [14], DistMult [127], and RotatE [104] operators. The final representation is passed to a score function 𝑓 which is modeled via an MLP. 5.3 The Proposed Framework In this section, we propose a new approach to improve the efficiency of path-based GNN models. Inspired by two observations in Section 5.3.1, we proposed a simple but effective distance-based pruning strategy. We then introduce a truncated version of the generalized Bellman-Ford algorithm that achieves the goal of our proposed pruning strategy. Finally, we describe a neural network model based on the truncated Bellman-Ford. 65 5.3.1 Motivation In this subsection, we discuss the motivation behind our framework design. In particular, we suggest that the inefficiency of path-based GNNs is mainly due to two observations: (1) the aggregation of many empty messages and (2) the proliferation of redundant messages when the number of layers is large. Next, we detail our observations and how they inspire us to design a more efficient method. Observation #1: Empty Messages. Most path-based GNNs aggregate empty messages that do not contain any path information. This has the effect of increasing the model complexity without any obvious benefit. We provide an illustrative example. In Figure 5.1a, during the first iteration, node 7 will try to aggregate path information from node 6. However, all node representations, outside of the source, are initialized to zero ("empty messages"). Hence, a non-informative “empty message” will be passed to node 7 from node 6. In fact, in the first iteration, only the 1-hop neighbors of the source aggregate non-empty messages which contains information on paths with length 1. Only after two iterations will node 6 contain path information from the source. Therefore aggregating any messages before the third iteration will not lead to any path information for node 7. However, both NBFNet [154] and A∗Net [152] will aggregate such messages, leading to increased complexity without any gain in additional path information. This observation suggests that a node 𝑜 of distance dist(𝑠, 𝑜) from the source can only aggregate path information from iteration 𝑡 = dist(𝑠, 𝑜) onwards. Observation #2: Redundant Messages. Due to their design, path-based GNNs with 𝑇 layers can only learn representations for nodes within 𝑇 hops of the source node. However, since the time complexity of all existing methods is proportional to the number of layers, learning representations for nodes far from the source (i.e., distant nodes) can be very inefficient. In particular, as we discussed in Section 5.1, this mainly afflicts target nodes closer to the source. Again, we utilize Figure 5.1a for illustration. In the first two iterations the node 4 aggregates two paths including (source, 4) and (source, 3, 4). These paths provide significant information between the source and 4. Comparatively, in the 6-th iteration node 4 aggregates paths1 of length 6, which reach further 1Strictly, these walks are not paths, as they contain repeated nodes and edges. In this paper, we follow the convention of the path-based GNN papers to loosely call them paths. 66 nodes and return to node 4. Since these paths already contain information present in shorter paths, little information is gained by aggregating them. Our empirical study in Section 5.4.3 also verifies that aggregating paths of longer length relative to the target node have little to no positive effect on performance. These two observations suggest that the efficiency of path-based GNN methods is low when there are nodes of diverse distances to the source. We verify this by analyzing the distance distribution for all test samples on the WN18RR [30] dataset. For each sample we calculate the shortest path distance between both nodes and plot the distribution of the distances over all samples. The results are shown in Figure 5.2. We note that around 25% of samples have a shortest distance ≥ 5. To aggregate information for these distant nodes, it is necessary to set 𝑇 to ≥ 5. In this case, nodes of larger distance will propagate empty messages for the first few iterations (Observation 1). Furthermore, about 35% of the samples have a shortest distance of 1. Such samples will aggregate redundant messages after a few iterations (Observation 2). Our Design Goal: The key to improving the efficiency of path-based GNNs is to modify their aggregation scheme. In particular, based on the aggregation scheme of path-based GNNs, all target nodes are aggregating paths with lengths ranging from 1 to 𝑇. Such paths contain many empty and redundant messages. To reduce the aggregation of those non-informative messages, we propose to customize the aggregations for each target node. Specifically, for close nodes, we do not aggregate long paths as they are redundant. For distant nodes, we do not aggregate short paths as they are empty. As such, we customize the aggregation process for each target node according to its distance from the source. Based on this intuition, we reformulate the path formulation, Eq. (5.1), as follows. x𝑞 (𝑠, 𝑜) = dist(𝑠,𝑜)+𝛿 (cid:202) (cid:202) | 𝑝| (cid:204) 𝑡=dist(𝑠,𝑜) 𝑝∈𝑃𝑡 𝑠,𝑜 𝑖=1 𝑤(𝑒𝑖), (5.5) where 𝛿 ≥ 0 is an offset. The parameter 𝛿 can be considered as a form of path pruning as it controls the paths we aggregate relative to the shortest path distance. For example, when 𝛿 = 0, it only aggregates those paths of the shortest distance for all node pairs. Empirical observations in Section 5.4.3 validate our use of pruning based on an offset 𝛿. 67 Figure 5.2 Test Distance Distribution for WN18RR. Due to the high complexity of Eq. (5.5), it is not practical to directly calculate it. Hence, based on the generalized Bellman-Ford algorithm [9], we propose a truncated version of the Bellman-Ford algorithm for calculating Eq. (5.5) in a more efficient fashion. 5.3.2 Truncated Bellman-Ford From our design goal, we are interested in capturing all paths of length dist(𝑠, 𝑜) ≤ 𝑙 ≤ dist(𝑠, 𝑜) + 𝛿. To achieve this goal, for node 𝑜, we begin aggregating at iteration 𝑡 = dist(𝑠, 𝑜) and stop aggregation after iteration 𝑡 = dist(𝑠, 𝑜) + 𝛿. This helps avoid aggregating empty messages before dist(𝑠, 𝑜)-th iteration and redundant messages after dist(𝑠, 𝑜) + 𝛿 iterations. However, during the iterations between dist(𝑠, 𝑜) and dist(𝑠, 𝑜) + 𝛿, there are still potential empty messages. For example, any node 𝑣 with the shortest distance to source larger than dist(𝑠, 𝑜) + 𝛿 always contains empty messages during these iterations. Hence, to further avoid aggregating many empty messages, we only allow aggregation from a subset of the neighboring nodes of 𝑜. More formally, we formulate the above intuition into the following constrained edge set C(𝑠, 𝑜, 𝑡) through which node 𝑜 aggregates information at iteration 𝑡. C(𝑠, 𝑜, 𝑡) =    ∅, if 𝑡 < dist(𝑠, 𝑜) or 𝑡 > dist(𝑠, 𝑜) + 𝛿 {(𝑣, 𝑟, 𝑜) ∈ E (𝑜) | dist(𝑠, 𝑣) < dist(𝑠, 𝑜) + 𝛿}, else (5.6) 68 Based on this constraint set of edges for node 𝑜, we update the generalized Bellman-Ford algorithm (Eq. 5.2) as follows where C = C(𝑠, 𝑜, 𝑡): x(𝑡 ) 𝑞 (𝑠, 𝑜) = (cid:18) (cid:202) (𝑣,𝑟 ,𝑜) ∈ C x(𝑡 −1) 𝑞 (𝑠, 𝑣) ⊗ w𝑞 (𝑣, 𝑟, 𝑜) (cid:19) ⊕ x(0) 𝑞 (𝑠, 𝑜). (5.7) The following theorem shows that the aggregation scheme proposed in Eq. (5.7) results in aggre- gation of the correct paths as described in Eq. (5.5). Theorem 2. Given a source node 𝑠, query 𝑞, and target node 𝑜, the final representation, x𝐹 𝑞 (𝑠, 𝑜) only aggregates all path representations whose path length is between dist(𝑠, 𝑜) and dist(𝑠, 𝑜) + 𝛿 for all 𝑜 ∈ 𝑉. It therefore contains all information present in Eq. (5.5) such that, x𝐹 𝑞 (𝑠, 𝑜) = dist(𝑠,𝑜)+𝛿 (cid:202) (cid:202) | 𝑝| (cid:204) 𝑡=dist(𝑠,𝑜) 𝑝∈𝑃𝑡 𝑠,𝑜 𝑖=1 𝑤(𝑒𝑖). (5.8) The detailed proof of Theorem 2 is provided in Appendix D.1. This design has the following advantages. (1) We don’t begin aggregating messages until layer 𝑡 = dist(𝑠, 𝑜). This helps avoid the aggregation of many empty messages for nodes far from the source. (2) We stop aggregating messages at layer 𝑡 = dist(𝑠, 𝑜) + 𝛿. This ensures that for close nodes we don’t aggregate many redundant messages. Furthermore, it ensures that we will always aggregate paths of 𝛿 + 1 different lengths for all target nodes regardless of their distance from the source. (3) In Section D.2.2, we demonstrate that the complexity of this design is independent of the number of layers, allowing for deep propagation. An Illustrative Example. We given an example of the effect of constraints on propagation in Figure 5.1 where 𝑠 = source. Figure 5.1a shows the involved nodes and edges over three iterations when 𝛿 = 1. We observe that only a portion of the nodes and edges are involved at any one iteration. For example, at iteration 1 only the 1-hop neighbors and the edges connecting them to the source are involved. This is because they are the only nodes and edges able to receive any path information at that stage. Figure 5.1b details the update status of nodes by distance from the source node. We note how as the iteration increases the number of nodes updated shift to the right in groups of two. Furthermore since we only iterate for three iterations, the 4+ hop neighbors never update as there is no available path information for them until iteration 4. 69 5.3.3 Degree Messages An effect of pruning paths, especially with low 𝛿, is that it can lead to very few messages being aggregated. This is especially true for smaller or sparser graphs. One consequence of few messages being aggregated is that it can make it difficult for a node to discern the properties of its neighborhood (e.g. degree). We give an example of node 4 in Figure 5.1. For each of the first 2 iterations, it only aggregates messages from 2/4 of it’s neighbors. As such, it never aggregates messages from all its neighbors at the same iteration. This can lead to a failure of node 4 to properly discern it’s degree, as the number of non-empty messages in each iteration is only a portion of the overall degree. Since the degree is known to be an important factor in link prediction [81, 2], we want to preserve the degree information for all nodes. In order to preserve the degree information for each node, we consider encoding the degree via the use of pseudo messages. Specifically, we want to add enough messages such that the total number of messages aggregated for a node 𝑜 is equivalent to its degree. We refer to such messages as degree messages. Going back to our example in Figure 5.1, for node 4 at iteration 1 and 2 we would add 2 degree messages so that the total number of messages is 4. Formally, we denote the degree of a node 𝑜 as 𝑏𝑜. The number of messages to add at iteration 𝑡 is given by 𝜌𝑜 = 𝑏𝑜 − |𝐶 (𝑠, 𝑜, 𝑡)|. For the value of the messages, we learn a separate embedding denoted as x(𝑡) deg that is the same across all nodes. Since the value of each message is the same we can avoid explicitly aggregating each degree message individually. Instead, we just aggregate one message that is equal to the number of degree messages multiplied by the degree embedding, x(𝑡) deg (𝑠, 𝑜) = 𝜌𝑜 · x(𝑡) deg , (5.9) where x(𝑡) deg (𝑠, 𝑜) is the value of the degree message for node 𝑜 at iteration 𝑡. This edge is then added to the set of messages to be aggregated, 𝐶 (𝑠, 𝑜, 𝑡). Since this is equivalent to computing and aggregating only one edge, it has no effect on the model complexity. Experimental results in Section 5.4.4 validate the effectiveness of degree messages. 70 5.3.4 GNN Formulation We follow similar conventions to NBFNet when converting Eq. (5.6) and Eq. (5.7) to a GNN. We denote the embedding of a source node 𝑠 and arbitrary target node 𝑜 as x𝑞 (𝑠, 𝑜). We further represent the indicator query embeddings as x𝑞 and the layer-wise relation embeddings as x(𝑡) 𝑟 . We utilize the INDICATOR function described in Section 5.2.4, PNA [26] for the AGGREGATE function, and DistMult [127] for the MSG function. The probability of a link existing between a source-target pair is determined via a score function 𝑓 . Both the final representation of the pair and the query embedding are given as input. The output of 𝑓 is then passed to a sigmoid to produce a probability, 𝑝(𝑠, 𝑜) = 𝜎 (cid:16) (cid:16) 𝑓 𝑞 (𝑠, 𝑜), x𝑞 xF (cid:17)(cid:17) , (5.10) where xF 𝑞 (𝑠, 𝑜) is the final pair representation. The full algorithm is detailed in Appendix D.2.1. We run a total of 𝑇 layers. We further show in in Appendix D.2.2 that time complexity is independent of the number of layers. This enables TAGNet to propagate for more layers than existing path-based GNNs. Furthermore, due to its general design, TAGNet can also be integrated with other efficiency- minded methods like A∗Net. This is described in more detail in Appendix D.2.3. Extensive experiments in Sections 5.4.1 and 5.4.2 also demonstrate that combining both methods can signif- icantly reduce the number of messages propagated by A∗Net without sacrificing performance. 5.3.5 Target-Specific 𝛿 A drawback of our current design is that we assume a single offset 𝛿 for all possible node pairs. However, for some pairs we may want to consider propagating more or less iterations. For example, in Figure 5.1 we may only want to consider 𝛿 = 0 for the target node 2 due to the limited number of paths connecting it to the source. However for node 4, which is concentrated in a denser portion of the subgraph, we may want to consider a higher value of 𝛿 such as 1 or 2 to capture more path information. We next detail our method for achieving this. 71 5.3.5.1 Target-Specific 𝛿 via Attention A target-specific 𝛿 can be attained by realizing the connection between the hidden representa- tions and the value of 𝛿. Let’s denote the value of the hyperparameter 𝛿 as ˆ𝛿. For a source-target node pair (𝑠, 𝑜), we only aggregate paths from length dist(𝑠, 𝑜) to dist(𝑠, 𝑜) + ˆ𝛿. At iteration 𝑡 = dist(𝑠, 𝑜) we aggregate paths of length dist(𝑠, 𝑜) and at iteration 𝑡 = dist(𝑠, 𝑜) + 1 only those paths of length dist(𝑠, 𝑜) + 1, and so on until 𝑡 = dist(𝑠, 𝑜) + ˆ𝛿. The set of hidden representations for a node pair is as follows where for convenience we represent x𝑞 (𝑠, 𝑜) as x(𝑠,𝑜): Hiddens(𝑠, 𝑜) = (cid:104) xdist(𝑠,𝑜) (𝑠,𝑜) , · · · , x(dist(𝑠,𝑜)+ ˆ𝛿) (𝑠,𝑜) (cid:105) . (5.11) The first hidden representation only contains paths of shortest length and therefore corresponds to 𝛿 = 0. Since the paths accumulate over hidden representations via a self-loop, x(dist(𝑠,𝑜)+1) (𝑠,𝑜) contains all paths of length dist(𝑠, 𝑜) and dist(𝑠, 𝑜) + 1, corresponding to 𝛿 = 1. As such, the final hidden representation is equivalent to 𝛿 = ˆ𝛿. Therefore, choosing a target-specific 𝛿 is achieved by selecting one of the hidden representations as the final representation. We utilize attention to determine which value of 𝛿 is best for a specific target node. This is formulated as the following: xF (𝑠,𝑜) = ˆ𝛿 ∑︁ 𝛿=0 (𝑠,𝑜)x(dist(𝑠,𝑜)+𝛿) 𝛼𝛿 (𝑠,𝑜) , (5.12) where 𝛼𝛿 (𝑠,𝑜) is the corresponding attention weight for the hidden representation x(dist(𝑠,𝑜)+𝛿) (𝑠,𝑜) . For each possible value of 𝛿, 𝛼𝛿 (𝑠,𝑜) is given by: (cid:16) x(dist(𝑠,𝑜)+𝛿) (𝑠,𝑜) ˜𝛼𝛿 (𝑠,𝑜) = 𝑔 𝛼𝛿 (𝑠,𝑜) = Softmax( ˜𝛼𝛿 (𝑠,𝑜)). (cid:17) , x𝑞 We model 𝑔 as an MLP that takes both the hidden representation and the query embedding as input. Taking inspiration from A∗Net [152], we conjecture that a well-learned score function can help determine which representations are better than others. As such, we further consider modeling 𝑔 as its own function or having it share parameters with the score function 𝑓 , Eq. (5.10). Lastly, we show in Appendix D.2.2 that the time complexity is unchanged when using a target-specific 𝛿. 72 Table 5.1 Transductive Results. Best results are in bold and the 2nd best underlined. Method Type Method FB15k-237 WN18RR MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 Embeddings GNNs Path-Based TAGNet TransE DistMult ComplEx 0.294 0.241 0.247 0.273 R-GCN CompGCN 0.355 DRUM 0.343 RED-GNN 0.374 0.392 AdaProp 0.415 NBFNet A∗Net 0.414 + A∗Net Fixed 𝛿 Specific 𝛿 0.409 0.421 0.417 - 0.155 0.158 0.182 0.264 0.255 0.283 0.309 0.321 0.324 0.323 0.328 0.328 0.465 0.419 0.428 0.456 0.535 0.516 0.558 0.555 0.599 0.592 0.577 0.602 0.592 0.226 0.43 0.44 0.402 0.479 0.486 0.533 0.553 0.551 0.547 0.555 0.562 0.565 - 0.39 0.41 0.345 0.443 0.425 0.485 0.502 0.497 0.490 0.502 0.509 0.513 0.501 0.49 0.51 0.494 0.546 0.586 0.624 0.652 0.666 0.658 0.657 0.667 0.667 Table 5.2 Inductive Results (evaluated with Hits@10). Ours results are averaged over 5 runs. Method NeuralLP DRUM GraIL RED-GNN AdaProp NBFNet A∗Net TAGNet + A∗Net TAGNet (fixed 𝛿) TAGNet (specific 𝛿) v1 0.468 0.474 0.429 0.483 0.470 0.607 0.535 0.541 0.596 0.596 FB15k-237 v3 v2 v4 v1 WN18RR v3 v2 0.586 0.595 0.424 0.629 0.651 0.704 0.638 0.646 0.700 0.698 0.571 0.571 0.424 0.603 0.620 0.667 0.610 0.604 0.677 0.675 0.593 0.593 0.389 0.621 0.614 0.668 0.630 0.623 0.666 0.661 0.772 0.777 0.760 0.799 0.798 0.826 0.810 0.813 0.816 0.818 0.749 0.747 0.776 0.780 0.836 0.798 0.803 0.805 0.796 0.803 0.476 0.477 0.409 0.524 0.582 0.568 0.544 0.535 0.534 0.544 v4 0.706 0.702 0.687 0.721 0.732 0.694 0.743 0.745 0.734 0.737 5.4 Experiment In this section, we evaluate the effectiveness of our proposed framework on KGC under both the transductive and inductive settings. We also empirically analyze the efficiency and conduct ablation studies on each component. The experimental details are listed in Appendix D.3. We note that for a fair comparison between path-based GNNs, we run each model using 6 layers and a hidden dimension of 32 as is done in both [154] and [152]. Please see Appendix D.3.2 for more details. 73 5.4.1 Effectiveness of TAGNet In this subsection, we present the results of TAGNet compared with baselines on both trans- ductive and inductive settings. We further detail the results when combining TAGNet with A∗Net. Transductive Setting: The results on the transductive setting are shown in Table 5.1. We observe that TAGNet achieves strong performance with just a fixed 𝛿. In particular, it outperforms A∗Net and AdaProp on most metrics. Also compared to NBFnet, which doesn’t utilize pruning, TAGNet achieves comparable or even stronger performance. This indicates that the proposed pruning strategy mostly reduces redundant aggregations that do not impair the models effectiveness. Inductive Setting: Table 5.2 shows the results on the inductive setting. TAGNet achieves strong performance on both datasets. In particular, it achieves comparable performance to the non-pruning version of NBFNet. Furthermore, TAGNet significantly outperforms A∗Net and AdaProp on the FB15k-237 splits, demonstrating the advantage of the proposed pruning strategy. TAGNet + A∗Net: We further test combining the pruning strategy of both TAGNet and A∗Net together (see Appendix D.2.3 for more details). Compared to A∗Net, we observe that TAGNet+A∗Net achieves comparable if not better performance under all settings despite aggregat- ing much fewer messages (see subsection 5.4.2). This suggests that the pruning strategy in A∗Net fails to prune many irrelevant paths, allowing TAGNet to work complementary to it. 5.4.2 Efficiency of TAGNet In this subsection, we empirically evaluate the efficiency of our model against NBFNet. Specif- ically, we compare the mean number of messages aggregated per sample during training. Figure 5.3 shows the % decrease in the number of messages of TAGNet as compared to NBFNet. All models are fit with 6 layers. We observe two trends. The first is that both FB15k-237 datasets follow a similar relationship that is close to what’s expected of the worst-case complexity detailed in Appendix D.2.2. On the other hand, the WN18RR datasets pass much fewer messages as they hover above 90% for all 𝛿. This is likely because WN18RR is a very sparse graph. This gives TAGNet plenty of opportunities to prune paths. We further compare the efficiency of just A∗Net and A∗Net + TAGNet. As before, we calculate 74 Figure 5.3 % Decrease in NBFNet Messages. the total number of messages passed for both methods. We fix 𝛿 = 2. Table 5.3 show the % decrease in the number of messages when utilizing both techniques compared to just A∗Net. We observe a large reduction in both the inductive and transductive setting. Since the performance of A∗Net + TAGNet is on par with just A*Net, it suggests that A*Net fails to prune many unneeded messages that do not improve performance. Furthermore, we find that the reduction in the number of messages becomes more pronounced with more layers, suggesting that TAGNet is even more useful when deep propagation is necessary. Table 5.3 % Decrease in # Msgs for A∗Net vs. A∗Net + TAGNet. Dataset 6 Layers 7 Layers 8 Layers FB15k-237 FB15k-237 v1 WN18RR WN18RR v1 39% 30% 10% 25% 51% 44% 17% 37% 59% 66% 26% 46% 5.4.3 Effect of 𝛿 In this subsection, we evaluate the effect of the offset 𝛿 on TAGNet test performance (w/o the target-specific setting). We fix the number of layers at 6 and vary 𝛿 from 0 to 5. We report results for both the transductive and inductive settings in Figures 5.4 and 5.5, respectively. For the inductive setting, we chose version v1 of both datasets as the representative datasets. For both transductive datasets, we find that the performance plateaus at 𝛿 = 2. A similar trend is observed for FB15k-237 v1. Interestingly, for WN18RR v1,the performance is constant when varying 𝛿. This suggests that 75 for some datasets almost all of the important information is concentrated in paths of the shortest length. Figure 5.4 Performance varying 𝛿 on Transductive setting. Figure 5.5 Performance varying 𝛿 on Inductive setting. 5.4.4 Effect of Degree Messages We demonstrate the effect of the degree messages described in Section 5.3.3. Table 5.4 shows the performance of TAGNet when trained with and without degree messages. We report the performance on all of the inductive splits for both FB15k-237 and WN18RR. Interestingly, we observe that while there is a consistent gain on FB15k-237, it often hurts performance on WN18RR. This may imply that preserving the degree information of each node is more important on FB15k-237 than WN18RR. 76 Table 5.4 Effect of Degree Messages on Inductive Splits. Dataset Split w/o Msgs with Msgs FB15k-237 WN18RR V1 V2 V3 V4 V1 V2 V3 V4 0.594 0.684 0.653 0.648 0.815 0.803 0.544 0.737 0.596 0.698 0.675 0.661 0.818 0.781 0.465 0.718 5.5 Related Work We give a brief overview of different types of KGC methods. (1) Embedding-Based Methods: Such methods are concerned with modeling the interactions of entity and relation embeddings. TransE [14] models each fact as translation in the embedding space while DistMult [127] scores each fact via a bilinear diagonal function. ComplEx [112] extends DistMult by further modeling the embeddings in the complex space. Lastly, Nodepiece [36] attempts to improve the efficiency of embedding-based KGC methods by representing each entity embedding as a combination of a smaller set of subword embeddings. Since this method concerns embedding-based techniques, it is orthogonal to our work. (2) GNN-Based Methods: GNN methods extend traditional GNNs by further considering the relational information. CompGCN [113] encodes each message as a combination of neighboring entity-relation pairs via the use of compositional function. RGCN [100] instead considers a relation-specific transformation matrix to integrate the relation information. (3) Path-Based Methods: Path-based methods attempt to leverage the path information connecting two entities to perform KGC. NeuralLP [128] and DRUM [96] learn to weight different paths by utilizing logical rules. More recently, NBFNet [154] considers path information by learning a parameterized version of the Bellman-Ford algorithm. A similar framework, RED-GNN [143] also attempts to take advantage of dynamic programming to aggregate path information. Both A∗Net [152] and AdaProp [144] attempt to prove upon the efficiency of the previous methods by learning which nodes to propagate to. 77 5.6 Conclusion In this paper we identify two intrinsic limitations of path-based GNNs that affect the efficiency and representation quality. We tackle these issues by introducing a new method, TAGNet, which is able to efficiently propagate path information. This is realized by only aggregating paths in a fixed window for each source-target pair. We demonstrate that the complexity of TAGNet is independent of the number of layers. For future work, we plan on exploring methods to capture path information without having to perform a separate round of propagation for every individual source node. 78 CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS In this work we propose enhancing link prediction from multiple perspectives through a data-centric framework. To do so, we first study the evaluation setting of link prediction, showing that it is unrealistic. We then promote a new evaluation procedure which corresponds to a more realistic evaluation of link prediction. Next, we propose a new method, LPFormer, that can efficiently model the factors necessary for LP in a data-driven way. The design of LPFormer is based on our empirical understanding of LP performance. We then focus on KG completion, seeking to mitigate problems with bias and efficiency. We first study the problem of degree bias in KGC, finding that it differs from traditional degree bias in node classification. We then propose a new data augmentation method to alleviate this bias. Lastly, we study the inefficiency of path-based GNNs used for KGC. We find that much computation is spend on redundant or empty messages. We then propose a new method for reducing these messages. In the future, we plan to further explore LP and Graph ML in various ways: 1. Generalizable LP: To properly work with real-world data, it’s necessary to design methods that can generalize to out-of-distribution (OOD) samples. Recent work has shown [92] has shown that for LP, methods often struggle to properly generalize across covariate distribution shifts. However, there is little work exploring how to practically overcome these issues and design methods that can generalize. Furthermore, it’s unclear how current methods can generalize to links that follow different structural patterns. I plan to explore both directions in future work. 2. Trustworthy LP: It has been shown that AI is susceptible to malicious attacks on the input given to models. It is necessary to safeguard against such potential attackers given its potential to cause great harm in safety-critical and real-world applications. Recent work [60] has further shown that for the task of LP, diffusion models can be helpful in protecting against certain types of adversarial attacks. However, despite the real-world use of such protection, it is still relatively unexplored. We therefore plan on continuing this line of research by exploring how current LP models can protect against a more diverse set of adversarial attacks and whether we can design 79 more efficient models to counterattack such attacks. 3. Foundation models for graph data. Recently, we’ve seen the emergence of “foundation models” (FMs), which are a single model that can be used on a broad spectrum of downstream tasks and datasets. The key advantage of FMs are that they allow us to sidestep the need to train a new model for each dataset and task. However, despite the success in other fields like CV and NLP, FMs have yet to take hold in the graph domain. Current graph foundation models (GFMs) are limited, only working for specific domains (e.g., KG reasoning) or with natural language node features. Recent work of ours has explored creating LP models that can generalize across different datasets and domains [60], a key requirement for any FM. I plan to continue developing GFMs that can be applied to graphs of various domains and for multiple different tasks. In order to do this, they are several key questions that we need to answer: How to handle the heterogeneity of the initial node features, not only in size but in semantic meaning? How do we contend with the fact that different tasks (e.g., node classification and LP) and domains require different inductive biases and therefore different final representations to perform optimally? Can we learn to extract a shared set of information that is useful across different tasks and domains? I argue that is necessary to answer these questions if we hope to create meaningful GFMs. 80 BIBLIOGRAPHY [1] [2] Khushnood Abbas, Alireza Abbasi, Shi Dong, Ling Niu, Laihang Yu, Bolun Chen, Shi-Min Cai, and Qambar Hasan. Application of network link prediction in drug discovery. BMC bioinformatics, 22:1–21, 2021. Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003. [3] Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Mikhail Galkin, Sahand Sharifzadeh, Asja Fischer, Volker Tresp, and Jens Lehmann. Bringing light into the dark: A large-scale evaluation of knowledge graph embedding models under a unified framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8825–8845, 2021. [4] [5] [6] [7] [8] [9] Saadullah Amin, Stalin Varanasi, Katherine Ann Dunfield, and Günter Neumann. Lowfer: In International Conference on Machine Low-rank bilinear pooling for link prediction. Learning, pages 257–268. PMLR, 2020. Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank In 2006 47th Annual IEEE Symposium on Foundations of Computer Science vectors. (FOCS’06), pages 475–486. IEEE, 2006. Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation In 3rd International Conference on Learning by jointly learning to align and translate. Representations, ICLR 2015, 2015. Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019. Albert-Laszlo Barabâsi, Hawoong Jeong, Zoltan Néda, Erzsebet Ravasz, Andras Schubert, and Tamas Vicsek. Evolution of the social network of scientific collaborations. Physica A: Statistical mechanics and its applications, 311(3-4):590–614, 2002. John S Baras and George Theodorakopoulos. Path problems in networks. Synthesis Lectures on Communication Networks, 3(1):1–77, 2010. [10] Aleksandar Bojchevski, Johannes Gasteiger, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining, pages 2464–2473, 2020. [11] Stephen Bonner, Ian P Barrett, Cheng Ye, Rowan Swiers, Ola Engkvist, Charles Tapley Hoyt, and William L Hamilton. Understanding the performance of knowledge graph embeddings in drug discovery. Artificial Intelligence in the Life Sciences, page 100036, 2022. 81 [12] JC de Borda. M’emoire sur les’ elections au scrutin. Histoire de l’Acad’emie Royale des Sciences, 1781. [13] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013. [14] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013. [15] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998. [16] Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997. [17] Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022. [18] Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference, pages 151–161, 2019. [19] Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization. arXiv preprint arXiv:2006.06049, 2020. [20] Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural net- works for link prediction with subgraph sketching. arXiv preprint arXiv:2209.15486, 2022. [21] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002. [22] Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Nagphormer: A tokenized graph transformer for node classification in large graphs. In The Eleventh International Conference on Learning Representations, 2022. [23] Sanxing Chen, Xiaodong Liu, Jianfeng Gao, Jian Jiao, Ruofei Zhang, and Yangfeng Ji. Hitter: Hierarchical transformers for knowledge graph embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10395– 10407, 2021. 82 [24] Fan Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences, 104(50):19735–19740, 2007. [25] Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 2174–2184, 2019. [26] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Princi- pal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems, 33:13260–13271, 2020. [27] Nur Nasuha Daud, Siti Hafizah Ab Hamid, Muntadher Saadoon, Firdaus Sahran, and Nor Badrul Anuar. Applications of link prediction in social networks: A review. Jour- nal of Network and Computer Applications, 166:102716, 2020. [28] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006. [29] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Thirty-second AAAI conference on artificial intelligence, 2018. [30] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional In Proceedings of the AAAI conference on artificial 2d knowledge graph embeddings. intelligence, volume 32, 2018. [31] Yuxiao Dong, Reid A Johnson, Jian Xu, and Nitesh V Chawla. Structural diversity and homophily: A study across more than one hundred big networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 807–816, 2017. [32] Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023. [33] Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 world wide web conference, pages 1775–1784, 2018. [34] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. 83 [35] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the anal- ysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007. [36] Mikhail Galkin, Etienne Denis, Jiapeng Wu, and William L Hamilton. Nodepiece: Compo- sitional and parameter-efficient representations of large knowledge graphs. In International Conference on Learning Representations, 2021. [37] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017. [38] David F Gleich, Kyle Kloster, and Huda Nassar. Localization in seeded pagerank. arXiv preprint arXiv:1509.00016, 2015. [39] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016. [40] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017. [41] Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. Wtf: The who to follow service at twitter. In Proceedings of the 22nd international conference on World Wide Web, pages 505–514, 2013. [42] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [43] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020. [44] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020. [45] Hong Huang, Jie Tang, Lu Liu, JarDer Luo, and Xiaoming Fu. Triadic closure pattern IEEE Transactions on Knowledge and Data analysis and prediction in social networks. Engineering, 27(12):3374–3389, 2015. [46] Zan Huang, Xin Li, and Hsinchun Chen. Link prediction approach to collaborative filtering. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, pages 141– 142, 2005. 84 [47] Zexi Huang, Mert Kosan, Arlei Silva, and Ambuj Singh. Link prediction without graph neural networks. arXiv preprint arXiv:2305.13656, 2023. [48] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018. [49] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39– 43, 1953. [50] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39– 43, 1953. [51] Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and Seunghoon Hong. Pure transformers are powerful graph learners. Advances in Neural Information Processing Systems, 35:14582–14595, 2022. [52] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [53] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [54] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016. [55] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. [56] Sadamori Kojaku, Jisung Yoon, Isabel Constantino, and Yong-Yeol Ahn. Residual2vec: De- biasing graph embedding with random graphs. Advances in Neural Information Processing Systems, 34, 2021. [57] István A Kovács, Katja Luck, Kerstin Spirohn, Yang Wang, Carl Pollis, Sadie Schlabach, Wenting Bian, Dae-Kyum Kim, Nishka Kishore, Tong Hao, et al. Network-based prediction of protein interactions. Nature communications, 10(1):1240, 2019. [58] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Infor- mation Processing Systems, 34:21618–21629, 2021. [59] Guangyao Li, Zequn Sun, Lei Qian, Qiang Guo, and Wei Hu. Rule-based data augmentation for knowledge graph embedding. AI Open, 2:186–196, 2021. [60] Hang Li, Wei Jin, Geri Skenderi, Harry Shomer, Wenzhuo Tang, Wenqi Fan, and Jiliang 85 Tang. Sub-graph based diffusion model for link prediction. In The Third Learning on Graphs Conference. [61] [62] Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang, and Dawei Yin. Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking. arXiv preprint arXiv:2306.10453, 2023. Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang, and Dawei Yin. Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking. Advances in Neural Information Processing Systems, 36, 2024. [63] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems, 33:4465–4478, 2020. [64] Ren Li, Yanan Cao, Qiannan Zhu, Guanqun Bi, Fang Fang, Yi Liu, and Qian Li. How does knowledge graph embedding extrapolate to unseen data: a semantic evidence view. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2022, 2022. [65] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge manage- ment, pages 556–559, 2003. [66] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [67] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. In Proceedings of the K-bert: Enabling language representation with knowledge graph. AAAI Conference on Artificial Intelligence, volume 34, pages 2901–2908, 2020. [68] Zemin Liu, Trung-Kien Nguyen, and Yuan Fang. Tail-gnn: Tail-node graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1109–1119, 2021. [69] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011. [70] Sijie Mai, Shuangjia Zheng, Yuedong Yang, and Haifeng Hu. Communicative message passing for inductive relation reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4294–4302, 2021. [71] Haitao Mao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. Demystifying structural disparity in graph neural networks: Can one size fit all? Advances in Neural Information Processing Systems, 36, 2024. 86 [72] Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. Revisiting link prediction: A data perspective, 2023. [73] Sameen Maruf, André FT Martins, and Gholamreza Haffari. Selective attention for context- aware neural machine translation. In Proceedings of NAACL-HLT, pages 3092–3102, 2019. [74] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In Joint european conference on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011. [75] Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien Mairal. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667, 2021. [76] Aisha Mohamed, Shameem Parambath, Zoi Kaoudi, and Ashraf Aboulnaga. Popularity In Conference on Uncertainty in agnostic evaluation of knowledge graph embeddings. Artificial Intelligence, pages 1059–1068. PMLR, 2020. [77] Sameh K Mohamed, Vít Nováček, and Aayah Nounu. Discovering protein drug targets using knowledge graph embeddings. Bioinformatics, 36(2):603–610, 2020. [78] Luis Müller, Mikhail Galkin, Christopher Morris, and Ladislav Rampášek. Attending to graph transformers. arXiv preprint arXiv:2302.04181, 2023. [79] Yohsuke Murase, Hang-Hyun Jo, János Török, János Kertész, and Kimmo Kaski. Structural transition in social networks: The role of homophily. Scientific reports, 9(1):4310, 2019. [80] Huda Nassar, Kyle Kloster, and David F Gleich. Strong localization in personalized pagerank vectors. In Proceedings of the 12th International Workshop on Algorithms and Models for the Web Graph-Volume 9479, pages 190–202, 2015. [81] Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001. [82] Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh Phung, et al. A novel embedding model for In Proceedings of knowledge base completion based on convolutional neural network. the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 327–333, 2018. [83] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factor- ization models by including observable patterns. Advances in Neural Information Processing Systems, 27, 2014. [84] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Icml, 2011. 87 [85] Vardaan Pahuja, Boshi Wang, Hugo Latapie, Jayanth Srinivasa, and Yu Su. A retrieve- and-read framework for knowledge graph link prediction. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1992–2002, 2023. [86] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [87] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. Advances in neural information processing systems, 30, 2017. [88] Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022. [89] Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann. Data augmentation can improve robustness. Advances in Neural Information Processing Systems, 34:29935–29948, 2021. [90] Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175– 4186, 2020. [91] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 452–461, 2009. [92] Jay Revolinsky, Harry Shomer, and Jiliang Tang. Understanding the generalizability of link predictors under distribution shifts on graphs. arXiv preprint arXiv:2406.08788, 2024. [93] Andrea Rossi, Denilson Barbosa, Donatella Firmani, Antonio Matinata, and Paolo Merialdo. Knowledge graph embedding for link prediction: A comparative analysis. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2):1–49, 2021. [94] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021. [95] Daniel Ruffinelli, Samuel Broscheit, and Rainer Gemulla. You can teach an old dog new tricks! on training knowledge graph embeddings. In International Conference on Learning Representations. 88 [96] Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, and Daisy Zhe Wang. Drum: End-to-end differentiable rule mining on knowledge graphs. Advances in Neural Information Processing Systems, 32, 2019. [97] Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion In Proceedings of the 2020 Conference on Empirical Methods in Natural benchmark. Language Processing (EMNLP), pages 8328–8350, 2020. [98] Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion In Proceedings of the 2020 Conference on Empirical Methods in Natural benchmark. Language Processing (EMNLP), pages 8328–8350, 2020. [99] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015. [100] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018. [101] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018. [102] Balasubramaniam Srinivasan and Bruno Ribeiro. On the equivalence between positional In International Conference on node embeddings and structural graph representations. Learning Representations, 2019. [103] Arjun Subramonian, Jian Kang, and Yizhou Sun. Theoretical and empirical insights into the origins of degree bias in graph neural networks. arXiv preprint arXiv:2404.03139, 2024. [104] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In International Conference on Learning Representations, 2019. [105] Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, and Yiming Yang. A re-evaluation of knowledge graph completion methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5516–5522, 2020. [106] Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and In Proceedings of the Junjie Yan. Equalization loss for long-tailed object recognition. IEEE/CVF conference on computer vision and pattern recognition, pages 11662–11671, 2020. [107] Xianfeng Tang, Huaxiu Yao, Yiwei Sun, Yiqi Wang, Jiliang Tang, Charu Aggarwal, Prasenjit Mitra, and Suhang Wang. Investigating and mitigating degree-related biases in graph convol- 89 tuional networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1435–1444, 2020. [108] Zhenwei Tang, Shichao Pei, Zhao Zhang, Yongchun Zhu, Fuzhen Zhuang, Robert Hoehndorf, and Xiangliang Zhang. Positive-unlabeled learning with adversarial data augmentation for knowledge graph completion. arXiv preprint arXiv:2205.00904, 2022. [109] Komal Teru, Etienne Denis, and Will Hamilton. Inductive relation prediction by subgraph reasoning. In International Conference on Machine Learning, pages 9448–9457. PMLR, 2020. [110] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019. [111] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd workshop on continuous vector space models and their compositionality, pages 57–66, 2015. [112] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016. [113] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. Composition-based In International Conference on Learning multi-relational graph convolutional networks. Representations, 2019. [114] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [115] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. stat, 1050:20, 2017. [116] Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Advances in neural information processing systems, 34:2215–2227, 2021. [117] Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. Equivariant and stable positional encoding for more powerful graph neural networks. arXiv preprint arXiv:2203.00199, 2022. [118] Peifeng Wang, Shuangyin Li, and Rong Pan. Incorporating gan for negative sampling in knowledge representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [119] Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew, Sheng Tang, Steven Hoi, and Jiashi 90 Feng. The devil is in classification: A simple framework for long-tail instance segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 728–744. Springer, 2020. [120] Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion for link prediction. arXiv preprint arXiv:2302.00890, 2023. [121] Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion for link prediction. arXiv preprint arXiv:2302.00890, 2023. [122] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella Yu. Long-tailed recognition In International Conference on Learning by routing diverse distribution-aware experts. Representations, 2021. [123] Zhitao Wang, Yong Zhou, Litao Hong, Yuanhang Zou, Hanjing Su, and Shouzhi Chen. Pairwise learning for neural link prediction. arXiv preprint arXiv:2112.02936, 2021. [124] Qitian Wu, Wentao Zhao, Zenan Li, David P Wipf, and Junchi Yan. Nodeformer: A scalable graph structure learning transformer for node classification. Advances in Neural Information Processing Systems, 35:27387–27401, 2022. [125] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020. [126] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 564–573, 2017. [127] Bishan Yang, Scott Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations (ICLR) 2015, 2015. [128] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. Advances in neural information processing systems, 30, 2017. [129] Yang Yang, Ryan N Lichtenwalter, and Nitesh V Chawla. Evaluating link prediction methods. Knowledge and Information Systems, 45(3):751–782, 2015. [130] Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. Under- standing negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1666–1676, 2020. [131] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning 91 with graph embeddings. In International conference on machine learning, pages 40–48. PMLR, 2016. [132] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classi- fication. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7370–7377, 2019. [133] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021. [134] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018. [135] Bo Yuan and Xiaoli Ma. Sampling+ reweighting: Boosting the performance of adaboost In The 2012 international joint conference on neural networks on imbalanced datasets. (IJCNN), pages 1–6. IEEE, 2012. [136] Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems, 34:13683–13694, 2021. [137] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond In International Conference on Learning Representations, empirical risk minimization. 2018. [138] L Zhang, Z Deng, K Kawaguchi, A Ghorbani, and J Zou. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021. [139] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018. [140] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018. [141] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems, 34:9061–9073, 2021. [142] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems, 34:9061–9073, 2021. 92 [143] Yongqi Zhang and Quanming Yao. Knowledge graph reasoning with relational digraph. In Proceedings of the ACM Web Conference 2022, pages 912–924, 2022. [144] Yongqi Zhang, Zhanke Zhou, Quanming Yao, Xiaowen Chu, and Bo Han. Adaprop: Learn- ing adaptive propagation for graph neural network based knowledge graph reasoning. In KDD, 2023. [145] He Zhao, Lan Du, and Wray Buntine. Leveraging node attributes for incomplete relational data. In International conference on machine learning, pages 4072–4081. PMLR, 2017. [146] Tianxiang Zhao, Xiang Zhang, and Suhang Wang. Graphsmote: Imbalanced node classifi- cation on graphs with graph neural networks. In Proceedings of the 14th ACM international conference on web search and data mining, pages 833–841, 2021. [147] Tong Zhao, Wei Jin, Yozen Liu, Yingheng Wang, Gang Liu, Stephan Günneman, Neil Shah, and Meng Jiang. Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871, 2022. [148] Tong Zhao, Gang Liu, Daheng Wang, Wenhao Yu, and Meng Jiang. Counterfactual graph learning for link prediction. arXiv preprint arXiv:2106.02172, 2021. [149] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Woodford, Meng Jiang, and Neil Shah. Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence, volume 35, pages 11015–11023, 2021. [150] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9719–9728, 2020. [151] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information. The European Physical Journal B, 71:623–630, 2009. [152] Zhaocheng Zhu, Xinyu Yuan, Louis-Pascal Xhonneux, Ming Zhang, Maxime Gazeau, and Jian Tang. Learning to efficiently propagate for reasoning on knowledge graphs. arXiv preprint arXiv:2206.04798, 2022. [153] Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman- ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems, 34:29476–29490, 2021. [154] Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman- ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems, 34:29476–29490, 2021. 93 APPENDIX A EVALUATING GRAPH NEURAL NETWORKS FOR LINK PREDICTION: CURRENT PITFALLS AND NEW BENCHMARKING A.1 Common Neighbor Distribution In Figure 2.1 we demonstrate the common neighbor (CN) distribution among positive and negative test samples for ogbl-collab, ogbl-ppa, and ogbl-citation2. These results demonstrate that a vast majority of negative samples have no CNs. Since CNs is a typically good heuristic, this makes it easy to identify most negative samples. We further present the CN distribution of Cora, Citeseer, Pubmed, and ogbl-ddi in Figure A.1. The CN distribution of Cora, Citeseer, and Pubmed are consistent with our previous observations on the OGB datasets in Figure 2.1. We note that ogbl-ddi exhibits a different distribution with other datasets. As compared to the other datasets, most of the negative samples in ogbl-ddi have common neighbors. This is likely because ogbl-ddi is considerably denser than the other graphs. As shown in Table A.1, the average node degree in ogbl-ddi is 625.68, significantly larger than the second largest dataset ogbl-ppa with 105.25. Thus, despite the random sampling of negative samples, the high degree of node connectivity within the ogbl-ddi graph predisposes a significant likelihood for the occurrence of common neighbors. We also present the CN distributions under the HeaRT setting. The plots for Cora, Citeseer, Pubmed are shown in Figure A.2. The plots for the OGB datasets are shown in Figure A.3. We observe that the CN distribution of HeaRT is more aligned with the positive samples. This allows for a fairer evaluation setting by not favoring models that use CN information. A.2 Additional Definitions A.2.1 Evaluation Metrics In this section we define the various evaluation metrics used. Given a single positive sample and 𝑀 negative samples, we first score each sample and then rank the positive sample among the negatives. The rank is then given by rank𝑖. I.e., a rank of 1 indicates that the positive sample 94 (a) Cora (b) Citeseer (c) Pubmed (d) ogbl-ddi Figure A.1 Common neighbor distribution for the positive and negative test samples for the Cora, Citeseer, Pubmed, and ogbl-ddi under the existing evaluation setting. has a higher score than all negatives. The hope is that the positive sample ranks above most or all negative samples. Various metrics make use of this rank. We use 𝑁 to denote the number of positive samples. Hits@K. It measures whether the true positive is within the top K predictions or not: Hits@K = 1 𝑁 (cid:205)𝑁 𝑖=1 1(rank𝑖 ≤ K). rank𝑖 is the rank of the 𝑖-th sample. The indicator function 1 is 1 if rank𝑖 ≤ K, and 0 otherwise. Mean Reciprocal Rank (MRR). It is the mean of the reciprocal rank over all positive samples: MRR = 1 𝑁 (cid:205)𝑁 𝑖=1 1 rank𝑖 , where rank𝑖 is the rank of the 𝑖-th sample. AUC. It measures the likelihood that a positive sample is ranked higher than a random negative sample: AUC = (cid:205) 𝑖 ∈ D0 (cid:205) 𝑗 ∈ D1 1(rank𝑖 24h >24h 59.95 ± 2.52 47.93 ± 3.18 13.26 14.96 25.64 0 13.26 11.22 ± 1.91 9.33 ± 2.83 0.16 ± 0.0 11.17 ± 2.93 OOM 19.37 ± 2.65 OOM 21.81 ± 4.3 26.33 ± 2.63 26.16 ± 1.24 40.29 ± 2.22 40.1 ± 1.06 OOM OOM 19.67 21.83 38.81 0 19.67 19.22 ± 1.69 21.08 ± 3.92 0.26 ± 0.03 21.04 ± 3.11 OOM 31.3 ± 2.36 OOM 36.88 ± 4.06 38.18 ± 1.32 37.95 ± 1.45 53.35 ± 1.77 52.09 ± 1.99 OOM OOM 77.99 77.99 77.99 >24h 78 82.8 ± 0.13 70.8 ± 12.0 74.16 ± 0.1 98.01 ± 0.04 OOM 97.48 ± 0.03 OOM 94.61 ± 0.11 97.79 ± 0.07 97.05 ± 0.07 97.97 ± 0.03 97.22 ± 0.78 OOM OOM ogbl-citation2 Hits@50 77.99 77.99 77.99 >24h 78 92.33 ± 0.1 74.48 ± 10.42 86.59 ± 0.08 99.03 ± 0.02 OOM 98.75 ± 0.03 OOM 95.0 ± 0.12 98.86 ± 0.04 98.75 ± 0.03 99.02 ± 0.02 98.2 ± 0.71 OOM OOM Hits@100 77.99 77.99 77.99 >24h 78 96.44 ± 0.03 75.5 ± 10.13 93.14 ± 0.06 99.48 ± 0.02 OOM 99.3 ± 0.02 OOM 95.37 ± 0.14 99.38 ± 0.03 99.41 ± 0.02 99.5 ± 0.01 98.77 ± 0.6 OOM OOM 103 A.7 Additional Details on HeaRT As described in Section 2.4.2, given a positive sample (𝑎, 𝑏), we seeks to generate 𝐾 negative samples to evaluate against. The negative samples are drawn from the set of possible corruptions of (𝑎, 𝑏), i.e, 𝑆(𝑎, 𝑏) (see Eq. (2.3)). Multiple heuristics are used to determine which 𝐾 negative samples to use. Furthermore, the negative samples are split evenly between both nodes. That is, we generate 𝐾/2 negative samples that contain either node 𝑎 and 𝑏, respectively. This process is illustrated in Figure 2.2. The rest of this section is structured as follows. In Section A.7.1 we describe how we use multiple heuristics for estimating the difficulty of negative samples. Then in Section A.7.2 we describe how we combine the ranks given by different heuristic methods. A.7.1 Determining Hard Negative Samples We are first tasked with how to choose the negative samples. As discussed and shown in Section 2.4.2, we want to select the negative samples from 𝑆(𝑎, 𝑏) such that they are non-trivial to classify. Hence, as inspired by the candidate generation process in real-world recommender systems [41, 33], we aim to select a set of ’hard’ negative samples that are more relevant to the source node. The candidate generation process is typically based on some primitive and simple link prediction heuristics. These heuristics can be also treated as link prediction methods (see Tables 2.1 and 2.2). We use multiple heuristics that capture a variety of different information. Most link prediction heuristics can be categorized into two main categories: local heuristics and global heuristics [69]. Local heuristics attempt to capture the local neighborhood information that exists near the node pair while global heuristics attempt to use the whole graph structure. To capture the local information we use resource allocation (RA) [151], a CN-based approach. Existing results show that RA can achieve strong performance on most datasets (see Tables 2.1 and 2.2). To measure the global information we use the personalized pagerank score (PPR) [15]. Random walk based methods are commonly used for candidate generation [41, 33]. Lastly, we further include the cosine feature similarity for the Cora, Citeseer, and Pubmed datasets. This is due to the strong performance of 104 a MLP on those datasets. By combining these heuristics, we are able to generate a diverse set of negative samples for each positive sample. For each heuristic we then rank all the possible negative samples. We first denote the score of a heuristic 𝑖 for a pair of nodes 𝑎 and 𝑏 as ℎ𝑖 (𝑎, 𝑏). Let’s say we want to rank all negative samples that contain a node 𝑎, i.e., (𝑎, ∗). The rank across all nodes is given by: 𝑅𝑖 = ArgSort ℎ𝑖 (𝑎, 𝑣), 𝑣∈ ¯𝑉 (A.2) where 𝑅𝑖 denotes the ranking for heuristic 𝑖 and and ¯𝑉 is a subset of the set of nodes in the graph 𝑉. We apply a filtering process to exclude all positive training samples, self-loops, and the sample itself under consideration from being selected as negative samples. Additionally, when choosing negative samples for the test samples, we disallow validation samples to be chosen as well. As such, we only consider a subset of nodes ¯𝑉 ∈ 𝑉. This is analogous to the filtered setting used in KGC [13]. However, we adopt a distinct filtering strategy for the ogbl-collab dataset, which is a dynamic collaboration graph. Specifically, positive training samples are not excluded in the generation of negative samples for validation and test. Similarly, positive validation samples are not omitted when creating negative test samples. This approach is justified for ogbl-collab which is a dynamic graph, as prior collaboration between authors does not necessarily indicates future collaborations. Further details and discussions are provided in Appendix A.9. We now have all possible negative samples ranked according to multiple heuristics. However, it is unclear how to choose the negative samples from multiple ranked lists. In the next subsection we detail how we combine the ranks according to each heuristic. This will give us a final ranking, of which we can choose the top 𝐾/2 as the negative samples for that node. A.7.2 Combining Heuristic Ranks In this subsection we tackle the problem of combing the negative sample ranks given by multiple heuristics. More concretely, say we use 𝑚 heuristics and rank all the samples according to each. We want to arrive at a combined ranking 𝑅total that is composed of each rank, 𝑅total = 𝜙(𝑅1, 𝑅2, · · · , 𝑅𝑚). (A.3) 105 Algorithm A.1 Generating Negative Samples of Form (𝑎, ∗) Require: 𝑎 = Node to generate samples for ¯𝑉 = Possible nodes to use for negative samples H = {ℎ1, ℎ2, · · · , ℎ𝑚} 1: for 𝑖 ∈ |H | do 2: 𝑅𝑖 = ArgSort ℎ𝑖 (𝑎, 𝑣) 𝑣 ∈ ¯𝑉 3: end for 4: for 𝑣 ∈ ¯𝑉 do 5: 6: end for 𝑅total(𝑎, 𝑣) = min (𝑅1(𝑎, 𝑣), 𝑅2(𝑎, 𝑣), · · · , 𝑅𝑚(𝑎, 𝑣)) 7: 𝑅 𝑓 = ArgSort 𝑣 ∈ ¯𝑉 𝑅total(𝑎, 𝑣) 8: return 𝑅 𝑓 [: 𝐾/2] ⊲ Set of 𝑚 heuristics ⊲ Sort by each heuristic individually ⊲ Combine the rankings ⊲ Sort by combined ranking ⊲ Return the top K/2 ranked nodes We model 𝜙 via Borda’s method [12]. Let 𝑅𝑖 (𝑎, 𝑣) be the rank of the node pair (𝑎, 𝑣) for heuristic 𝑖. The combined rank 𝑅total(𝑎, 𝑣) across 𝑚 ranked lists is given by: 𝑅total(𝑎, 𝑣) = 𝑔 (𝑅1(𝑎, 𝑣), 𝑅2(𝑎, 𝑣), · · · , 𝑅𝑚 (𝑎, 𝑣)) , (A.4) where 𝑔 is an aggregation function. We set 𝑔 = min(·). This is done as it allows us to capture a more distinct set of samples by selecting the “best" for each heuristic. This is especially true when there is strong disagreement between the different heuristics. A final ranking is then done on 𝑅total to select the top nodes, 𝑅 𝑓 = ArgSort 𝑅total, (A.5) 𝑣∈ ¯𝑉 where 𝑅 𝑓 is the final ranking. The highest 𝐾/2 nodes are then selected from 𝑅 𝑓 . Lastly, we note that for some nodes there doesn’t exist sufficient scores to rank 𝐾/2 total nodes. In this case the remaining nodes are chosen randomly. The full generation process for a node 𝑎 is detailed in Algorithm A.1. A.8 Additional Results Under HeaRT We present additional results of Cora, Citeseer, Pubmed, and OGB datasets under HeaRT in Table A.8 and Table A.9. These results include other hit metrics not found in the main tables. 106 Table A.8 Additional results on Cora, Citeseer, and Pubmed(%) under HeaRT. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Models Hits1 CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG 3.98 5.31 5.31 0.57 4.64 5.69 ± 0.81 1.46 ± 0.8 5.48 ± 0.99 7.59 ± 0.61 5.03 ± 0.81 5.48 ± 0.97 9.72 ± 0.73 3.89 ± 2.04 5.88 ± 0.60 5.71 ± 0.41 4.85 ± 0.81 4.78 ± 0.71 5.31 ± 1.16 6.98 ± 0.57 Cora Hits3 10.25 12.71 12.52 2.85 11.95 15.1 ± 0.99 5.46 ± 1.67 14.15 ± 1.56 17.46 ± 0.82 13.66 ± 0.67 15.43 ± 1.07 19.24 ± 0.76 10.82 ± 4.04 13.76 ± 1.03 13.89 ± 0.82 14.46 ± 0.98 14.72 ± 1.24 14.95 ± 0.72 14.93 ± 0.61 Hits100 Hits1 38.71 38.9 38.52 55.6 59.96 77.21 ± 2.34 59.68 ± 3.41 77.0 ± 1.02 85.47 ± 0.52 80.87 ± 1.32 81.61 ± 0.96 79.66 ± 0.95 61.9 ± 13.97 82.46 ± 1.79 80.28 ± 1.08 84.14 ± 1.24 85.62 ± 0.83 76.24 ± 0.68 82.52 ± 1.28 2.2 3.96 4.18 0.22 3.74 9.63 ± 0.82 2.68 ± 0.92 10.44 ± 0.82 9.27 ± 0.99 8.02 ± 1.21 8.37 ± 1.62 13.81 ± 0.82 5.08 ± 1.31 10.09 ± 0.50 6.81 ± 0.73 16.77 ± 2.05 11.14 ± 0.82 5.95 ± 1.06 9.93 ± 0.6 Citeseer Hits3 9.45 12.09 11.21 3.52 11.87 23.5 ± 1.37 7.25 ± 0.98 26.46 ± 1.24 23.19 ± 0.98 20.09 ± 0.82 23.74 ± 1.62 27.71 ± 1.34 13.68 ± 1.32 26.11 ± 1.26 17.8 ± 1.19 30.51 ± 0.97 27.21 ± 0.96 14.53 ± 1.19 21.91 ± 0.59 Hits100 Hits1 33.63 33.85 34.07 53.85 55.82 84.46 ± 1.86 53.25 ± 2.91 86.83 ± 1.36 89.1 ± 2.13 86.83 ± 1.09 92.33 ± 0.68 85.49 ± 1.37 68.94 ± 2.3 92.66 ± 0.92 85.51 ± 1.01 90.42 ± 0.98 92.73 ± 1.16 72.66 ± 0.95 90.15 ± 1.43 0.47 0.74 0.72 0 0.74 0.75 ± 0.14 1.13 ± 0.24 1.28 ± 0.22 2.09 ± 0.31 1.14 ± 0.16 3.03 ± 0.46 1.48 ± 0.23 1.47 ± 0.32 2.24 ± 0.17 1.90 ± 0.24 1.13 ± 0.18 2.73 ± 0.49 >24h 0.88 ± 0.18 Pubmed Hits3 1.49 1.87 1.78 0.02 2.12 2.4 ± 0.32 3.25 ± 0.44 4.33 ± 0.28 5.58 ± 0.27 3.06 ± 0.36 8.19 ± 1.0 4.05 ± 0.39 4.71 ± 0.68 5.93 ± 0.21 6.07 ± 0.47 3.95 ± 0.24 7.05 ± 0.72 >24h 2.61 ± 0.39 Hits100 19.29 20.37 20.04 21.57 32.78 52.27 ± 0.65 50.56 ± 1.11 76.34 ± 0.79 73.59 ± 0.53 67.06 ± 0.69 79.47 ± 0.53 59.79 ± 0.67 65.81 ± 2.43 72.01 ± 0.46 76.57 ± 0.58 71.46 ± 0.97 79.22 ± 0.96 >24h 64.95 ± 1.81 Table A.9 Additional results on OGB datasets(%) under HeaRT. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. Models Hits50 Hits100 Hits50 Hits100 Hits50 Hits100 Hits50 Hits100 ogbl-collab ogbl-ddi ogbl-ppa ogbl-citation2 CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG 30.52 33.74 36.68 33.77 39.18 28.56 ± 0.17 30.83 ± 0.22 28.88 ± 0.32 35.29 ± 0.49 32.92 ± 1.41 33.48 ± 1.40 OOM 33.57 ± 0.84 39.04 ± 0.11 36.11 ± 2.36 34.53 ± 0.98 34.96 ± 3.80 OOM 30.12 ± 0.63 42.80 45.20 46.42 45.85 48.80 41.84 ± 0.25 43.23 ± 0.34 46.83 ± 0.33 50.83 ± 0.21 46.71 ± 0.84 48.33 ± 0.49 OOM 43.06 ± 1.09 50.49 ± 0.09 49.25 ± 0.81 45.69 ± 0.42 46.93 ± 2.04 OOM 45.40 ± 0.66 70.12 71.08 76.39 0 70.12 98.38 ± 0.7 95.52 ± 0.72 N/A 97.65 ± 0.68 98.15 ± 0.24 99.17 ± 0.11 28.29 ± 13.65 82.42 ± 3.37 97.81 ± 0.31 83.45 ± 11.03 98.43 ± 0.22 >24h >24h 84.21 ± 9.2 86.53 87.36 90.96 0 86.53 99.91 ± 0.01 99.54 ± 0.08 N/A 99.85 ± 0.06 99.93 ± 0.02 99.98 ± 0.01 48.34 ± 15.0 92.63 ± 2.05 99.93 ± 0.01 94.7 ± 4.82 99.96 ± 0.01 >24h >24h 95.76 ± 3.48 57.56 86.51 80.53 58.87 87.55 81.93 58.88 86.84 81.65 >24h 1.4 1.34 54.97 86.51 80.53 61.22 ± 0.16 81.88 ± 0.06 69.94 ± 0.06 29.64 ± 7.3 89.75 ± 1.9 83.29 ± 3.35 61.29 ± 0.07 22.01 ± 0.01 5.36 ± 0.0 70.77 ± 0.34 89.62 ± 0.23 81.48 ± 0.48 OOM OOM OOM 71.91 ± 0.1 89.46 ± 0.13 81.84 ± 0.24 OOM OOM OOM 65.11 ± 2.33 92.45 ± 0.26 87.34 ± 0.49 67.47 ± 0.32 88.36 ± 0.32 82.5 ± 0.51 81.21 ± 1.39 62.14 ± 0.51 88.31 ± 0.19 89.37 ± 0.28 93.11 ± 0.27 71.56 ± 0.03 72.85 ± 0.9 94.72 ± 0.18 91.0 ± 0.24 OOM OOM OOM OOM OOM OOM 68.04 69.39 68.83 >24h 67.56 77.11 ± 0.13 65.87 ± 8.37 76.94 ± 0.1 85.43 ± 0.18 OOM 85.86 ± 0.09 OOM 77.64 ± 2.43 81.94 ± 0.26 79.13 ± 0.42 84.01 ± 0.05 86.35 ± 0.51 OOM OOM 107 Table A.10 Results on ogbl-collab under HeaRT when excluding the positive train/validation samples of being negative samples during testing. We highlight the results ranked first, second, and third as green, blue, and orange, respectively. MRR Hits20 Hits50 Hits100 CN AA RA Shortest Path Katz Node2Vec MF MLP GCN GAT SAGE GAE SEAL BUDDY Neo-GNN NCN NCNC NBFNet PEG 12.60 16.40 28.14 46.71 47.15 12.10 ± 0.20 26.86 ± 1.74 12.61 ± 0.66 18.28 ± 0.84 10.97 ± 1.16 20.89 ± 1.06 OOM 22.53 ± 3.51 32.42 ± 1.88 21.90 ± 0.65 17.51 ± 2.50 19.02 ± 5.32 OOM 15.68 ± 1.10 27.51 32.65 41.16 46.56 48.66 25.85 ± 0.21 38.44 ± 0.07 23.05 ± 0.89 32.90 ± 0.66 29.58 ± 2.42 33.83 ± 0.93 OOM 36.48 ± 2.55 45.62 ± 0.52 38.40 ± 0.29 37.07 ± 2.97 35.67 ± 6.78 OOM 29.74 ± 0.95 38.39 42.61 46.9 46.97 51.07 35.49 ± 0.22 43.62 ± 0.08 35.32 ± 0.74 43.17 ± 0.36 42.07 ± 1.51 43.02 ± 0.63 OOM 43.5 ± 1.75 50.57 ± 0.18 46.93 ± 0.17 45.89 ± 1.11 44.76 ± 4.64 OOM 38.71 ± 0.17 47.4 50.25 51.78 48.11 54.28 46.12 ± 0.34 51.75 ± 0.14 51.09 ± 0.37 54.93 ± 0.14 53.45 ± 0.64 54.38 ± 0.27 OOM 49.25 ± 1.39 55.63 ± 0.68 53.81 ± 0.19 52.36 ± 0.33 52.41 ± 2.09 OOM 49.34 ± 0.70 A.9 Additional Investigation on ogbl-collab As introduced in Section A.7.1, we adopt a different strategy to generate the hard negative samples for ogbl-collab which is a dynamic collaboration graph. In this dataset, nodes represent authors and edges represent a collaboration between two authors. Each edge further includes an attribute that specifies the year of collaboration. Specifically, each edge takes the form of (Author 1, Year, Author 2). The task is to predict collaborations in 2019 (test) based on those until 2017 (training) and 2018 (validation). Contrary to other datasets, we do not exclude the positive training samples when generating the negative samples for validation and test. We note that we also do not exclude positive validation edges when generating the negatives for test. In simpler terms, when creating negative samples for testing, both positive samples from training and validation are considered. This means that negative samples during testing could present in the training and validation positive samples. This approach is reasonable and well-aligned with the real-world scenario in the context of collaboration graphs. Specifically, authors who collaborated in the past might not do so in the future. For 108 instance, just because the positive sample (Author 2, 2017, Author 3) exists, it does not imply that (Author 2, 2019, Author 3) is also true. However, this is implied to be true if we exclude positive train/validation samples from appearing as negative samples during testing. We validate this approach by contrasting it with the strategy that excludes train/validation data when generating hard negatives. The results are presented in Table A.10. We observed that under this setting, both the Shortest Path and Katz perform considerably well on ogbl-collab. Specifically, the MRR gap between the second-ranked method (Shortest Path) and the third (BUDDY) is 14.29. We found that it is due to the ogbl-collab being a dynamic graph. Of the positive samples in the test set, around 46% also appear as positive samples in the training set. In particular, an edge (Author 1, 2017, Author 2) in the training data may also “appear” in the test data in the form of (Author 1, 2019, Author 2). This is because two authors who collaborated in the past often tend to collaborate again in the future. As such, when evaluating the test sample (Author 1, 2019, Author 2), there exists path of length 1 between the two authors in the graph. Furthermore, this phenomenon is common among positive samples but not observed among negatives. This is because we exclude the positive training samples when generating the negative samples for evaluation. As a result of this exclusion, the presence of a direct link (i.e., a shortest path of length 1) between two authors suggest a positive sample, while its absence often corresponds to a negative sample. As such, it provides an easy “shortcut” to distinguish positive and negative samples during testing. This explains why methods like Shortest Path and Katz can achieve good performance on ogbl-collab when excluding positive train/validation samples of being negative samples during testing. On the contrary, when allowing positive train/validation samples to also be negative samples for HeaRT, the results on ogbl-collab, shown in Tables 2.5 and A.9, indicate that Shortest Path does not maintain its superior performance as observed in Table A.10. Additionally, the overall results under HeaRT are inferior to the ones in Table 2.5. For instance, while all the MRR values in Table 2.5 exceed 10, the highest MRR in Table A.10 is approximately 6. This indicates that excluding those positive samples from being negative samples disproportionately helps the Shortest Path. 109 A.10 Dataset Licenses The license for each dataset can be found in Table A.11. Table A.11 Dataset Licenses. Datasets License Cora Citeseer Pubmed NLM License NLM License NLM License ogbl-collab MIT License MIT License MIT License ogbl-citation2 MIT License ogbl-ddi ogbl-ppa A.11 Limitation One potential limitation of HeaRT lies in the generation of customized negative samples for each positive sample. This design may result in an increased number of negative samples compared to the existing setting. Although this provides a more realistic evaluation, it could have an impact on the efficiency of the evaluation process, especially in scenarios where a significant number of positive samples exist. Nonetheless, this limitation does not detract from the potential benefits of HeaRT in providing a more realistic and meaningful link prediction evaluation setting. Furthermore, as each evaluation node pair is independent, it offers scope for parallelization, mitigating any potential efficiency concerns to a large extent. Future work can investigate ways to optimize this process. A.12 Social Impact Our method HeaRT harbors significant potential for positive societal impact. By aligning the evaluation setting more closely with real-world scenarios, it enhances the applicability of link pre- diction research. This not only contributes to the refinement of existing prediction methods but also stimulates the development of more effective link prediction methods. As link prediction has far- reaching implications across numerous domains, from social network analysis to recommendation systems and beyond, improving its performance and accuracy is of paramount societal importance. We also carefully consider the broader impact from various perspectives such as fairness, security, and harm to people. No apparent risk is related to our work. 110 APPENDIX B LPFORMER: AN ADAPTIVE GRAPH TRANSFORMER FOR LINK PREDICTION B.1 Existing Formulations of Pairwise Encodings In this section we give an overview of existing formulations of pairwise encodings using in DP-MPNNs. The standard formulation of DP-MPNNs is given in Eq. 3.1 where 𝑠(𝑎, 𝑏) is the pairwise encoding. We briefly describe other existing solutions below: NCN [121]: NCN only considers the CNs of the target link (𝑎, 𝑏) by summing the node represen- tation of each. The pairwise encoding, 𝑠(𝑎, 𝑏), is written as: 𝑠(𝑎, 𝑏) = ∑︁ h𝑢, 𝑢∈N CN (𝑎,𝑏) (B.1) where h𝑢 is the node representation encoded by a MPNN. NCNC [121]: NCNC extends NCN by further considering the 1-hop neighbors of the node pair that aren’t CNs. To account for the difference, they are weighted by the probability of they themselves being CNs of the other node in the pair. This is given for a target link (𝑎, 𝑏) as: where 𝑠(𝑎, 𝑏) = ∑︁ 𝑢∈V 𝑤(𝑎, 𝑏, 𝑢) h𝑢, 𝑤(𝑎, 𝑏, 𝑢) = 1, when 𝑢 ∈ N CN (𝑎,𝑏) NCN( 𝐴, 𝑋, 𝑏, 𝑢) when 𝑢 ∈ N (𝑎) NCN( 𝐴, 𝑋, 𝑎, 𝑢) when 𝑢 ∈ N (𝑏) 0, else    .    (B.2) (B.3) This weighting scheme ensures that CNs play a larger role in the pairwise information than non-CNs. BUDDY [121]: BUDDY considers counting the number of nodes that correspond to different labels given by the double radius node labeling trick [142]. We first define the number of nodes that are a distance 𝑑𝑎 and 𝑑𝑏 from nodes 𝑎 and 𝑏 as A𝑎𝑏 [𝑑𝑎, 𝑑𝑏]. We further define the number of nodes 111 where max(𝑑𝑢, 𝑑𝑣) > 𝑘 as 𝛽𝑎𝑏 [𝑑]. The pairwise encoding concatenates the counts belonging to all combination of 𝑑 = 1 · · · 𝑘. The counts are estimated using subgraph sketching algorithms [35, 16] and are denoted ˆA and ˆB. The pairwise encoding for a target link (𝑎, 𝑏) is given by the following where [𝑘] = {1 · · · 𝑘 }: 𝑠 ˆA (𝑎, 𝑏) = ∥ ˆA𝑎𝑏 [𝑑𝑎, 𝑑𝑏], 𝑑𝑎,𝑑𝑏∈[𝑘] 𝑠 ˆ𝛽 (𝑎, 𝑏) = ∥ ˆ𝛽𝑎𝑏 [𝑑], 𝑑∈[𝑘] 𝑠(𝑎, 𝑏) = 𝑠 ˆA (𝑎, 𝑏) ∥ 𝑠 ˆ𝛽 (𝑎, 𝑏). (B.4) (B.5) (B.6) Neo-GNN [121]: Neo-GNN considers the higher-order neighbor overlap between two nodes. This is done by first learning a structural representation for each node 𝑖, 𝑥𝑠𝑡𝑟𝑢𝑐𝑡 𝑖 . This is given by: 𝑥𝑠𝑡𝑟𝑢𝑐𝑡 𝑖 = 𝑓1 (cid:169) (cid:173) (cid:171) ∑︁ 𝑓2 (cid:0)𝐴𝑖 𝑗 (cid:1)(cid:170) (cid:174) (cid:172) . (B.7) 𝑗 ∈N (𝑖) To consider the 𝐿-hop structural information, the structural representations are diffused over 𝐿 hops and weighted by a hyperparameter 𝛽: 𝑍 = MLP (cid:32) 𝐿 ∑︁ 𝛽𝑙−1 𝐴𝑙 𝑋 𝑠𝑡𝑟𝑢𝑐𝑡 (cid:33) , 𝑙=1 where 𝑋 = diag(𝑥𝑠𝑡𝑟𝑢𝑐𝑡). The pairwise encoding 𝑠(𝑎, 𝑏) is the dot product of both the final representations, 𝑠(𝑎, 𝑏) = 𝑧𝑇 𝑎 𝑧𝑏. (B.8) (B.9) (B.10) B.2 Special Cases of the General Pairwise Encoding In this section we demonstrate that multiple popular heuristics and pairwise encodings can be formulated as special cases of the general pairwise encoding given in Eq. (3.2). Common Neighbors (CNs) [81]: The CNs of a pair of nodes (𝑎, 𝑏) is defined the overlapping 1-hop neighbors of both nodes: N CN (𝑎,𝑏) = N (𝑎) ∩ N (𝑏). (B.11) 112 Eq. (3.2) is equal to the CNs when ℎ(𝑎, 𝑏, 𝑢) = 1 and 𝑤(𝑎, 𝑏, 𝑢) is: 𝑤(𝑎, 𝑏, 𝑢) =    1, when 𝑢 ∈ N (𝑎) ∩ N (𝑏) 0, else .    (B.12) Adamic-Adar (AA) [2]: AA is defined as the reciprocal log-degree weighted CN score where 𝑑𝑢 is the degree of node 𝑢: AA(𝑎, 𝑏) = ∑︁ 𝑢∈N CN (𝑎,𝑏) 1 log(𝑑𝑢) . (B.13) Eq. (3.2) can be rewritten as the AA when ℎ(𝑎, 𝑏, 𝑢) = 1/log(𝑑𝑢) and 𝑤(𝑎, 𝑏, 𝑢) is equal to Eq. (B.12). Resource Allocation (RA) [151]: RA is defined as the reciprocal degree weighted CN score: RA(𝑎, 𝑏) = ∑︁ 𝑢∈N CN (𝑎,𝑏) . 1 𝑑𝑢 (B.14) Eq. (3.2) can be rewritten as the AA when ℎ(𝑎, 𝑏, 𝑢) = 1/𝑑𝑢 and 𝑤(𝑎, 𝑏, 𝑢) is equal to Eq. (B.12). Katz Index [49]: The Katz index is a global structural measure. It is defined as weighted summation of the number of paths of different lengths connecting 𝑎 and 𝑏. It is given by the following where the decay weight 𝛽 ∈ [0, 1], This is equivalent to Eq. (3.2) when: Katz(𝑎, 𝑏) = 𝑤(𝑎, 𝑏, 𝑢) = ∞ ∑︁ 𝑙=1 ∞ ∑︁ 𝑙=1 𝛽𝑙 𝐴𝑙 𝑎,𝑏. 𝛽𝑙 𝑒𝑇 𝑎 𝐴𝑙, where 𝑒𝑖 ∈ B|V | is a one-hot vector for a node 𝑖. We further set, ℎ(𝑎, 𝑏, 𝑢) = 𝑒𝑇 𝑏 , when 𝑢 = 𝑏 0, else    .    (B.15) (B.16) (B.17) Personalized Pagerank (PPR) Score [15]: The personalized pagerank score is the pagerank score localized to a root node 𝑢. The localization is via a teleportation probability 𝛼 that transports the 113 random walk back to the root node. We show that Eq. (3.2) can be rewritten as the PPR score when setting ℎ(𝑎, 𝑏, 𝑢) equal to (B.17) and, following [24], setting 𝑤(𝑎, 𝑏, 𝑢) to: 𝑤(𝑎, 𝑏, 𝑢) = 𝛼 ∞ ∑︁ 𝑙=0 (1 − 𝛼)𝑙 𝑒𝑇 𝑎 (𝐷−1 𝐴)𝑙 . (B.18) Feature Similarity: The feature similarity of the pair of nodes (𝑎, 𝑏) is expressed by dis(x𝑎, x𝑏) where x𝑎 are the node features of node 𝑎 and dis(·) is a distance function (e.g., euclidean distance). This can be rewritten as Eq. (3.2) by substituting: 𝑤(𝑎, 𝑏, 𝑢) = dis(x𝑎, x𝑢), (B.19) and ℎ(𝑎, 𝑏, 𝑢) = 𝑒𝑇 𝑏 where 𝑒𝑖 ∈ B|V | is a one-hot vector for a node 𝑖. NCN [121]: The pairwise encoding used in NCN is defined as the summation of the representations for the CNs of a link. Eq. (3.2) can be rewritten as NCN when 𝑤(𝑎, 𝑏, 𝑢) is equal to Eq. (B.12). ℎ(𝑎, 𝑏, 𝑢) is equal to the node representation 𝑢 encoded by a MPNN, i.e., ℎ(𝑎, 𝑏, 𝑢) = h𝑢 where 𝐻 = MPNN( 𝐴, 𝑋). NCNC [121]: NCNC extends NCNC by further weighting the 1-hop (non-CN) by their probability of linking to the other nodes. Given Eq. (3.2), the weight 𝑤(𝑎, 𝑏, 𝑢) is equal to following where 1-hop neighbors are weighted by their probability of linking with the other node: 𝑤(𝑎, 𝑏, 𝑢) = 1, when 𝑢 ∈ N CN (𝑎,𝑏) NCN( 𝐴, 𝑋, 𝑏, 𝑢) when 𝑢 ∈ N (𝑎) NCN( 𝐴, 𝑋, 𝑎, 𝑢) when 𝑢 ∈ N (𝑏) 0, else    .    (B.20) NCN( 𝐴, 𝑋, 𝑎, 𝑢) is the probability of 𝑎 and 𝑢 being linked using the NCN model. We further define ℎ(𝑎, 𝑏, 𝑢) = h𝑢. Neo-GNN [136]: The pairwise encoding used in Neo-GNN considers the higher-order neighbor- hood overlap between two nodes. The formulation is given in Section B.2. When 𝑙 = 1, it can be 114 expressed using Eq. (3.2) by setting: and ℎ(𝑎, 𝑏, 𝑢) = 𝑓1 (cid:169) (cid:173) (cid:171) ∑︁ 𝑣∈N (𝑢) 𝑓2 ( 𝐴𝑢𝑣)(cid:170) (cid:174) (cid:172) 2 , 𝑤(𝑎, 𝑏, 𝑢) = 1, when 𝑢 ∈ N CN (𝑎,𝑏) 0, else    .    (B.21) (B.22) B.3 Proof of Proposition 1 Proposition 1. Consider a target link (𝑎, 𝑏) and a node 𝑢 ∈ V \ {𝑎, 𝑏}. The PPR [15] score of a root node 𝑖 and target node 𝑗 with teleportation probability 𝛼 is denoted by ppr(𝑖, 𝑗). Let 𝑟 𝑘 𝑎 (𝑢) be the probability of a walk of length 𝑘 beginning at node 𝑎 and terminating at 𝑢. We define 𝑟 𝑘 𝑎,𝑏 (𝑢) := 𝑟 𝑘 𝑏 (𝑢). We also define a weight 𝛾 𝑘 := 𝛼(1 − 𝛼) 𝑘 for all walks of length 𝑘. The PPR scores, 𝑝 𝑝𝑟 (𝑎, 𝑢) and 𝑝 𝑝𝑟 (𝑏, 𝑢), along with the random walk probabilities of disparate 𝑎 (𝑢) + 𝑟 𝑘 lengths, are interconnected through the following relationship. Γ(𝑎, 𝑏, 𝑢) = ppr(𝑎, 𝑢) + ppr(𝑏, 𝑢) = ∞ ∑︁ 𝑘=0 𝛾 𝑘𝑟 𝑘 𝑎,𝑏 (𝑢). Proof. Per [24], the PPR vector for a root node 𝑠, pr𝑠, is equivalent to: pr𝑠 = 𝛼 ∞ ∑︁ 𝑘=0 (1 − 𝛼) 𝑘𝑊 𝑘 𝑥𝑠, (3.6) (B.23) where 𝑊 is a the random walk matrix and 𝑥𝑠 is a preference vector that is a one-hot vector for element 𝑠. We note that pr𝑠 (𝑡) represents the landing probability of node 𝑡 given the root node 𝑠. 𝑠 = 𝑊 𝑘 𝑥𝑠 ∈ RV represents As such, by definition, pr𝑠 (𝑡) = ppr(𝑠, 𝑡). Furthermore, it is clear that 𝑟 𝑘 the probability of a walk of length 𝑘 beginning at node 𝑠 and stop all other nodes, individually. Also, the probabilities of all walks of length 𝑘 are weighted by 𝛾 𝑘 = 𝛼(1 − 𝛼) 𝑘 . Γ (𝑎, 𝑏, 𝑢) can be obtained by first taking the sum of the PPR vectors for nodes 𝑎 and 𝑏, pr𝑎 + pr𝑏 = 𝛼 pr𝑎,𝑏 = 𝛼 ∞ ∑︁ 𝑘=0 ∞ ∑︁ 𝑘=0 (1 − 𝛼) 𝑘𝑊 𝑘 𝑥𝑎 + 𝛼 ∞ ∑︁ 𝑘=0 (1 − 𝛼) 𝑘𝑊 𝑘 𝑥𝑏, (1 − 𝛼) 𝑘𝑊 𝑘 (𝑥𝑎 + 𝑥𝑏) , (B.24) 115 where pr𝑎,𝑏 = pr𝑎 + pr𝑏. From this, we can express Γ(𝑎, 𝑏, 𝑢) as: Γ(𝑎, 𝑏, 𝑢) = ppr(𝑎, 𝑢) + ppr(𝑏, 𝑢), = pr𝑎,𝑏 (𝑢), = pr𝑎 (𝑢) + pr𝑏 (𝑢), (B.25) which as shown in Eq. (B.24) is equivalent to the probability of a walk that originates from either node 𝑎 or 𝑏 and terminates at node 𝑢. This completes the proof. B.4 Attention Formulation For a target link (𝑎, 𝑏), LPFormer attends to the nodes in the set ¯V (𝑎, 𝑏). The attention mechanism used in LPFormer is defined in Section 5.3 as follows where 𝑤(𝑎, 𝑏, 𝑢) is the attention weight of 𝑢 to the target link and ¯V (𝑎, 𝑏) = V \ {𝑎, 𝑏}: ˜𝑤(𝑎, 𝑏, 𝑢) = 𝜙 (cid:16) 𝑤(𝑎, 𝑏, 𝑢) = (cid:17) , h𝑎, h𝑏, h𝑢, rpe(𝑎,𝑏,𝑢) exp( ˜𝑤(𝑎, 𝑏, 𝑢)) (cid:205)𝑣∈ ¯V (𝑎,𝑏) exp( ˜𝑤(𝑎, 𝑏, 𝑢)) . (B.26) The function 𝜙(·) is modeled via the attention mechanism defined in GATv2 [17]. We define 𝑎 ∈ R2𝑑′ and 𝑊 ∈ R𝑑×𝑑′ . The raw attention weights are then given by: ˜𝑤(𝑎, 𝑏, 𝑢) = a𝑇 LeakyReLU (cid:104) 𝑊 h𝑎 ∥ 𝑊 h𝑏 ∥ 𝑊 h𝑢 ∥ rpe(𝑎,𝑏,𝑢) (cid:105) . (B.27) The final attention weights, 𝑤(𝑎, 𝑏, 𝑢), are given by passing ˜𝑤(𝑎, 𝑏, 𝑢) through a softmax activation layer. B.5 Additional Experimental Details B.5.1 Planetoid splits We note that for each of Cora, Citeseer, Pubmed we use a fixed split. This follows the recent work of [61]. [61] observe that for Cora, Citeseer, Pubmed there exists no unified data split between studies. They find that while recent work [20, 121] use 10 random splits, prior work [154, 115] use a fixed split and train over 10 random seeds. Furthermore, there exists discrepancies in the preprocessing between those works that use the random splits. [20] only use the largest connected 116 component of each dataset while [121] use the whole dataset. This makes any comparison of the published results difficult. Due to these discrepancies, we use the performance on the fixed split given by [61], as it’s the only split where all methods are evaluated and compared under the same setting. B.5.2 Omission of ogbl-ddi under the Existing Evaluation We further omit the results of ogbl-ddi in Table 4.1. This is due to the observation made by [61] that there exists a poor relationship between the validation and test performance. This extends to recent pairwise MPNNs, including NCN [121], Neo-GNN [136], and BUDDY [20]. This makes tuning on the validation set difficult, as it doesn’t guarantee good test performance. Due to this, they observe that when tuning on a fixed set of hyperparameter ranges, they are unable to achieve comparable results to the reported performance. Often they observe that the performance is actually much lower. Due to these concerns we believe ogbl-ddi is not suitable for the task of transductive link prediction and don’t report the performance. For more details and discussion, please see Appendix D in [61]. However, they show that this problem does not afflict ogbl-ddi under the newly proposed HeaRT [61] evaluation setting. As such, we further include the results for our method under HeaRT in Table 3.5. B.5.3 Computation of the PPR Matrix We compute the PPR matrix via the efficient approximation algorithm introduced by [5]. The estimation is controlled by a tolerance parameter 𝜖. The parameter 𝜖 controls both the speed of computation and the sparsity of the solution (i.e., a higher value of 𝜖 will produce a sparser PPR matrix). We use: 𝜖 = 1𝑒−7 for Cora and Citeseer, 𝜖 = 5𝑒−5 for ogbl-collab and ogbl-ppa, 𝜖 = 1𝑒 − 5 for Pubmed, and 𝜖 = 5𝑒−3 for ogbl-Citation2. The value of 𝜖 is chosen as a trade-off between accuracy and sparsity to allow for ease of storage in GPU memory. B.5.4 Splitting Target Links by LP Factor In Section 3.4.3 we demonstrate the performance on samples that correspond to a single LP factor. In this section we further detail the algorithm used to determine the set of samples corresponding to each factor. We consider the three main factors: local structural information, 117 global structural information, and feature proximity. We measure each using a single representative heuristic: CNs [81] for local information, PPR [15] for global information, and cosine feature similarity for feature proximity. For each sample, we check if the score is only high in one heuristic. In this way, it tells us that there is a dominant factor present in the pairwise information. This determination is done by comparing the the heuristic scores of each target link against a threshold value. For a LP factor 𝑖 and target link (𝑎, 𝑏), we denote the heuristic score as 𝑠𝑖 (𝑎, 𝑏). The threshold value for factor 𝑖 is represented by ˆ𝑠𝑖 and is chosen such that it corresponds to a higher score. We desire ˆ𝑠𝑖 to be a higher score such that any score ≥ than it indicates that a plethora of pairwise information exists corresponding to factor 𝑖. This is done by setting the threshold equal to the 𝑝-th percentile value for that heuristic among all target links. For example, for CNs, the 80th percentile score on one dataset may be 9. The value of 𝑝 is chosen to be high (e.g., 80%) due to the aforementioned reasoning. Given these inputs, for each target link we compare the score for factor 𝑖 against the threshold value of that factor. Continuing our example, if (𝑎, 𝑏) only has 2 CNs, it is below the previously defined threshold. We only consider a sample as “belonging” to a single factor when it is 𝑠𝑖 (𝑎, 𝑏) ≥ ˆ𝑠𝑖 is true for one only one factor 𝑖. So if the heuristic score for (𝑎, 𝑏) is below the 𝑝-th percentile threshold for CNs and PPR but above for feature similarity, then feature proximity will be considered the dominant LP factor. However, if it’s above the threshold for both local and structural information, it will not be assigned to any group. This is done as we want to isolate links that only highly express one LP factor. This allows us to better understand how certain methods can model that specific factor. The detailed algorithm is given in Algorithm B.1. We note that each target link may not belong to a category. This can be due to there being no or many dominant LP factor. We further set the percentile equal to 90% on all datasets except for ogbl-collab for which we use 80%. These values were chosen as we wanted the percentile to be suitably high such that we are confident that the corresponding factor is relevant to the target link. Furthermore, we use a lower value for ogbl-collab as we found it produced a more even distribution of links by factor. 118 (a) Pubmed (b) ogbl-ppa Figure B.1 Performance for target links when there is only one LP factor strongly expressed. Results are on (a) Pubmed, (b) ogbl-ppa. We note that due the quality of features used, we omit the feature proximity factor for ogbl-ppa from our analysis. B.5.5 Additional Results for the LP Factor Experiments In Section 3.4.3 we observed the performance of various methods on target links where only a single LP factor is expressed. This is done through the use of heuristic scores. We further demonstrate the results on the Pubmed and ogbl-ppa datasets. Of note is that for ogbl-ppa the initial node features are one-hot vectors that signify the species that the protein belongs to. We observe that due to the sparseness of these features, feature proximity measures are unable to properly predict any target links on their own. As such, the factor corresponding to feature proximity is not expressed. We therefore exclude that factor for this analysis on ogbl-ppa. The results for both Pubmed and ogbl-ppa datasets are given in Figure B.1. As shown earlier in Figure 3.3, LPFormer can most consistently perform well across each factor. This suggests that LPFormer is best able to both model a variety of factors and adapt accordingly for each target link. B.5.6 Performance on Heterophilic Datasets In this section we evaluate LPFormer on multiple heterophilic datasets. Heterophily refers to the tendency of dissimilar nodes to be connected. This is as opposed to homophily, in which nodes with similar attributed are more likely to be connected. Since most graphs used for benchmark 119 datasets tend to contain homophilic patterns, heterophilic graphs present an interesting challenge regarding the effectiveness of graph-based methods. For a more detailed discussion on heterophilic graphs, please see [71]. We test on two prominent heterophilic datasets, Squirrel and Chameleon [94]. The statistics for each are in Table B.1. We limit our comparison to those LP methods that tend achieve the best results, including GCN, BUDDY, and NCNC. In Table B.2, we report the MRR over five random seeds. Note that we test under the original evaluation setting and not HeaRT. We observe that LPFormer can achieve a large increase over other methods, with a 14% and 9% increase in performance on Squirrel and Chameleon, respectively. These results indicate the superior ability of LPFormer to accurately model LP on heterophilic graphs, as compared to other methods. Table B.1 Heterophilic Dataset Statistics. Squirrel Chameleon #Nodes #Edges Split Ratio 5201 198,353 85/5/10 2277 31,371 85/5/10 Table B.2 Results on Heterophilic Datasets. Method GCN BUDDY NCNC LPFormer % Improvement Squirrel 22.77 ± 4.54 9.69 ± 0.99 32.37 ± 5.46 36.77 ± 2.77 14% Chameleon 20.74 ± 8.08 6.30 ± 2.40 26.24 ± 3.37 28.61 ± 6.68 9% B.5.7 More Efficiently Incorporating the PPR Scores In Figure 3.4 we compare the training time between LPFormer and NCNC. We observe that on the denser datasets, ogbl-ppa and ogbl-ddi, LPFormer is considerably more efficient. Furthermore, on ogbl-collab, both methods have a fast runtime. However, we find that LPFormer struggles on ogbl-citation2 in comparison to NCNC. We observe that this is due to the need of the PPR matrix, which while sparse, requires a large amount of memory and processing time. In the future, we plan 120 Algorithm B.1 Determining Samples by LP Factor Require: CN(·) = Maps (𝑖, 𝑗) to # of CNs of the pair PPR(·) = Maps (𝑖, 𝑗) to PPR score of the pair FS(·) = Maps (𝑖, 𝑗) to feature cosine similarity of the pair 𝑝 = Percentile used to determine whether a factor is present Etest = Positive test links 1: // Compute the score corresponding to the 𝑝-th percentile for each heuristic 2: ˆ𝑠CN = Percentile( 𝑝, {𝐶𝑁 (𝑖, 𝑗) | (𝑖, 𝑗) ∈ Etest}) 3: ˆ𝑠FS = Percentile( 𝑝, {𝐹𝑆(𝑖, 𝑗) | (𝑖, 𝑗) ∈ Etest}) 4: ˆ𝑠PPR = Percentile( 𝑝, {𝑃𝑃𝑅(𝑖, 𝑗) | (𝑖, 𝑗) ∈ Etest}) 5: Create empty lists 𝐿CN, 𝐿PPR, and 𝐿FS 6: for (𝑖, 𝑗) ∈ Etest do 7: 8: 9: link-cn = CN(𝑖, 𝑗) link-fs = FS(𝑖, 𝑗) link-ppr = PPR(𝑖, 𝑗) // Assign sample to corresponding list based on scores if link-cn ≥ ˆ𝑠CN and link-fs < ˆ𝑠FS and link-ppr < ˆ𝑠PPR then else if link-cn < ˆ𝑠CN and link-fs ≥ ˆ𝑠FS and link-ppr < ˆ𝑠PPR then else if link-cn < ˆ𝑠CN and link-fs < ˆ𝑠FS and link-ppr ≥ ˆ𝑠PPR then Append(𝐿CN, (𝑖, 𝑗)) Append(𝐿FS, (𝑖, 𝑗)) 10: 11: 12: 13: 14: 15: 16: 17: 18: end for 19: return 𝐿CN, 𝐿PPR, 𝐿FS end if Append(𝐿PPR, (𝑖, 𝑗)) to fix this problem by performing a simple and efficient pre-processing step. Specifically, before training, we can iterate over all target links and extract the relevant PPR scores. This would obviate the need to store the PPR matrix and determine the nodes for each link. Furthermore, this only needs to be done once before tuning the model. This would greatly reduce the storage and time needed to train LPFormer on all datasets and is an avenue we plan to explore in the future. 121 APPENDIX C TOWARD DEGREE BIAS IN EMBEDDING-BASED KNOWLEDGE GRAPH COMPLETION C.1 Dataset Statistics The statistics for each dataset can be found in Table C.1. Table C.1 Dataset Statistics. Statistic FB15K-237 NELL-995 CoDEx-M #Entities #Relations #Train #Validation #Test 14,541 237 272,115 17,535 20,466 74,536 200 149,678 543 2,818 17,050 51 185,584 10,310 10,311 C.2 Infrastructure All experiments were run on a single 32G Tesla V100 GPU and implemented using PyTorch [86]. C.3 Parameter Settings Each model is trained for 400 epochs on FB15K-237, 250 epochs on CoDEx-M, and 300 epochs on NELL-995. The embedding dimension is set to 200 for both methods except for the dimension of the relation embeddings in TuckER which is is tuned from {50, 100, 200}. The batch size is set to 128 and the number of negative samples per positive sample is 100. The learning rate is tuned from {1𝑒−5, 5𝑒−5, 1𝑒−4, 5𝑒−4, 1𝑒−3}, the decay from {0.99, 0.995, 1}, the label smoothing from {0, 0.1}, and the dropout from {0, 0.1, 0.2, 0.3, 0.4, 0.5 }. For KG-Mixup we further tune the degree threshold from {3, 5, 10}, the number of samples generated from {5, 10}, and the loss weight for the synthetic samples from {1e-2, 1e-1, 1 }. Lastly, we tune the stochastic weight averaging (SWA) initial learning rate from { 1e-5, 5e-4 }. The best hyperparameter values for ConvE and TuckER using KG-Mixup are shown in Figures C.2 and C.3, respectively. 122 C.4 Preliminary Study Results using ConvE In Section 4.3 we conduct a preliminary study using TuckER [7] on the FB15K-237 dataset [111]. In this section we further include the corresponding results when using the ConvE [29] embedding models. The plots can be found in Figure C.1. We note that they show a similar pattern to those displayed by TuckER in Figure 4.2. Table C.2 Hyperparameter values for ConvE on each dataset. Hyperparameter Learning Rate LR Decay Label Smoothing Dropout #1 Dropout #2 Dropout #3 Degree Threshold # Generated Synth Loss Weight SWA LR FB15K-237 NELL-995 CoDEx-M 5𝑒−5 None 0.1 0.4 0.3 0.1 5 5 1 1𝑒−5 1𝑒−5 None 0.1 0.1 0.2 0.3 5 5 1 1𝑒−5 1𝑒−4 None 0 0.2 0.5 0.2 5 5 1 5𝑒−4 Table C.3 Hyperparameter values for TuckER on each dataset. Hyperparameter Learning Rate LR Decay Label Smoothing Dropout #1 Dropout #2 Dropout #3 Rel Dim Degree Threshold # Generated Synth Loss Weight SWA LR FB15K-237 NELL-995 CoDEx-M 5𝑒−5 None 0 0.3 0.3 0.2 100 25 5 1𝑒−2 1𝑒−5 1𝑒−5 0.995 0 0.3 0.5 0.5 100 5 5 1 5𝑒−4 5𝑒−5 0.99 0 0.3 0.4 0.5 200 5 5 1 5𝑒−4 C.5 Expected Calibration Error Expected calibration error [40] is a measure of model calibration that utilizes the model accuracy and prediction confidence. Following [40], we first split our data into 𝑀 bins and define the accuracy 123 (a) In and Out Degree Analysis (b) Tail In-Degree Analysis (c) Controlling for Degrees Figure C.1 MRR when predicting the tail for ConvE on FB15K-237 when varying the (a) in-degree and out-degree of the head and tail entity, (b) tail-relation and other-relation in-degree, and (c) other-relation in-degree for smaller sub-ranges of the tail-relation degree. (acc) and confidence (conf) on one bin 𝐵𝑚 as: acc(𝐵𝑚) = 1 |𝐵𝑚 | |𝐵𝑚| ∑︁ 1( ˆ𝑦𝑖 = 𝑦𝑖), 𝑖=1 |𝐵𝑚| ∑︁ 𝑖=1 ˆ𝑝𝑖, conf(𝐵𝑚) = 1 |𝐵𝑚 | (C.1) (C.2) where 𝑦𝑖 is the true label for sample 𝑖, ˆ𝑦𝑖 is the predicted label, and ˆ𝑝𝑖 the prediction probability for sample 𝑖. A well-calibrated model should have identical accuracy and confidence for each bin. ECE is thus defined by [40] as: ECE = 𝑀 ∑︁ 𝑚=1 |𝐵𝑚 | 𝑛 |acc(𝐵𝑚) − conf(𝐵𝑚)|, (C.3) where 𝑛 is the total number of samples across all bins. As such, a lower ECE indicates a better calibrated model. For KGC we split the samples into bin by the tail-relation degree. Furthermore to calculate the accuracy score (acc) we utilize His@10 to denote a correct prediction. For the confidence score (conf) we denote 𝜎( 𝑓 (ℎ, 𝑟, 𝑡)) as the prediction probability where 𝑓 (ℎ, 𝑟, 𝑡) is a KG embedding score function and 𝜎 is the sigmoid function. When calculating the ECE over all samples Eq. (C.3) is unchanged. When calculating ECE for just one bin of samples (e.g. low degree triples) it is defined as: ECE𝑚 = |acc(𝐵𝑚) − conf(𝐵𝑚)|, (C.4) where ECE𝑚 is the expected calibration error for just bin 𝑚. 124 C.6 Proof of Theorem 1 In this section we prove Theorem 1. Following recent work [19, 138] we examine the regular- izing effect of the mixup parameter 𝜆. This is achieved by approximating the loss 𝑙𝜃 on a single mixed sample ˜𝑒 using the first-order quadratic Taylor approximation at the point 𝜏 = 1 − 𝜆 near 0. Assuming 𝑙𝜃 is differentiable, we can approximate 𝑙𝜃, with some abuse of notation as: 𝑙𝜃 (𝜏) = 𝑙𝜃 (0) + 𝑙′ 𝜃 (0)𝜏. (C.5) We consider the case where 𝑙𝜃 is the binary cross-entropy loss. Since the label 𝑦𝑖 𝑗 = 1 is always true, it can be written as, where 𝜎 is the sigmoid function and ˜𝑒 = (cid:0) ˜ℎ, ˜𝑟, 𝑡(cid:1) is the mixed sample. Since 𝜏 = 1 − 𝜆, we can 𝑙𝜃 ( ˜𝑒) = 𝑙𝑜𝑔 𝜎 ( 𝑓 ( ˜𝑒)) , (C.6) rewrite the mixed sample as: ˜𝑒 = ((1 − 𝜏)ℎ𝑖 + 𝜏ℎ 𝑗 , (1 − 𝜏)𝑟𝑖 + 𝜏𝑟 𝑗 , 𝑡). (C.7) As such, setting 𝜏 = 0 doesn’t mix the two samples resulting in ˜𝑒 = (ℎ𝑖, 𝑟𝑖, 𝑡). The term 𝑙𝜃 (0) is therefore equivalent to the standard loss L (𝜃) over the samples 𝑆. We can now compute the first derivative in Eq. (C.5). We evaluate 𝑙′ 𝜃 via the chain rule, 𝑙′ 𝜃 = 𝜕 𝑙𝑜𝑔 𝜎 ( 𝑓 ( ˜𝑒)) 𝜕𝜎 ( 𝑓 ( ˜𝑒)) · 𝜕𝜎 ( 𝑓 ( ˜𝑒)) 𝜕 𝑓 ( ˜𝑒) 𝜕 𝑓 ( ˜𝑒) 𝜕𝜏 , · where 𝜕 𝑓 ( ˜𝑒) 𝜕𝜏 is evaluated via the multivariable chain rule to: 𝜕 𝑓 ( ˜𝑒) 𝜕𝜏 = (cid:18) 𝜕 𝑓 ( ˜𝑒) 𝜕𝑥 ˜ℎ 𝜕𝑥 ˜ℎ 𝜕𝜏 + 𝜕 𝑓 ( ˜𝑒) 𝜕𝑥 ˜𝑟 𝜕𝑥 ˜𝑟 𝜕𝜏 + 𝜕 𝑓 ( ˜𝑒) 𝜕𝑥𝑡 𝜕𝑥𝑡 𝜕𝜏 (cid:19) . Evaluating 𝑙′ 𝜃: 𝑙′ 𝜃 = 1 𝜎 ( 𝑓 ( ˜𝑒)) · 𝜎 ( 𝑓 ( ˜𝑒)) (1 − 𝜎 ( 𝑓 ( ˜𝑒))) · 𝜕 𝑓 ( ˜𝑒) 𝜕𝜏 , = (1 − 𝜎 ( 𝑓 ( ˜𝑒))) · = (1 − 𝜎 ( 𝑓 ( ˜𝑒))) 𝜕 𝑓 ( ˜𝑒) 𝜕𝜏 (cid:20) 𝜕 𝑓 ( ˜𝑒) 𝜕𝑥 ˜ℎ , (𝑥ℎ 𝑗 − 𝑥ℎ𝑖 ) + 𝜕 𝑓 ( ˜𝑒) 𝜕𝑥 ˜𝑟 (𝑥𝑟 𝑗 − 𝑥𝑟𝑖 ) (cid:21) , 125 (C.8) (C.9) (C.10) (C.11) (C.12) where the term related to 𝑥𝑡 in Eq. (C.9) cancels out since 𝜕𝑥𝑡/𝜕𝜏 = 0. Since we are evaluating the expression near 𝜏 = 0, we only consider the original sample in the score function, reducing the above to the following where Δℎ = (𝑥ℎ 𝑗 − 𝑥ℎ𝑖 ), and Δ𝑟 = (𝑥𝑟 𝑗 − 𝑥𝑟𝑖 ), 𝑙′ 𝜃 (0) = (1 − 𝜎 ( 𝑓 (𝑒𝑖))) (cid:20) 𝜕 𝑓 (𝑒𝑖) 𝜕𝑥 ˜ℎ Δℎ + 𝜕 𝑓 (𝑒𝑖) 𝜕𝑥 ˜𝑟 (cid:21) . Δ𝑟 Plugging in 𝑙𝜃 (0) and 𝑙′ 𝜃 (0) into Eq. (C.5) and rearranging the terms we arrive at: LMix(𝜃) = L (𝜃) + R1(𝜃) + R2(𝜃), where R1(𝜃) and R2(𝜃) are defined over all samples 𝑆 as: R1(𝜃) = R2(𝜃) = 𝜏 |𝑆| 𝜏 |𝑆| |𝑆| ∑︁ 𝑘 ∑︁ 𝑖=1 |𝑆| ∑︁ 𝑗=1 𝑘 ∑︁ 𝑖=1 𝑗=1 (1 − 𝜎 ( 𝑓 (𝑒𝑖))) (1 − 𝜎 ( 𝑓 (𝑒𝑖))) 𝜕 𝑓 (𝑒𝑖)𝑇 𝜕𝑥 ˜ℎ Δℎ, 𝜕 𝑓 (𝑒𝑖)𝑇 𝜕𝑥 ˜𝑟 Δ𝑟, with 𝜏 = E𝜆∼D𝜆 (1 − 𝜆) where D𝜆 = Beta(𝛼, 𝛼). (C.13) (C.14) (C.15) (C.16) 126 APPENDIX D DISTANCE-BASED PROPAGATION FOR EFFICIENT KNOWLEDGE GRAPH REASONING D.1 Proof Details of Theorem 2 We prove Theorem 2 via induction on the path length 𝑙. We denote all nodes a distance 𝑙 from the source node 𝑠 as 𝑉 𝑙 𝑠 . The path length offset is represented by 𝛿. Lastly, for convenience we split the constraints in Eq. (5.6) into two: a node constraint and an edge constraint. We formulate it as the following where Node𝛿 (𝑠, 𝑜, 𝑡) represents the node constraint and EdgeC𝛿 (𝑠, 𝑜, 𝑢) the edge constraint: Node𝛿 (𝑠, 𝑜, 𝑡) = 𝑡 − 𝛿 ≤ dist(𝑠, 𝑜) ≤ 𝑡, EdgeC𝛿 (𝑠, 𝑜, 𝑢) = dist(𝑠, 𝑢) < dist(𝑠, 𝑜) + 𝛿 (D.1) (D.2) Base Case (𝑙=1): We want to show for all 𝑙 = 1 hop neighbors of 𝑠, 𝑜 ∈ 𝑉 1 𝑠 , their final representation 𝑥𝐹 𝑞 (𝑠, 𝑜) aggregates all path representations in the range [0, 1+𝛿]. To be true, the embedding 𝑥𝐹 𝑞 (𝑠, 𝑜) must satisfy two conditions: • Condition 1: The final embedding 𝑥𝐹 𝑞 (𝑠, 𝑜), contains all paths representations of length less than or equal to 1 + 𝛿 between 𝑠 and 𝑜. • Condition 2: The final embedding 𝑥𝐹 𝑞 (𝑠, 𝑜) contains no other path information. Condition 1: For it to be true, a node 𝑜 ∈ 𝑉 1 𝑠 must aggregate all edges of the form (𝑢, 𝑟, 𝑜) where 𝑢 belongs to the set: 𝑈 (0,𝛿) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 0 𝑠 , 𝑉 1 𝑠 , · · · , 𝑉 𝛿 𝑠 }}, (D.3) where E𝑜 represents all edges where 𝑜 is the target node. It’s intuitive that all paths starting at 𝑠 of length ∈ [0, 𝛿 + 1] must pass through the nodes in the set 𝑈 (0,𝛿) 𝑠,𝑜 Theorem 3 that 𝑜 will aggregate all nodes in the set 𝑈 (0,𝛿) 𝑠,𝑜 . in order to reach 𝑜. We prove in Condition 2: We want to demonstrate that the representation of node 𝑜 aggregates no other path information such that 𝑥 (𝛿+1) 𝑞 (𝑠, 𝑜). This is true as per the node constraint (Eq. (D.1)) the (𝑠, 𝑜) = 𝑥𝐹 𝑞 representation of a node 𝑜 stops updating after iteration 𝑘 = 1 + 𝛿. 127 Inductive Step: We assume that for all m-hop neighbors of 𝑠, 𝑜 ∈ 𝑉 𝑚 𝑠 , their final representation 𝑥𝐹 𝑞 (𝑠, 𝑜) aggregates all path representations of length between [𝑚, 𝑚 + 𝛿]. This is achieved by a node 𝑜 aggregating all edges (𝑢, 𝑟, 𝑜) where 𝑢 belongs to the set: 𝑈 (𝑚−1,𝑚−1+𝛿) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 𝑚−1 𝑠 , · · · , 𝑉 𝑚−1+𝛿 𝑠 }}, (D.4) as all such paths must pass through these nodes. We note that this implies that: • The set of nodes 𝑈 (𝑚−1,𝑚−1+𝛿) must themselves only contain all path representations of lengths 𝑠,𝑜 [𝑚 − 1, 𝑚 − 1 + 𝛿] when aggregated by 𝑜 ∈ 𝑉 𝑚 𝑠 . • The set of nodes 𝑈 (𝑚−1,𝑚−1+𝛿) 𝑠,𝑜 must obtain such path information by iteration 𝑘 = 𝑚 − 1 + 𝛿. This must be true as per the node constraint 𝑜 will last update at iteration 𝑘 = 𝑚 + 𝛿. We now want to show for all (𝑚 + 1) hop neighbors of 𝑠, 𝑜 ∈ 𝑉 𝑚+1 𝑠 , their final representation 𝑥𝐹 𝑞 (𝑠, 𝑜) aggregates all path representations of of length between [𝑚 + 1, 𝑚 + 1 + 𝛿]. This requires showing that 𝑥𝐹 𝑞 (𝑠, 𝑜) (1) contains all paths representations between [𝑚 + 1, 𝑚 + 1 + 𝛿] between 𝑠 and 𝑜 and (2) it contains no other path information. Condition 1: For 𝑜 ∈ 𝑉 𝑚+1 𝑠 to aggregate all paths of length between 𝑚 + 1 and 𝑚 + 1 + 𝛿, their representation must aggregate all edges (𝑢, 𝑟, 𝑜) where 𝑢 belongs to the set: 𝑈 (𝑚,𝑚+𝛿) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 𝑚 𝑠 , · · · , 𝑉 𝑚+𝛿 𝑠 }}. (D.5) Such edges are aggregated by 𝑜 ∈ 𝑉 𝑚+1 • From the inductive step we know that nodes 𝑈 (𝑚−1,𝑚−1+𝛿) via the edge constraint. Furthermore, = 𝑈 (𝑚,𝑚+𝛿) 𝑠,𝑣 𝑠,𝑜 𝑠 \ 𝑉 𝑚+𝛿 𝑠 have already aggregated all path representations of lengths [𝑚 − 1, 𝑚 − 1 + 𝛿] by iteration 𝑘 = 𝑚 + 𝛿. • From both constraints we know that ∀𝑢 ∈ 𝑉 𝑚+𝛿 𝑠 will only contain all path representations of length 𝑚 + 𝛿 (i.e. shortest path) by iteration 𝑘 = 𝑚 + 𝛿. As such, after aggregating the nodes in the set 𝑈 (𝑚,𝑚+𝛿) all paths representations between 𝑚 and 𝑚 + 𝛿. Per the node constraint, ∀𝑜 ∈ 𝑉 𝑚+1 iteration 𝑘 = 𝑚+1+𝛿. Therefore by aggregating 𝑈 (𝑚,𝑚+𝛿) 𝑥 (𝑚+1+𝛿) 𝑞 (𝑠, 𝑜) will contain all path representations between length 𝑚 + 1 and 𝑚 + 1 + 𝛿. the representation 𝑥 (𝑚+𝛿) 𝑠,𝑜 𝑠,𝑜 𝑞 𝑠 (𝑠, 𝑢) will contain last update at at iteration 𝑘 = 𝑚+1+𝛿, the representation 128 Condition 2: Lastly, we want to show that ∀𝑜 ∈ 𝑉 𝑚+1 𝑠 the final representation 𝑥𝐹 𝑞 (𝑠, 𝑜) will only contain path representations of length 𝑚 + 1 to 𝑚 + 1 + 𝛿. This is true as per the node constraint the representation of a node 𝑜 ∈ 𝑉 𝑚+1 𝑥 (𝑚+1+𝛿) 𝑞 (𝑠, 𝑜) = 𝑥𝐹 𝑠 𝑞 (𝑠, 𝑜). As such, the final representation only aggregates paths of length between last updates at iteration 𝑘 = 𝑚 + 1 + 𝛿. Therefore 𝑚 + 1 and 𝑚 + 1 + 𝛿. Theorem 3. We are given a source node 𝑠, query 𝑞, and target node 𝑜 which is a 1-hop neighbor of 𝑠. The final representation of a 1-hop neighbor 𝑜, x𝐹 𝑞 (𝑠, 𝑜), will at minimum aggregate all path representations whose path length is between 1 and 1 + 𝛿. It therefore at least contains the path information, 1+𝛿 (cid:202) (cid:202) | 𝑝| (cid:204) 𝜂 = 𝑤(𝑒𝑖). 𝑙=1 𝑝∈𝑃𝑙 𝑠,𝑜 𝑖=1 (D.6) This is equivalent to stating that 𝑜 will aggregate all nodes in the following set by iteration 𝑘 = 1+ 𝛿, 𝑈 (0,𝛿) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 0 𝑠 , 𝑉 1 𝑠 , · · · , 𝑉 𝛿 𝑠 }}. (D.7) We prove this Theorem via induction on the layer iteration 𝑘 in our algorithm D.1 (denoted their as 𝑙). Base Case (𝑘=1): We want to first show that after one iteration, the representation of a 1-hop neighbor 𝑥1 𝑞 (𝑠, 𝑜) aggregates all paths of length 1 from the source. This is achieved by 𝑥1 𝑞 (𝑠, 𝑜) aggregating all edges connecting 𝑜 to 𝑠, i.e. (𝑠, 𝑟, 𝑜). Such edges are aggregated by 𝑜 as both the edge and node constraints are satisfied: EdgeC𝛿 (𝑠, 𝑜, 𝑠) = 0 < 1 + 𝛿, NodeC𝛿 (𝑠, 𝑜, 1) = 1 − 𝛿 ≤ 1 ≤ 1. (D.8) (D.9) Inductive Step: We assume that at some iteration 𝑘 = 𝑛, s.t. 𝑛 < 1 + 𝛿, the representation 𝑥𝑛 𝑞 (𝑠, 𝑜) for 𝑜 ∈ 𝑉 1 𝑠 aggregates all path representations up to a length 𝑛 from the source. This is achieved by aggregating all edges that contain nodes in the set: 𝑈 (0,𝑛−1) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 0 𝑠 , 𝑉 1 𝑠 , · · · , 𝑉 𝑛−1 𝑠 }}. (D.10) 129 𝑞 (𝑠, 𝑜) contains all path representations up to length 𝑛, then it follows that Since we assume that 𝑥𝑛 ∀𝑢 ∈ 𝑈 (0,𝑛−1) 𝑛 − 1. As such, by node 𝑜 aggregating 𝑈 (0,𝑛−1) their corresponding representation 𝑥𝑛 𝑠,𝑜 𝑠,𝑜 𝑞 (𝑠, 𝑜) must also contain all paths up to length it extend the length of each path by 1. We want to prove that at iteration 𝑘 = 𝑛 + 1, the representation 𝑥 (𝑛+1) 𝑞 (𝑠, 𝑜) aggregates all path representations up to a length 𝑛 + 1 from the source. This is achieved by aggregating all edges that contain the nodes in the set: 𝑈 (0,𝑛) 𝑠,𝑜 = {𝑢 | (𝑢, 𝑟, 𝑜) ∈ E𝑜, 𝑢 ∈ {𝑉 0 𝑠 , 𝑉 1 𝑠 , · · · , 𝑉 𝑛 𝑠 }}. (D.11) Per the previous inductive step, we assumed that the representations 𝑥𝑛 𝑞 (𝑠, 𝑜) ∀𝑜 ∈ 𝑉 𝑛 𝑠 contain all path representations up to length 𝑛. Furthermore we noted that at iteration 𝑘 = 𝑛, the representations for each node in the set 𝑈 (0,𝑛−1) Since 𝑈 (0,𝑛) 𝑠,𝑜 = 𝑈 (0,𝑛−1) 𝑠,𝑜 Thereby when 𝑥 (𝑛+1) 𝑠 , this implies that 𝑈 (0,𝑛) (𝑠, 𝑜) aggregates the nodes in 𝑈 (0,𝑛) 𝑠,𝑜 must also contain all path representations up to a length 𝑛 − 1. contain all path representations up to length 𝑛. it aggregates all path representations up to ∪ 𝑉 𝑛 𝑠,𝑜 𝑠,𝑜 𝑞 a length 𝑛 + 1. A node 𝑜 ∈ 𝑉 1 𝑠 will aggregate such nodes at iteration 𝑘 = 𝑛 + 1 per both constraints. This proves by induction that for 𝑜 ∈ 𝑉 1 𝑠 , their representation 𝑥 (1+𝛿) 𝑞 (𝑠, 𝑜) aggregates all path representations of length less than or equal to 1 + 𝛿. D.2 Further Details on TAGNet D.2.1 TAGNet Algorithm The algorithm for TAGNet, with a fixed 𝛿, is presented in Algorithm D.1. D.2.2 Time Complexity Analysis Per the constraints in Eq. (5.6), each node can be updated at most 𝛿 + 1 times and each edge can be aggregated at most 𝛿 + 1 times. The shortest path distance from a source node 𝑠 to all other nodes can be calculated in linear time via a breadth-first search. The worst-case complexity for the standard version of TAGNet is therefore: (cid:16) 𝑂 (𝛿 + 1) · (cid:16) |𝑉 |𝑑2 + |𝐸 |𝑑 (cid:17)(cid:17) . (D.12) Of note is that the worst case-complexity is independent of the number of layers. This allows for much deeper propagation. 130 Algorithm D.1 TAGNet Algorithm (fixed 𝛿) Require: 𝑠 = Source node 𝑞 = Query relation 𝑇 = Max Number of Layers x = Embeddings 𝛿 = Offset Agg-Degree = Whether to include degree msgs 1: Initialize: 𝑥 (0) (𝑠,𝑜) = 0, ∀𝑜 ∈ V 𝑥 (0) (𝑠,𝑜) = x𝑞 2: for 𝑡 = 1...𝑇 do 3: 4: 5: 6: for 𝑜 ∈ V do if 𝑡 − 𝛿 ≤ dist(𝑠, 𝑜) ≤ 𝑡 then C(𝑠, 𝑜, 𝑡) = {(𝑢, 𝑟, 𝑜) ∈ E (𝑜) | dist(𝑠, 𝑢) < dist(𝑠, 𝑜) + 𝛿} Msgs = {x(𝑡 −1) | (𝑢, 𝑟, 𝑜) ∈ C(𝑠, 𝑜, 𝑡)} (𝑠,𝑢) ⊙ x(𝑡 ) 𝑟 7: 8: 9: if Agg-Degree then 𝜌𝑜 = 𝑏𝑜 − |Msgs| (cid:110) Msgs = Msgs ∪ (cid:111) 𝜌𝑣 · x(𝑡 ) deg end if x(𝑡 ) (𝑠,𝑜) = Aggregate{ Msgs } end if 10: 11: 12: 13: 14: end for 15: return x(dist(𝑠,𝑜)+ 𝛿 ) end for (𝑠,𝑜) for all 𝑜 ∈ V We further discuss the complexity when utilizing degree messages and a target-specific 𝛿. As noted in Section 5.3.3, the inclusion of degree messages is equivalent to aggregating an additional edge each iteration. As such, it doesn’t effect the model complexity. Furthermore, when utilizing a target-specific 𝛿, an additional (𝛿 + 1) · 𝑑2 operations are added to calculate the attention scores. This is equivalent to updating each one node one additional time and therefore also has no effect on the model complexity. D.2.3 TAGNet + A∗Net We further experiment with combining the pruning strategy of both A∗Net and TAGNet. This is achieved by taking the intersection of the edge sets produced by both methods for a node pair (𝑠, 𝑜) at iteration 𝑡. This is because we only want to aggregate an edge if it is not pruned by both 131 methods. For TAGNet, the edge set C(𝑠, 𝑜, 𝑡) is defined as in Eq. (5.6). We further denote the edge set for A∗Net as A (𝑠, 𝑜, 𝑡). Adapting Eq. (5.7) we arrive at: x(𝑡) 𝑞 (𝑠, 𝑜) = (cid:169) (cid:173) (cid:171) (cid:202) (𝑣,𝑟,𝑜)∈C(𝑠,𝑜,𝑡)∩A (𝑠,𝑜,𝑡) x(𝑡−1) 𝑞 (𝑠, 𝑣) ⊗ w𝑞 (𝑣, 𝑟, 𝑜)(cid:170) (cid:174) (cid:172) ⊕ x(0) 𝑞 (𝑠, 𝑜). (D.13) The performance and efficiency when combining both methods is detailed in Section 5.4.1 and 5.4.2, respectively. Lastly, we note that we don’t consider combining with the pruning strategy in AdaProp [144] due to its strong similarity with that of A∗Net. D.3 Experimental Settings D.3.1 Datasets We conduct experiments on both the transductive and inductive settings. For the transductive setting, we consider FB15K-237 [111] and WN18RR [30]. For the inductive setting, where the train and test entities are disjoint, we consider the splits generated by [109] from both FB15K-237 and WN18RR. Of note is that we omit the NELL-995 [126] dataset from both sets of our experiments. This is due to concerns raised by [98], where they argue that most of the triples in NELL-995 are either meaningless or trivial. The statistics for all the transductive and inductive datasets are given in Tables D.1 and D.2, respectively. Table D.1 Statistics for Transductive Datasets. Statistic FB15K-237 WN18RR #Entities #Relations #Train #Validation #Test 14,541 237 272,115 17,535 20,466 40,943 11 86,835 3,034 3,134 D.3.2 Baselines In the transductive setting, following [154], we consider a variety of different models. For embedding-based methods we consider TransE [14] (performance from [82]), DistMult [127], ComlEx [112]. For GNN methods we include R-GCN [100] (performance on WN18RR taken from [154]) and CompGCN [113]. For path-based methods we include DRUM [96], NBFNet [154], 132 Table D.2 Statistics for Inductive Datasets. Dataset #Relations FB15k-237 WN18RR v1 v2 v3 v4 v1 v2 v3 v4 180 200 215 219 9 10 11 9 #Entities 1,594 2,608 3,668 4,707 2,746 6,954 12,078 3,861 Train #Query 4,245 9,739 17,986 27,203 5,410 15,262 25,901 7,940 #Fact #Entities Validation #Query #Fact #Entities 4,245 9,739 17,986 27,203 5,410 15,262 25,901 7,940 1,594 2,608 3,668 4,707 2,746 6,954 12,078 3,861 489 1,166 2,194 3,352 630 1,838 3,097 934 4,245 9,739 17,986 27,203 5,410 15,262 25,901 7,940 1,093 1,660 2,501 3,051 922 2,757 5,084 7,084 Test #Query 205 478 865 1,424 188 441 605 1,429 #Fact 1,993 4,145 7,406 11,714 1,618 4,011 6,327 12,334 RED-GNN [143], A∗Net [152], and AdaProp [144]. We note that for AdaProp the original results from [144] utilize 7 and 8 layers for FB15k237 and WN18RR, respectively (see Table 7 in [144]). For other methods such as TAGNet, NBFNet, and A∗NET, the number of layers is fixed at 6. To facilitate a fair comparison, we run AdaProp on both datasets using 6 layers. We utilize the official source code 1 and the published hyperparameters. For the inductive setting, following [109, 154], we include GraIL [109], CoMPILE [70], and NeuralLP [128] in addition to NBFNet and A∗Net. We note that embedding methods aren’t applicable to the inductive setting as the train and test entities are disjoint. For NBFNet, the results on the inductive FB15k-237 splits are reported by us while the results for the WN18RR splits are from [152]. This is because we observed that we can achieve better performance for NBFNet on the FB15k-237 splits than what was reported in [152]. Lastly, as with the transductive setting, we run AdaProp with 6 layers to facilitate a fair comparison between it and other path-based GNNs. We also set the hidden dimension to 32 as is with all other path-based GNNs. D.3.3 Evaluation Metrics In the transductive setting, we report the mean reciprocal rank (MRR), Hits@1, and Hits@10 following the filtered setting as described in [14]. For the inductive setting, following [144, 152], we only report the Hits@10. D.3.4 Hyperparameter Settings We list the parameters settings for TAGNet. Under the fixed-𝛿 formulation it is trained for 20 and 16 epochs on the transductive and inductive setting, respectively. For the specific-𝛿 formulation, 1https://github.com/LARS-research/AdaProp 133 we train for 25 and 20 epochs on the transductive and inductive setting, respectively, as we’ve found it takes longer to converge. For all transductive and inductive experiments in Table 5.1 and 5.2 we set the maximum number of layers to 6 and the hidden dimension to 32. This is to facilitate a fair comparison with NBFNet and A∗Net. Furthermore the transductive batch size is fixed at 16. The number of negative samples is tuned from {128, 512, 1024, 2048}, the dropout from the range [0, 0.7], the learning rate decay from {0.9, 0.95, 1}, the weight decay from [1e-8, 1e-3], and the adversarial temperature from {0.5, 1}. For the target specific setting we further test on setting 𝑔 as its own function or as equal to the score function, 𝑔 = 𝑓 . We further tune the softmax temperature for attention from {0.5, 1, 5}. For the inductive setting we further tune the batch size from {16, 32, 64, 128} and the learning rate from [1e-4, 1e-2]. Lastly, for all experiments, the offset 𝛿 is tuned from {1, 2, 3}. D.3.5 Implementation Details The framework is implemented with PyTorch [86]. All experiments were run on a single 32G Tesla V100 GPU. We train TAGNet with the binary cross-entropy loss optimized via the Adam optimizer [52]. We follow [128] and augment the graph by including reciprocal edges, such that for an edge (ℎ, 𝑟, 𝑡), its reciprocal edge (𝑡, 𝑟 −1, ℎ) is included. In this scenario 𝑟 −1 is considered a distinct relation from 𝑟. D.4 Additional Analysis on TAGNet In this section we take a closer look as to what kind of messages are pruned by TAGNet. As noted in Section 5.3.1 we strive to limit the number of empty and redundant messages. We first analyze how well TAGNet can prune both of those messages. We then examine the reason why some datasets may prune more empty or redundant messages. We first analyze the number of empty and redundant messages pruned for both transductive datasets. We report the results in Table D.3 as a % of the total number of pruned messages. E.g., For FB15k-237 51% of the total number of pruned messages are empty messages. For simplicity, we limit this study to the the best versions of each model, i.e. 𝛿 = 2 for FB15K-237 and 𝛿 = 3 for WN18RR. We find that on FB15k-237, the messages pruned are split evenly between empty and 134 redundant messages. On the other hand, for WN18RR over 90% of the messages pruned are empty messages. Table D.3 % of Messages Pruned that are either Empty or Redundant. Dataset % Empty % Redundant FB15k-237 WN18RR 51% 91% 49% 9% An obvious question is: Why does the composition of pruned messages differ between datasets? We believe this can be explained via two properties of each datasets, the density and distance distribution. We measure the sparsity via the mean degree, which is shown in Table D.4. We do this as graphs with a low mean degree will contain few connections between nodes, resulting in fewer paths between different nodes and thereby fewer redundant paths. Furthermore, there will be a lower chance of a node visiting another node already on the path, as most nodes are linked to only a handful of nodes. We further show the distance distribution of the test samples, i.e., the % of test samples that are a distance 𝑘 from each other, in Table D.5. This is because when nodes are typically far from each other, the target nodes will aggregate many empty messages. Using Figure 5.1a as an example, the source and node 7 are a distance 3 from each other. Because of this, in the first two iterations NBFNet will propagate node 6 to node 7, even though node 6 contains no information. However, this is less of an issue between nodes of shorter distances as there fewer iterations needed to reach it. From this, we hypothesize that graphs that feature, on average, a larger distance between nodes will propagate more empty messages. Table D.4 Mean Degree of Transductive Datasets. Dataset Mean Degree FB15k-237 WN18RR 18.7 2.1 From the results in Table D.4 and D.5 we make the following observations: (a) WN18RR is much sparser than FB15k-237. The higher density of FB15k-237 leads to many more paths and subsequent opportunities to visit a node already on the path. The opposite is true for WN18RR as 135 Table D.5 Distance Distribution of Test Samples on the Transductive Datasets. Distance FB15k-237 WN18RR 1 2 3 4 5 6+ 0% 73% 26% 0.2% 0.005% 0% 35% 9% 21% 7% 9% 18% since the average degree is low, few paths exist in the graph. This results in many more redundant paths existing in FB15k-237 as compared to WN18RR. (b) For FB15k-237, the vast majority of test samples are close to each other. This leads to less empty messages. However, for WN18RR the distance covers a much wider range. For example, over 33% of test samples have a distance of 4+ between them. This is only true for 0.205% of samples on FB15k-237. This helps explain why TAGNet mostly prunes empty messages on WN18RR, as the larger distance between nodes leads to many messages that contain no information. 136