OUT-OF-DISTRIBUTION GENERALIZATION IN GRAPH NEURAL NETWORKS: PROBLEMS, METHODS AND APPLICATIONS By Yiqi Wang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2022 ABSTRACT Graphs are one of the most natural representations of many real-world data, such as social networks, chemical molecules, and transportation networks. Graph neural networks (GNNs) are deep neural networks (DNNs) that are specially designed for graphs and have aroused great research interest. Recently, GNNs have been theoretically and empirically proven to be effective in learning graph representations and have been widely applied in many scenarios, such as recommendation and drug discovery. Despite its great success in numerous graph-related tasks, GNNs still face a tremendous challenge in terms of out-of-distribution generalization. Specifically, it has been observed that significant performance gaps for GNNs exist between the training graph set and the test graph set in some graph-related tasks. In addition, graph samples can be very diverse, even though coming from the same dataset. They can be different from each other in not only node attributes but graph structures, which makes the out-of-distribution generalization problem in GNNs more challenging and complex than that in traditional deep learning-based methods. Apart from the out-of-distribution generalization problem, GNNs also come across other kinds of challenges when applied in different application scenarios, such as data sparsity and knowledge transfer in the recommendation task. In this dissertation, we aim at alleviating the out-of-distribution generalization problem in GNNs. In particular, two novel frameworks are proposed to improve GNN’s out-of-distribution generalization ability from two perspectives, i.e., a novel training perspective, and an advanced learning perspective. Meanwhile, we design a novel GNN-based method to solve the data sparsity challenge in the recommendation application. In addition, we propose an adaptive pre-training framework based on the new GNN-based recommendation method and thus increase the abilities of GNNs in terms of generalization and knowledge transfer in the real-world application of recommendations. ACKNOWLEDGEMENTS First and foremost, I would like to express my most sincere gratitude and appreciation to my dearest advisor, Dr. Jiliang Tang, for his careful guidance, great patience, constant encouragement, and strong support. During my Ph.D. study, I have come across some unexpected but severe challenges, which made me almost lose faith in the pursuit of the Ph.D. degree. It is Dr.Tang, who lifts me out of the pit of despair, out of the mud and the mire. I have learned so much from him, including the passion, persistence, and skills of research and the faith and courage to make the impossible possible. For me, Dr. Tang is not only an excellent advisor but a dearest friend and a lifelong role model. I would like to express my deepest appreciation to my committee members, Dr. Anil K. Jain, Dr. Pan-Ning Tan, Dr. Kenneth A. Frank, and Dr. Yunhao Liu for their insightful comments and helpful suggestions. I was very fortunate to collaborate with Dr. Anil K. Jain on several research projects, and I am really impressed by his broad knowledge, brilliant insights, and dedication to research. I took the course Computational Foundations in AI and Machine Learning from Dr. Pan-Ning Tan in the fall of 2019 and it prepares me with useful mathematical knowledge that benefits my Ph.D. study a lot. Dr. Kenneth A. Frank kindly offers numerous insightful suggestions and comments for my research and helps broaden my research vision from a novel perspective. Dr. Yunhao Liu also provides me with lots of valuable advice in terms of research and career. Thank you all again, my dear committee members. I was fortunate to work with amazing colleagues and mentors: Dr. Chaozhuo Li, Dr. Xing Xie, Dr. Yiding Liu, Dr. Shuaiqiang Wang, and Dr. Dawei Yin. I really enjoyed the wonderful and productive time with you. Thank you all for your careful guidance and valuable suggestions. Also, I am very grateful to have had so many supportive and encouraging friends and colleagues during my Ph.D. life. I would like to express my thanks to all my dear colleagues and friends from the Data Science and Engineering Lab: Dr. Yao Ma, Wei Jin, Dr. Xiaorui Liu, Dr. Tyler Derr, Dr. Wenqi Fan, Dr. Xiangyu Zhao, Dr. Hamid Karimi, Dr. Zhiwei Wang, Dr. Haochen Liu, Juanhui Li, Yaxin Li, Hua Liu, Jamell Dacon, Wentao Wang, Dr. Xiaoyang Wang, Han Xu, Hongzhi Wen, Harry Shomer, Jie Ren, Jiayuan Ding, Haoyu Han, Yuxuan Wan, Pengfei He, Yingqian Cui, Haitao iii Mao, Hang Li, Kaiqi Yang, Dr. Xin Wang, and Dr. Jiangtao Huang. Thank you all for your joyful companionship and support. Finally, I would like to thank my family, for their unconditional love and support. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION .................................................................................................. 1 CHAPTER 2 TEST-TIME TRAINING FOR GRAPH NEURAL NETWORKS ....................... 5 CHAPTER 3 CUSTOMIZED GRAPH NEURAL NETWORKS.............................................. 29 CHAPTER 4 LOCALIZED GRAPH COLLABORATIVE FILTERING ................................. 50 CHAPTER 5 AN ADAPTIVE GRAPH PRE-TRAINING FRAMEWORK FOR LOCALIZED COLLABORATIVE FILTERING ................................................ 68 CHAPTER 6 CONCLUSION................................................................................................... 102 BIBLIOGRAPHY ....................................................................................................................... 105 APPENDIX A TEST-TIME TRAINING FOR GRAPH NEURAL NETWORKS ................. 116 APPENDIX B CUSTOMIZED GRAPH NEURAL NETWORKS ......................................... 121 APPENDIX C LOCALIZED GRAPH COLLABORATIVE FILTERING ............................. 124 APPENDIX D AN ADAPTIVE GRAPH PRE-TRAINING FRAMEWORK FOR LOCALIZED COLLABORATIVE FILTERING ........................................... 131 v CHAPTER 1 INTRODUCTION Graphs are a kind of common data format, which can naturally denote many real-world data, including biological molecules [9, 39, 126], social networks [19, 122] and transportation networks [111]. There are numerous real-world applications that can be formalized as tasks over graphs. For example, recommendations [108, 25] can be viewed as the link prediction task over user-item bipartite graphs. The property prediction of proteins [9] can be regarded as graph classification tasks over protein graphs. To address these graph tasks, it is important to learn good representations for nodes and graphs. Deep Neural Networks (DNNs) have shown great capacity in representation learning, and have achieved tremendous success in many applications, such as natural language processing [15] and computer vision [33]. Graph Neural Networks (GNNs) are a successful generalization of DNNs over graphs, which have been empirically and theoretically proven to be powerful in graph representation learning. GNNs usually consist of information aggregation among nodes and feature transformation to refine node representations, and graph pooling to summarize a graph representation. In graph tasks, GNNs are typically trained from training samples, and then applied to make predictions among test samples, following the common paradigm adopted in DNNs. One important condition where this paradigm works effectively is that the training samples and the test samples come from the same distribution. However, this condition usually can not be satisfied in real scenarios. Graph data can be especially diverse since samples in graph data can be various in not only the node attributes but also the graph structures. For example, graph samples from one dataset can be very different in terms of graph size, and node samples from one graph can differ from each other greatly in regard to degrees [103]. Such gaps between training samples and test samples can bring significant performance degradation of GNN models. In this dissertation, we mainly focus on solving this challenge in GNNs for graph tasks. Specifically, we tackle this problem from two perspectives, i.e., a training perspective and a learning perspective. In addition, we also focus on addressing the challenges in applying GNNs in real-world recommendation scenarios. In particular, we propose a novel localized GNN-based recommendation method to overcome the data sparsity 1 issue. Furthermore, we propose an adaptive pre-training framework based on the new GNN-based recommendation method and thus increase the abilities of GNNs in terms of generalization and knowledge transfer in the real-world application of recommendation. 1.1 Dissertation Contributions The major contributions of this dissertation are summarized as follows. • Numerous recent works have observed that there exist significant performance gaps of GNNs between the training phase and the test phase. We seek to alleviate this problem by designing a novel training paradigm for GNNs. Specifically, instead of training GNNs from the training graph samples and then utilizing the GNNs to directly make predictions for test samples, we propose to further carefully train the GNNs in the test time. We introduce the first test-time training framework for GNNs (GT3) to enhance the model out-of-distribution generalization capacity for the graph classification task. In particular, we design a novel test-time training strategy with a two-level self-supervised learning task to adjust the GNN model for each test graph sample. Also, an effective adaptation constraint is proposed to prevent undesirable excessive feature distortion. Extensive experiments are conducted to validate the effectiveness of the proposed method, and we also provide a theoretical analysis to demonstrate the benefits of test-time training for GNNs. • The out-of-distribution generalization challenge in GNNs has attracted emerging research interest. We seek to explore and analyze this challenge from the perspective of graph data diversity. We investigate the data diversity in terms of graph structure and their influence on GNNs performance. The investigation inspires us to develop a novel customized graph neural network framework, i.e., Customized-GNN, which can generate a sample-specific model for this graph based on its structure, rather than a unified GNN model for graphs with diverse structures. In addition, the proposed Customized-GNN is very general and flexible that can be applied to numerous existing GNN models. Comprehensive experiments on numerous graph datasets from various domains have verified the effectiveness of the proposed framework. 2 • GNNs have been widely applied in recommendation tasks and have achieved tremendous success. However, it has been observed that most GNN-based recommendation methods suffer greatly from the data sparsity issue, and unfortunately, this issue is quite common in real-world recommendation scenarios. To solve this limitation, we study a novel perspective to build a GNN-based recommendation method, Localized Graph Collaborative Filtering (LGCF). Specifically, instead of learning embeddings for each user and item, we aim at learning some recommendation-related patterns in the localized graph induced by the target user and item. Extensive experiments on various datasets validate the effectiveness of LGCF, especially in sparse scenarios. Furthermore, empirical results demonstrate that LGCF provides complementary information to the embedding-based recommendation model. Thus, it can be utilized to further boost recommendation performance in other scenarios. • Data sparsity is one of the most common problems in many real-world applications, such as machine translation, image classification, and recommendation. Pre-training techniques have been proven effective to address data sparsity issues in the domains of natural language processing and computer vision by knowledge transfer. However, it is not trivial to design an effective pre-training framework for the GNN-based method in recommendation tasks. To solve the challenges that different recommendation scenarios not sharing the same users and items and that use-item interaction graphs possessing diverse properties, we design an adaptive GNN pre-training framework (ADAPT) based on LGCF for recommendations, which is capable of capturing the common knowledge among different user-item bipartite graphs and preserving the uniqueness of each graph. The proposed framework does not require transferring user/item embeddings and is able to capture both the common knowledge across different graphs and the uniqueness of each graph simultaneously. Extensive experimental results have demonstrated the effectiveness and superiority of ADAPT. 3 1.2 Dissertation Structure The remainder of this dissertation is organized as follows. In Chapter 2, we aim at enhancing the model out-of-distribution generalization capacity for the graph classification task based on a novel training paradigm by introducing the test-time training framework for GNNs. In Chapter 3, we focus on bridging the graph classification gap between the training graphs and test graphs from a new learning perspective by developing a customized graph neural network framework, which can generate a sample-specific model for a test graph based on its structure. In these two chapters, we aim at alleviating the out-of-distribution generalization problem in GNNs from two different perspectives, i.e., the training perspective and the learning perspective. Though from various perspectives, these proposed frameworks are very flexible and can be applied in many existing GNN methods. In Chapter 4, we propose a novel GNN-based recommendation method to solve the data sparsity issue, which is designed to learn useful recommendation information from a localized graph, rather than to learn effective embeddings for all the users and items. In Chapter 5, we further propose an adaptive pre-training framework based on the proposed GNN-based recommendation method, which enables effective knowledge transfer among different recommendation scenarios. In these two chapters, we aim at addressing the practical challenges of applying GNNs in real-world recommendation tasks. In particular, we focus on improving the performance of the GNN-based recommendation methods in data sparsity scenarios and further increase the abilities of GNNs in terms of generalization and knowledge transfer in the real-world application of recommendation. Finally, we conclude the dissertation and discuss further promising research directions in Chapter 6. 4 CHAPTER 2 TEST-TIME TRAINING FOR GRAPH NEURAL NETWORKS Graph Neural Networks (GNNs) have made tremendous progress in the graph classification task. However, a performance gap between the training set and the test set has often been noticed. To bridge the gap, in this chapter we introduce the first test-time training framework for GNNs to enhance the model generalization capacity for the graph classification task. In particular, we design a novel test-time training strategy with self-supervised learning to adjust the GNN model for each test graph sample. Experiments on the benchmark datasets have demonstrated the effectiveness of the proposed framework, especially when there are distribution shifts between the training set and the test set. We have also conducted exploratory studies and theoretical analysis to gain a deeper understanding of the rationality of the design of the proposed graph test time training framework (GT3). 2.1 Introduction Many real-world data can be naturally represented as graphs, such as social networks [19, 122] and molecules [9, 39, 126]. Graph neural networks (GNNs) [7, 62, 68], a successful generalization of deep neural networks (DNNs) over the graph domain, typically refine node representations via information aggregation among nodes and feature transformation that can be summarized as the graph representation via graph pooling operation. GNNs have been empirically and theoretically proven to be effective in graph representation learning, and have achieved revolutionary progress in various graph-related tasks, including node classification [52, 62, 106], graph classification [120, 69, 126] and link prediction [51]. Graph classification is one of the most important tasks on graphs, which aims to predict some properties of graphs. There are lots of real-world scenarios for graph classification, such as predicting molecular property to help drug discovery [17], forecasting the movie type based on the actor/actress connection graphs [35] and predicting the collaboration topic and research interests of an academic collaboration graph [52]. In the graph classification task, we typically train a graph neural network from the training graph 5 Table 2.1: Graph classification results for ogbg-molhiv (Results from the paper [39]). ROC-AUC(%) Method Training Validation Test GCN 88.65±1.01 83.73±0.78 74.18±1.22 GIN 93.89±2.96 84.1±1.05 75.2±1.30 samples and then utilize the trained model to make predictions for the test graph samples. This training strategy works effectively when the training samples and the test samples come from the same distribution. However, this assumption may not hold in many real scenarios. Often, there could exist gaps between the distributions of the training data and the test data, and the model performance will degrade significantly due to such gaps [101, 75]. For example, as reported by the OBG paper [39] (as shown in Table 2.1), there exist significant gaps between the performance of both the GCN [52] model and the GIN [120] model in the training set and the test set in predicting whether HIV virus replication is inhibited by a molecular. Besides, in the graph domain, graph samples can be very diverse in terms of their graph structures. As demonstrated in Figure 2.1, two graph samples from the same dataset can be very different from each other in terms of structural properties. This also brings in new challenges for the distribution shift problem in graphs. Although recognized by existing works [39, 18], this problem has rarely been explored for GNNs. Recently, test-time training [63, 101], a newly-proposed paradigm, has been successfully proposed to address the distribution shift issue for images [6, 101]. Specifically, it adapts the model trained on the training set via solving a self-supervised learning task, such as image rotation, Figure 2.1: Two graph samples from D&D dataset [74], which present very different structure properties. 6 specifically for every single image at the test time. This paradigm has great potential to solve the distribution graph problem for GNNs since it not only aims at improving the model generalization but also can be designed to specifically adapt a GNN model for each single graph sample with unique structure information. However, it is challenging to design a test-time training framework for graph neural networks over graph data. First, what self-supervised learning task to design for test-time training in graph data? Graph data usually consists of not only node attribute information but also abundant structure information. Thus, it is important to blend these two kinds of graph information into the design of the self-supervised learning task. Second, how to mitigate the potential severe feature space distortion during the test time? The representation space can be severely tortured in some directions that are undesirable for the main task in the test-time training [63] since the model is adapted solely based on one sample and a self-supervised learning task. This problem can be much more severer for graph data, as graph samples can vary greatly from each other in both the node attributes and the graph structures. Third, as the first work to design a test-time training framework for GNNs, it is desired to understand the rationality of test-time training over GNNs. In order to solve these challenges, we propose Graph Test-Time Training with Constraint (GT3), a test-time training framework with model adaptation constraints and carefully designed self-supervised learning tasks, especially for graph neural networks. Our key contributions are summarized as follows: • To the best of our knowledge, this work is the first to design a test-time training framework for the graph neural networks for graph classification tasks. • We propose to use a two-level self-supervised learning task in the test-time training framework for GNNs, consisting of both the local (node-node level) contrast learning and global (node- graph level) contrast learning, which can fully exploit the attribute and structure information of each graph. • We have designed an effective adaptation constraint in the test-time training process for 7 GNNs, which can help prevent excessive feature space distortion and increase the framework’s robustness and performance. • We have conducted exploratory studies and theoretical analysis to understand the rationality of the design of the proposed framework. 2.2 Preliminary In this section, we will introduce the key notations used in this paper by defining the graph classification problem and reviewing the general structure of Graph Neural Networks (GNNs). 2.2.1 Graph Classification Graph classification is one of the most common graph-level tasks. It aims at predicting some properties of graphs. There is a wide range of applications for graph classification, such as predicting the molecule properties to assist drug discovery [17] and forecasting whether a protein function as enzymes [126]. Suppose that there is a training set G = {⌧ 1 , ⌧ 2 , . . . , ⌧ = }, consisting of = graphs. For each graph, ⌧ 8 2 G, its structure information, node attribute information and label information can be denoted as ⌧ 8 = (A8 , X8 , H8 ), where A8 is the graph adjacency matrix, X8 represents the node attributes and H8 is the label. The goal of graph classification is to learn a classification model 5 = (·; )) based on G which can be used to predict the label of a new graph ⌧ 0. 2.2.2 Graph Neural Networks Graph neural networks (GNNs) have successfully extended deep neural networks to graph domains and have been widely applied to numerous graph tasks, including node classification [52, 106], graph classification [126, 120] and link prediction [51]. Generally, graph neural networks consist of several layers, and typically each GNN layer refines node representation by feature transforming, propagating, and aggregating node representation across the graph. According to [24], the node representation update process of node E in the ;-th GNN layer of an !-layer GNN can be formulated as follows: 1) a;E = ⌧⌧ '⇢⌧ ) ⇢ ; (hD(; , D 2 NE ), ; 2 [!] (2.1) 8 1) h;E = ⇠$ " ⌫ # ⇢ ; (hE(; , a;E ), ; 2 [!], (2.2) where ⌧⌧ '⇢⌧ ) ⇢ is the aggregation function aggregating information from neighborhood nodes D 2 NE , and ⇠$ " ⌫ # ⇢ is the combination function updating the node representation in the ;-th layer based on the aggregated information 0 ;E and the node representation of E from the previous layer (; 1). For simplicity, we summarize the node representation update process in the ;-th GNN layer, consisting of Eqs 2.1 and 2.2, as follows: 1) H; = ⌧ # # (A, H (; ; ) ; ), ; 2 [!], (2.3) where A is the adjacency matrix, H; denotes node representations of all the nodes in the ;-th GNN layer, and ) ; indicates the learnable parameters in the ;-th layer. Note that H0 is the input node attributes X. For graph classification, an overall graph representation is summarized from the !-th GNN layer based on the node representations by a readout function, which will be utilized by the prediction layer to predict the label of the graph: h⌧ = '⇢ ⇡$*) (H ! ), (2.4) 2.3 The Proposed Test-time Training Framework for GNNs In this section, we introduce the proposed framework of Graph Test-Time Training with Constraint (GT3) and detail its key components. We first discuss the overall process, then detail the two-level SSL task and finally describe the adaptation constraint to mitigate the potential issue of feature distortion in GT3. 2.3.1 An Overview Suppose that the graph neural network for the graph classification task consists of ! GNN layers, and the learnable parameters in the ;-th layer are denoted as ) ; . The overall model parameters can be denoted as ) <08= = () 1 , ) 2 , . . . , ) ! , %< ), where %< is the parameters in the prediction layer. In this work, we mainly focus on the graph classification task, which is named as the main task 9 in the test-time training framework. Given the training graph set G = (⌧ 1 , ⌧ 2 , . . . , ⌧ = ) and its corresponding graph label set Y = (H 1 , H 2 , . . . , H = ). The main task is to solve the following empirical risk minimization problem: 1’ = min L< (A8 , X8 , H8 ; ) <08= ), ⌧ 8 2 G, (2.5) ) <08= = 8=1 where L< (·) denotes the loss function for the main task. In addition to the main task, self-supervised learning (SSL) tasks also play a vital role in the test-time training framework. As informed by [63], the SSL tasks should be as much information as possible. Also, the more correlated the SSL task and the main task are, the more likely the test-time training will benefit the main task. Given the great success of contrastive learning, methods have been achieved as auxiliary tasks in graph-related tasks, in this work, we carefully design a two-level graph contrastive learning task as the SSL task (details in Section 2.3.2). For simplicity, the loss for any self-supervised learning task is denoted as L B (·). With the main task and the SSL task, an overall of the proposed framework GT3 is shown in Figure 2.2. It consists of three key processes: (1) the training process (Figure 2.2(a)); (2) the test-time training process (Figure 2.2(b)) and (3) the inference process (Figure 2.2(c)). 10 Classification Layer G 1 Classification Head + G … SSLPrediction Prediction SSL 2 Layer Layer G 3 Shared Graph Shared Graph Feature Extractor Feature Extractor SSL Head (a) The Training Process. * SSL Prediction Layer * G t Shared Graph Feature Extractor SSL Head (b) Test-time Training Process. * Main Prediction ^ yt Layer *’ G t Shared Graph Feature Extractor Classification Head (c) Inference Process. Figure 2.2: An Overview of the Proposed Framework GT3. 11 The Training Process. Given the training graphs G, the training process is to train the model parameters with the main task and the SSL task. The key challenge is how to integrate the main task and the SSL task. To tackle this challenge, we conduct an empirical study where we train two GNN models separately by the main task and the SSL task, and we found that these two GNN models will extract similar features by their first few layers. More details can be found at Section 2.5.2. This empirical finding paves us a way to integrate the main task and the SSL task: we allow them to share the parameters in the first few layers as shown in Figure 2.2(a). We denote the shared model parameters as ) 4 = () 1 , ) 2 , . . . , ) ), where 2 {1, 2, . . . , !}. The shared component is named shared graph feature extractor in this work. The model parameters that are specially for the main task are denoted as ) < = () +1 , . . . , ) ! , X < ), described as the graph classification head. Correspondingly, the SSL task has its own self-supervised learning head, 0 0 denoted as ) B = () +1 , . . . , ) ! , % B ), where % B is the prediction layer for the SSL task. To summarize, the model parameters for the main task can be denoted as ) <08= = () 4 , ) < ), the model parameters for the SSL task can be represented as ) B4; 5 = () 4 , ) B ), and these for the whole framework can be denoted as ) >E4A0;; = () 4 , ) < , ) B ). In the training process, we train the whole framework by minimizing a weighted combination loss of L< (·) and L B (·) as follows: 1’ # min L< (A8 , X8 , H8 ; ) 4 , ) < )+ ) 4 ,) < ,) B = 8=1 WL B (A8 , X8 , H8 ; ) 4 , ) B ), ⌧ 8 2 G, (2.6) where W is a hyper-parameter balancing the main task and the self-supervised task in the training process. The Test-time Training Process. Given a test graph ⌧ C , the test-time training process will fine-tune the model via the SSL task as illustrated in Figure 2.2(b). Specifically, given a test graph sample ⌧ C = (AC , XC ), the shared graph feature extractor and the self-supervised learning head are fine-tuned by minimizing the SSL task loss as follows: min L B (AC , XC ; ) 4 , ) B ), (2.7) ) 4 ,) B 12 + View0 LOSS g Z0 MLP g0 DA a - View1 Global Contrastive Learning GNN Z1 DAb G - LOSSl View2 Z2 MLP Z 2’ + DAb - Z3 Z 3’ View3 Local Contrastive Learning Figure 2.3: The Self-supervised Learning Task for GT3. Suppose that ) >E4A0;; ⇤ ⇤ , ) ⇤ ) is the optimal parameters from minimizing the overall loss of = () 4⇤ , ) < B Eq 2.6 from the training process. In the test process, given a specific test graph sample ⌧ C , the 0 0 proposed framework further updates ) 4⇤ and ) B⇤ to ) 4⇤ and ) B⇤ by minimizing the SSL loss of Eq 2.7 over the test graph sample ⌧ C . The Inference Process. The inference process is to predict the label of the test graph ⌧ C based on the finetuned model by the test-time training as demonstrated in Figure 2.2(c). In particular, the 0 label of ⌧ C is predicted via the GNN model 5 (·; ) 4⇤ , ) < ⇤ ) as: 0 Ĥ C = 5 (XC , AC ; ) 4⇤ , ) < ⇤ ). (2.8) 2.3.2 Self-supervised Learning Task for GT3 An appropriate and informative SSL task is the key to the success of test-time training [63]. However, it is challenging to design an appropriate SSL task for test-time training for GNNs. First, graph data is intrinsically different from images, some common properties such as rotation invariance 13 do not exist in graph data, thus most SSL tasks commonly adopted by existing test-time training are invalid in graph data. Second, most existing SSL tasks for graph classification are intra-graph contrastive learning based on the difference among diverse graph samples [99, 31]. These are not applicable for the test-time training where we aim at specifically adapting the model for every single graph during the test time. Third, graph data usually consists of not only node attribute information but abundant and unique structure information. It is important to fully leverage these two kinds of graph information into the design of the SSL task since an informative SSL task is one of the keys to the success of test-time training. Motivated by the success of contrastive learning in graph domains, we propose to use a two-level SSL task from both local and global perspectives for GT3, which fully exploits the graph information from both the node-node level and node-graph level. The proposed self-supervised learning task is based on graph characteristics, instead of the properties in Euclidean data. In addition, the proposed SSL task is not built up based on the differences among distinct graphs, thus it can be naturally applied to a single graph. We empirically validate that the global (node-graph level) contrastive learning task and the local (node-node level) contrastive learning task in the proposed two-level SSL task is necessary for GT3 since they are complementary to each other across datasets. More details about this empirical study can be found in Section 2.5.2. Next, we will detail the global contrastive learning task and the local contrastive learning task. 2.3.2.1 Global Contrastive Learning Global contrastive learning aims at helping node representation capture global information of the whole graph. Overall, global contrastive learning is based on maximizing the mutual information between the local node representation and the global graph representation. Specifically, as shown in Figure 2.3, given an input graph sample ⌧, different views can be generated via various types of data augmentations. For global contrastive learning, two views are adopted: one is the raw view +84F 0 , where no changes are made on the original graph; the other one is augmented view +84F 1 , where node attributes are randomly shuffled among all the nodes in a graph. With these two graph views, we can get two corresponding node representations Z0 and Z1 via a shared GNN model. 14 After that, a global graph representation g0 can be summarized from the node representation matrix Z0 from the raw view +84F 0 via a multiplayer perceptron (MLP). In our work, a single-liner layer model is adopted as the MLP. Following [107], the positive samples in the global contrastive learning consist of node-graph representation pairs, where both the node representation and graph representation come from the raw view +84F 0 . The negative samples consist of node-graph representation pairs where node representations come from +84F 1 and graph representation comes from +84F 0 . A discriminator D is employed to compute the probability score for each pair, and the score should be higher for the positive pair and lower for the negative pair. In our work, we set D (ZB 8 , g0 ) = ZB 8 ⇤ g0 , where Zs8 denotes the representation of node 8 from +84F B . The objective function for global contrastive learning is summarized as follows: 1 ’ # L6 = ( (;>6D (Z08 , g0 ) + ;>6(1 D (Z18 , g0 ))), (2.9) 2# 8=1 where # denotes the number of nodes in the input graph. 2.3.2.2 Local Contrastive Learning In global contrastive learning, the model aims at capturing the global information of the whole graph into the node representations. In other words, the model attempts to learn an invariant graph representation from a global level and blend it into the node representations. However, graph data consists of nodes with various attributes and distinct structural roles. To fully exploit the structure information of a graph, in addition to the invariant graph representation from a global level, it is also important to learn invariant node representations from a local level. To achieve this, we propose local contrastive learning, which is based on distinguishing different nodes from different augmented views of a graph. As shown in Figure 2.3, given an input graph ⌧, we can generate two views of the graph via data augmentation. Since the local contrastive learning is to distinguish whether two nodes from different views are the same node of the input graph, the adopted data augmentation should not 15 make large changes on the input graph. To meet this requirement, we adopt two adaptive graph data mechanisms – adaptive edge dropping and adaptive node attribute masking. Specifically, the edges are dropped based on the edge importance in the graph. The more important an edge is, the less likely it would be dropped. Likewise, the node attributes are masked based on the importance of each dimension of node attributes. The more important the attribute dimension is, the less likely it would be masked out. Intuitively, the higher degree a node holds, the more important role it plays in the whole graph. Therefore, in this work, the importance of both the edges and node attributes are computed based on the node degree. In particular, the importance score of an edge is computed as the average degree of its connected nodes and the importance score of an attribute is the average of the products of its norm over all nodes and the degree of the corresponding node in the whole graph. After the described data augmentation, two views of an input graph are generated and serve as the input of the shared GNN model. The outputs are two node representation matrices Z2 and Z3 corresponding to two views. Note that the basic goal of local contrastive learning is to distinguish whether two nodes from augmented views are the same node in the input graph. Therefore, (Z28 , Z38 ), (8 2 {1, . . . , # }) denotes a positive pair, where # is the number of nodes. (Z28 , Z3 9 ) and (Z28 , Z2 9 ) (8, 9 2 {1, . . . , # } and 8 < 9) denote an intra-view negative pair and an inter-view negative pair, respectively. Inspired by InfoNCE [79, 141], the objective for a positive node pair (Z28 , Z38 ) is defined as follow: ⌘(Z28 , Z38 ) l 2 (Z28 , Z38 ) = ;>6 Õ Õ , (2.10) ⌘(Z28 , Z38 ) + 9<8 ⌘(Z28 , Z3 9 ) + 9<8 ⌘(Z28 , Z2 9 ) where ⌘(Z28 , Z3 9 ) = 4 2>B ( 6(Z28 ),6(Z3 9 ) ) /g , 2>B() denotes the cosine similarity function, g is a temperature parameter and 6() is a two-layer perceptron (MLP) to further enhance the model expression power. The node representations refined by this MLP are denoted as Z2 and Z3 , 0 0 respectively. Apart from the contrastive objective described above, a decorrelation regularizer has also been added to the overall objective function of the local contrastive learning, in order to encourage different representation dimensions to capture distinct information. The decorrelation regularizer is 16 applied over the refined node representations Z2 and Z3 as follows: 0 0 0 0 ) 0 2 l 3 (Z2 ) = Z2 Z2 . (2.11) The overall objective loss function of local contrastive learning is denoted as follows: 1 ’ # V 0 0 L; = (l 2 (Z28 , Z38 ) + l 2 (Z38 , Z28 )) (l 3 (Z2 ) + l 3 (Z3 )), (2.12) 2# 8=1 2 where V is the balance parameter. 2.3.2.3 The Overall Objective Function The overall objective loss function for the SSL task for GT3 is a weighted combination of the global contrastive learning loss and the local contrastive loss as follows: L B = L6 + UL; , (2.13) where U is the parameter balancing the local contrastive learning and the global contrastive learning. By minimizing the SSL loss defined in Eq 2.13, the model not only captures the global graph information but also learns the invariant and key node representations. 2.3.3 Adaptation Constraint Directly applying test-time training could lead to severe representation distortion, since the model could overfit to the SSL task for a specific test sample [63], which harms the model performance for the main task. This problem may be riskier for test-time training for graph data since graph samples can vary from each other in not only node attributes but also graph structures. To mitigate this issue, we propose to add an adaptation constraint into the objective function during the test-time training process. The core idea is to put a constraint over the embedding spaces outputted by the shared graph feature extractor between the training graph samples and the test graph sample. In such a way, there is an additional constraint for the embedding of the test graph sample to be close to the embedding distribution of training graph samples which is generated with the guidance of the main task, in addition to just fitting the SSL task. The shared graph feature extractor consists of the first GNN layers. Let us denote the node embeddings outputted by the shared graph feature extractor for the training graphs {⌧ 1 , ⌧ 2 , . . . , ⌧ = } 17 G 1 G … 2 G t * G 3 Shared Graph Shared Graph Shared Graph Feature Extractor Feature Extractor Feature Extractor The Training Process The Test-time Training Process t , t , Figure 2.4: The Adaptation Constraint in the Test-time Training Process. in the training process as {H1 , H2 , . . . , H= }. Once the training process completes, we first get graph embeddings via a read-out function h8 = '⇢ ⇡$*) (H8 ), and then summarize two Õ statistics of these graph embeddings: the empirical mean ` = =1 8= h8 and the covariance matrix 1 ) ) ) ⌃= = 1 (H H (I) H )) (I) H )), where H = {h1 , . . . , h= }. During the test-time training process, given an input test graph ⌧ C , we also summarize embedding statistics of ⌧ C and its augmented views and denote them as `C and ⌃C . The adaptation constraint is to force the embedding statistics of the test graph sample to be close to these of training graph samples, which can be formally defined as: L2 = k` `C k 22 + k⌃ ⌃C k 2 . (2.14) 2.4 Theoretical Analysis In this section, we explore the rationality of test-time training for GNNs from a theoretical perspective. The goal is to demonstrate that the test-time training framework can be beneficial for GNNs in the graph classification task. Next, we first describe some basic notions, and then develop two 18 supportive theorems to achieve the goal. For a graph classification task, suppose that there are ⇠ classes and we use a !-layer simplified graph convolution (SGC) [115] with the sum-pooling layer to learn graph representation, i.e., g = 5GNN (A, X; )) = pool(A ! XW1 )W2 = 1T AL X), (2.15) where A and X are the input graph adjacency matrix and node attributes, 1 is a vector with all elements as 1, and ) = W1 W2 denotes the model parameters. Note that we use SGC for this theoretical analysis since SGC has a similar filtering pattern as GCN but its architecture is simpler. Then, the B> 5 C<0G function is applied over the graph representation g to get the prediction probability p2 : 4 g2 p2 = Õ⇠ , (2.16) g8 8=1 4 Õ⇠ where 2=1 p2 = 1. The cross-entropy loss is adopted as the optimization objective for the SGC model, i.e., ’⇠ L (A, X, H; )) = 1H=2 log(p2 ), (2.17) 2=1 where H is the graph ground-truth label. Theorem 1. Let L (A, X, H; )) be defined based on Eqs 2.15 and 2.16, 2.17, i.e., the cross-entropy loss for softmax classification for a !-layer SGC with sum-pooling for a graph A, X, H. It is convex and V-smooth in ), and kr) L (A, X, H; )) k  ⌧ for all ), where ⌧ is a positive constant. 0 0 Proof Sketch. We demonstrate that the cross-entropy loss for softmax classification is convex and V-smooth by proving its Hessian Matrix is positive semi-definite and the eigenvalues of the Hessian Metrix are smaller than a positive constant. As 5GNN defined in Eq 2.15 is a linear transformation mapping function, thus it will not change these properties, which completes the proof of Theorem 1. Proof details can be found in Appendix A.1.1 . Theorem 2. Let L< (G, H; )) denote a supervised task loss on a instance G, H with parameters ), and L B (G; )) denotes a self-supervised task loss on a instance G also with parameters ). Assume that for all G, H, L< (G, H; )) is differentiable, convex and V-smooth in ), and both 19 kr) L< (G, H; ))k, kr) L B (G; ))k  ⌧ for all ), where ⌧ is a positive constant. With a fixed learning rate [ = n V⌧ 2 , for every G, H such that hr) L< (G, H; )), r) L B (G; ))i > n, (2.18) we have L< (G, H; )) > L< (G, H; ) (G)), (2.19) where ) (G) = ) [r) L B (G; )). The detailed proof is provided in Appendix A.1.2. Theorem 2 suggests that if the objective function of the main task is differentiable, convex and V-smooth, there exists a SSL task meeting some requirements and the test-time training via this SSL task can make the main task loss decrease. Remark. From Theorem 1, we know that the objective loss function for the SGC model for graph classification is differentiable, convex and V-smooth. Combined with Theorem 2, we can conclude that test-time training can be beneficial for the GNN performance on graph classification. 2.5 Experiments In this section, we validate the effectiveness and explore the rationality of the proposed GT3 via empirical studies. In the following subsections, the experimental settings are first introduced. Then, we conduct a layer-wise feature space exploration of GNN models trained for different tasks to further validate the rationality of the design of GT3. Next, the classification performance comparison of the proposed GT3 and baseline methods is discussed. Finally, we perform an ablation study and hyper-parameter sensitivity exploration. 2.5.1 Experimental Settings In this study, we implement the proposed GT3 on two well-known GNN models: the Graph Convolutional Network (GCN) [52] and the Graph Isomorphism Network (GIN) [120]. Empirically, we implement four-layer GCN models and three-layer GIN models with max-pooling. The layer normalization is adopted in the implementations. The hyper-parameters are selected based on the validation set from the ranges listed as follows: 20 • Hidden Embedding Dimension: 32,64,128 • Batch Size: 8,16,32,64 • Learning Rate: [1e-5, 5e-3] • Dropout Rate: 0.0,0.1,0.2,0.3,0.4,0.5 For the balance hyper-parameters U, V, and W in the SSL task, we determine their values based on the joint training of the SSL task and the main task on the validation set. As for performance evaluation, we conduct experiments on data splits generated by 3 random seeds and model initialization of 2 random seeds on DD, ENZYMES, and PROTEINS, and report the average and variance values of classification accuracy. For ogbg-molhiv, we do an empirical study on the fixed official data split with model initialization of 2 random seeds and report the average performance in terms of ROC-AUC. We conduct experiments on four common graph datasets as follows: • DD [74]: a dataset consists of protein structures of two categories, where each node denotes an amino aid. • ENZYMES [74]: it consists of enzymes of six categories, where each enzyme is represented as its tertiary structure. • PROTEINS [74]: a dataset contains protein structures of two categories, where each node denotes a secondary structure element. • ogbg-molhiv [39]: it includes molecules of two categories (carrying HIV virus or not), where each node is an atom. In order to simulate the scenario where the test set distribution is different from the training set, graph samples from DD, ENZYMES, and PROTEINS are split into two groups based on their graph size (number of nodes). Next, we randomly select 80% of the graphs from the small group as the training set and the other 20% of the graphs as the validation set. The test set consists of 21 (a) ENZYMES. (b) PROTEINS. Figure 2.5: CKA Values of Representation from Different Layers for Various Tasks. Note that "M" denotes the graph classification task; "C" indicates the proposed GT3 SSL task; "G" is the global contrastive learning task, and "L" is the local contrastive learning task. graphs randomly chosen from the larger group and the number of test graphs is the same as that of the validation set. For the ogbg-molhiv, we adopt the official scaffold splitting, which also aims at separating graphs with different structures into different sets. For simplicity, these data splits described above are denoted as OOD data split. To validate the reasonability of the OOD data split, we examine the training, validation, and test performance of GCN, and the results are shown in Table 2.2. Note that the performance in ogbg-molhiv is copied from [39]. From the table, we indeed observe a big performance gap. Table 2.2: Graph Classification Performance Gap of GCN on Training Set, Validation Set, and Test Set based on OOD Data Split. Note that the performance metric for ogbg-molhiv is ROC-AUC(%) and that for others is Accuracy(%). Graph Classification Performance Dataset Training Validation Test DD 92.8±7.13 73.82 ±3.36 49.43 ±15.30 ENZYMES 87.43±12.34 65.22 ±3.76 37.43 ±5.03 PROTEINS 77.5±2.07 77.78 ±1.93 68.46 ±7.17 ogbg-molhiv 88.65±1.01 83.73±0.78 74.18±1.22 2.5.2 Layer-wise Representation Exploration Test-time training framework is designed to share the first few layers of Convolutional Neural Networks (CNNs) in the existent work. This design is very natural since it is well received that the 22 first few layers of CNNs capture similar basic patterns even when they are trained for different tasks. However, it is unclear how GNNs work among different tasks. To answer this question, we explore the representation similarity of GNNs trained for different tasks. To be specific, we train four GCN models with exactly the same architectures on ENZYMES and PROTEIN for four different tasks: the graph classification task defined in Eq 2.5, the proposed GT3 SSL task with loss described in Eq 2.13, the SSL tasks with the local contrastive loss (Eq 2.12) and the global contrastive loss (Eq 2.9). Then we leverage Centered Kernel Alignment (CKA) [54] as a similarity index to evaluate the representation similarity of different GCN models layer by layer. (a) DD. (b) ENZYMES. (c) PROTEINS. (d) ogbg-molhiv. Figure 2.6: Performance Comparison of GCN and GIN on Four Datasets based on OOD Data Split. Note that the performance metric for ogbg-molhiv is ROC-AUC(%) and that for others is Accuracy(%). The similarity results are demonstrated in Figure 2.5, where “M" denotes the graph classification 23 task, “C" indicates the proposed GT3 SSL task, “G" is the global contrastive learning task and “L" is the local contrastive learning task. "M-C" represents the representation pair of "M" and "C" to be compared. The representation pair for other tasks are denoted similarly. We can make the following observations. First, the representations for different tasks tend to be more similar to each other in the lower layer. Second, the representation similarity for the same task pair may be different in different datasets. The representations of the graph classification task and the global contrastive learning task are more similar to each other compared to that of the graph classification task and the local contrastive learning task in ENZYMES, while the observation is the opposite in PROTEINS. Third, the representation similarity between local contrastive learning and global contrastive learning is relatively lower, which reveals that they may learn knowledge that is complementary to each other. These observations motivate us (1) to share the first few layers of GNNs in test-time training; and (2) to develop the two-level SSL task for GT3, which is capable of capturing graph information from both the local and global levels. 2.5.3 Performance Comparison In this subsection, we compare the graph classification performance of GCN and GIN in three mechanisms: (1) RAW model is trained only by the main task, (2) JOINT model is jointly trained by the main task and the proposed two-level SSL task, and (3) GT3 model is learned by test-time training. The comparison results are shown in Figure 2.6. Note that the performance of RAW in ogbg-molhiv is copied from [39]. From the figure, we can observe that the joint training of the proposed two-level SSL task and the main task can slightly improve the model performance in some cases. However, the proposed GT3 is able to consistently perform better than both RAW and JOINT, which demonstrates the effectiveness of the proposed GT3. To further demonstrate the performance improvement from GT3, we summarize the relative performance improvement of GT3 over RAW. The results are shown in Table 2.3 where the performance improvement can be up to 43.3%. In addition to the performance comparison on the OOD data split, we have also conducted a performance comparison on a random data split, where we randomly split the dataset into 80%/10%/10% for the training/validation/test sets, respectively. The results demonstrate that 24 Table 2.3: Relative Performance Improvement of GT3 over RAW. Datasets Models DD ENZYMES PROTEINS ogbg-molhiv GCN +43.3% +7.81% +5.25% +3.13% GIN +6.09% +13.67% +1.63% +5.21% GT3 can also improve the classification performance in most cases, and more results can be found in Appendix A.2. 2.5.4 Ablation Study In this subsection, we conduct an ablation study to understand the impact of key model components on the performance of the proposed GT3 including the SSL task and the adaptation constraint. This ablation study is conducted on DD, ENZYMES, and PROTEINS with the GCN model. Specifically, we study the effect of the adaptation constraint, global contrastive learning, and local contrastive learning by respectively eliminating each of them from GT3. These three variants of GT3 are denoted as GT3-w/o-constraint, GT3-w/o-global, and GT3-w/o-local. The performance is shown in Figure 2.7(a). We observe that the removal of each component will degrade the model’s performance. This demonstrates the necessity of all three components. (a) The Ablation Study of GT3. (b) The Effect of Different Shared Layers. Figure 2.7: (a)The Ablation Study of GT3 and (b) The Effect of Different Shared Layers. 25 2.5.5 The Effect of Shared Layers We have also explored the effect of the number of shared layers on the performance of GPT3. To be specific, we vary the number of shared layers of the GCN model in GT3 as {0, 1, 2, 4}. Note that the model where the number of shared layers is zero denotes the model trained only by the main task. The graph classification performance on DD, ENZYMES, and PROTEINS is shown in Figure 2.7(b). From the figure, we can observe that most of the GT3 implementations with different shared layers can improve the classification performance over their baseline counterparts, while an appropriate choice of shared layers can boost the model performance significantly on some datasets such as DD. 2.6 Related Work Test-Time Training. Test-time training with self-supervision is proposed in [101], which aims at improving the model generalization under distribution shift via solving a self-supervised task for test samples. This framework has empirically demonstrated its effectiveness in bridging the performance gap between the training set and test set in the image domain [6, 21, 101]. There are also works combining the test-time training and meta-learning learning [6], and attempting to apply test-time training in reinforcement learning [28]. Apart from these applied researches, a recent work tries to explore when test-time training thrives from a theoretical perspective [63]. Graph Neural Networks. Graph Neural Network is a successful generalization of Deep Neural Network over graph data. It has been empirically and theoretically proven to be very powerful in graph representation learning [120], and has been widely used in various applications, such as recommendation [35, 19], social network analysis [102] and natural language processing [43, 11]. GNNs typically refine node representations by feature aggregation and transformation. Most existent GNN models can be summarized into a message-passing framework consisting of message passing and feature update [24]. There is also another perspective unifying many GNN models as methods to solve a graph signal denoising problem [67]. Overall, GNN is an active research field and there are a lot of works focusing on improving the performance, robustness, and scalability of the GNN models [12, 46, 62]. 26 2.7 Limitations To the best of our knowledge, GT3 is the first test-time training framework specially designed for GNNs. While we have carefully designed each component of GT3, there still exist some limitations in the proposed framework at its current stage. First, it is very challenging to determine when to stop the model adaptation process during the test time. In other words, how to select the optimal model during the test-time training process is very hard, since there is no validation set. Although we have tried some heuristic methods to define the stopping criterion for model adaptation during the test-time training process, none of them work effectively on all the test samples from different datasets. After the empirical study, we end up with a simple stopping criterion, that is, to cease the model adaptation process after a fixed number of training steps. However, this criterion is very crude and may fail to find an optimal model for some test samples, especially when the test samples are very diverse. Second, although we have theoretically illustrated that the test-time training framework can be beneficial for GNNs on the graph classification task, we have not provided a principled method with theoretical guarantees, to develop self-supervised learning tasks that can benefit different main tasks in the graph test-time training framework. In other words, for different graph-related tasks, we need to design new self-supervised learning tasks for the test-time training based on intuitions and some preliminary study. 2.8 Conclusion In this work, we design a test-time training framework for GNNs (GT3) for the graph classification tasks, with the aim to bridge the performance gap between the training set and the test set when there is a data distribution shift. As the first work to combine test-time training and GNNs, we empirically explore the rationality of the test-time training framework over GNNs via a layer-wise representation comparison of GNNs for different tasks. In addition, we validate that the GNN performance on graph classification can be benefited from test-time training from a theoretical perspective. Furthermore, we demonstrate the effectiveness of the proposed GT3 on extensive experiments and understand the impact of the model components via ablation study. As future work, it is worth studying test-time training for GNNs for other tasks on graphs such as node classification. 27 Also, it is interesting to explore how to automatically determine the number of shared layers and choose appropriate SSL tasks for various main tasks on different datasets. 28 CHAPTER 3 CUSTOMIZED GRAPH NEURAL NETWORKS Graph Neural Networks (GNNs) have greatly advanced the task of graph classification. Typically, we first build a unified GNN model with graphs in a given training set and then use this unified model to predict labels of all the unseen graphs in the test set. However, graphs in the same dataset often have dramatically distinct structures, which indicates that a unified model may be sub-optimal given an individual graph. In this chapter, different from the training perspective proposed in Chapter 2, we propose to develop customized graph neural networks for graph classification from a novel learning perspective. Specifically, we propose a novel customized graph neural network framework, i.e., Customized-GNN. Given a graph sample, Customized-GNN can generate a sample-specific model for this graph based on its structure. Meanwhile, the proposed framework is very general that can be applied to numerous existing graph neural network models. Comprehensive experiments on various graph classification benchmarks demonstrate the effectiveness of the proposed framework. 3.1 Introduction Graphs are natural representations for many real-world data such as social networks [27, 52, 122], biological networks [9, 95] and chemical molecules [24, 91]. A crucial step to perform downstream tasks on graph data is to learn better representations. Deep neural networks have demonstrated great capabilities in representation learning for Euclidean data and thus have advanced numerous fields including speech recognition [77], computer vision [33], and natural language processing [15]. However, they cannot be directly applied to graph-structured data since graphs have complex topological structures. Recently, graph neural networks (GNNs) have generalized deep neural networks to graph data. GNNs typically update node representations by transforming, propagating, and aggregating node features across the graph. They have boosted the performance of many graph-related tasks such as node classification [52, 27], link prediction [22, 105, 134], and graph classification [69, 58, 23, 126]. Graph classification is one of the most important and prevalent graph-related tasks [18], and in 29 (a) Node size distribution. (b) A graph with 31 nodes. (c) A graph with 302 nodes. (d) Classification accuracy. Figure 3.1: An illustrative example of varied structural information and its impact on the performance of graph neural network-based graph classification. (a) demonstrates the distribution of graph size (i.e., the number of nodes) for protein graphs in the D&D dataset, where the graph size varies dramatically from 30 to 5, 748. ; (b) and (c) show two graphs sampled from the D&D dataset, which present very different structural properties; (d) shows classification performance of three models trained on training sets with various graph sizes. this work, we aim to advance graph neural networks for the graph classification task. There are numerous real-world applications for graph classification. For example, proteins can be denoted as graphs [16], and the task to infer whether a protein functions as an enzyme or not can be regarded as a graph classification task; and it can also be applied to forecast Alzheimer’s disease progression in which individual brains are represented as graphs [97]. Unlike data samples in classification tasks in other domains such as computer vision [89] and natural language processing [55], graph samples in the graph classification task are described not only by the input (node) features but 30 their graph structures. Both the input node features and the graph structures play crucial roles in the graph classification tasks [126, 69, 23]. In reality, graphs in the same data set can present significantly different structural properties. Figure 3.1(a) demonstrates the distribution of graph size (i.e., the number of nodes) for protein graphs in the D&D dataset [16], where the graph size varies dramatically from 30 to 5, 748. We further illustrate two graphs sampled from the D&D dataset in Figures 3.1(b) and 3.1(c), respectively. These two graphs present very different structural properties such as the number of nodes, graph shapes, and diameters. The above investigations indicate that graphs in the same data set could have dramatically distinct structural properties. It naturally raises the question – whether we should treat these graphs differently? To investigate this question, we take the graph size as the representative structure property and demonstrate how it affects the graph classification performance. Specifically, we divide graphs from D&D into two groups based on their graph size – one for graphs with a small number of nodes and the other for graphs with a large number of nodes. Then, we split each group into a training set and a test set. Next, we train three GNN models 1 based on three training sets – the small one, the large one, and the overall (a combination of the small and large one), separately. Then, we test their performance on the two test sets. The results are shown in Figure 3.1(d). In the test set with small/large graph sizes, the model trained on the training set with the same graph size can significantly outperform the other two models. Investigations and consistent observations on more settings can be found in Section 3.2. These investigations suggest that a unified model is not optimal for graphs with diverse structure properties and efforts are desired to consider the structure-property difference among graphs. Hence, in this paper, we aim to learn “customized models” for graphs with different structural properties. A natural way is to divide the dataset into different splits according to structure properties and train a model for each split. However, we face enormous challenges to achieve this goal in practice. First, there are potentially several structure properties (graph size, density, etc.) affecting the performance, and we have no explicit knowledge about how the graphs should be split according to these properties. 1The GNN model for graph classification uses GCN [52] as the filtering operation and max-pooling as the pooling operation. 31 Second, dividing the dataset leads to small training sets for the splits, which may not be sufficient to train satisfactory models. To address these challenges, we propose a novel graph neural network framework, Customized-GNN, for graph classification. The Customized-GNN framework is trained on all graphs in the given training set (without splitting) and is able to produce customized GNN models for each individual graph. Specifically, we design an adaptor, which is able to smoothly adjust a general GNN model to a specific one according to the structural properties of a given graph. The general GNN model and the adaptor are learned during the training stage simultaneously utilizing all graphs. Our major contributions are listed as follows: 1) We empirically observed that graphs in a given dataset could have dramatically distinct structural properties. Furthermore, it is not optimal to train a unified model for graphs with various structure properties for a graph classification task; 2) We propose a framework, Customized-GNN, which is able to generate a customized GNN model for each graph sample based on its structural properties. The proposed framework is general and can be directly applied to many existing graph neural network models; 3) We designed and conducted comprehensive experiments on numerous graph datasets from various domains to verify the effectiveness of the proposed framework. 3.2 Preliminary Data Analysis In Figure 3.1, we have demonstrated that graphs in D&D have varied properties, which affected the performance of GNNs for graph classification. In this section, we aim to further investigate this phenomenon by answering the following two questions – (1) can the observations on D&D be extended to other datasets? and (2) whether incorporating these properties into the models can facilitate the performance? We choose four representative graph datasets from different domains for this study including D&D [16], ENZ [95], PROT [9] and RE-BI [122]. We checked the properties such as node size and edge size. Similar to D&D, graphs in all datasets present very diverse properties. More details about these datasets can be found in Appendix B.1. Following the same setting as D&D, we divide each data into two groups according to the node size, i.e., large training and test (denoted as 32 “L-training" and “L-test") and small training and test (indicated as “S-training" and “S-test"). The results are demonstrated in Table 3.1. From the table, we make consistent observations with these in D&D – models trained on one property group (e.g., L-training) cannot perform well on the other property group (e.g., S-test). Table 3.1: Graph classification accuracy on different node-size sets. Accuracy (%) DD ENZ PROT RE-BI S-test L-test S-test L-test S-test L-test S-test L-test S-training 66.2 55.7 45.0 20.0 76.5 51.4 88.6 46.6 L-training 47.2 77.8 27.0 39.0 45.6 78.6 29.3 83.1 To answer the second question, we divide D&D into several subsets based on the node size and then divide each subset into a sub-training set (80%) and a sub-test set (20%). We train models on different sub-training sets separately and then test their performance on all the sub-test sets. Specifically, we have trained four models on four different training sets from D&D, which are training 1, training 2, training 3 containing graph samples with node sizes from different ranges, and training which is the combination of training 1, 2 and 3. Then, we test four models on four test sets, i.e., test 1, test 2, test 3 and test which is the combination of test 1, 2 and 3. Statistics about these training sets are summarized in Table 3.2. The performance of four models on the test sets is illustrated in Table 3.2. We note that the model trained on a specific training set performs much better on the corresponding test set that shares the same node size range than the other test sets. This suggests the potential to incorporate the structure properties into the model training. In addition, though training 1, training 2, and training 3 have much fewer training samples, the models trained on specific training sets can achieve better performance on the corresponding test sets compared to these trained from the entire training set (or training). This indicates that a unified graph neural network that is trained from the entire training set is not optimal for graphs with various structure properties in the test set. Discussion. Via the preliminary data analysis, we have established: (1) graphs in real-world data present distinct structure properties that tend to impact the graph classification performance of GNNs; and (2) incorporating the difference has the potential to boost the graph classification 33 Table 3.2: The classification accuracy of the models trained from four training sets in the D&D dataset and Statistics for four training sets. Accuracy(%) Training set Node size range #Graphs test 1 test 2 test 3 test training 1 [0,200] 369 76.1 41.2 26.7 50.9 training 2 [200,400] 392 43.5 75.3 82.2 62.4 training 3 [400,2000] 180 23.9 75.3 88.9 56.8 training [0,2000] 941 64.1 61.9 75.6 66.2 Input Structure Customized Adaptor Graphs Property GNNs s1 yˆ1 g1 GNN1 y1 s2 yˆ2 Adaptor L g2 GNN2 y2 … … … … gi si yˆi yi GNNi Shared GNN Figure 3.2: An overview of the proposed customized graph neural networks. Given a set of graph samples, the adaptor networks take their structure properties as input and generate corresponding adaptor parameters, which are used to adapt a shared GNN model (this could be any GNN model that works for the graph classification task) to a model specific for each graph sample. Each adapted model incorporates the structure information of a graph, and thus, is customized for this graph sample to make label predictions. With the predicted label and the ground truth label, we can calculate the overall loss, which is used to guide the optimization of the adaptor networks and the shared GNN model. performance. These observations lay the foundations of the model design in the next section. 3.3 The Proposed Framework In this section, we introduce the proposed framework Customized-GNN that has been designed for graphs with inherently distinct structure properties. 34 3.3.1 The Overall Design As mentioned in earlier sections, graphs in real-world data inherently present distinct structural properties. Thus, we are desired to build distinct GNN models for them. To achieve this goal, we face tremendous challenges. First, we have no explicit knowledge about how the graph structure properties will influence graph neural network models. Second, if we separately train different models for graphs with different structure properties, we have to split the training sets for each model; as a consequence, the training data for each model could be very limited. For example, in the extreme case where each graph has unique graph structural properties, we only have one training sample for the corresponding model. Third, even if we can well train distinct GNN models for different graphs, during the test stage, for an unlabelled graph with unseen structural properties, it is hard to decide which trained model we should adopt to make the prediction. In this work, we propose a customized graph neural network framework, i.e., Customized-GNN, which can tackle the aforementioned challenges simultaneously. An overview of the architecture of Customized-GNN is demonstrated in Figure 3.2. The basic idea of Customized-GNN is – it generates customized adaptor parameters for each graph sample 68 via an adaptor network with the graph structure properties as input. These generated adaptor parameters are used to adapt a shared GNN model denoted as ⌧ # # (this could be any GNN model that works for the graph classification task) to a model specific for the graph sample 68 . The adapted model ⌧ # #8 incorporates the structure information of graph 68 , and thus, is customized for the graph sample 68 . With the proposed Customized-GNN framework, the first challenge is handled, since the influence is implicitly modeled by the adaptor networks, which can customize the shared GNN model to a graph sample-specific one. Furthermore, Customized-GNN can be trained on the entire training set without splitting it according to the graphs’ structure properties. This not only solves the second challenge but also ensures that the trained model can preserve common knowledge from the entire training set. The third challenge is also automatically addressed by the Customized-GNN framework. Given an unseen graph 6 9 , the Customized-GNN framework first takes its graph 35 structure information as input and generates adaptor parameters. Then, these generated adaptor parameters can be used to customize the general GNN model to a customized one ⌧ # # 9 to predict the label of 6 9 . Next, we introduce details about the adaptor network, the process of adapting a shared model to a specific one for a given graph, and the time complexity analysis of the proposed framework. 3.3.2 The Adaptor Network The goal of the adaptor network is to generate the adaptor parameters for a given graph. From the preliminary data analysis, we have the intuition that the customized GNN model for a specific graph sample should be correlated to its structural properties. However, there is no explicit knowledge about how these structural properties influence graph neural network models. To model this implicit mapping function, we propose to utilize a powerful neural network to generate the model adaptor parameters from the observed structure information of a given sample. In addition, graph neural networks often consist of several subsequent filtering and pooling layers, which can be viewed as different GNN blocks. For example, GNN blocks are shown in Figure 3.3. The graph structure properties of a given sample may have different influences on different GNN blocks. Hence, for each GNN block, we introduce one adaptor network to generate adaptor parameters for each block. Specifically, we first extract a vector s8 to denote the structure information of a given graph 68 . We will discuss more details about s8 in the experiment section. As shown in Figure 3.3, the adaptor networks take the structure information s8 as input and generate the adaptation parameters for each block. In the case where there are blocks in the graph neural network, we have independent adaptor networks corresponding to the blocks. Note that these adaptor networks share the same input s8 while their outputs are different. Specifically, the adaptor network for the 9-th block can be expressed as follows: 58 9 = ⌘ 9 (s8 ; ⌦ 9 ), 9 = 1, . . . , , (3.1) where ⌦ 9 denotes the parameters of the 9-th adaptor network and 58 9 denotes its output, which will be used to adapt the 9-th learning block. The adaptor network ⌘ 9 can be modeled using any function. 36 Figure 3.3: An overview of the GNN adaptation process. Given a specific graph sample 68 , the adaptor networks consisting of blocks take its graph structure properties B8 as input, and generate corresponding adaptor parameters to adapt each shared GNN block to get a customized GNN for 68 . This customized GNN is then used to make a prediction for 68 . In this work, we utilize feed-forward neural networks due to their strong capability. According to the universal approximation theorem [38], a feed-forward neural the network can approximate any nonlinear functions. For convenience, we summarize the process of the adaptor networks with B8 as input below: 8 = (s8 ; ⌦ ), (3.2) where 8 contains the generated adaptation parameters of all the GNN blocks for graph 68 and ⌦ denotes the parameters of the adaptor networks. 3.3.3 The Adapted Graph Neural Network Any existing graph neural network model can be adapted by the Customized-GNN framework to generate sample-specific models based on the structure information. Therefore, we first generally introduce the GNN model for graph classification and describe how to adapt it given a specific sample. Then, we illustrate how to adapt specific GNN models. 3.3.3.1 A General Adapted Framework A typical GNN framework for graph classification contains two types of layers, i.e., the filtering layer and the pooling layer. The filtering layer takes the graph structure and node representations as input and generates refined node representations as output. The pooling layer takes graph structure 37 and node representations as input to produce a coarsened graph with a new graph and new node representations. A general GNN framework for graph classification contains ? pooling layers, each of which follows 5 stacking filtering layers. Hence, there are = ? ⇤ 5 learning blocks in the GNN framework. A graph-level representation can be obtained from these layers that can be further utilized to perform the prediction. Given a graph sample 6 9 , we need to adapt each of the layers according to its adaptor parameters generated from the adaptor network. Via this process, we can generate a GNN model ⌧ # # 9 specific to 6 9 . Without loss of generality, when introducing a filtering layer or a pooling layer, we use an adjacency matrix A 2 R=⇥= and node representations X 2 R=⇥3 to denote the input of these layers where = is the number of nodes and 3 is the dimension of node features. Then, the operation of a filtering layer can be described as follows: X=4F = 5 (A, X; \ 5 ) (3.3) where \ 5 denotes the parameters in the filtering layer and X=4F 2 R=⇥3=4F denotes the refined node representations with dimension 3=4F generated by the filtering layer. Assuming q 5 is the corresponding adaptor parameter for this filtering layer, we adapt the model parameter \ 5 of this filtering layer as follows: \ <5 = \ 5 ⇧ q 5 , (3.4) where \ <5 is the adapted model parameter that has the same dimension as the original model parameter \ 5 ; and ⇧ is the adaptation operator. The adaption operator can have various designs, which can be determined according to the specific GNN model. We will provide the details of the adaptation operator when we introduce concrete examples in the following subsections. Then, with the adapted model parameters, we can define the adapted filtering layer as follows: X=4F = 5 (A, X; \ 5 ⇧ q 5 ). (3.5) On the other hand, the process of a pooling layer can be described as follows: A=4F , X=4F = ?(A, X; \ ? ), (3.6) 38 where \ ? denotes the parameters of the pooling layer, A=4F 2 R==4F ⇥==4F with ==4F < = is the adjacency matrix for the newly generated coarsened graph and X=4F 2 R==4F ⇥3=4F is the learned node representations for the coarsened graph. Similarly, we adapt the model parameters of the pooling layer as follows: \ 5 C<0G( 50 (A, X; \ 50 )), (3.13) X=4F = S) Z, (3.14) A=4F = S) AS, (3.15) 40 where 50 is a filtering layer embedded in the pooling layer, S 2 R=⇥==4F is a soft-assignment matrix, which softly assigns each node into a supernode to generate a coarsened graph. Specifically, the structure and the node representations for the coarsened graph are generated by Eq. (3.15) and Eq. (3.14) respectively, where Z 2 R=⇥3=4F is the output of the filtering layers. To adapt the pooling layer, we only need to adapt Eq. (3.13), which follows the same way as introduced in Eq. (3.10) as it is also a filtering layer. We refer to the adapted diffpool model as Customized-DiffPool. 3.3.4 Training and Test via the Customized Framework Given a graph sample 68 with the adjacency matrix A8 , and the feature matrix X8 , the Customized- GNN framework performs the classification task as follows: H̃8 = ⌧ # # ( G8 , X8 ; ⇥GN N ⌃ (s8 ; ⌦ )). (3.16) During the training, we are given a set G = {68 , H8 } of # graphs as training samples, where each graph 68 is associated with a ground truth label H8 . Then, the objective function of Customized-GNN can be represented as follows: ’ # min L (H8 , ⌧ # # ( G8 , X8 ; ⇥GN N ⌃ (s8 ; ⌦ ))), (3.17) ⌦ ,) G N N 8=1 where # is the number of training samples and L is a loss function. In this work, we use Cross-Entropy as the loss function and adopt ADAM [50] to optimize the objective. During the test phase, the label of a given sample 6✓ can be inferred using (3.16). Specifically, the graph structure information s✓ of the sample is first utilized as the input of the adaptor network (·; ⌦) to identify its distribution information, which is then utilized to adapt the shared model parameter ⇥⌧ # # to generate a sample-specific model ⌧ # # ✓ . This sample-specific model finally performs the classification for this sample. 3.3.5 Time Complexity Analysis In this subsection, we analyze the additional time required to calculate the adaptation parameters and perform the adaptation. Specifically, we use the FiLM adaptation operator, as an example of the adaptor network. For convenience, the dimension of the output node features in all layers is assumed 41 to be the same 3. The dimension of the output of the adaptation network q 5 is 23. Furthermore, we assume that the input of the adaptation network, i.e., the graph property information s8 is with dimension B. Then, the time complexity to generate the adaptation parameters for a single block using Eq. (3.1) is $ (23 · B) = $ (3 · B). Furthermore, the time required to adapt the parameters for a single block with Eq. (3.11) is $ (3 2 ). Hence, for graph neural networks with learning blocks, the time complexity to calculate the adaptation parameters and perform the adaptation for all learning blocks is $ ( ·3·B+ · 3 2 ). Note that, the time complexity of a single filtering operation in Eq. (3.9) is $ (< · 3 + = · 3 2 ) where < denotes the number of edges while = is the number of nodes. Therefore, the total time complexity for learning blocks without adaptation is $ ( ·<·3+ · = · 3 2 ). Furthermore, B is typically small (much smaller than <); hence, the additional time complexity introduced by the adaptation operation is rather small. 3.4 Experiments In this section, we conducted comprehensive experiments to verify the effectiveness of the proposed Customized-GNN framework. We first describe the experimental settings. Then, we evaluate the performance of the framework by comparing the original GCN, DiffPool, and gPool with the adapted GCN, DiffPool, gPool models by the Customized-GNN framework. Next, we analyze the importance of different components in the adaptor operator. Finally, we conduct case studies to further facilitate our understanding of the proposed method. 3.4.1 Experimental Settings We carried out graph classification tasks on seven datasets. More details about these datasets can be found in Appendix B.1. Next, we describe the baselines. In Section 3.3, we apply the proposed framework to adapt three graph neural networks models: a basic graph convolutional network (GCN) [52], and two SOTA graph classification models DiffPool [126] and gPool [22], respectively. The corresponding adapted versions are Customized-GCN, Customized-DiffPool, and Customized-gPool, respectively. Our evaluation purpose is if the proposed framework can boost the performance of existing models by adapting them to their corresponding customized versions. 42 Thus, (1) to validate the effectiveness of the proposed model, we compare Customized-GCN, Customized-DiffPool, Customized-gPool with GCN, DiffPool, and gPool; and (2) we have not chosen models in [88, 69, 128] as baselines here but the proposed framework can be directly applied to adapt them as well. Note that in this work, we construct a set of simple structural features s8 of 68 such as the number of nodes, the number of edges, and the graph density; however, it is flexible to include other complex features by the proposed framework. Furthermore, we create baselines to directly concatenate the graph structure properties s8 to the output graph embedding of the GCN, DiffPool, and gPool model. Correspondingly, we call these three methods as Concat-GCN, Concat-Diff, and Concat-gPool. In addition, we develop baseline methods, Multi-GCN, Multi-Diff, and Multi-gPool. They learn multiple graph convolutional networks for graph samples with different structural information. More details of these baselines can be found in Appendix B.2. 3.4.2 Graph Classification Performance Comparison In this subsection, we first perform the comparison following the traditional setting. To further demonstrate the advantage of the proposed frameworks, we show their adaptability when the properties of test graphs are different from these of training graphs. Following the setting in [126], for each graph dataset, we randomly shuffle the dataset and then split 90% of the data as the training set and the remaining 10% as a test set. We train all the models on the training set and evaluate their performance on the test set with accuracy as the measure. We repeat this process with different data shuffling and initialization seeds for 4 times and report the average performance and standard variance. In terms of the implementation details, the GCN/Customized-GCN model consists of 3 filtering layers and a single max-poling layer; the hidden dimension of each filtering layer is 20, and ReLU [76] activation is applied after each filtering layer. For DiffPool/Customized-DiffPool and gPool/Customized-gPool, we set ? = 2, 5 = 3 and the dimension of hidden filtering layer 20. We adopt fully-connected networks to implement the adaptor networks in the Customized-GNN frameworks. Its input dimension is the same as the dimension of the graph’s structural information. The results are shown in Table 3.3. We notice that the Concat- and the Multi-versions of the GNN models can, in some cases, achieve comparable or even better performance than their corresponding 43 Table 3.3: Comparisons of graph classification performance in terms of accuracy. Datasets Accuracy (%) COLLAB ENZ PROT DD RE-BI RE-5K NCI109 GCN 67.9±1.4 50.4±3.0 77.0±2.3 79.3±5.3 82.6±4.9 50.7±1.3 74.9±2.7 Concat-GCN 68.4±1.4 52.5±5.1 78.4±1.9 77.6±3.2 80.7±3.5 50.7±1.0 75.6±1.2 Multi-GCN-2 68.3±1.4 47.0±1.8 79.5±1.3 77.1±2.4 80.6±3.5 50.3±1.9 74.2±1.9 Multi-GCN-3 67.0±1.4 44.6±5.4 79.9±2.2 76.7±3.5 77.5±7.3 48.5±2.1 75.8 ±1.5 Customized-GCN 71.3±1.0 55.4±4.6 78.8±3.2 79.6±3.9 91.5±1.6 53.3±1.3 76.7±1.1 DiffPool 70.6±1.2 57.9±2.5 78.6±2.5 81.5±4.1 89.6±1.1 56.2±1.1 77.5±0.7 Concat-Diff 70.7±0.7 60.0±1.7 77.7±2.5 81.3±2.9 91.1±1.7 54.9±1.4 78.0±0.5 Multi-Diff-2 70.7±0.6 56.3±1.3 80.0±1.2 79.3±2.9 89.9±2.5 53.7±0.6 76.8±0.7 Multi-Diff-3 70.8±1.1 52.5±0.8 80.9±1.7 80.6±2.3 88.8±0.7 53.4±2.4 78.5±1.2 Customized-DiffPool 73.6±0.5 57.9±7.2 78.6±2.9 80.6±2.6 95.1±1.6 55.8±1.1 78.2±0.9 gPool 69.4±2.2 53.8±3.2 77.3±3.0 78.9±5.5 88.9±1.6 51.3±0.6 77.1±1.2 Concat-gPool 69.7±0.5 57.1±1.8 79.1±2.2 78.0±2.6 88.5±1.3 50.9±2.2 76.3±0.7 Multi-gPool-2 69.0±1.9 50.8±5.1 79.7±1.0 79.5±2.7 84.0±3.2 49.3±2.3 73.5±2.1 Multi-gPool-3 68.9±1.6 46.2±3.2 80.6±0.8 80.0±3.6 83.1±4.5 48.9±1.8 75.2±1.9 Customized-gPool 72.3±1.0 62.9±3.6 80.6±1.6 80.0±3.1 91.1±0.7 53.3±1.4 76.5±1.9 original versions. This indicates that utilizing the graph structure properties has the potential to help improve the model performance. However, the performance of these variants is not so stable across different datasets, which means that these simple methods are not suitable for all datasets. For example, the Concat- versions may work well on datasets where the label is directly related to the graph structure properties but fail on those datasets where graph structure properties have a more implicit impact on the labels. On the other hand, the performance of the Multi-version of the GNN models is heavily dependent on how the data is split into different groups. It is not practical to find good splits manually. Furthermore, simply training different models for different graphs can lead to unsatisfactory performance because less training data is available for each model. In contrast, our proposed Customized models learn sample-wise adaptation, which automatically finds suitable models for different data samples according to their graph structure properties. Compared with the original GCN, DiffPool, and gPool, the corresponding Customized models achieve better performance in most of the datasets. This demonstrates that the sample-wise adaptation performed by the Customized-GNN framework can boost the performance of GNN frameworks. Adaptability Study. To further show the adaptability of the proposed framework to new graphs with different structures, we order graphs according to their node sizes in non-decreasing order. 44 Table 3.4: Adaptability study (Note here Cust-X denotes Customized-X). Methods Accuracy(%) Cust-GCN GCN Cust-DiffPool DiffPool Cust-gPool gPool ENZ 22.2 20.5 25.6 22.2 35.0 24.8 RE-BI 70.2 50.4 78.6 52.7 80.0 59.9 Then, we use the first 80% of the data as a training set and the remaining 20% as a test set. The purpose of this setting is to simulate the case where the structures of graphs in the test set are different from those in the training set. We only show the results on the ENZ and RE-BI datasets in Table 3.4, since observations from other datasets are consistent. We note that (1) GCN, DiffPool, and gPool cannot work properly in this setting; and (2) the customized frameworks perform much better under this setting. These results demonstrate the ability of the learned Customized-GNNs to adapt GNNs to graphs with new properties. 3.4.3 Ablation Study In this subsection, we investigate the effectiveness of different components in the adaptor operator in Eq. (3.11) used in our model. Specifically, we want to investigate whether W 5 and V 5 play important roles in the adaptor operator by defining the variants of Customized-GCN – Customized-GCNW : It is a variant of the adaptor operator with only element-wise multiplication operation where instead of Eq. (3.11), the adaptation process is now expressed as W ⇧ q 5 = (W 1A (W 5 , 3=4F )); and Customized-GCN V : It is a variant of the adaptor operator with only element-wise addition operation where instead of Eq. (3.11), the adaptation process is now: W ⇧ q 5 = W + 1A (V 5 , 3=4F ). Following the previous experimental setting, we compared Customized-GCN with its variants. The results are presented in Table 3.5. We observe that both Customized-GCNW and Customized- GCN V can outperform the original GCN model. It indicates that both terms with W and V are effective for the adaptation and utilizing either one of them can already adapt the original model in a reasonable manner. We also note that the Customized-GCN model outperforms both Customized-GCNW and Customized-GCN V on most of the datasets. It demonstrates that the adaption effects of the term with W and V are complementary to each other and combining them together can further enhance the performance. 45 Table 3.5: Ablation study. Datasets Accuracy (%) COLLAB ENZ PROT DD RE-BI RE-5K NCI109 GCN 69.9 51.8 76.6 77.2 81.9 50.4 75.7 Customized-GCNW 70.8 52.3 77.6 78.1 85.2 51.7 76.0 Customized-GCNV 71.2 54.0 77.9 78.0 88.8 51.9 77.1 Customized-GCN 73.2 55.9 77.9 79.3 90.4 52.9 77.1 3.4.4 Case Study (a) Embeddings with node size. (b) Embeddings with edge size. (c) Embeddings with density. (d) Sample demonstration. Figure 3.4: Case Study. (a) depicts the model embeddings and (b) demonstrates model embeddings for graphs that are mistakenly classified by GCN, but correctly classified by Customized GCN, and (c) and (d) illustrate the embeddings for graphs extracted by GCN and Customized-GCN, respectively. To further illustrate the effectiveness of the proposed framework, we conducted case studies on D&D. First, we visualize the distribution of embeddings of sample-specific model parameters 46 (a) GCN. (b) Customized-GCN. Figure 3.5: Embedding Visualization. for different graph samples with various node sizes, edge sizes, and densities. Specifically, we take the parameters of the first filtering layer of each sample-specific Customized-GCN framework and then utilize t-sne [70] to project these parameters to 3-dimensional embeddings. We visualize these 3-d embeddings in the form of scatter plot as shown in Figure 3.4(a), 3.4(b) and 3.4(c). Note that in these figures, the red triangle denotes the embedding of the parameters (i.e. W) of the original GCN model (the one before adaptation). For each point in these figures, we use color to represent the scale of values in terms of node size (or edge size, density). Specifically, a deeper red color indicates a larger value, while a deeper blue color indicates a smaller value. We make some observations from Figure 3.4(a), 3.4(b) and 3.4(c). First, the proposed Customized-GCN framework indeed generates distinct models for different graph samples that are different from the original model. Second, the points with similar colors stay closely with each other, which means that graphs with similar structural information share similar models. In addition, in Figure 3.4(d), we illustrate the sample-specific model parameters for seven samples with a different number of nodes. They are misclassified by the original GCN model but correctly classified by the proposed Customized-GCN framework. It is obvious that Customized-GCN has generated seven different GCN models for these graph samples, each of which can successfully predict the label for the corresponding sample. We further visualize the graph embeddings before the classification layer, 47 extracted by the model GCN and Customized-GCN. These embeddings from the two models are then projected to a 2-dimensional space via PCA and shown in Figure 3.5(a) and 3.5(b), respectively. We observe that the embeddings from different categories are better separated by Customized-GCN. This demonstrates that, compared to the original GCN, the proposed framework can get more distinct embeddings, and thus can achieve better classification performance. 3.5 Related Work Graph Neural Networks have recently drawn great interest due to its strong representation capacity in graph-structured data in many real-world applications. Generally, graph neural networks can be divided into two categories: the spectral approaches and the non-spectral approaches. The spectral methods aim at defining the parameterized filters based on graph spectral theory by using graph Fourier transform and graph Laplacian [10, 13, 59, 52], and the non-spectral methods aim at defining parameterized filters based on nodes’ spatial relations by aggregating information from neighboring nodes directly [27, 106]. Graph neural networks have advanced a wide variety of tasks including node classification [52, 27], link prediction [94, 134, 22] and graph classification [126, 69]. In the task of graph classification, one of the most important step is to get a good graph-level representation. A straight-forward way is to directly summarize the graph representation by globally combining the node representations [17]. Recently, there are some works investigating learning hierarchical graph representations by leveraging deterministic graph clustering algorithms [13, 20]. There also exist end-to-end models aiming at learning hierarchical graph representations, such as DiffPool [126]. MuchGNN [138] proposed to learn a set of graph channels at each layer to shrink the graph hierarchically. Furthermore, some methods [22, 57, 133] propose principles to select the most important : nodes to form a coarsened graph in each network layer. EigenPooling [69] is based on graph Fourier transform and is able to capture the local structural information. In [128], conditional random fields (CRF) are used to design the pooling operation. 48 3.6 Limitations In this section, we discuss the limitations of the proposed framework. The proposed customized framework could suffer from scalability issue. The model size of the adaptor networks increases with the model size of the shared GNN, and the model size of the whole proposed framework would be rather large when the GNN model is large. This scalability issue has not become an obstacle for most existing GNN models, since the depth of these GNN models is often small currently. However, with the development of advanced research on GNNs, more and more complex GNN models have been designed, such as the GNN-based transformers, which makes the scalability issue of the customized GNN increasingly important. 3.7 Conclusion In this paper, we propose a general graph neural network framework, Customized-GNN, to deal with graphs that have various graph structure properties. Comprehensive experiments demonstrated that the Customized-GNN framework can effectively adapt both flat and hierarchical GNNs to enhance their performance. Future research directions include better modeling the adaptor networks, considering more complex properties, and adapting more existing graph neural networks models. 49 CHAPTER 4 LOCALIZED GRAPH COLLABORATIVE FILTERING User-item interactions in recommendations can be naturally denoted as a user-item bipartite graph. Given the success of graph neural networks (GNNs) in graph representation learning, GNN-based Collaborative Filtering (CF) methods have been proposed to advance recommender systems. These methods often make recommendations based on the learned user and item embeddings. However, we found that they do not perform well with sparse user-item graphs which are quite common in real-world recommendations. Therefore, in this chapter, we introduce a novel perspective to build GNN-based CF methods for recommendations which leads to the proposed framework of Localized Graph Collaborative Filtering (LGCF). One key advantage of LGCF is that it does not need to learn embeddings for each user and item, which is challenging in sparse scenarios. Alternatively, LGCF aims at encoding useful CF information into a localized graph and making recommendations based on such graph. Extensive experiments on various datasets validate the effectiveness of LGCF, especially in sparse scenarios. Furthermore, empirical results demonstrate that LGCF provides complementary information to the embedding-based CF model which can be utilized to boost recommendation performance. 4.1 Introduction The rapid development of information technology has facilitated an explosion of information, and it has accentuated the challenge of information overload. Recommender systems aim to mitigate the information overload problem by suggesting a small set of items for users to meet their personalized interests. They have been widely adopted by various online services such as e-commerce and social media [78]. The key to building a personalized recommender system lies in modeling users’ preference on items based on their historical interactions (e.g., ratings and clicks), known as collaborative filtering (CF) [98]. In recommendation systems, the user-item interactions can naturally form a bipartite graph and GNNs have shown great potential to further improve the CF performance [110, 35]. GNNs are capable of capturing the CF signals encoded in 50 the graph topology. Specifically, high-order user-item interactions can be explicitly captured by the stacked GNN layers, which contributes to learning expressive representations. For example, Neural Graph Collaborative Filtering (NGCF) [110] stacks several embedding propagation layers to explicitly encode the high-order connectivity in the user-item graph into the embedding process. LightGCN [35] simplifies the design for GNNs for recommendation by only including neighborhood aggregation while eliminating other operations such as feature transformation. The key advantage of GNN-based CF methods is to learn both the embeddings and a GNN model by explicitly capturing the structural information in the user-item interaction graph. However, we empirically found that the improvement may no longer exist when the bipartite graph is very sparse. In Figure 4.1, we illustrate how the sparsity of the user-item bipartite graph affects the performance of representative CF models including MF, NGCF, and LightGCN. Specifically, we adjust the sparsity of the user-item graphs from two different commodity categories in the Tianchi dataset [2] by randomly removing edges from the original graph while not disrupting the graph connectivity. We use G 2 {1, 2, 3, 4, 5} (i.e., the x-axis) to denote the sparsity of a bipartite graph where the larger G is, the sparser the graph is. As shown in Figure 4.1, the performance of all the methods drops when the graph sparsity increases, in which GNN-based methods (NGCF and LightGCN) yield more dramatic performance reduction compared with MF. The inferior performance of GNN-based methods under the sparse setting could be attributed to the difficulty to learn high-quality user/item embeddings from limited user-item interactions. However, the user-item interaction graphs are often sparse in real-world recommendation scenarios [3]. Thus, in this paper, we study how to alleviate the challenge of sparsity while enjoying the merits of high-order topological connectivity. Note that in this work, we focus on addressing the recommendation tasks solely based on the historical interactions following the previous works [110, 35]. Our motivation lies in learning CF signals from local structures of the user-item interaction graph, and the proposed model is called Localized Graph Collaborative Filtering (LGCF). Particularly, given a user and an item, LGCF makes predictions about their interaction based on their local structural context in the bipartite graph, rather than the user and item embeddings. Further comparisons and discussions between the traditional GNN-based 51 (a) Tianchi-2798696. (b) Tianchi-4527720. Figure 4.1: The Performance of MF, NGCF, and LightGCN vs. the Graph Sparsity on Two Tianchi datasets. CF methods and the proposed LGCF can be found in Appendix C.1. To achieve the goal of LGCF, we face two main challenges: (1) how to construct the localized graph given two target nodes, and (2) how to capture the important CF information from the localized graph. To solve these two challenges, LGCF first samples a set of nodes for two target nodes from their neighborhoods and then generates a localized graph with this node set and their edges. Next, LGCF adopts a GNN model to extract a graph representation for each localized graph. To better capture the graph topological information, LGCF generates a label for each node according to their topological importance in the localized graph and takes the set of node labels as input attributes. Finally, LGCF makes predictions about the target user and item based on their corresponding localized graph representation. The contributions of this paper are summarized as follows: (1) We study a new perspective to build GNN-based CF models for recommendations. Specifically, we aim at learning the recommendation- related patterns in the localized graphs induced from the input bipartite graph, instead of learning user and item embeddings; (2) We propose Localized Graph Collaborative Filtering (LGCF) which utilizes a GNN model and minimum distances-based node labeling function to generate graph representation for each user-item pair. In this way, we can encode the localized graph topological information and high-order connectivity into the graph representation simultaneously; 52 Input Bipartite Localized Graph Positional Node Localized Graph Scoring Graph Construction Labelling Representation Learning Function label=6 i1 u1 i2 u2 label=4 i2 i1 u1 u2 i3 i5 label=1 conv conv conv sui u3 i4 i3 u3 i5 Figure 4.2: An Overview of the Proposed Framework LGCF. (3) We validate the effectiveness of LGCF on numerous real-world datasets. Especially, LGCF can achieve significantly better performance than representative GNN-based CF methods when user-item bipartite graphs are sparse. 4.2 Problem Statement A CF-based recommendation dataset can be formulated as a bipartite graph ⌧ = (U, I, E), in which U = {D 1 , D 2 , ..., D = } denotes a set of = users, I = {81 , 82 , ..., 8 < } is the set of < items, and E = {e0 , e1 , ..., el } (e.g., 4 9 = (D 4 9 , 8 4 9 )) indicates the edge set describing the historical interactions between users and items. Following previous works [110, 35], we assume that users and items have no content information. Given a candidate pair (D 9 , 8 : ) consisting of a target user D 9 and a potential item 8 : , we need to calculate a preference score B 9,: to indicate how likely this potential item should be recommended to the target user. To solve this problem, we aim at learning a desirable score mapping function B2>A4(), which generates a preference score B 9 : = B2>A4(D 9 , 8 : |E) to make recommendations based on the historical interactions E. 4.3 The Proposed Framework In this section, we first give an overall description of the proposed framework, then detail its key modules, next introduce the optimization objective and finally discuss integration strategies with existing GNN-based CF methods. Note that compared to existing GNN-based recommendation methods, the mode size of LGCF is much smaller and is independent of the number of users and items. More details about the mode size analysis can be found in Appendix C.2. The goal of LGCF is to learn CF-related knowledge for each user-item pair from its local 53 structures in an interaction bipartite graph. To achieve this goal, it provides solutions to tackle two aforementioned obstacles – a local structure extraction to construct the localized graph for the user-item pair and a powerful model to capture the CF-related patterns from the localized graph. An overview of LGCF is illustrated in Figure 4.2. Specifically, given the input bipartite graph, LGCF first constructs the localized graphs centered with the target user and the target item. To preserve edges with rich CF information in the localized graph, we develop a localized graph construction method to efficiently generate the localized graphs. After that, LGCF labels the nodes within the localized graphs according to the topological positions relative to the target nodes. Then a GNN-based graph representation learning module will embed the high-order connectivity along with the node positional annotations simultaneously. Through this process, LGCF eliminates the traditional embedding-based strategy and effectively captures the critical CF-related patterns. Finally, the scoring function module calculates the preference score based on the learned graph representation. To sum up, there are three key modules in this framework: (1) the localized graph construction module that extracts a localized graph covering most edges related to a given user-item pair; (2) the localized graph representation learning module that learns recommendation related representations from the localized graph; and (3) the scoring function module that computes a preference score based on the learned graph representation. Next, we will detail each module. 4.3.1 Localized Graph Construction The localized graph construction module is designed to extract the localized graph covering the most important edges (i.e., collaborative filtering information) for a given user-item pair. We need to extract a localized graph for each user-item pair in the training process and the inference process, thus we aim to include the most representative edges for each pair with the consideration of scalability. To achieve these goals, we propose a localized graph construction as shown in Figure 4.3. This framework can be divided into three steps: (1) step 1 -random walk: we first take advantage of a random walk with restart (RWR) [104] method to sample neighboring nodes for the user node and the item node. By using RWR, we can sample neighboring nodes which are different hops away from the center node. As a consequence, we can get a representative localized graph that 54 i1 i1 u1 i1 i2 u1 i2 u1 u2 i3 u1 u2 u3 i2 u2 i3 u2 i3 Random Trace Graph u3 i4 Walk Union i1 i2 i3 i5 Extraction u3 u2 i3 i5 i5 u3 i5 (u1 , i5 ) (tu , ti ) Vut SGui Figure 4.3: An Illustration of Localized Graph Extraction. captures abundant context information. Meanwhile, RWR is a sampling method in essence, thus it can naturally improve the scalability of the proposed LGCF. Specifically, given a user-item bipartite graph ⌧ and a user-item pair (D, 8), we perform a random walk on ⌧ from the user node D and the item node 8, respectively. Starting from the original node, each walk travels to one of the connected nodes iteratively with a uniform probability. In addition, there is a pre-defined probability for each walk to return to the starting node at each step. In this way, more neighboring nodes can be included in the walk. Via RWR, we can get two traces CD and C8 for the user node D and the item node 8, respectively. Each trace then can be denoted as a node subset, i.e., +D and +8 . (2) step 2 - trace union: Next, we merge +D and +8 into a union node subset +D8 . (3) step 3 - graph extraction: With this union node subset +D8 , we can extract a localized graph from the original user-item graph. Specifically, all the nodes in +D8 and all the edges among these nodes are extracted from the original graph to build a localized graph, which is denoted as (⌧ D8 . To conclude, (⌧ D8 consists of neighboring nodes of different hops away from D and 8 and thus preserves important collaborative filtering information for the given user-item pair (D, 8). 4.3.2 Localized Graph Representation Learning This module aims at learning an overall representation of each user-item pair based on its extracted localized graph. We expect to encode the CF information and also the high-order connectivity related to a user-item pair into this representation. Therefore, we propose to use a 55 graph neural network model since it can naturally capture the structural information and explicitly encode the high-order connectivity into the representation. Also, to further encode the topological information, we adopt a node labeling method to generate a positional label for each node based on its distance towards the user-item pair [129] where the node labels are considered as the input graph node attributes. Next, we will introduce the node labeling method and the adopted GNN model. 4.3.2.1 The Node Labeling Method Following the common setting from the previous works [35, 110], we do not assume that node attributes are available for LGCF. Instead, we generate a label for each node to indicate its unique role in the localized graph. Specifically, there are three expectations for the generated label. First, it is important to distinguish the target user and item from other nodes through the generated node labels, as the model should be aware of the target user-item pair. Second, contextual nodes excluding the target user and item, play different roles in the recommendation prediction considering their different relative position towards the target user and item. Thus, we need to distinguish these nodes via the generated node labels based on their positions; Finally, for the consistency of node labels, the closer the node is to both the target user and target item, the more similar its label should be to that of the target ones. Therefore, we propose to use a node labeling method based on its distance to the user-item pair via Double-Radius Node Labeling (DRNL) [129] to generate a label for each node and take the generated node labels as the input node attribute for the GNN model. The key motivations of DRNL are to distinguish the target user node and the item node from other nodes while preserving their relatively positional relations. Specifically, we first assign label 1 to the target user node and the target item node to distinguish them from other nodes. Next, we assign labels to other nodes based on their minimum distances toward two target nodes on the extracted localized graph. Intuitively the closer the user or item node is to the target nodes, the more similar its label should be to that of the target nodes. For a specific node G on the graph, we evaluate its distance toward the target user and the target item by summing up its minimum distances to these two nodes. Since we set the label of the target user and the target item as 1, we will also assign these nearby nodes labels with small values. In particular, given nodes G and H, if the distance between G and the 56 target nodes is smaller than that of H, the label value of G is supposed to be smaller than that of H. If the distances are the same, the node with a smaller minimum distance to the target user or the target item should have a label with a smaller value. DRNL adopts a hashing function 5; () satisfying the criteria described above to compute node labels. The node labeling function 5; (G) for each node G is summarized as follows: 8 > > > < 1, > x=u or x=i, 5; (G) = (4.1) > > > > 1 + min(3D , 38 ) + (3/2) 2 , otherwise, : where D and 8 denote the target user node and item node, respectively. 3D (38 ) represents the minimum distance between nodes G and D (8). 3 = (3D + 38 ) is the sum of minimum distances which should be odd due to the nature of the bipartite graphs. 4.3.2.2 The GNN Model For each user-item pair(D, 8), we have extracted its localized graph (⌧ D8 , and it can be denoted as (⌧ D8 = {AD8 , XD8 }, where AD8 is the adjacency matrix of (⌧ D8 , and XD8 is the input node attributes generated by above node labeling method. It is natural to take advantage of a GNN model to learn an overall representation for each pair based on its localized graph (⌧ D8 , since GNNs have shown great potential in capturing complex patterns on graphs [120]. There are typically two key operations in a GNN model – the graph filtering operation to refine node representations and the graph pooling operation to abstract the graph-level representation from the node representations. In this work, we adopt graph convolutional filtering [52] as the graph filtering operation. Its process can be described as follows: 1 1 X;+1 = f( D̃ 2 ÃD̃ 2 X; W), (4.2) where X; denotes the node representations in the ;-th GNN layer, W is the transformation matrix to Õ be learned, à = AD8 + I represents the adjacency matrix with self-loops, D̃ = 9 à 9 9 is the diagonal degree matrix of à and f() is the activation function. The node labels are set to be the input node representations X0 . The graph pooling operation adopted by LGCF is sum pooling, which sums up representations all over the nodes for the graph-level representation. Its process can be denoted as 57 follows: xD8 = ?>>; (X ! ) = BD<(X ! ), (4.3) where xD8 denotes the graph representation for the user-item pair (D, 8) and X ! is the node representations outputted by the last layer. The overall process of the GNN model can be summarized as: xD8 = ⌧ # # (AD8 , X0 ; ) ⌧ # # ), (4.4) where ) ⌧ # # denotes the parameters to be learned in the GNN model. It is straightforward to extend LGCF with other graph filtering operations [120, 106] and graph pooling operations [69, 22], and we will leave such investigation as one future work. 4.3.3 The Scoring Function The scoring function aims at giving a score for a user-item pair (D, 8) based on its corresponding localized graph representation xD8 . The higher the score is, the more likely the item should be recommended to the user. We use a single-layer linear neural network to model the scoring function in LGCF, which can be described as follows: BD8 = B2>A4(xD8 ) = f(x)D8 ⇤ w), (4.5) where w is the transformation vector to be learned and f is the sigmoid function. 4.3.4 Model Optimization In this work, we adopt the pairwise Bayesian Personalized Ranking (BPR) loss to train the framework. BPR loss is one of the most popular objective functions in recommendation tasks. It takes the relative order of the positive node pairs and negative node pairs where the positive node pairs are existent user-item node pairs observed in graphs, while the negative node pairs are 58 non-existent user-item node pairs generated by negative sampling. Specifically, BPR assumes that the prediction scores of the positive node pairs should be larger than these of the corresponding negative ones. The objective function of our model is as follows: ’ min ln f(BD8 BD8 0 ), (4.6) ⇥⌧ # # ,w 0 (D,8,8 )2O 0 0 where O = {(D, 8, 8 )|(D, 8) 2 ⇢, (D, 8 ) 2 ⇢ } denotes the training data for the graph ⌧, ⇢ is the set of the existent interactions between users and items in ⌧ and ⇢ indicates the set of the non-existent interactions. In this work, we adopt ADAM [50] to optimize the objective. 4.3.5 Integration with Embedding-based CF Methods LGCF aims at capturing the collaborative filtering information from a localized perspective by encoding the localized CF information into a subgraph-level representation, while embedding-based GNN methods such as NGCF and LightGCN are to learn a set of user and item embeddings in the latent space based on all the user-item interactions. Therefore, LGCF provides a new perspective to building GNN-based recommendations. As a consequence, LGCF is likely to provide complementary information to existing GNN-based methods and integrating it with existing methods has the potential to boost the recommendation performance. In this subsection, we discuss strategies to combine LGCF with existing GNN-based methods. In particular, we propose two strategies, i.e., LGCF-emb and LGCF-ens. Before we detail these strategies, we first denote the refined user D and item 8 embedding generated by embedding-based GNN methods as hD and h8 , and the representation outputted by LGCG for a user-item pair (D, 8) as xD8 . In this work, we focus on how to integrate LGCF with LightGCN given the competitive performance of LightGCN. 4.3.5.1 LGCF-emb In LGCF-emb, we propose to directly concatenate the embeddings generated by LGCF and LightGCN, and then the scoring function takes the concatenated embedding as input to generate a score. Finally, the LGCF model and the LightGCN model are jointly trained by optimizing the objective function in Eq. 4.6. The concatenation process is summarized as follows: hD8 = (hD ⇤ h8 )||xD8 , (4.7) 59 where ⇤ denotes element-wise multiplication and || represents the concatenation operation. 4.3.5.2 LGCF-ens In LGCF-ens, we propose to combine LGCF and LightGCN in an ensemble way. Specifically, we conduct a weighted combination of the scores generated by LGCF and LightGCN. In addition, instead of joint training, we train LGCF and LightGCN, separately. The final score for a user-item pair (D, 8) can be computed as follows: BD8 = B2>A4(xD8 ) + _ · (hDT h8 ), (4.8) where B2>A4() denotes the scoring function in Eq 4.5 and _ can be either a predefined hyper-parameter or learnable parameter. 4.4 Experiment In this section, we conduct extensive experiments on various real-world datasets to validate the effectiveness of the proposed LGCF. Via experiments, we aim to answer the following questions: Q1: How does LGCF perform compared with the state-of-the-art CF models on the sparse setting? Q2: How does LGCF perform compared with the state-of-the-art CF models on the normal setting? Q3: Can LGCF be complementary to the embedding-based GNN CF models? Q4: How does the sparsity of data affect the performance of LGCF and other CF models? In the following subsections, we will first introduce experimental settings, next design experiments to answer the above questions and we further probe to understand the working of LGCF. 4.4.1 Experimental Settings In our experiments, we have tested the proposed framework on twelve datasets, which are from three different real-world recommendation scenarios: Tianchi[2], Amazon[71], and MovieLens[30]. Specifically, we construct seven datasets corresponding to seven-item categories from Tianchi and denote each of them as Tianchi- ⇡, where ⇡ indicates the item category, two datasets from two item categories from Amazon, and two datasets from two item categories from MovieLens. More 60 Table 4.1: Densities of the Sparse User-item Bipartite Graphs. Here “Density" is computed specifically for the interaction graph as ⇡4=B8CH = #⇢ 364B ÷ (#*B4A B ⇥ # C4=E ; . Thus, we can denote the parameters for a GNN model as 1 ⇥⌧ # # = {) 2>=E ! }. With above definitions, the goal of the pre-training task is to train a , · · · , ) 2>=E model from graphs {⌧ 1 , ⌧ 2 , · · · , ⌧ # } that can help build a GNN model G(·; ⇥⌧ # #C ) for ⌧ C from the downstream recommendation task. 5.2.2 An Overall Design of ADAPT An overview of ADAPT is demonstrated in the top subgraph of Figure 5.1. It consists of two key components – a meta-LGNN model and a GNN adaptor. One key challenge of pre-training GNNs for recommendations is that user-item interaction graphs in different recommendation tasks have distinct user/item sets; as a result, it is hard to directly transfer user/item embeddings across different recommendation graphs like pre-trained word embeddings in NLP. To solve this challenge, we propose to use the meta-LGNN model based on a new collaborative filtering method [112]. It provides a new perspective for GNN-based recommendations, where the recommendation prediction is made based on the local structure surrounding a user and an item, rather than their embeddings, via a GNN model. Therefore, it does not need to learn embeddings of users and items, which is 72 Pre-training Customized GNN Meta- GNN: !+ , … , !1 , … , !( LGNN:"(; %"#!$%&'(( ) Losspre-training adaptor:'('(; Ω) "(; *)** 1 ) Fine-tuning Customized !! GNN Meta- GNN: Lossfine_tuning adaptor:'('(; Ω) LGNN:"(; %"#!$%&'(( ) "(; *)**! ) Zoom in 1&" -.-(; /& ) 1&%+ 2&-./0 3456(; 2&-./0" ) " … -.-(; /&%+ ) 1+" 2&%+ 3456(; 2&%+ -./0" ) -./0 "(; %,#!$%&'(( ) … … … +" -.-(; /+ ) 3456(; 2+-./0" ) … 2+-./0 !" '('(; 0) "(; %'((" ) Figure 5.1: The top subgraph shows an overview of the proposed framework ADAPT. There are two key components in ADAPT – a meta-LGNN model and a GNN adaptor. The GNN adaptor is designed to adapt the meta-LGNN model to a customized GNN model for the interaction graph of a given recommendation task by considering its distinct properties. With these two components, ADAPT first trains a GNN adaptor and a meta-LGNN model simultaneously in the pre-training phase. Then, given the target graph from the downstream recommendation task in the fine-tuning phase, we generate a customized GNN model based on the properties of the target graph via the GNN adaptor and the meta-LGNN model. The bottom subgraph illustrates the adaptation process. Given a specific graph ⌧ < , the GNN adaptor takes its graph property vector p< as input, and outputs customized adapting parameters < for ⌧ < . Suppose that the meta-LGNN model consists of ! GNN layers, we can denote its parameter as ⇥<4C0 !⌧ # # = {) 2>=E 1 ! }, where ) ; , · · · , ) 2>=E 3 ; 1 ⇥3 ; and 3 ; 2>=E 2 R denotes the dimension of node embeddings outputted by the ;-th GNN layer. The GNN adaptor will generate ! adapting parameters corresponding to these GNN layers for ⌧ < , which can be denoted ! }, where 5 ; 2 R23 ; 1 3 ; . For the graph convolution in the ;-th GNN layer of as < = {51< , · · · , 5< < the customized GNN model for graph ⌧ < , its parameters 5;< are adapted via 5;< on ) 2>=E ; . Note that the adaptation processes of the pre-training and fine-tuning processes are the same. naturally more flexible for pre-training, compared to most existing GNN-based recommendation methods. The success behind pre-training is that there exists common knowledge that can be transferred from the pre-training data to the downstream task. On the one hand, there are similar patterns of user-item local structures in different recommendation graphs, which are captured by 73 the proposed meta-LGNN model. However, there could exist different patterns for these graphs since different graphs can present distinct properties. Therefore, ADAPT provides the GNN adaptor that can adapt the meta-LGNN model to a customized GNN model for the interaction graph of a given recommendation task by considering its distinct properties. With these two components, ADAPT first trains a GNN adaptor and a meta-LGNN model simultaneously in the pre-training phase. Then, given the target graph from the downstream recommendation task in the fine-tuning phase, we generate a customized GNN model based on the properties of the target graph via the GNN adaptor and the meta-LGNN model. In the following subsections, we first introduce the proposed GNN recommendation method for localized collaborative filtering in ADAPT. Next, we describe the pre-training process and the fine-tuning process of ADAPT in detail. 5.2.3 GNN-based Recommendation Method for Localized Collaborative Filtering Graph Neural Network (GNN) models have been demonstrated effective and superior in facilitating recommendation tasks [123, 110, 35]. Most existing GNN-based recommendation methods aim at learning a set of user and item embeddings via a GNN model. Given that user-item interaction graphs from different recommendation tasks have distinct user/item sets, these GNN- based recommendation methods are impractical to be pre-trained, since user/item embeddings can not be transferred across different graphs. To take advantage of the power of both GNN methods and the pre-training techniques, we propose to adopt a new GNN-based localized collaborative filtering, (i.e., meta-LGNN) to build the basic recommendation model in our pre-training framework, which does not require learning user/item embeddings and makes predictions based on local structures extracted from the interaction graphs. Next, we will first briefly introduce the most common existing GNN-based recommendation method, and then illustrate the meta-LGNN. We use H = {hD1 , · · · , hD |* | , h81 , · · · , h8 | | } to denote the user and item embeddings the model aims to learn. ⇥⌧ # # = {) 2>=E1 ! } represents the parameters to be learned for a !-layer , · · · , ) 2>=E GNN model, where ) 2>=E; denotes the parameters for the graph convolution in layer ;. Generally, for a user D in a given user item bipartite graph, the graph convolution first aggregates item embeddings from its neighborhood, and then updates the target user embedding based on its original embeddings 74 u u u GNN sui i i i (u, i) Asub(u,i) Asub(u,i) , Xsub(u,i) Figure 5.2: The proposed GNN recommendation method based on localized collaborative filtering. Given a target user D and a target item 8 , we first extract a local graph at these two target nodes from the user-item interaction graph, of which the structure is denoted as ABD1(D,8) . Next, we generate a positional attribute for each node based on their minimum distances towards the target nodes, and these node attributes are denoted as XBD1(D,8) . Then, we can leverage a GNN model to get the graph representation for this attributed graph. Finally, we can compute a recommendation score BD8 based on the local graph representation via a scoring function to make the final prediction for the given user and item pair. and the aggregated embeddings. The update process of item embeddings works similarly as that for a user node. We formalize the user node embedding update process via the graph convolution in the ;-th layer of GNN for illustration as follows: hD; = 52>=E (h;? 1 , hD; 1 ; ) 2>=E ; ), 8? 2 N (D), (5.1) where 52>=E represents the graph convolution operation and 8? 2 N (D) denotes any item ? belonging to the one-hop neighborhood of user D. Note that hD0 = hD and h80 = h8 . The refined user embedding hD! and the item embedding h8! outputted by the final layer of the GNN model are used to represent each user and item in a bipartite graph, and then typically we use the inner product over the user embedding hD! and item embedding h8! to make the recommendation prediction. Overall, these GNN-based recommendation methods aim at learning the user/item embeddings H and the GNN model parameters ⇥⌧ # # simultaneously. Since interaction graphs from different recommendation tasks have distinct sets of users and items, apparently it is not feasible to transfer the learned embeddings, i.e., H. In this paper, in order to ease pre-training for GNN-based recommendations, we build the meta-LGNN model based on a new collaborative filtering method [112]. The intuition is that the key collaborative filtering information for recommendation is encoded in the historical interactions. 75 Given a specific user and item pair, their key collaborative information is also encoded in their historical interactions. Thus it is reasonable to make a recommendation prediction based on the local structure consisting of their historical interactions. The framework of meta-LGNN is shown in Figure 5.2. For a target user and item pair, we first extract a local graph at these two target nodes from the user-item interaction graph. Next, we generate a positional attribute for each node based on their minimum distances toward the target nodes. Then, we can leverage a GNN model to get the graph representation for this attributed graph. Finally, we can compute a recommendation score based on the local graph representation via a scoring function to make the final prediction for the given user and item pair. Given a target user D and a target item 8 in a user-item interaction graph ⌧, we perform two random walks [80] starting from these two nodes on ⌧, separately. Note that the random walk strategy we adopt in our work is with the restarting mechanism, and thus this process can be regarded as a neighbor sampling process, where we can get two sampled neighboring node sets ND and N8 for D and 8, respectively. Next, we merge these two node sets to get an overall node set # (D,8) = ND [ N8 , and finally, we can get a local graph ⌧ BD1(D,8) based on the set # (D,8) to reveal the local structure around the given pair (D, 8). Note that the reason why we only sample some neighboring nodes is to ensure the scalability of our model on large graphs. Apart from graph structure ABD1(D,8) , we also need its node attributes XBD1(D,8) , so that we can use a GNN model G(·; ⇥) to get a meaningful representation for ⌧ BD1(D,8) . For the model transferability, we propose to use the Double-Radius Node Labeling (DRNL) [129] to generate a set of positional attributes for each user or item in a graph, rather than using the existent node attributes, which are very likely to miss or be different across various recommendation tasks. The setting without no node attributes is also commonly adopted by previous related works [53, 110, 35], and it is straightforward for the proposed ADAPT framework to incorporate node attributes if they are available. DRNL labels each node based on its minimum distances towards the target user D and the target item 8 on the local graph ⌧ BD1(D,8) . Given a user/item node C, its positional attribute GC can be calculated as: GC = 1 + min(3D , 38 ) + (3/2) 2 (5.2) 76 where 3D denotes the minimum distance between D and C, and 38 denotes the minimum distance between 8 and C. 3 = (3D + 38 ) is the sum of two minimum distances. Note that we denote GD = 1 and G8 = 1, so that we can distinguish the target user and item from other nodes. Now we have the graph structure and the node positional attributes of ⌧ BD1(D,8) , thus, we can use a GNN model G(·; ⇥) to encode it into a meaningful representation. The node embedding update process in each GNN layer also follows Eq. 5.1. we use HBD1(D,8) ! 2 R=⇥3 to denote the user-item embedding matrix outputted by the final GNN layer, where ! is the number of GNN layers, = is the number of users and items in ⌧ BD1(D,8) and 3 denotes the embedding dimension. Then, the GNN model leverages a pooling operation to get the graph representation for ⌧ BD1(D,8) as follows: h (D,8) = ?>>; (ABD1(D,8) , HBD1(D,8) ! ). (5.3) Note that the positional attributes are used as the input node features, i.e., H0BD1(D,8) = XBD1(D,8) , which means that there is no need to learn the user/item embeddings. Finally, a score function is applied to compute a score for the user-item pair (D, 8) to make a prediction. Overall, meta-LGNN aims at only learning the GNN model parameters ⇥⌧ # # while eliminating the user/item embeddings H. 5.2.4 The ADAPT Pre-training In this subsection, we will introduce the overall pre-training process of the proposed ADAPT. In this pre-training phase, one key challenge is how to learn the common transferable knowledge across multiple interaction graphs and capture their differences simultaneously. To solve this challenge, we equip ADAPT with a GNN adaptor. The basic idea of model adaptation is to adapt the meta-LGNN model to a customized GNN model specially for each recommendation graph based on its graph properties. In other words, we aim at customizing similar GNN models for recommendation graphs sharing similar graph properties. One key factor for the rationality of the designed model is that the graph properties of a recommendation graph are related with the collaborative filtering patterns. Thus, it is important to choose appropriate graph properties that can indicate collaborative filtering information. Intuitively, recommendation graphs holding similar structure properties are more likely to have analogical recommendation patterns. For examples, large online-shopping 77 platforms may share more similar collaborative filtering patterns with each other than that with small online-shopping platforms. Recommendation domains where the purchasing behaviors are frequent may have different recommendation patterns with that where the users are inactive. More details about the design of graph properties can be found in Section 5.2.4.1. In the pre-training process, both the GNN adaptor and the meta-LGNN model are optimized simultaneously. Next, we will detail the GNN adaptor component, the adaption process and the pre-training process. 5.2.4.1 The GNN Adaptor The goal of the GNN adaptor is to generate customized adapting parameters to adapt the meta- LGNN model for a given graph. Given a specific graph ⌧ < , the GNN adaptor takes its graph property vector p< as input, and outputs customized adapting parameters < for ⌧ < . It is expected that the chosen graph properties are able to indicate important collaborative filtering information. To achieve this goal, we carefully select numerous graph structure properties. Intuitively, the collaborative filtering pattern is likely to be related to the recommendation graph size, i.e., the number of users and items in this recommendation graph. Also, the ratio between the number of users and that of items, and the frequency of the user-item interactions are important for recommendation patterns. Thus, we include the number of nodes, the number of edges, the user-item ratio, and the graph density in the graph properties. In addition, we also include the degree assortativity coefficient, which may indicate whether active users prefer popular items or niche items in a specific recommendation domain. Likewise, the robins-alexander clustering coefficient, the number of connected components, and the global efficiency have been included to indicate the interaction pattern of a recommendation graph. To conclude, to fully exploit the interaction pattern in a recommendation graph, we utilize eight normalized graph structural properties including the number of nodes, the number of edges, the user-item ratio, the graph density, the degree assortativity coefficient, robins-alexander clustering coefficient, the number of connected components and the global efficiency as the graph property input of the GNN adaptor. Note that there are also other alternatives we can explore to serve as the input of the GNN adaptor, such as the graph representation extracted by some graph embedding 78 methods or unsupervised graph learning methods, which can also capture some important graph information. However, these alternatives may introduce additional model complexity and computing overhead. Thus, we prefer to use the eight graph properties introduced above, which are intuitive and empirically effective. As discussed before, suppose that the meta-LGNN model consists of ! GNN 1 ; 1 ⇥3 ; layers, we can denote its parameter as ⇥<4C0 !⌧ # # = {) 2>=E ! }, where ) ; , · · · , ) 2>=E 2>=E 2 R 3 and 3 ; denotes the dimension of node embeddings outputted by ;-th GNN layer. The GNN adaptor will generate ! adapting parameters corresponding to these GNN layers for ⌧ < , which can be ; 1 3; denoted as < = {51< , · · · , 5< ! }, where 5 ; 2 R23 < . The GNN adaptor can be modeled using any function of p< . Specifically, for the graph convolution in the ;-th GNN layer for ⌧ < , the GNN adaptor generates its corresponding adapting parameters as follows: 5;< = 030(p< ; 8; ), (5.4) where 8; denotes the parameters of the GNN adaptor function 030() corresponding to the graph convolution in the ;-th GNN layer. In our work, we implement 030() as a feed-forward neural network. Overall, we can summarize the adapting parameters generation process for ⌧ < as follows: < = ⇡ (p< ; ⌦), (5.5) where ⇡ () consists of all adaptor functions for different GNN layers and ⌦ = {81 , · · · , 8 ! } represents the parameters for all the GNN adaptors. 5.2.4.2 The Adaptation Process In our work, the meta-LGNN model can be arbitrary GNN models including GCN [52], GIN [120] and etc [126, 69, 22]. The meta-LGNN model can be directly applied to a given user-item interaction graph. However, as aforementioned, there exist both similarities and differences among the user-item interaction graphs in different recommendation tasks. Thus to preserve similarities and differences simultaneously, we do not directly apply the same meta-LGNN model to all the recommendation graphs in the pre-training process. Instead, we utilize the GNN adaptor to generate a customized GNN model for each recommendation graph based on the meta-LGNN model and its graph 79 properties. The adaptation process is illustrated in the bottom subgraph of Figure 5.1. Specifically, for the graph convolution in the ;-th GNN layer of the customized GNN model for graph ⌧ < , its parameters are adapted as follows: ; ) 2>=E < ; = ) 2>=E ⇧ 5;< , (5.6) where ⇧ denotes an adaptation operation. In the proposed model, we adopt FiLM [81] as the ; 1 3; adaptation operation. Specifically, we split the adapting parameters 5;< 2 R23 into two parts ; 1 ⇥3 ; ; 1 ⇥3 ; and reshape them as $< ; 2 R3 and #< ; 2 R3 . Then, we can formalize the adapting process of ) 2>=E ; for ⌧ < as follows: ; ) 2>=E < ; = ) 2>=E $< ; + #< ; , (5.7) where denotes the element-wise multiplication between two matrices. By applying the adaptation operation on ! GNN layers of the meta-LGNN model, we can get the customized GNN model G(·; ⇥⌧ # #< ) for ⌧ < , where ⇥⌧ # #< = {) 2>=E 1 ! , · · · , ) 2>=E }. The overall adaptation for ⌧ 8 can be < < summarised as: ⇥⌧ # #< = ⇥<4C0 !⌧ # # ⌃ <. (5.8) Then, we utilize the customized GNN model to generate graph embedding for the local graph ⌧ BD1(D,8) extracted from ⌧ < for any user-item pair (D, 8) as follows: hD,8 = ?>>; (ABD1D,8 , G(ABD1D,8 , XBD1D,8 ; ⇥⌧ # #< )), 8(D, 8) 2 ⇢ < (5.9) 5.2.4.3 The Pre-training Process In the pre-training phase, suppose that there are # user-item interaction graphs {⌧ 1 , ⌧ 2 , · · · , ⌧ # } from different recommendation tasks. For any graph ⌧ < , we can generate a customized GNN model G(; ⇥⌧ # #< ) based on the meta-LGNN model and graph properties p< via the GNN adaptor. For 80 any user-item pair (D, 8) 2 ⇢ < , we utilize G(; ⇥⌧ # #< ) to compute a graph representation hD,8 for its corresponding local graph ⌧ BD1D,8 , and then we use a scoring function to compute a recommendation score for (D, 8) based on hD,8 : B (D,8) = B2>A4(h (D,8) ; %) = f(h (D,8) · %), (5.10) where % is the linear transformation weights to be learned in the scoring function. We adopt the pairwise BPR loss [90] in ADAPT. The BPR loss is one of the most popular objective functions in recommendation tasks, which measures the relative order of the positive node pairs and negative node pairs. The positive node pairs are user-item interactions observed in graphs, while the negative node pairs are non-existent user-item interactions generated by negative sampling. Specifically, BPR assumes that the recommendation score of the positive node pairs should be higher than the corresponding negative ones. The objective function L ?A4 of our model in the pre-training phase is formulated as follows: ’ min ln f(B (D,8) B (D,8 0 ) ), (5.11) ⌦,⇥<4C 0 !⌧ # # ,% 0 (D,8,8 )2O – 0 0 where O = {(D, 8, 8 )|(D, 8) 2 ⇢ < , (D, 8 ) 2 ⇢ < } denotes the pre-training data for the graph set < {⌧ 1 , ⌧ 2 , · · · , ⌧ # }. ⇢ < represents the existent interaction set between users and items in ⌧ < and ⇢ < is the non-existent interaction set. The overall pre-training process is summarized in Algorithm 1. Given a set of interaction graphs used for pre-training {⌧ 1 , ⌧ 2 , · · · , ⌧ # } , batch size 1 and the amount of interactions sampled for pre-training # B0< ?;4B , we first randomly initialize the meta-LGNN model, the GNN adaptor and the scoring function. Next, in each training epoch (from line 3 to line 12), a random graph ⌧ < is sampled from {⌧ 1 , ⌧ 2 , · · · , ⌧ # } and the GNN adaptor generates its corresponding adaptation parameters < based on its structure properties p< , and then the customized GNN model G(; ⇥⌧ # #< ) is generated especially for ⌧ < . Furthermore, we randomly sample 1 user-item interactions from the pre-training graph ⌧ < and generate their negative counterparts, and then the meta-LGNN model, the GNN adaptor and the scoring function are updated via minimizing the BPR loss on these samples. The training process will continue until the objective function is converged. 81 Algorithm 1 The Pre-training Process of ADAPT Input: A set of interaction graphs from different recommendation tasks: {⌧ 1 , ⌧ 2 , · · · , ⌧ # }; Batch size 1; Õ Sample amount # B0< ?;4B (Note that # B0< ?;4B < < |⇢ < |.) Output: The meta-LGNN model G(; ⇥<4C0 !⌧ # # ); the GNN adaptor ⇡ (; ⌦); The scoring function B2>A4(; %) 1: Initialize ⇥<4C0 !⌧ # # , ⌦ and % randomly; 2: while not converged do # 3: for batch=1,.., B0<1 ?;4B do 4: Sample a graph ⌧ < from {⌧ 1 , ⌧ 2 , · · · , ⌧ # }; 5: Compute < = ⇡ (p< ; ⌦); 6: Compute ⇥⌧ # #< = ⇥<4C0 !⌧ # # ⌃ < ; 7: Sample 1 user-item interactions (D, 8) 2 ⇢ < ; 0 Sample 1 negative interactions (D, 8 ) 8 ⇢ < ; Õ 8: 9: Compute ;>BB = ln f(B2>A4 (D,8) B2>A4 (D,8 0 ) ) 10: Compute 6A03 = 102:F0A 3 (;>BB) 11: Update( ⇥<4C0 !⌧ # # , ⌦, %; 6A03) 12: end for 13: end while 5.2.5 The ADAPT Fine-tuning In the fine-tuning phase, we also adopt the BPR loss. However, in this phase, we aim at optimizing the parameters of the customized GNN model. The formulation is listed below: ’ min ln f( Ĥ (D,8) Ĥ (D,8 0 ) ), (5.12) ⇥⌧ # #C 0 0 (D,8,8 )2⇢ C 0 0 0 where ⇢ C = {(D, 8, 8 )|(D, 8) 2 ⇢ C , (D, 8 ) 2 ⇢ C } denotes the training data for the target graph ⌧ C . ⇢ C represents the set of existent interactions between users and items in ⌧ C and ⇢ C is the non-existent interaction set generated manually. With the GNN adaptor and the meta-LGNN model from the pre-training phase, we can conduct model fine-tuning especially for a target recommendation graph ⌧ C . We design two fine-tuning strategies and illustrate them in Figure 5.3. • The first fine-tuning strategy is named as direct fine-tuning. As shown in the top subfigure of 82 the ADAPT-D meta- LGNN Gt the the GNN Customized predictions pt adaptor GNN the ADAPT-J meta- LGNN Gt the the GNN Customized predictions pt adaptor GNN : Trainable parameters Figure 5.3: The fine-tuning strategies. We design two fine-tuning strategies – direct fine-tuning and joint fine-tuning, which are denoted as ADAPT-D and ADAPT-J, separately. In ADAPT-D, we generate the customized GNN model ⇥⌧ # #C based on the input graph via the GNN adaptor and the meta-LGNN, and we fine-tune the customized GNN model ⇥⌧ # #C ; In ADAPT-J, we fine-tune both the meta-LGNN model ⇥<4C0 !⌧ # # and the GNN adaptor ⌦. Figure 5.3, given a target downstream graph ⌧ C , we first compute the customized adapting parameters C = ⇡ ( ? C ; ⌦) based on its property vector ? C via the GNN adaptor ⇡ (; ⌦), then generate the customized GNN model ⇥⌧ # #C = ⇥<4C0 !⌧ # # ⌃ C. Finally, we fine-tune ⇥⌧ # #C by optimizing the objective function described in Eq. 5.12. We denote ADAPT with this fine-tuning strategy as ADAPT-D. • The second fine-tuning strategy is named as joint fine-tuning. As shown in the bottom subfigure of Figure 5.3, given a target downstream graph ⌧ C , we fine-tune both the meta-LGNN model ⇥<4C0 !⌧ # # and the GNN adaptor ⌦ on ⌧ C . This fine-tuning strategy is inspired by test-time training [100], which is proposed to fine-tune the model for a test sample in the test process so that the model can be better adapted for the test data. We use ADAPT-J to indicate ADAPT with the joint fine-tuning strategy. 83 5.2.6 Time Complexity Analysis In the subsection, we analyze the additional time complexity introduced by the GNN adaptor in the ADAPT framework. For convenience, we assume the dimension of the output features from each meta-LGNN layer as 3. Correspondingly, the dimension of output features from each GNN adaptor layer is 23. The graph property vector p< of a given specific graph ⌧ < is assumed to be ?-dimensional. Then, the time cost of generating the adaptation parameters in each ADAPT layer is $ (3 · ?) and the time complexity of the adaptation process is $ (3 2 ). Thus, for the ADAPT framework consisting of ! layers, the overall time complexity caused by adaptation is $ (! · ? · 3 + ! · 3 2 ). The time complexity of the filtering operation in each layer of the meta-LGNN model is $ (# 4 · 3 + #= · 3 2 ), where # 4 and #= denote the number of edges and the number of nodes of the input graph, respectively. Hence, the overall time complexity of the meta-LGNN model is $ (! · # 4 · 3 + ! · #= · 3 2 ). Typically, ? is significantly smaller than # 4 . Therefore, the additional time complexity introduced by the GNN adaptor in the ADAPT framework is comparatively low. 5.3 Experiments In this section, we conduct extensive experiments to validate the effectiveness of the proposed ADAPT. We first introduce the experimental settings. Then we illustrate how the sparsity of the recommendation graphs affects the performance of existent GNN-based recommendation methods. Next, we evaluate the performance of ADAPT and representative baselines on various real-world datasets. We conduct the ablation study to understand the importance of the GNN adaptor. Finally, we investigate the impact of graph sparsity and the number of pre-training graphs on the performance of ADAPT. 5.3.1 Experimental Settings In this work, we conduct experiments on datasets from two real-world applications: Tianchi [2] and MovieLens [1]. Specifically, we construct numerous user-item interaction graphs from different recommendation scenarios based on the item category from these two applications. For each application, we select some interaction graphs as pre-training graphs and some of them as the target 84 (a) 60 pre-training graphs from Tianchi. (b) 6 pre-training graphs from MovieLens. Figure 5.4: Properties of pre-training graphs. Note that each circle represents a graph and its size denotes the density of the graph. downstream graphs. The statistics of the pre-training graphs are demonstrated in Figure 5.4 and these of the downstream graphs are summarized in Table 5.1. As shown in Figure 5.4, we select 60 interaction graphs from Tianchi and 6 interaction graphs from MovieLens for pre-training, and these graphs show very diverse properties. We have also summarized the graph property statistics of the 85 Table 5.1: Statistics of the downstream graphs from Tianchi and MovieLens. Note that we do not include all the graph properties in this table, since some properties vary with the graph data split method. Dataset #Users #Items #user-item ratio #Edges Tianchi-174490 2,267 285 7.954385965 3,826 Tianchi-61626 2,305 764 3.017015707 3,270 Tianchi-3937919 1,846 158 11.683544304 2,980 Tianchi-2798696 2,579 94 27.436170213 2,988 MovieLens-War 2,970 89 33.370786517 4,240 60 pre-training graphs from Tianchi in Appendix D.1. We briefly introduce the Tianchi dataset and the MovieLens dataset as follows: • Tianchi: It is a user behavior dataset from Alibaba (one of the biggest online shopping platforms in China), which consists of millions of user-item interactions, such as clicking, liking, and purchasing. Each interaction record includes user ID, item ID, behavior type, item category ID, and interaction timestamp. In our work, we use Tianchi- ⇡ to denote the user-item interaction graph whose items belong to the category ⇡. • MovieLens: This is a movie rating dataset from MovieLens (a popular movie recommendation website), which consists of user rating records for thousands of movies from different categories. Each rating record includes user ID, movie ID, rating, and movie category. In our work, we use MovieLens-- to denote the user-movie interaction graph whose movies come from the category -. We mainly compare the proposed method with baseline methods from two groups. The first group includes GNN-based recommendation methods and a classic CF method. Particularly, we select NGCF and LightGCN since they are two of the most representative GNN-based recommendation methods, and LightGCN is one of the state-of-the-art models. MF is chosen because it is one of the most classic and popular recommendation methods. The second group includes existing pre-training methods for GNNs. Though there are a few pre-training methods for GNNs [40, 41, 84], the majority of them are not designed specifically to recommendations. Thus, we adapt a recent method GCC for 86 recommendations as the representative baseline since it only relies on the topological information and can be pre-trained on multiple graphs as the proposed ADAPT does. We have not chosen the pre-training strategies proposed in [40] because there are no node attributes or graph labels in our scenarios. Similarly, GPT-GNN [41] has not been included because it also requires node attributes and cannot be applied across multiple graphs. The details of these baselines are presented in Appendix D.2. Note that since we do not focus on the cold-start problem, we do not include the pre-training work in [29] as one baseline. For the downstream recommendation tasks, we divide the corresponding user-item interactions into three sets: the training set, the validation set, and the test set. Note that each of the validation sets and the test set has 5 percent of the total samples. To avoid the cold-start problem, we constrain that all the users and items in the validation set and the test set should exist in the training set. In the fine-tuning phase, we use the validation set to select the best model and then report its performance in the test set. For each user-item interaction in the validation or the test set, we generate 49 non-existent user-item interactions for the user. In the model evaluation, we first compute a recommendation score for each interaction, and then we rank these scores. We calculate the Hit Rate (HR) of the model prediction based on if the score of the real user-item interaction is in the top 5 among all the 50 user-item interactions. For each experiment, we average and report the model performance with 5 seeds in terms of HR. More implementation details can be found in Appendix D.3 5.3.2 Preliminary Study In this subsection, we study how the performance of two representative GNN-based recom- mendation methods, i.e., NGCF and LightGCN, is affected by the sparsity of interaction graphs. Specifically, we increase the sparsity of interaction graphs by randomly removing parts of the training edges while fixing the test edges. The results are shown in Figure 5.5 where G 2 {1, 2, 3, 4} (i.e., the x-axis) is used to denote the graph sparsity and a larger G indicates a sparser graph. Both LightGCN and NGCF show a significant decreasing trend in terms of recommendation performance with the increase of graph sparsity. This empirically demonstrates that GNN-based recommendation methods suffer from data sparsity. This motivates us to take advantage of pre-training techniques to 87 (a) Tianchi-61626. (b) Tianchi-174490. Figure 5.5: The performance of NGCF and LightGCN vs. the graph sparsity on two Tianchi datasets (The reported performance is measured with HR in %). alleviate this problem. 5.3.3 Performance Comparison We compare the proposed ADAPT and baselines described aforementioned on five datasets, including four item categories from Tianchi and one movie category from MovieLens. The recommendation methods are divided into two groups: 1) the training from scratch group consists of MF, NGCF, and LightGCN, which are directly trained from scratch on the target downstream graphs; and 2) the pre-training & fine-tuning group consists of GCC and the proposed ADAPT with two fine-tuning strategies, i.e., ADAPT-D and ADPAT-J. To ease the comparison, we also show the best performance of methods from these two groups and compute the performance gain of the pre-training & fine-tuning group over the training from scratch group. Note that in order to simulate the data scarcity scenarios in real-world recommendation applications, after the validation set and the test set are determined, we randomly remove some interactions from the remaining interactions under the constraint of introducing no isolated users or items. Specifically, we tailor two training sets: One retains 60 percent of the remaining interactions, and the other one consists of 40 percent of interactions. The comparison results on these two scenarios are shown in Table 5.2 and Table 5.3, respectively. To compare the proposed ADAPT framework with more graph pre-training methods, 88 Table 5.2: The recommendation performance comparison with 60% as training (The reported performance is measured with HR in %). Training from Scratch Pre-training & Fine-tuning Performance Datasets Gain MF NGCF LightGCN Best GCC ADAPT-D ADAPT-J Best Tianchi-174490 33.51±3.71 46.39±3.07 47.22±3.08 47.22±3.08 37.17±15.6 60.94±1.46 58.64±2.85 60.94±1.46 +29.05% Tianchi-61626 17.30±3.14 28.34±5.52 34.11±3.52 34.11±3.52 33.37±3.4 50.06±3.18 49.81±3.16 50.06±3.18 +46.77% Tianchi-3937919 21.75±3.95 26.71±4.59 40.2±3.31 40.2±3.31 35.57±12.12 57.72±2.31 59.6±1.97 59.6±1.97 +48.25% Tianchi-2798696 30.81±5.22 35.97±2.71 28.99±3.86 35.97±2.71 45.1±3.4 47.38±2.15 46.84±0.28 47.38±2.15 +31.72% MovieLens-War 62.94±4.25 63.41±3.06 62.94±4.07 63.41±3.06 64.93±5.13 73.93±1.18 75.99±0.72 75.99±0.72 +19.84% Table 5.3: The recommendation performance comparison with 40% as training (The reported performance is measured with HR in %). Training from Scratch Pre-training & Fine-tuning Performance Datasets Gain MF NGCF LightGCN Best GCC ADAPT-D ADAPT-J Best Tianchi-174490 30.73±4.35 42.36±2.96 37.64±3.24 42.36±2.96 33.40±1.36 57.70±2.49 60.07±3.2 60.07±3.20 +41.82% Tianchi-61626 16.32±3.21 23.50±4.89 28.83±4.48 28.83±4.48 32.27±2.32 49.94±3.75 52.39±1.27 52.39±1.27 +81.70% Tianchi-3937919 20.60±3.84 25.57±3.62 35.44±1.70 35.44±1.70 34.9±17.0 49.13±8.98 58.79±3.37 58.79±3.37 +65.9% Tianchi-2798696 28.86±4.12 33.29±1.98 27.11±4.89 33.29±1.98 46.17±2.53 48.66±2.86 46.58±2.10 48.66±2.86 +46.16% MovieLens-War 61.14±4.52 64.17±2.50 56.78±4.86 64.17±2.50 50.76±13.65 71.85±1.76 71.47±1.18 71.85±1.76 +11.97% we have also pre-trained the Meta-LGNN models following the InfoGraph and MVGRL methods, and compare their performance on the Tianchi downstream datasets in Table 5.4. We can make the following observations: • NGCF and LightGCN often perform better than MF. This observation is consistent with previous observations in [110, 35]. • The proposed ADAPT frameworks achieve great performance improvement over the recom- mendation models trained from scratch on the target downstream data. This demonstrates the effectiveness of ADAPT in alleviating the data scarcity problem in recommendation tasks. • ADAPT achieves significantly better performance than GCC, InfoGraph, and MVGRL. Compared to these pre-training methods, ADAPT is designed specifically to the pre-training task for recommendations with the recommendation BPR objective and the adaptor to capture the differences among pre-training graphs. More investigations on the importance of the adaptor will be discussed in the following subsection. 89 Table 5.4: The recommendation performance comparison with graph pre-training methods on Tianchi datasets (The reported performance is measured with HR in %). Dataset GCC InfoGraph MVGRL ADAPT Tianchi-174490 33.40±1.36 33.3±19.45 47.8±13.7 60.07±3.20 Tianchi-61626 32.27±2.32 34.97±10.2 22.45±4.44 52.39±1.27 40% Tianchi-3937919 34.9±17.0 32.21+21.8 46.8±11.5 58.79±3.37 Tianchi-2798696 46.17±2.53 31.95±8.28 43.4±1.55 48.66±2.86 Tianchi-174490 37.17±15.6 32.56±22.5 53.7±18.1 60.94±1.46 Tianchi-61626 33.37±3.4 31.90±11.9 35.54±6.84 50.06±3.18 60% Tianchi-3937919 35.57±12.12 32.6±25.6 48.59±7.19 59.6±1.97 Tianchi-2798696 45.1±3.4 31.1±12.36 47.0±2.14 47.38±2.15 Table 5.5: Ablation study of the GNN adaptor (The reported performance is measured with HR in %). Pre-training & Fine-tuning Training from Scratch Datasets 60 pre-training graphs 1 pre-training graph ADAPT-w/o-adaptor-scratch ADAPT-best ADAPT-w/o-adaptor ADAPT-w/o-adaptor Tianchi-174490 60.07±3.2 32.25±9.43 43.04±14.56 53.92±10.58 Tianchi-61626 52.39±1.27 44.78±9.83 43.81±6.36 49.57±3.98 Tianchi-3937919 58.79±3.37 38.12±7.29 48.59±7.89 58.25±5.89 Tianchi-2798696 48.66±2.86 47.36±1.46 45.64±1.50 47.25±2.15 Average 54.98 40.63 45.27 52.24 • Both the direct fine-tuning strategy and the joint fine-tuning strategy are effective. The joint fine-tuning strategy is empirically shown to be more effective than the direct fine-tuning strategy in most cases. This observation could indicate that fine-tuning the adaptor in the test time has the potential to benefit the performance of the downstream recommendation task. This observation is consistent with that in [100]. • The performance gain of ADAPT from the best baseline performance is more significant under the 40% than 60%. This observation suggests that pre-training is a promising solution to tackle the data sparsity problem in recommendations. 5.3.4 Ablation Study We conduct an ablation study to investigate the effectiveness of the GNN adaptor. The results are shown in Table 5.5. We use ADAPT-w/o-adaptor to denote the ADAPT framework without 90 the GNN adaptor. We pre-train two model instances of ADAPT-w/o-adaptor with 60 pre-training graphs and 1 pre-training graph, respectively. We use ADAPT-w/o-adaptor-scratch to indicate that ADAPT-w/o-adaptor is trained from scratch with only the downstream graph. ADAPT-best denotes the ADAPT framework with any fine-tuning strategy that can achieve better performance. We can make the following observations: • The overall performance of ADAPT-best is significantly better than that of the ADAPT-w/o- adaptor and ADAPT-w/o-adaptor-scratch. This validates (1) the effectiveness of the GNN adaptor for pre-training and (2) the importance of pre-training. • The ADAPT-w/o-adaptor instance pre-trained on 60 pre-training graphs performs even worse than the instance pre-trained on a single pre-training graph. This shows that it is necessary to capture the differences among multiple pre-training graphs, and the GNN adaptor in the proposed ADAPT can capture these differences in the pre-training process. • Both the ADAPT-w/o-adaptor instances cannot beat ADAPT-w/o-adaptor-scratch. This indicates that pre-training is not always helpful. It can even introduce performance degradation if not carefully designed. To further demonstrate the importance of pre-training, we check if ADAPT can transfer knowledge from the pre-training graphs to the downstream recommendation task. To achieve this goal, we directly use the models generated from the pre-training without fine-tuning. In particular, we compare the performance of three different variants of the proposed GNN recommendation method for localized collaborative filtering. They include a randomly-initialized model, the meta-LGNN model from the pre-training phase, and the customized-GNN model generated by the meta-LGNN model and the GNN adaptor, especially for the target dataset. The results are demonstrated in Table 5.6. Overall, the meta-LGNN model performs better than the randomly-initialized model, and the performance of the customized-GNN model is significantly better than that of the other two variants. These observations demonstrate that the meta-LGNN model learns some useful knowledge from the pre-training process, and the GNN adaptor effectively adapts the meta-LGNN 91 Table 5.6: Performance comparison of three different variants of the proposed GNN recommendation method for localized collaborative filtering (The reported performance is measured with HR in %). Datasets random-ini meta-LGNN customized-GNN Tianchi-174490 16.44 13.09 28.17 Tianchi-61626 12.64 16.20 26.99 Tianchi-3937919 16.11 12.89 30.33 Tianchi-2798696 11.68 19.33 17.59 Average 14.22 15.38 25.77 model to the customized one. The customized-GNN model performs remarkably better than the randomly-initialized model. This provides direct evidence that ADAPT successfully transfers knowledge from pre-training graphs to the downstream recommendation task. 5.3.5 Further Probing In this subsection, we further probe the proposed framework by exploring the following two problems: 1) how does the sparsity of the target downstream graph affect the ADAPT performance? and 2) how does the number of pre-training graphs influence the ADAPT performance? 5.3.5.1 Sparsity Analysis (a) Tianchi-174490. (b) Tianchi-61626. (c) MovieLens-War. Figure 5.6: The performance of the proposed ADAPT with varied sparsity of the target downstream graph. Note that the x-axis is used to denote the graph sparsity. The larger the sparsity is, the sparser the target downstream graph is. In Figure 5.6, we show how the performance of ADAPT and two GNN-based recommendation methods, NGCF and LightGCN, changes with the change of graph sparsity. Specifically, we gradually increase the graph sparsity by randomly removing some interactions from the training set under the constraint of introducing no isolated users or items, and meanwhile, the test set 92 and validation set remain unchanged. It is observed that the proposed ADAPT performs much more stable with the increase of the graph sparsity, compared to NGCF and LightGCN. This observation further demonstrates the appealing of pre-training to mitigate the data sparsity problem in recommendations. 5.3.5.2 The Number of Graphs for Pre-training (a) Tianchi-174490. (b) Tianchi-61626. (c) Tianchi-393719. (d) Tianchi-2798696. Figure 5.7: Performance variants of the proposed ADAPT in terms of the number of pre-training graphs. In order to explore the influence of the number of pre-training graphs on the model performance, we pre-train four ADAPT models on one single graph, 6 graphs, 60 graphs and 300 graphs, respectively. Then we fine-tune these models on four target downstream graphs. Note that for the 93 ADAPT model pre-trained on multiple graphs, we report its best performance among two fine-tuning strategies. For the ADAPT model pre-trained on one single graph, it is not applicable to pre-train the GNN adaptor, thus we report its performance without the GNN adaptor. As shown in Figure 5.7, the model performance first increases when the number of pre-training graphs increases and then it decreases consistently on all the datasets when the number of graphs is 300. This demonstrates that it is potentially beneficial to increase the number of pre-training graphs for the downstream performance. However, too many pre-training graphs may introduce noise that could hurt the downstream performance. 5.3.5.3 Investigation of Node Attributes Alternative Following the commonly adopted scenarios [53, 110, 35] and also for the feasibility and transferability of the proposed ADAPT, we don’t assume that there exists the same set of node attributes across different recommendation graphs. Instead, we propose to use Double-Radius Node Labeling (DRNL) [129] to generate a set of positional attributes, which are able to encode structural information, to serve as node attributes. The proposed ADAPT is very flexible with the input node attributes. If there do exist the same set of node attributes across different graphs, we can easily incorporate these raw node attributes by either concatenating them with the positional attributes or directly using them as input node attributes. To further investigate other alternatives for the input node attributes, we have also attempted to use the DeepWalk [82] embedding of each node as the input node attributes, and have compared its performance with the DRNL method on Tianchi dataset in Table 5.7. We can observe that the DeepWalk node attributes perform worse than the DRNL node attributes in most cases, but it beats its counterpart in one case. This shows that DRNL is a good attributes generation method for the proposed ADAPT given its effectiveness and simplicity, and there may exist potential to improve the model performance by further investigating other node attributes alternatives. 5.3.5.4 Visualization of Customized Models To further explore the model adaptation, we visualize the customized models for the four downstream Tianchi datasets. Specifically, we extract the parameters of the first GNN layer of the 94 Table 5.7: Performances comparison of ADAPT with DRNL node labeling method and DeepWalk node labeling methods on Tianchi datasets with 40% as training. Tianchi-174490 Tianchi-61626 Tianchi-3937919 Tianchi-2798696 ADAPT-DRNL 60.07±3.20 52.39±1.27 58.79±3.37 48.66±2.86 ADAPT-DeepWalk 44.16±5.27 26.99±5.7 36.7±8.49 55.03±4.65 four customized models and then use t-sne [70] to project them to a 3-dimensional space. These 3-dimensional embeddings are then visualized in Figure 5.8, from which we can observe that the GNN adaptor indeed generates different customized models for different recommendation graphs. In addition, for graphs with similar graph properties, their corresponding customized models are more similar than those that are relatively different. (a) gnn layer 1 weight. (b) gnn layer 1 bias. Figure 5.8: Visualization of customized models for different downstream graphs on Tianchi dataset. 5.3.5.5 Test on a larger dataset We further test the proposed ADAPT on a relatively larger downstream dataset graph compared to the other downstream graphs. This dataset is from the romance category of MovieLens, which consists of 20407 users, 376 items, and 30496 edges. We compare the model performances with 60% edges as the training set. The result is shown in Table 5.8, where the proposed ADAPT method also achieves the best performance. 95 Table 5.8: Performance comparison on MovieLens-Romance dataset (with 20407 users, 376 items and 30496 edges) with 60% as training (The reported performance is measured by HR in %). MF NGCF LightGCN GCC ADAPT-D ADAPT-J 72.19±1.44 76.6±2.64 78.54±1.19 71.63±4.29 79.01±1.1 76.8±5.05 5.4 Related Work Our work is related to graph neural networks, pre-training for graph neural networks, recom- mendation models based on graph neural network, inductive recommendation and cross-domain recommendation. Next, we briefly review representative methods from each category. 5.4.1 Graph Neural Networks In recent years, increasing attention and efforts have been devoted into graph neural networks, which successfully extend deep neural networks to graph data. Graph neural networks are theoretically and empirically demonstrated to be very powerful in graph representation learning [52, 27], and have achieved great success in various applications from different domains, such as natural language processing [124, 43, 11], computer vision [83, 114] and recommendation [123, 125, 110, 96, 35, 117, 47, 137]. There are mainly two groups of graph neural networks: the spectral-based methods and the spatial-based methods. In [93], the first graph neural network is proposed from the spatial perspective to solve both graph and node level tasks, which aggregates information from neighboring nodes for each node in every layer. Subsequently, Bruna et al. [10] propose to generalize the convolution operation to the graph domain based on graph Laplacian theory from the spectral perspective. Following this work, Defferrard et al [13] propose ChebNet, which uses chebyshev polynomials to modulate the graph Fourier coefficients for different graph signals. Next, Kipf and Welling [52] further simplify ChebNet via some assumptions and propose graph convolutional networks (GCNs). It is remarkable that despite developed from the spectral perspective, GCNs can also be well illustrated in a spatial way. From then on, multiple spatial-based graph neural networks have been proposed. A comprehensive overview about GNNs can be found in recent surveys [119, 135] and books [68]. In addition, there emerge some work trying to further explore the rationale behind GNNs, such as Ma et al [67] propose that most existent GNNs can be unified 96 as graph signal denoising. In addition to the aforementioned work mainly focusing on graph convolution operation, there are also numerous works targeting at graph pooling operation, which summarizes graph representation from node representations and plays an essential role in graph representation learning. There are simple pooling methods, such as directly averaging all the node representations as the graph representation [17] or adding a virtual node that connects to all the nodes in the graph and then taking its node representation as the graph representation [60]. An increasing number of hierarchical pooling methods have been proposed to learn graph representation hierarchically, which are believed to be able to better capture the graph structure information. Specifically, DiffPool [126] is proposed to learn a differentiable soft cluster assignment for each node at every GNN layer. Inspired by encoder-decoder model, graph U-Net [22] is designed to consist of graph pooling (gPool) and graph unpooling (gUnpool) operations, where gPool can adaptively select important nodes based on the importance scores computed on a learnable projection vector. Furthermore, RepPool [58] proposes a learnable way to better integrate non-selected nodes, which can better preserve the information of both the important nodes and normal nodes .In addition, EigenPooling [69] is designed from the perspective of graph Fourier transform and it can leverage both the node features and the local structures. 5.4.2 Pre-training for Graph Neural Networks Inspired by the great success of pre-training techniques in multiple domains such as computer vision and natural language processing [72, 86, 15, 14, 32, 44, 45], many researchers have also started to explore how to a apply pre-training appropriately in GNN models [40, 84, 41, 65, 29]. Hu et al. [40] propose a pre-training strategy based on self-supervised learning, which focuses on pre-training a GNN model at both the node level and graph level, so that the pre-trained model can learn effective node representations and graph representations simultaneously. Qiu et al. propose GCC [84], a GNN pre-trainng framework based on contrastive learning, which aims at capturing transferable topological knowledge across multiple graphs. GPT-GNN [41] is a GNN pre-training framework built on generative model. It aims at capturing both the structure information and the semantic information via generating node attributes and edges 97 alternatively. In addition, some work focus on bridging the gap between GNN pre-training and GNN fine-tuning. For example, L2P-GNN [65] is proposed to mimic fine-tuning during the pre-training process via a dual-adaption mechanism at both the node level and graph level. Besides, there also emerges GNN pre-training work specially for alleviating the cold-start problem in recommendation tasks [29], where researchers propose to use embedding reconstruction as the pre-training task, so that the cold-start user/item can get a good representation from the reconstruction knowledge. 5.4.3 Graph Neural Networks for Recommendation Systems Recommendation tasks are to predict a user’s preference based on its historical interactions with various items. Historical user-item interactions can be directly denoted as a bipartite graph, and thus it is very natural to apply graph neural networks to recommendation tasks. In fact, there exist numerous works focusing on exploring GNNs for recommendation systems and many of them have achieved promising performance [125, 110, 35, 19, 109]. PinSage [125] is designed especially for recommendation tasks based on GraphSage, and can be directly applied to a web-scale recommendation tasks. For each user or item node, it utilizes a random-walk based sampling method to sample some neighboring nodes for information aggregation. NGCF [110] is designed to explicitly capture high-order connectivity in user-item interaction graphs via embedding propagation. Later on, He et al. propose LightGCN [35], which simplifies the design of GNNs especially for recommendation tasks via eliminating feature transformation and non-linear activation function, and has achieved significant performance improvement. Furthermore, Wu et al. [116] explore to improve the performance of GNN models for recommendations by incorporating self-supervised learning techniques. This work can be supplementary to most supervised GNN models for recommendations, which aim at improving user and item representations via self-discrimination. These GNN-based recommendation methods only relying on the user-item interaction graphs. There are also works focusing on utilizing GNNs to deal with side information, such as social networks and knowledge graphs, to facilitate recommendation performance. For examples, GraphRec [19] is proposed to use two graph attention networks to learn user embeddings and item embeddings, and the user embedding is learned from both the social graph and interaction graph. KGAT [109] is proposed to 98 integrate the user-item interaction graph and the knowledge graph into one unified graph by viewing the user-item interaction as one type of the relations, and to use attentive embedding propagation layers to refine embeddings over this unified graph. 5.4.4 Inductive Recommendation and Cross-Domain Recommendation Inductive recommendations focus on recommendation tasks in inductive scenarios, where there are emerging users or items that do not appear in the training process. Most methods [53, 36, 110, 35] that leverage one-hot identification representation as input to learn user/item embeddings that typically cannot be applied in inductive scenarios. Some item-based methods [92, 48] aim at making predictions based on similarities between the new items and a fixed item set, which can recommend existent items to new users. Recently, there are also a few models proposed especially for inductive recommendation where there are both new users and items. IGCN [118] is one of such methods. It initiates the embeddings of the new users and items by leveraging the embeddings of some pre-defined core items/users and global item/user representation. CF-GCN [87] learns embeddings of the new users and items via fusing embeddings of existent users/items of different layers. Most of these methods aim at learning representations for new items/users via their interactions with items/users seen in the training phase. The aforementioned inductive scenario is quite different from the pre-training scenario we discuss about at this work. The inductive scenario aims at handling new items/users in an existent recommendation graph, while the proposed pre-training scenario targets at transferring common knowledge among different recommendation graphs. Furthermore, the methods [92, 48, 118, 87] for the inductive scenario are not applicable in the proposed pre-training scenario, since it requires interactions between new items/users between the core items/users seen in the training phase, but there is no such common core items/users among different recommendation graphs in the proposed pre-training scenario. Cross-domain recommendation is also proposed to alleviate the data scarcity problem in recommendation in sparser domains via transferring knowledge from richer domains. There are different scenarios in cross-domain recommendation [140], among which single-target scenario is most similar with the proposed pre-training scenario, where richer data from source domains are 99 leveraged to help recommendation in a target domain with sparser data. In this scenario, it is assumed that there are common users or items among source domains and the target domain [131, 139, 61], or there exist attributes of the same modality among these domains, such as user ratings [8] and semantic properties [56, 130]. Typical techniques adopted in cross-domain recommendation include transfer learning [73, 136, 34], domain adaptation [49, 127], multi-task learning [4, 64] and so on. Although sharing similar goals, cross-domain recommendation and the proposed pre-training recommendation are quite different. In the proposed pre-training recommendation scenario, there are no common users or items among different domains. Also, no user/item attributes of the same modality are provided. Therefore, most cross-domain recommendation methods are not applicable to the studied scenario. 5.5 Conclusion In this paper, we propose an adaptive graph pre-training framework for localized collaborative filtering, ADAPT, which can effectively help alleviate the data scarcity challenge in recommendation tasks. There are two key components in the proposed ADAPT: the meta-LGNN and the GNN adaptor. Specifically, the meta-LGNN is a novel GNN-based recommendation method from a new perspective of local structure. It aims at encoding the collaborative filtering information into the graph representation for the neighboring structure of a given user-item pair, and it does not require learning user/item embeddings. Thus it is more flexible for pre-training compared to existing GNN-based recommendation methods. The GNN adaptor is the key to allowing the effectiveness of taking advantage of multiple graphs for pre-training. It can capture the difference for each graph, and adapt the meta-LGNN model to the customized GNN model accordingly. Overall, the proposed ADAPT is able to transfer common knowledge from the pre-training graphs for the target downstream recommendation graph and to capture their uniqueness in the meanwhile. Extensive experiments have validated the effectiveness and superiority of the proposed ADAPT in GNN pre-training for recommendations. There are a few directions we can further explore in the future. First, the current implementation of the meta-LGN is based on GCN, which is a vanilla model in graph representation learning. The 100 performance of the meta-LGN could be further improved by replacing the GCN model with other advanced GNN models such as GIN. Second, the node labeling method currently adopted by the meta-LGN treats user and item nodes equally. In further exploration, we may consider distinguishing the user and item nodes in the node labeling method. Third, the designs of the adaptor model and adaptation operation are also flexible, and there are some alternative models and operations that can further be explored on the proposed framework. Last but not least, instead of pre-training the proposed framework on numerous independent user-item interaction graphs, we may consider how to leverage some available shared user or item nodes among these graphs. 101 CHAPTER 6 CONCLUSION Graph Neural Networks have been proven effective in graph representation learning, and have achieved revolutionary achievements in many graph-related tasks, including graph classification, node classification, and link prediction. GNNs have aroused great research interests and many novel GNN models have been designed and applied in various applications. While significant progress has been made in regard to GNNs, their out-of-distribution generalization problem has not been effectively solved and this hinders their effective applications in real-world scenarios. In this dissertation, we focus on the out-of-distribution generalization in GNNs and propose to improve their out-of-distribution generalization ability from two perspectives, i.e., a novel training perspective and an advanced learning perspective. In addition, we aim at addressing the data sparsity challenge in applying GNNs in the recommendation tasks and improving the abilities of GNNs in terms of generalization and knowledge transfer in the real-world application of recommendations. In this chapter, we summarize our proposed methods and discuss some promising future research directions. 6.1 Summary To bridge the performance gap of GNNs between the training set and the test set, we propose to design a novel training paradigm, the test-time training paradigm, for GNNs. In other words, the GNN models are further trained for the test graph samples in the test time. Specifically, we propose the first test-time training framework for GNNs (GT3) to improve the out-of-distribution generalization ability of GNNs in the graph classification task. A two-level self-supervised learning task is carefully designed to adjust the GNN model for each test graph sample. Also, we propose an additional adaptation constraint to prevent undesirable excessive feature distortion during test-time training. Extensive experiments have been conducted to validate the effectiveness of the proposed GT3, and we also provide a theoretical analysis to demonstrate the benefits of test-time training for GNNs. 102 Apart from the training perspective, we also propose Customized-GNN, which aims at improving GNNs’ out-of-distribution generalization ability from a novel learning perspective. Specifically, we first explore the graph data diversity in terms of graph structure and then investigate the influence of the data diversity on GNNs’ performance. Inspired by the investigation, we propose to consider the diverse graph structure properties in GNNs’ training process. Specifically, the proposed Customized-GNN can generate a sample-specific GNN for each graph sample based on its graph structure. The effectiveness of Customized-GNN is validated via comprehensive experiments. In addition, the proposed framework is very general and flexible and thus can be applied to numerous existing GNN models. In addition to improving the GNNs’ out-of-distribution generalization ability via two general frameworks, we also focus on addressing the challenges of applying GNNs in the real-world application of recommendations. More specifically, we aim at improving the abilities of GNNs in terms of generalization and knowledge transfer in the recommendation scenario. To achieve this goal, we first propose a novel GNN-based recommendation method, LGCF, which is able to learn the recommendation knowledge from the localized graph induced from a target user and item. Unlike most other GNN-based recommendation methods, LGCF does not require learning effective embeddings for each user and item. Thus, it is essentially more friendly in the recommendation scenarios of data sparsity, and more importantly, it paves a way to transfer recommendation knowledge among different recommendation scenarios. Next, based on LGCF, we further design an adaptive pre-training framework, ADAPT, which is able to transfer common collaborative filtering knowledge among different user-item graphs and preserve the uniqueness of each graph simultaneously. 6.2 Future Directions Although GNNs have achieved promising progress in numerous graph-related tasks, the out-of- distribution generalization of GNNs is a vital problem that calls for more dedication and exploration from different perspectives. We discuss some possible future research directions as follows: 103 • Generalization challenges brought by node diversity: The two methods we designed in this dissertation mainly focus on the graph-level tasks where the training graph samples and the test graph samples are diverse in terms of the graph structure. The scenario where the distributions of node attributes vary between the training set and test set has not been considered yet. However, this scenario is also quite common in real-world applications. For example, in social networks, different social platforms may collect different demographic information from their users. Thus, to further solve the out-of-distribution generalization problem of GNNs in real-world applications, it is necessary to take the distribution difference of node attributes into consideration, instead of only considering the graph structure diversity. Also, it is observed that some node properties such as degree and local subgraph surrounding the target node play important roles in node-level tasks [103, 42]. Therefore, it is also promising to take node property diversity into consideration to further improve the out-of-distribution generalization ability of GNNs. • Generalization challenges in different applications: There exist various unique challenges and diverse domain knowledge for GNNs in different applications. In this dissertation, we focus on improving the GNNs’ performance and out-of-distribution generalization ability for the recommendation task. However, there are many other applications, such as molecular prediction and single-cell analysis. In order to fulfill GNNs’ potential in these applications, we need to target at their unique challenges and consider their domain knowledge to further improve the out-of-distribution generalization ability of GNNs in various applications. 104 BIBLIOGRAPHY [1] Movielens. https://grouplens.org/datasets/movielens/. Accessed May 23, 2021. [2] User behavior data from taobao for recommendation. https://tianchi.aliyun.com/dataset/ dataDetail?dataId=649&userId=1. Accessed May 23, 2021. [3] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data engineering, 17(6):734–749, 2005. [4] Deepak Agarwal, Bee-Chung Chen, and Bo Long. Localized factor models for multi-context recommendation. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 609–617, 2011. [5] Charu C Aggarwal. Model-based collaborative filtering. In Recommender systems, pages 71–138. Springer, 2016. [6] Alexander Bartler, Andre Bühler, Felix Wiewel, Mario Döbler, and Bin Yang. Mt3: Meta test-time training for self-supervised test-time adaption. arXiv preprint arXiv:2103.16201, 2021. [7] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. [8] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. Cross-domain mediation in collaborative filtering. In International Conference on User Modeling, pages 355–359. Springer, 2007. [9] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005. [10] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. [11] Deng Cai and Wai Lam. Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020. [12] Ming Chen, Zhewei Wei, Bolin Ding, Yaliang Li, Ye Yuan, Xiaoyong Du, and Ji-Rong Wen. Scalable graph neural networks via bidirectional propagation. Advances in neural information processing systems, 33:14556–14566, 2020. [13] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016. 105 [14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [16] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003. [17] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015. [18] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893, 2019. [19] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The World Wide Web Conference, pages 417–426, 2019. [20] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018. [21] Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, and Jan Kautz. Learning to track instances without video annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8680–8689, 2021. [22] Hongyang Gao and Shuiwang Ji. Graph u-nets. In international conference on machine learning. PMLR, 2019. [23] Hongyang Gao, Yi Liu, and Shuiwang Ji. Topology-aware graph pooling networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [24] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017. [25] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017. [26] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. 106 [27] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017. [28] Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020. [29] Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. Pre-training graph neural networks for cold-start users and items representation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 265–273, 2021. [30] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4), 2015. [31] Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pages 4116–4126. PMLR, 2020. [32] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918– 4927, 2019. [33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [34] Ming He, Jiuling Zhang, Peng Yang, and Kaisheng Yao. Robust transfer learning for cross- domain collaborative filtering using multiple rating patterns approximation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 225–233, 2018. [35] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 639–648, 2020. [36] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017. [37] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1):89–115, 2004. [38] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991. [39] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In NeurIPS, 2020. 107 [40] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019. [41] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1857–1867, 2020. [42] K. Huang and M. Zitnik. Graph meta learning via local subgraphs. arXiv e-prints, 2020. [43] Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356, 2019. [44] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016. [45] Xiaodong Jiang, Ronghang Zhu, Sheng Li, and Pengsheng Ji. Co-embedding of nodes and edges with graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [46] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203, 2020. [47] Yuanyuan Jin, Wei Zhang, Xiangnan He, Xinyu Wang, and Xiaoling Wang. Syndrome-aware herb recommendation with multi-graph convolution network. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 145–156. IEEE, 2020. [48] Santosh Kabbur, Xia Ning, and George Karypis. Fism: factored item similarity models for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 659–667, 2013. [49] Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, Yukihiro Tagami, and Taiji Suzuki. Cross-domain recommendation via deep domain adaptation. In European Conference on Information Retrieval, pages 20–29. Springer, 2019. [50] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [51] Thomas N Kipf and Max Welling. Variational graph auto-encoders. ArXiv preprint, 2016. [52] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [53] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009. 108 [54] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019. [55] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. Text classification algorithms: A survey. Information, 10(4):150, 2019. [56] Anil Kumar, Nitesh Kumar, Muzammil Hussain, Santanu Chaudhury, and Sumeet Agarwal. Semantic clustering-based cross-domain recommendation. In 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 137–141. IEEE, 2014. [57] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. arXiv preprint arXiv:1904.08082, 2019. [58] Juanhui Li, Yao Ma, Yiqi Wang, Charu Aggarwal, Chang-Dong Wang, and Jiliang Tang. Graph pooling with representativeness. In 2020 IEEE International Conference on Data Mining (ICDM), pages 302–311. IEEE, 2020. [59] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018. [60] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. [61] Meng Liu, Jianjun Li, Guohui Li, and Peng Pan. Cross domain recommendation via bi- directional transfer graph collaborative filtering networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 885–894, 2020. [62] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Proceedings of Machine Learning Research, 2021. [63] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? Advances in Neural Information Processing Systems, 34, 2021. [64] Yichao Lu, Ruihai Dong, and Barry Smyth. Why i like it: multi-task learning for recom- mendation and explanation. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 4–12, 2018. [65] Yuanfu Lu, Xunqiang Jiang, Yuan Fang, and Chuan Shi. Learning to pre-train graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4276–4284, 2021. [66] Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 779–787, 2021. 109 [67] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. arXiv preprint arXiv:2010.01777, 2020. [68] Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021. [69] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 723–731, 2019. [70] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9:2579–2605, 2008. [71] Julian McAuley. Amazon product data. http://jmcauley.ucsd.edu/data/amazon/. Accessed May 23, 2021. [72] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [73] Orly Moreno, Bracha Shapira, Lior Rokach, and Guy Shani. Talmud: transfer learning for multiple domains. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 425–434, 2012. [74] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663, 2020. [75] Vaishnavh Nagarajan, Anders Andreassen, and Behnam Neyshabur. Understanding the failure modes of out-of-distribution generalization. arXiv preprint arXiv:2010.15775, 2020. [76] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML- 10), pages 807–814, 2010. [77] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. Speech recognition using deep neural networks: A systematic review. IEEE Access, 7:19143–19165, 2019. [78] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. Embedding-based news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1933–1942, 2017. [79] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv preprint, 2018. [80] Karl Pearson. The problem of the random walk. Nature, 72(1867):342–342, 1905. [81] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 110 [82] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD, pages 701–710, 2014. [83] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 401–417, 2018. [84] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1150–1160, 2020. [85] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre- trained models for natural language processing: A survey. Science China Technological Sciences, pages 1–26, 2020. [86] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. [87] Rahul Ragesh, Sundararajan Sellamanickam, Vijay Lingam, Arun Iyer, and Ramakrishna Bairi. User embedding based neighborhood aggregation method for inductive recommendation. 2021. [88] Ekagra Ranjan, Soumya Sanyal, and Partha Pratim Talukdar. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. arXiv preprint arXiv:1911.07979, 2019. [89] Waseem Rawat and Zenghui Wang. Deep convolutional neural networks for image classifica- tion: A comprehensive review. Neural computation, 29(9):2352–2449, 2017. [90] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012. [91] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In NeurIPS, 2020. [92] B. Sarwar. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th International World Wide Web Conference (WWW10), Hong Kong, May 1-5 (2001), 2001. [93] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar- dini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008. [94] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In Advances in neural information processing systems, pages 991–1001, 2017. 111 [95] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011. [96] Heng-Shiou Sheu and Sheng Li. Context-aware graph embedding for session-based news recommendation. In Fourteenth ACM conference on recommender systems, pages 657–662, 2020. [97] Tzu-An Song, Samadrita Roy Chowdhury, Fan Yang, Heidi Jacobs, Georges El Fakhri, Quanzheng Li, Keith Johnson, and Joyita Dutta. Graph convolutional neural networks for alzheimer’s disease classification. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 414–417. IEEE, 2019. [98] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009, 2009. [99] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000, 2019. [100] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020. [101] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-time training for out-of-distribution generalization. 2019. [102] Qiaoyu Tan, Ninghao Liu, and Xia Hu. Deep representation learning for social network analysis. Frontiers in big Data, 2:2, 2019. [103] Xianfeng Tang, Huaxiu Yao, Yiwei Sun, Yiqi Wang, Jiliang Tang, Charu Aggarwal, Prasenjit Mitra, and Suhang Wang. Investigating and mitigating degree-related biases in graph convoltuional networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1435–1444, 2020. [104] Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In Sixth international conference on data mining (ICDM’06), pages 613–622. IEEE, 2006. [105] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. Composition-based multi-relational graph convolutional networks. In International Conference on Learning Representations, 2020. [106] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In ICLR, 2018. [107] Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. ICLR (Poster), 2(3):4, 2019. 112 [108] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1235–1244, 2015. [109] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. Kgat: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 950–958, 2019. [110] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, pages 165–174, 2019. [111] Xiaoyang Wang, Yao Ma, Yiqi Wang, Wei Jin, Xin Wang, Jiliang Tang, Caiyan Jia, and Jian Yu. Traffic flow prediction via spatial temporal graph neural network. In Proceedings of The Web Conference 2020, pages 1082–1092, 2020. [112] Yiqi Wang, Chaozhuo Li, Mingzheng Li, Wei Jin, Yuming Liu, Hao Sun, Xing Xie, and Jiliang Tang. Localized graph collaborative filtering. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), pages 540–548. SIAM, 2022. [113] Yiqi Wang, Chaozhuo Li, Zheng Liu, Mingzheng Li, Jiliang Tang, Xing Xie, Lei Chen, and Philip S Yu. An adaptive graph pre-training framework for localized collaborative filtering. ACM Transactions on Information Systems (TOIS), 2022. [114] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019. [115] Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In ICML, Proceedings of Machine Learning Research, 2019. [116] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 726–735, 2021. [117] Le Wu, Lei Chen, Pengyang Shao, Richang Hong, Xiting Wang, and Meng Wang. Learning fair representations for recommendation: A graph-based perspective. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors, WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 2198–2208. ACM / IW3C2, 2021. [118] Yunfan Wu, Qi Cao, Huawei Shen, Shuchang Tao, and Xueqi Cheng. Inductive representation based graph convolution network for collaborative filtering. 2021. [119] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020. 113 [120] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. [121] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Deep matrix factorization models for recommender systems. In IJCAI, volume 17, pages 3203–3209. Melbourne, Australia, 2017. [122] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015. [123] Jheng-Hong Yang, Chih-Ming Chen, Chuan-Ju Wang, and Ming-Feng Tsai. Hop-rec: high- order proximity for implicit recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 140–144, 2018. [124] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7370–7377, 2019. [125] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In KDD, 2018. [126] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4800–4810, 2018. [127] Feng Yuan, Lina Yao, and Boualem Benatallah. Darec: Deep domain adaptation for cross- domain recommendation via transferring rating patterns. arXiv preprint arXiv:1905.10760, 2019. [128] Hao Yuan and Shuiwang Ji. Structpool: Structured graph pooling via conditional random fields. In International Conference on Learning Representations, 2020. [129] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. arXiv preprint arXiv:1802.09691, 2018. [130] Qian Zhang, Peng Hao, Jie Lu, and Guangquan Zhang. Cross-domain recommendation with semantic correlation in tagging systems. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019. [131] Qian Zhang, Jie Lu, Dianshuang Wu, and Guangquan Zhang. A cross-domain recommender system with kernel-induced knowledge transfer for overlapping entities. IEEE transactions on neural networks and learning systems, 30(7):1998–2012, 2018. [132] Yian Zhang, Alex Warstadt, Haau-Sing Li, and Samuel R Bowman. When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946, 2020. [133] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang. Hierarchical graph pooling with structure learning. arXiv preprint arXiv:1911.05954, 2019. 114 [134] Zhen Zhang, Hongxia Yang, Jiajun Bu, Sheng Zhou, Pinggang Yu, Jianwei Zhang, Martin Ester, and Can Wang. Anrl: Attributed network representation learning via deep neural networks. In IJCAI, volume 18, pages 3155–3161, 2018. [135] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020. [136] Lili Zhao, Sinno Jialin Pan, Evan Wei Xiang, Erheng Zhong, Zhongqi Lu, and Qiang Yang. Active transfer learning for cross-system recommendation. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013. [137] Yu Zheng, Chen Gao, Xiangnan He, Yong Li, and Depeng Jin. Price-aware recommendation with graph convolutional networks. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 133–144. IEEE, 2020. [138] Kaixiong Zhou, Qingquan Song, Xiao Huang, Daochen Zha, Na Zou, and Xia Hu. Multi- channel graph neural networks. arXiv preprint arXiv:1912.08306, 2019. [139] Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, and Xiaolin Zheng. A graphical and attentional framework for dual-target cross-domain recommendation. In IJCAI, pages 3001–3008, 2020. [140] Feng Zhu, Yan Wang, Chaochao Chen, Jun Zhou, Longfei Li, and Guanfeng Liu. Cross-domain recommendation: challenges, progress, and prospects. arXiv preprint arXiv:2103.01696, 2021. [141] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pages 2069–2080, 2021. 115 APPENDIX A TEST-TIME TRAINING FOR GRAPH NEURAL NETWORKS A.1 Theorem Proof A.1.1 Proof of Theorem 1 Õ⇠ 4 g2 Denote L (H; g) = 2=1 1 H=2 log( Õ⇠ 4 g8 ), and from Eq 2.15, we have 8=1 g = 5GNN (A, X; )) = 1T AL X). (A.1) ’ ⇠ 4 g2 L (H; g) = 1H=2 log( Õ⇠ ) 4 g8 2=1 8=1 4 zc = log( Õ⇠ ). (A.2) 4 zi 8=1 The Hessian Matrix of Eq A.2, we have 8 > > > < p8 p 9 > i > > > p8 p82 i=j, : where p8 is defined in Eq 2.16. For 8a 2 R⇠ , we have ’ ⇠ ’⇠ a Ha = ) p8 a82 ( a8 p8 ) 2 8=1 8=1 ’ ⇠ ’⇠ p p = p8 a82 ( a8 p8 p8 ) 2 (A.4) 8=1 8=1 According to Cauchy-Schwarz Inequality, we have ’⇠ p p ’⇠ ’ ⇠ ( a8 p8 p8 ) 2  ( a8 2 p8 ) ( p8 ). (A.5) 8=1 8=1 8=1 Thus, we have ’⇠ ’⇠ ’ ⇠ a) Ha p8 a82 ( a8 2 p8 ) ( p8 ). (A.6) 8=1 8=1 8=1 116 According to the definition of p8 in Eq 2.16, we have ’ ⇠ p8 = 1. (A.7) 8=1 Therefore, a) Ha 0. (A.8) Thus, we have proved that the Hessian Matrix of the function defined in Eq A.2, i.e., H, is positive semi-definite (PSD). This is equivalent to prove that L (H; g) is convex in g (property 1). For 8b 2 R⇠ and kbk = 1, we have ’ ⇠ ’⇠ b Hb = ) p8 b82 ( b8 p8 ) 2 8=1 8=1 ’ ⇠ ’⇠ p p = p8 b82 ( b8 p8 p8 ) 2 8=1 8=1 ’ ⇠ ’⇠ p p  b82 + ( b8 p8 p8 ) 2 . (A.9) 8=1 8=1 According to Cauchy-Schwarz Inequality, we can arrive at ’⇠ ’⇠ p p b Hb  ) b82 +( b8 p8 p8 ) 2 8=1 8=1 ’⇠ ’⇠ ’⇠  b82 + ( b82 ) ( p8 ). (A.10) 8=1 8=1 8=1 Õ⇠ Since kbk = 1 and 8=1 p8 = 1, we have b) Hb  2. (A.11) We have thus proved that the eigenvalues of the Hessian Matrix H are smaller than 2. This is equivalent to prove that L (H; g) is V-smooth in g (property 2). By computing the first-order gradient of L (H; g) over z, we have 8 > > > < 1 + p8 > i=c rg L (A, X, H; g) = (A.12) > > > > p8 i < c. : 117 Then, we can derive its ! 2 norm s ’ krg L (H; g) k = ( 1 + p2 ) 2 + (p8 ) 2 s’ 8<2 ’ = (p8 ) 2 + (p8 ) 2 8<2 8<2  2. (A.13) We have thus proved that the ! 2 norm of the first-order gradient of L (H; g) over z is bounded by the positive constant (property 3). So far, we have proved that for all g, H, L (H; g) is convex and V-smooth in g, and both krg L (H; g) k  ⌧ for all g, where ⌧ is a positive constant. According to Eq A.1, 5GNN is a linear transformation mapping function and won’t change these three properties. Therefore, the proof of Theorem 1 is completed. A.1.2 Proof of Theorem 2 We follow [101] to prove Theorem 2. For any [, by V-smoothness, we have L< (G, H; ) (G)) = L< (G, H; ) (G) [r) L B (G; )))  L< (G, H; ) (G) [ hr) L< (G, H; )), r) L B (G; ))i [2 V + kr) L B (G; )) k 2 . (A.14) 2 Denote hr) L< (G, H; )), r) L B (G; ))i [⇤ = , (A.15) Vkr) L B (G; )) k 2 then we have L< (G, H; ) [⇤ r) L B (G; ))) hr) L< (G, H; )), r) L B (G; ))i 2  L< (G, H; )) . (A.16) 2Vkr) L B (G; )) k 2 ) By the assumptions on the ! 2 norm of the first-order gradients for the main task and the SSL task and the assumption on their inner product, we have 118 n2 L< (G, H; )) L< (G, H; ) [⇤ r) L B (G; ))) . (A.17) 2V⌧ 2 Denote [ = n V⌧ 2 , based on the assumptions, we have 0 < [  [⇤ . Denote t = r) L B (G; )), by convexity of L< , L< (G, H; ) (G)) = L< (G, H; ) [t) [ [ = L< (G, H; (1 ⇤ )) + ⇤ () [⇤ t)) [ [ [ [  (1 ⇤ )L< (G, H; )) + ⇤ L< (G, H; ) [⇤ t) [ [ [ [ n2  (1 )L < (G, H; )) + (L < (G, H; )) ) [⇤ [⇤ 2V⌧ 2 [ n2 = L< (G, H; )) . (A.18) [⇤ 2V⌧ 2 [ Since [⇤ > 0, we have L< (G, H; )) L< (G, H; ) (G)) > 0. (A.19) A.2 Performance Comparison on Random Data Split (a) DD. (b) ENZYMES. (c) PROTEINS. Figure A.1: Performance Comparison of GCN and GIN on Three Datasets on the Random Data Split. Note that the performance metric for ogbg-molhiv is ROC-AUC(%) and that for others is Accuracy(%). In addition to the performance comparison on the OOD data split, we have also conducted a performance comparison on a random data split, where we randomly split the dataset into 119 80%/10%/10% for the training/validation/test sets, respectively. The results are shown in Figure A.1. From it, we can observe that GT3 can also improve the classification performance in most cases, which demonstrates the effectiveness of the proposed GT3 in multiple scenarios. 120 APPENDIX B CUSTOMIZED GRAPH NEURAL NETWORKS B.1 Datasets Some key statistics of seven datasets used in our experiments are shown in Table B.1, and more details of them are introduced as follows: • COLLAB [122] is a dataset of scientific collaboration networks, which describes collaboration pattern of researchers from three different research fields. • ENZ [95] is a dataset of protein tertiary structures of six classes of enzymes. • PROT [9] is a dataset of protein structures, where each graph represents a protein and each node represents a secondary structure element (SSE) in the protein. • D&D [16] is a dataset of protein structures. Each protein is represented as a graph, where each node in a graph represents an amino aid and each edge between two nodes denotes that they are less than 6 Ångstroms apart. • RE-BI and RE-5K [122] are two datasets of online discussion threads crawled from different subreddits in Reddit, where each node represents an user and each edge between two users represents their interaction. Table B.1: The statistics of seven datasets. #Graphs denotes the number of graphs. #Class is the number of graph classes.Node size indicates range, average and standard deviation of the number of nodes among the graphs.Edge size represents range, average and standard deviation of the number of edges among the graphs. Node size Edge size Dataset #Graphs #Class range mean std range mean std COLLAB 5000 3 [32,492] 74.5 62.3 [60,40120] 2457.8 6439.0 ENZ 600 6 [2,125] 32.6 14.9 [1,149] 62.14 25.5 PROT 1113 2 [4,620] 39.1 45.7 [5,1049] 72.82 84.6 DD 1178 2 [30,5748] 284.3 272.0 [63,14267] 715.65 693.9 RE-BI 2000 2 [6,3,782] 429.6 554.0 [4,4071] 497.8 623.0 RE-5K 4999 5 [22,3648] 508.5 452.6 [21,4783] 594.9 566.8 NCI109 4127 2 [4,111] 29.6 13.6 [3,119] 32.1 15.0 121 Training Set Test Sample !! d1 d2 d3 K-means Training Subset2 Training Subset1 Training Subset3 Training Subset2 d2 < d1,d3 Training Subset1 Training Subset3 GCN2 "#! GCN1 GCN2 GCN3 !! !"#$%$%& (ℎ#*+ !+*, (ℎ#*+ Figure B.1: The framework of Multi-GCN with 3 clusters. we group training samples into 3 clusters and train a GCN model for each cluster. Given a test sample 68 , we measure the distance between this sample and the centroids of the three clusters and utilize the corresponding model of the closest cluster to perform the prediction for this sample. • NCI109 [95] is a dataset of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, which are provided by Natinal Cancer Institue (NCI). B.2 Baselines In the experiment section, we apply the proposed framework to two graph neural networks: a basic graph convolutional network (GCN) [52] and two SOTA graph classification models DiffPool [126] and gPool [22], respectively. The corresponding adapted versions are Customized- GCN, Customized-DiffPool, and Customized-gPool, respectively. Besides, we also develop baseline methods, Concat-GCN, Concat-Diff, Concat-gPool, Multi-GCN, Multi-Diff, and Multi-gPool. More details of the baseline methods are as follows: • GCN [52] is originally proposed for semi-supervised node classification. It consists of a stack of GCN layers, where a new representation of each node is computed via transforming and 122 aggregating node representations of its neighboring nodes. Finally, a graph representation is generated from node representations in the last GCN layer via a global max-pooling layer and then used for graph classification. • Diffpool [126] is a recently proposed method which has achieved state-of-the-art performance on the graph classification task. It proposes a differentiable graph pooling approach to hierarchically generate a graph-level representation by coarsening the input graph level by level. • gpool [22] is a newly proposed pooling method that has achieved state-of-the-art performance on the graph classification task. It develops a U-Net-like architecture for graph data, consisting of graph pooling operation and unpooling operation based on node importance value. • Concat-GCN (or Concat-Diff, Concat-gpool ) is the baseline method that directly concate- nates the graph structure properties s8 to the output graph embedding of the GCN, DiffPool, and gPool model. • Multi-GCN (or Multi-Diff, Multi-gpool) consists of several GCN (or Diffpool, gpool) models trained from different subsets of the training dataset. As shown in Figure B.1, we first cluster data samples from the training set into different training subsets via the K-means method based on the graph structural information. Note that in this work, the structural information s8 of 68 includes the number of nodes, the number of edges, and the graph density. Then we train different models from different training subsets. During the test phase, given a test graph sample, we first compute the euclidean distance between its graph structural properties and the centroids of different training subsets. Then, we choose the model trained on the closest training subset to do label prediction for this graph sample. In this experiment, we set the number of clusters to 2 and 3, and denote the corresponding frameworks as Multi-GCN-2 (or Multi-Diff-2, Multi-gpool-2 ) and Multi-GCN-3 (or Multi-Diff-3, Multi-gpool-3). Note that we can also ensemble models trained on all training subsets; however, we empirically find that this strategy does not work. 123 APPENDIX C LOCALIZED GRAPH COLLABORATIVE FILTERING C.1 Further comparison and discussion over two pipelines of GNN-based CF models The inferior performance of GNN-based methods with sparse interactions may be due to the following two reasons. Firstly, as shown in Figure C.1(a), GNN-based CF methods learn a unique embedding vector 4 for each user/item from all the historical interactions, which depicts the characteristics of user/item from a global view. The quality of learned embeddings is essential for the desirable recommendations. The user-item interactions denote the supervision information LightGCN and NGCF leverage to learn these user and item representations. Such supervision information in the sparse user-item bipartite graph is not sufficient to learn user and item embeddings of high quality, leading to undesirable recommendation performance. Secondly, the stacked GNN layers contribute to leveraging the high-order information by aggregating neighbors’ representations as shown in the middle part of Figure C.1(a). However, given limited interactions, the noises in the learned inferior embeddings may accumulate or even reinforce themselves in this stacked aggregation procedure, which could further degrade the performance [66]. However, the user-item interaction graphs are often sparse in real-world recommendation scenarios [3]. Thus, in this paper, we study how to alleviate the challenge of sparsity while enjoying the merits of high-order topological connectivity. Note that in this work, we focus on addressing the recommendation tasks solely based on the historical interactions following the previous works [110, 35]. Our motivation lies in learning CF signals from local structures of the user-item interaction graph, and the proposed model is called Localized Graph Collaborative Filtering (LGCF). Particularly, given a user and an item, LGCF makes predictions about their interaction based on their local structural context in the bipartite graph, rather than the user and item embeddings. Traditional user and item embeddings are eliminated as LGCF only needs to describe the position of a node relative to the target user and item nodes in the localized graph, which is denoted as red rectangles in Figure C.1(b). Without these intractable user and item embeddings, 124 e i1 e u2 e i2 hu1 i1 e u1 u1 i2 u2 i3 e u3 e i3 sui i4 e i2 Prediction u3 e i4 e u2 hi5 i5 e i5 e i3 e u 3 (a) Pipeline of GNN-based CF Methods. i1 i1 u1 i2 u1 i3 i2 x u 1 i5 u2 i3 sui Prediction i4 i5 u3 i5 u2 u3 (b) Pipeline of the Localized Graph-based CF Model. Figure C.1: GNN-based CF Methods vs. the Proposed Localized Graph-based CF Model. LGCF can significantly reduce the number of trainable parameters and thus adapt to the sparse scenario. In addition, the critical CF information is also apparently preserved in the localized graph. The key CF information, i.e., the historical interactions between users and items, are denoted as edges. Therefore, a localized structure covering most edges related to a user-item pair should be able to provide sufficient information to make predictions for this user-item pair. In other words, recommendation-related patterns can be encoded by the localized structures for user-item pairs which motivate LGCF to make recommendations by extracting these patterns. The range of localized structures can be flexibly adjusted to incorporate proper high-order information. In fact, LGCF provides a new perspective to build GNN-based CF models for recommendations where we do not learn a general representation for each user or item but we target on learning a good model to extract the recommendation-related patterns encoded by localized structures of a user-item pair. 125 Table C.1: Model Size Analysis. Model Size Model Size MF (= + <)30 NGCF (= + <)30 + 2!3 2 LightGCN (= + <)30 LGCF !3 2 + 3 ! C.2 Mode Size Analysis In this subsection, we discuss and compare the model sizes of the proposed LGCF and existing representative CF methods. In this analysis, we assume that: 1) there are ! layers in the GNN model for the GNN-based CF methods, and all layers have the same dimension 3; 2) the dimension of user/item embedding for the embedding-based CF methods is 30 , and 3) there are = users and < items in the input user-item bipartite graph. For MF, it only needs to learn user/item embeddings. Thus, its model size is the embedding size, i.e., (= + <)30 . For NGCF, it needs to learn both the user/item embeddings and a GNN model, and thus its model size is the sum of the embedding size and the GNN model size. Note that there are two GNN models in NGCF. One is for user embedding propagation and the other is for item embedding propagation. The weight matrix size in the ;-th layer is 3 2 , thus the GNN model size in total is 2!3 2 . For LightGCN, it simplifies the GNN model by eliminating feature transformation, therefore its model size is the same as MF. For the proposed LGCF, it does not need to learn embeddings for users and items, thus, its model size is only the sum of a GNN model size and a scoring function size. Note that we adopt a single-layer linear layer for the scoring function, so its model size is 3 ! , where 3 ! denotes the output dimension of the GNN model and also the input dimension of the scoring function. The model sizes of LGCF, MF, NGCF and LightGCN are summarized in Table C.1. As shown in the table, the model sizes of baselines are all proportional to the graph size (= + <), which indicates that the models would be very large when the input graph is large. However, the model size of LGCF is independent of the graph size. C.3 Datasets We conduct extensive experiments on recommendation datasets from three scenarios. The details of them are introduced as the follows: • Tianchi: This is a user behavior dataset provided by Alibaba for recommendations. There are 126 millions of user-item interactions recorded in this dataset including clicking, purchasing and so on. Each interaction consists of a user ID, an item ID, an item category ID, a behavior type and a timestamp. Note that in our experiments, if there exists a behavior between a user and an item, we consider it as an interaction between them. Specifically, we extract seven user-item interaction datasets from different item categories and denote each of them as Tianchi- ⇡, where ⇡ indicates the item category. It is common to extract a user-item interaction dataset based on the item category from a large recommendation dataset that has been widely adopted by existing works [71, 110, 35] • Amazon: This is a product review dataset from Amazon, which includes reviews (ratings or texts), product information (e.g., descriptions, category and so on), and user-item interactions (i.e., viewing or purchasing). We select user-item interactions to construct two datasets for items from two categories in Amazon, i.e., Amazon-beauty and Amazon-Gift. • MovieLens: This dataset describes 5-star ratings from MovieLens, a movie recommendation service. Each rating record consists of a user ID, a movie ID, a rating score and the category of the movie. In our experiments, we consider each rate as one interaction, regardless of its score. Specifically, we construct seven datasets corresponding to seven-item categories from Tianchi and denote each of them as Tianchi-ID, where ID indicates the item category, two datasets from two item categories from Amazon, and two datasets from two item categories from MovieLens. The statistics of the datasets are summarized in Table C.2. C.4 Baselines and Evaluation Metrics In the experimental parts, we compare the proposed LGCF with MF [53], NGCF [110] and LightGCN [35]. More details of the baseline methods are as follows: • MF: MF aims at directly learning the latent user and item embeddings via minimizing Bayesian Personalized Ranking (BPR) loss. The recommendation prediction is made based 127 Table C.2: The Statistics of the Datasets from Tianchi, Amazon and MovieLens, respectively. Dataset #Users #Items #Edges Tianchi-685988 4,151 539 7,492 Tianchi-2798696 2,579 94 2,988 Tianchi-4527720 1,853 137 2,860 Tianchi-174490 2,267 285 3,826 Tianchi-61626 2,305 764 3,270 Tianchi-810632 5,057 555 9,402 Tianchi-3937919 1,846 158 2,980 Amazon-Beauty 991 85 4,092 Amazon-Gift 458 148 2,966 MovieLens-War 2,970 89 4,240 MovieLens-Romance 20,407 376 33,884 on the inner product of the user embedding and the item embedding. • NGCF: NGCF explicitly encodes the multi-hop connectivity into the embedding learning process. It refines the user and item embeddings via propagation layers, and then makes recommendation predictions based on the refined user embedding and the refined item embedding. • LightGCN: LightGCN is a GNN-based CF method that is customized for recommendations. Specifically, it only includes neighborhood aggregation in GNNs for collaborative filtering, while discarding other components such as feature transformation. We randomly split the interactions in a dataset into three sets: the training set, the validation set and the test set. To avoid the cold-start setting, we further constrain that all users and items in the validation set and the test set should exist in the training set. More details about the data split will be introduced later. For each user-item interaction in the validation set or the test set, we generate 49 non-existent user-item interactions which share the same user with this user-item interaction. In other words, for the user in a user-item interaction, there are 50 candidate user-item pairs, where only one is positive and the others are negative. During the model evaluation, we compute recommendation scores for 50 candidate user-item pairs for each user-item pair. Then, we rank these scores and calculate the Hit Rate (HR) of the positive pair in the top 5 among 50 candidates and its Normalized Discounted Cumulative Gain (NDCG). Finally, we average HR and 128 NDCG for all user-item interactions in the validation set or the test set. Specially, we evaluate every model on each dataset with 10 seeds and then report the average performance. In this work, we only report the HR performance since NDCG performance is similar and consistent with HR performance. C.5 How is LGCF complementary to LightGCN To explore what kind of user-item pairs can benefit from integrating LGCF and LightGCN, we divide the test set in normal scenarios into different groups based on the average degree of the user and the item for a user-item pair. Specifically, we first sort all the user-item pairs in the test set in ascending order according to their average degree and divide the sorted pairs into 5 groups. We then train models on the same training set and test them on these groups. The performance of LightGCN, LGCF, and LGCF-ens on six different datasets is shown in Figure C.2. It can be observed that the performance of all methods tends to increase when the degree increases. In most cases, for all degree groups, LGCF-ens (or integrating LGCF and LightGCN) can boost the recommendation performance. 129 (a) Tianchi-61626. (b) Tianchi-810632. (c) Tianchi-3937919. (d) Amazon-Gift. (e) MovieLens-War. (f) MovieLens-Romance. Figure C.2: Performance Analysis with Degree. 130 APPENDIX D AN ADAPTIVE GRAPH PRE-TRAINING FRAMEWORK FOR LOCALIZED COLLABORATIVE FILTERING D.1 Datasets We have summarized the graph property statistics of the 60 pre-training graphs in Table D.1. D.2 Baselines The details of these baselines are presented as follows: • NGCF: NGCF [110] proposes to encode high-order connectivity among user-item interaction graphs into user/item embeddings. Specifically, it designs propagation layers to aggregate information from connected nodes. The recommendation prediction is made based on the refined user and item embeddings. • LightGCN: LightGCN [35] is one of the state-of-the-art GNN-based recommendation methods. It tailors GNN particularly for recommendation tasks. Specifically, it eliminates feature transformation and nonlinear activation, and only includes neighborhood aggregation in GNNs for collaborative filtering. • MF: MF [53] is one of the most classic and popular recommendation methods. It learns a set of latent user and item embeddings directly from user-item interactions, and then makes recommendation predictions based on the inner product of the user embedding and the item embedding. • GCC: GCC [84] is a self-supervised pre-training framework for GNNs. GCC designs a subgraph instance discrimination task as the pre-training task and it utilizes contrastive learning to empower GNNs to learn transferable structural representations across multiple graphs. Note that GCC is not specially designed for recommendation tasks. To apply GCC into the recommendation scenarios, we first pre-train a GCN model via the GCC framework on 131 Table D.1: Graph property statistics of the 60 pre-training graphs from Tianchi. the degree the number of the number the number robins-alexander graph id the user-item ratio the graph density assortativity connected the global efficiency of nodes of edges clustering coefficient coefficient components 1 1066 1426 2.405750799 0.006050346 -0.137080382 0.07622926 58 0.155483363 2 583 961 5.477777778 0.021658778 -0.324017825 0.088734101 15 0.25966763 3 869 1311 4.570512821 0.0117866 -0.229564535 0.075031464 29 0.255150568 4 976 1483 5.731034483 0.012307565 -0.395316067 0.0805384 42 0.21907052 5 825 996 1.272727273 0.005938965 -0.221096689 0.081102362 67 0.116534298 6 547 723 3.240310078 0.013408256 -0.240459448 0.044851669 14 0.204873494 7 442 871 3.20952381 0.02461495 -0.438782979 0.134212489 8 0.28049023 8 879 1606 4.598726115 0.014168005 -0.28871791 0.108867024 17 0.261053578 9 688 1068 3.845070423 0.013774957 -0.391439392 0.08856928 15 0.252946534 10 820 1046 6.192982456 0.012996372 -0.401747324 0.072571549 24 0.189789498 11 1613 2441 7.489473684 0.009028369 -0.364616349 0.060301114 28 0.246112989 12 1828 4100 4.69470405 0.008475505 -0.263666513 0.115984126 21 0.259046914 13 2178 3621 11.88757396 0.010665025 -0.468666839 0.117659122 10 0.233061438 14 2182 3784 7.693227092 0.007807197 -0.246347149 0.070869752 33 0.241451953 15 1455 2201 3.575471698 0.006087409 -0.203866274 0.115162664 49 0.165451187 16 3112 4364 9.103896104 0.005053078 -0.271381784 0.045556577 45 0.237112946 17 1791 2589 10.05555556 0.009810609 -0.423324113 0.062591444 15 0.247061121 18 1632 2319 7.201005025 0.008132077 -0.292355843 0.08403141 45 0.19445713 19 2133 2901 1.83643617 0.00279342 -0.173274517 0.060554047 122 0.133096639 20 2191 3995 7.834677419 0.008290721 -0.378478248 0.091360009 17 0.257314823 21 2986 4045 9.514084507 0.005271265 -0.334648858 0.018636375 16 0.349629458 22 1102 2003 5.482352941 0.01264201 -0.391299794 0.088862343 11 0.249883801 23 1667 2660 10.34013605 0.011904762 -0.262163651 0.057379615 8 0.282671889 24 1264 1858 6.479289941 0.010040258 -0.401666782 0.056050426 17 0.231782658 25 1224 2145 3.25 0.007957176 -0.236265509 0.071828792 38 0.212799377 26 1105 1858 4.846560847 0.010732192 -0.278355787 0.081693911 15 0.247897105 27 1631 3520 18.65060241 0.027396407 -0.528782764 0.140792463 6 0.284775151 28 1344 1786 6.466666667 0.008524246 -0.295103378 0.267301969 31 0.196343865 29 2400 3008 7.391608392 0.004975157 -0.249780826 0.076222519 64 0.140260529 30 2231 3660 5.639880952 0.00574821 -0.360584068 0.076936354 61 0.208970493 31 2432 4992 11.28282828 0.011285641 -0.358731221 0.099939883 24 0.276760795 32 4669 7776 4.182019978 0.002290449 -0.272368625 0.052135774 82 0.203264965 33 4144 5838 5.6304 0.00265439 -0.231825096 0.073917706 161 0.161605432 34 4224 6530 5.148471616 0.002687332 -0.180418703 0.046033635 112 0.188019456 35 4461 7107 7.764243615 0.003533065 -0.20104595 0.050613051 70 0.219329989 36 4380 6090 8.106029106 0.003247274 -0.239028273 0.062308213 117 0.20649004 37 3745 7738 3.734513274 0.00331163 -0.245603955 0.063679839 75 0.235388263 38 2168 4729 3.6825054 0.005990512 -0.238712029 0.061400362 26 0.235880829 39 4123 5174 21.90555556 0.007289994 -0.27692024 0.059894654 40 0.219759568 40 2930 4631 2.123667377 0.002478464 -0.159829692 0.03445955 130 0.164731802 41 4082 6653 3.38453276 0.002267877 -0.249515119 0.047776691 116 0.177176302 42 2406 4571 2.51754386 0.003880806 -0.245047333 0.046862868 63 0.19723984 43 2659 7717 5.581683168 0.008470725 -0.306202709 0.074841219 10 0.27057376 44 4892 7366 7.814414414 0.003060196 -0.184142643 0.040597988 30 0.251175191 45 3372 6459 5.386363636 0.00430132 -0.307159357 0.047727729 32 0.236428669 46 2413 5228 3.339928058 0.005063477 -0.311985584 0.065333207 46 0.208494542 47 3130 5607 5.804347826 0.004565217 -0.288644888 0.050126535 54 0.220249111 48 3921 5942 1.584706658 0.001629343 -0.181370456 0.026907729 161 0.160805618 49 2654 4640 2.005662514 0.002967145 -0.202301709 0.070429402 73 0.183938794 50 3379 5260 4.287949922 0.003004238 -0.271414349 0.05995576 142 0.182738023 51 5401 9717 6.060130719 0.002739853 -0.27968436 0.044058991 117 0.197594675 52 5364 9430 15.05988024 0.005613028 -0.248281597 0.056374776 29 0.276165783 53 6849 10503 3.97747093 0.001394664 -0.132708191 0.058927812 195 0.16746493 54 6479 10583 10.90992647 0.003277851 -0.253056336 0.049755793 76 0.229149938 55 5865 8956 5.495016611 0.001998801 -0.251292838 0.060734676 167 0.174011088 56 12284 19924 5.478902954 0.001011594 -0.259199682 0.038720139 332 0.16613786 57 8056 18171 8.154545455 0.002877489 -0.282159789 0.060062896 49 0.250108737 58 7927 17428 17.05694761 0.005301725 -0.366123756 0.102245138 22 0.280886077 59 6970 19142 4.813177648 0.002766413 -0.290319073 0.046690635 72 0.237782573 60 7131 14532 7.869402985 0.002856745 -0.210269642 0.063454378 101 0.226749207 132 the pre-training graphs. Then we build a scoring function model specific to recommendations in the fine-tuning phase. The fine-tuning objective function is the same as ADAPT. • InfoGraph: InfoGraph [99] is an unsupervised graph learning method based on contrastive learning. It trains the GNNs via maximizing the mutual information between the local-level representation and graph-level representation. To apply InfoGraph into the proposed scenario, we pre-train a GCN model via InfoGraph on the pre-training graphs, and then fine-tune the GCN model and train a recommendation scoring function model on the downstream graph. • MVGRL: MVGRL [31] is also an unsupervised graph learning method based on contrastive learning. It contrasts the node-level representation and a graph-level representation from a graph diffusion view and a regular view. To apply MVGRL into the proposed scenario, we first pre-train a GCN model following MVGRL on the pre-training graphs, and then do fine-tuning on the downstream graph and also train a recommendation scoring function model. D.3 Implementation Details In our work, we implement the meta-LGNN model as a 3-layer graph convolutional networks and the scoring function as a linear transformation layer. For the input of the GNN adaptor, we compute eight graph properties via the networkX package [26], including the number of nodes, the number of edges, the user-item ratio, the graph density, the degree assortativity coefficient, robins-alexander clustering coefficient, the number of connected components and the global efficiency. Note that it is straightforward to include more graph properties. We use the Adam [50] optimizer. The batch size is set to 256. The learning rate is chosen from [0.0005, 0.001], and the dropout rate is chosen from [0.0, 0.1, 0.2, 0.3]. 133