ADVANCED OPERATORS FOR GRAPH NEURAL NETWORKS By Yao Ma A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2021 ABSTRACT ADVANCED OPERATORS FOR GRAPH NEURAL NETWORKS By Yao Ma Graphs, which encode pairwise relations between entities, are a kind of universal data structure for many real-world data, including social networks, transportation networks, and chemical molecules. Many important applications on these data can be treated as computational tasks on graphs. For example, friend recommendation in social networks can be regarded as a link prediction task and predicting properties of chemical compounds can be treated as a graph classification task. An essential step to facilitate these tasks is to learn vector representations either for nodes or the entire graphs. Given its great success of representation learning in images and text, deep learning offers great promise for graphs. However, compared to images and text, deep learning on graphs faces immense challenges. Graphs are irregular where nodes are unordered and each of them can have a distinct number of neighbors. Thus, traditional deep learning models cannot be directly applied to graphs, which calls for dedicated efforts for designing novel deep graph models. To help meet this pressing demand, we developed and investigated novel GNN algorithms to generalize deep learning techniques to graph-structured data. Two key operations in GNNs are the graph filtering operation, which aims to refine node representations; and the graph pooling operation, which aims to summarize node representations to obtain a graph representation. In this thesis, we provide deep understandings or develop novel algorithms for these two operations from new perspectives. For graph filtering operations, we propose a unified framework from the perspective of graph signal denoising, which demonstrates that most existing graph filtering operations are conducting feature smoothing. Then, we further investigate what information typical graph filtering operations can capture and how they can be understood beyond feature smoothing. For graph pooling operations, we study the procedure of pooling from the perspective of graph spectral theory and present a novel graph pooling operation. We also propose a technique to downsample nodes considering both mode importance and representativeness, which leads to a novel graph pooling operation. To my parents, siblings, and entire family for their love and support. iv ACKNOWLEDGEMENTS First and foremost, I would like to thank my supervisor, Dr. Jiliang Tang, for his invaluable advice, inspiration, encouragement, and support during the course of my Ph.D. degree. I have learned so much from him throughout these years, ranging from finding and addressing challenging research problems, writing insightful papers, giving attractive presentations, to efficiently mentoring students and managing a research lab. Dr. Tang is not only a supervisor in academia but also a sincere friend who guided me with his knowledge and experience in many aspects of life. Dr. Tang, I cannot thank you enough. I would like to express my deepest appreciation to my dissertation committee members, Dr. Anil K. Jain, Dr. Charu C. Aggarwal, Dr. Pan-Ning Tan, Dr. Kenneth A. Frank, and Dr. Nitesh V. Chawla for their insightful comments and helpful suggestions. I was fortunate to work as interns at JD.com and Snap Research with amazing colleagues and mentors: Zhaochun Ren, Ziheng Jiang, Ziyi Guo, Yuan He, and Dawei Yin from JD.com; and Neil Shah, Yozen Liu, Tong Zhao, and Leonardo Neves from Snap Research. I enjoyed three wonderful and productive summers with you. I am grateful to have had the pleasure and fortune of having supportive and encouraging friends and colleagues during my Ph.D. I am thankful to all my colleagues from the Data Science and Engineering Lab: Ibrahim Ahmed, Dr. Meznah Almutairy, Aaron Brookhouse, Jamell Dacon, Dr. Wenqi Fan, Bryan Hendryx, Dr. Jiangtao Huang, Wei Jin, Cassidy Johnson, Dr. Hamid Karimi, Dr. Tyler Derr, Juanhui Li, Yaxin Li, Haochen Liu, Hua Liu, Xiaorui Liu, Daniel K. Ofori-Dankwa, Namratha Shah, Hannah Striebel, Pegah Varghaei, Chenxing Wang, Wentao Wang, Xiaoyang Wang, Dr. Xin Wang, Yiqi Wang, Dr. Zhiwei Wang, Han Xu, Hansheng Zhao, Yongqi Han, Charles L. Green, and Dr. Xiangyu Zhao. In particular, thanks to my DSE collaborators: Ibrahim Ahmed, Dr. Wenqi Fan, Dr. Jiangtao Huang, Dr. Tyler Derr, Wei Jin, Juanhui Li, Haochen Liu, Hua Liu, Xiaorui Liu, Daniel K. Ofori-Dankwa, Wentao Wang, Xiaoyang Wang, Dr. Xin Wang, Yiqi Wang, Dr. Zhiwei Wang, Han Xu, Hansheng Zhao, Yongqi Han, Charles L. Green, and Dr. Xiangyu Zhao. v I would like to extend my sincere thanks to all my collaborators from outside the DSE Lab: Dr. Anil Jain, Dr. Charu K. Aggarwal, Dr. Joseph Thekinen, Hasan Bayhan, Dr. Sinem Mollaoglu, Eric Zhao, Dr. Lingfei Wu, Dr. Tengfei Ma, Dr. Yu Rong, Dr. Tingyang Xu, Dr. Junzhou Huang, Dr. Wenbing Huang, Dr. Hong Cheng, Dr. Zitao Liu, Dr. Hui Liu, Dr. Jianping Wang, Dr. Qing Li, Dr. Xianfeng Tang, Dr. Yuan He, Dr. Caiyan Jia, Dr. Jian Yu, Debayan Deb, Tong Zhao, Yozen Liu, and Dr. Neil Shah. I look forward to continued collaboration with you. Finally, I would like to thank my family, for their unconditional love and support. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Computational Tasks on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Semi-supervised Node Classification Task . . . . . . . . . . . . . . . . . . 7 2.2.2 Graph Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 An Overview of Machine Learning Approaches for Tasks on Graphs . . . . 8 2.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Graph Filtering Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Graph Pooling Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 General Frameworks of Graph Neural Networks . . . . . . . . . . . . . . . 11 2.3.3.1 General GNN Framework for Node-focus Tasks . . . . . . . . . . 11 2.3.3.2 General GNN Framework for Graph-focus Tasks . . . . . . . . . 12 CHAPTER 3 A UNIFIED VIEW ON GRAPH FILTERING OPERATIONS AS GRAPH SIGNAL DENOISING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Representative Graph Neural Networks Models . . . . . . . . . . . . . . . . . . . 14 3.2.1 Graph Convolutional Networks (GCN) . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Graph Attention Networks (GAT) . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Personalized Propagation of Neural Predictions (PPNP) . . . . . . . . . . . 15 3.3 Graph filtering operations as graph signal denoising . . . . . . . . . . . . . . . . . 16 3.3.1 Connection to PPNP layers and APPNP layers . . . . . . . . . . . . . . . . 17 3.3.2 Connection to GCN layers . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Connection to GAT layers . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Ugnn: A Unified Framework for Graph Filtering Operations via Graph Signal Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Ada-Ugnn: Adaptive Local Smoothing with Ugnn . . . . . . . . . . . . . . . . . 23 3.5.1 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Feature Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.1 Node Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.1.1 Datasets and Experimental Settings . . . . . . . . . . . . . . . . 26 vii 3.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.3 Node Classification Accuracy For Nodes with Low-level and High-level Local Label Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 29 3.6.4 Robustness Under Adversarial Attacks . . . . . . . . . . . . . . . . . . . . 31 3.6.5 Investigation on Number of Gradient Descent Steps in ADA-UGNN . . . . 34 3.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 CHAPTER 4 IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS? . 36 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Homophily in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Graph Convolutional Networks under Heterophily . . . . . . . . . . . . . . . . . . 39 4.3.1 When does GCN learn similar embeddings? . . . . . . . . . . . . . . . . . 40 4.3.2 How does classification performance change over the homophily-heterophily spectrum? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.2.1 Targeted Heterophilous Edge Addition . . . . . . . . . . . . . . 48 4.3.2.2 Introducing noise to neighborhood distributions . . . . . . . . . . 50 4.4 Revisiting GCN’s Performance on Real-world Graphs . . . . . . . . . . . . . . . . 53 4.4.1 Evaluating Node Classification on Homophilous and Heterophilous Bench- marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Investigating GCN performance on Homophilous and Heterophilous Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 CHAPTER 5 EIGENPOOLING: A GRAPH POOLING OPERATION FROM THE SPECTRAL GRAPH THEORY PERSPECTIVE . . . . . . . . . . . . . . . 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 The Proposed Framework – EigenGCN . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 An Overview of EigenGCN . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Eigenvector-Based Pooling – EigenPooling . . . . . . . . . . . . . . . . . . 62 5.2.3.1 Graph Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 63 5.2.3.2 The Design of Pooling Operations . . . . . . . . . . . . . . . . . 64 5.3 Theoretical Analysis of EigenPooling . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 A Local View of EigenPooling . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.2 A Global View of EigenPooling . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.3 Permutation Invariance of EigenGCN . . . . . . . . . . . . . . . . . . . . 70 5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.2 Baselines and Experimental Settings . . . . . . . . . . . . . . . . . . . . . 72 5.4.3 Performance on Graph Classification . . . . . . . . . . . . . . . . . . . . . 73 5.4.4 Understanding Graph Signals . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 viii CHAPTER 6 REPPOOL: A GRAPH POOLING OPERATION WITH REPRESENTA- TIVENESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.2 Graph Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 The Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.2 Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.2.1 Node Importance . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3.2.2 Node Representativeness . . . . . . . . . . . . . . . . . . . . . . 85 6.3.2.3 Node Selection Algorithm . . . . . . . . . . . . . . . . . . . . . 86 6.3.3 Coarser Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.3.1 Generating the Feature Matrix . . . . . . . . . . . . . . . . . . . 87 6.3.3.2 Generating The Adjacency Matrix . . . . . . . . . . . . . . . . . 88 6.3.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.4 Graph Classification Performance . . . . . . . . . . . . . . . . . . . . . . 92 6.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4.7 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 ix LIST OF TABLES Table 3.1: Dataset summary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 3.2: Node Classification Accuracy on Various Datasets . . . . . . . . . . . . . . . . 29 Table 4.1: SSNC accuracy on two heterophilous datasets: standard GCN outperforms all heterophily-specific models on these graphs. . . . . . . . . . . . . . . . . . . . . 39 Table 4.2: GNN node classification performance (accuracy) on homophilous and het- erophilous graphs. Standard GCN or MLP+GCN methods (top) outperform or achieve comparable performance to heterophily-specific methods (bottom). . . . 55 Table 5.1: Statistics of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 5.2: Performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Table 6.1: Statistics of the eight datasets. Graphs: total number of graphs. Nodes (avg): the average number of nodes. Edges (avg): the average number of edges. #Class: the total number of classes. . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 6.2: Comparisons of graph classification performance in terms of accuracy. Note that we highlight the best accuracy result in bold. . . . . . . . . . . . . . . . . . 91 Table 6.3: Graph classification performance for RepPool and its variants. . . . . . . . . . . 93 Table A.1: 𝐾 and ℎ values for generated graphs based on Cora . . . . . . . . . . . . . . . 103 Table A.2: 𝐾 and ℎ values for generated graphs based on Citeseer . . . . . . . . . . . . . 103 Table A.3: 𝐾 and ℎ values for generated graphs based on Squirrel . . . . . . . . . . . . . 104 Table A.4: 𝐾 and ℎ values for generated graphs based on Chameleon . . . . . . . . . . . . 104 Table A.5: Extended 𝐾 and ℎ values for generated graphs based on Cora . . . . . . . . . . 105 Table A.6: Benchmark dataset summary statistics. . . . . . . . . . . . . . . . . . . . . . . . 106 x LIST OF FIGURES Figure 2.1: An overview of machine learning approaches for node-focus tasks . . . . . . . . 7 Figure 2.2: An overview of machine learning approaches for graph-focus tasks . . . . . . . 8 Figure 2.3: An overview of graph filtering operation . . . . . . . . . . . . . . . . . . . . . 9 Figure 2.4: An overview of graph pooling operation . . . . . . . . . . . . . . . . . . . . . 10 Figure 2.5: A general GNN framework for node-focus tasks . . . . . . . . . . . . . . . . . 11 Figure 2.6: A general GNN framework for graph-focus tasks . . . . . . . . . . . . . . . . . 12 Figure 3.1: Distribution of local label smoothness (homophily) on different graph datasets: note the non-homogeneity of smoothness values. . . . . . . . . . . . . . . . . . 28 Figure 3.2: Accuracy for nodes with low and high local label smoothness. . . . . . . . . . . 30 Figure 3.3: Accuracy with low label smoothness and high label smoothness nodes. Note the consistent improvement in low smoothness cases, enabled by adaptive local smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 3.4: Distribution of local label smoothness on Cora with various attack perturbation rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 3.5: Distribution of local label smoothness on Citeseer with various attack perturbation rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 3.6: Distribution of local label smoothness on Pubmed with various attack pertur- bation rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 3.7: Robustness under adversarial attacks (node classification accuracy). . . . . . . . 33 Figure 3.8: ADA-UGNN performance (test accuracy) under different numbers of gradient steps (𝐾). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 4.1: A perfectly heterophilous graph on which GCN achieves perfect class separation. . . . 37 Figure 4.2: Two nodes share the same neighborhood distribution; GCN learns equivalent embeddings for 𝑎 and 𝑏. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 xi Figure 4.3: SSNC accuracy of GCN on synthetic graphs with various homophily ratios, generated by adding heterophilous edges according to pre-defined target distributions on Cora (a) and Citeseer (b): performance first decreases, then increases, forming a 𝑉-shape (see the black lines). . . . . . . . . . . . . . . . . 48 Figure 4.4: Cross-class neighborhood similarity on synthetic graphs generated from Cora; all graphs have ℎ = 0.3, but with varying neighborhood distributions as per the noise parameter 𝛾: as 𝛾 increases, the intra-class similarity and the inter-class similarity become closer to each other, indicating that the distributions for different classes become more indistinguishable. . . . . . . . . . . . . . . . . . 50 Figure 4.5: Cross-class neighborhood similarity on homophilous graphs (a) and het- erophilous graphs (b-d). The deviations between inter and intra-class similari- ties substantiate GCN’s performance on these datasets. . . . . . . . . . . . . . 55 Figure 5.1: An illustrative example of the general framework . . . . . . . . . . . . . . . . . 60 Figure 5.2: Understanding graph signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 6.1: An illustration of node selection based on the importance by gPool [23] at the first layer. Red nodes are the selected ones. gPool might select nodes that are redundantly concentrated on some substructures while totally neglecting information of other substructures. . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 6.2: An Illustration of the proposed hierarchical graph classification architecture and the proposed pooling module. In the node selection process in (b), 4 nodes are selected step by step. Note that the red color indicates that the node has been selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 6.3: Visualization of node selection results at first layer in RepPool and gPool. Figures (a) and (b) are outputs on the NCI1 dataset and figures (c), (d) are outputs on the PTC dataset. Each node is attached with a number indicting its index in the graph. 25% nodes are selected and highlighted in red. . . . . . . . 93 Figure 6.4: Graph classification results by varying four parameters: pooling ratio 𝑟, learning rate 𝜇, pooling layers 𝐾 𝑝 , and GCN layers 𝐾𝑔 on three datasets: Synthie, PTC, and IMDB-BINARY. . . . . . . . . . . . . . . . . . . . . . . . 96 Figure A.1: Performance of GCN on synthetic graphs with various homophily ratio. . . . . 104 Figure A.2: The GCN’s performance approaches 100% as we keep adding edges following the specific patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure A.3: Neighborhood Similarity On Citeseer and Pubmed. On both graphs, the intra-class similarity is clearly higher than the inter-class ones. . . . . . . . . . 108 xii Figure A.4: Neighborhood Similarity On Squirrel, Texas and Wisconsin. The inter- class similarity on Squirrel is slightly higher than intra-class similarity for most classes, which substantiates the middling performance of GCN. Both Texas and Wisconsin are quite small, hence the cross-class similarity in these two graphs present severe bias and may not provide precise information about these graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xiii LIST OF ALGORITHMS 1 𝐾-layer GCN As Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Heterophilous Edge Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Heterophilous Edge Addition with Noise . . . . . . . . . . . . . . . . . . . . . . . 51 4 Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 xiv CHAPTER 1 INTRODUCTION Recent years have witnessed a rapid growth in our ability to generate and gather data from numerous platforms in the online world and various sensors in the physical world. Graphs provide a universal representation for most of these data such as online social networks, transportation networks, and chemical molecules, where entities are denoted as nodes while their relations are denoted as edges. Hence, many real-world applications can be regarded as computational tasks on graphs. For example, for online social network platform such as Facebook, it is important to always suggest potential friends for the users to connect with. This friend recommendation process is helpful to keep the users and the entire platform active. This task of friend recommendation can be treated as a link prediction task on a graph (the social network can be regarded as a graph). In the drug discovery industry, to discover effective drugs,the researchers need to evaluate millions of different structures of chemical compounds, which is prohibitively expensive in time and resources. Predicting the properties of the chemical compounds accurately may largely reduce the search space and thus help speed up the process. The task of predicting properties of chemical compounds can be regarded as a graph classification (or regression) task, where each chemical compound can be modeled as a graph. To facilitate these computational tasks on graphs, it is essential to learn proper representations for graphs. Deep learning techniques have shown their great capability in learning representations for images and text and hence offer great promise for graphs. However, generalizing deep learning to graphs faces tremendous challenges. Unlike images and text, which are usually well-structured as grids, graphs are irregular from many perspectives. Specifically, nodes in a graph do not have meaningful orderings and they can also have distinct number of nodes in their neighborhoods. Thus, traditional deep learning techniques designed for regular grid data are not directly suitable to be applied to graphs, which calls for dedicated efforts to develop algorithms for graph-structured data. Recently, graph neural networks (GNNs), which generalize deep learning techniques to graph structured data, have been attracting increasing attention. Graph neural networks have shown their 1 great capability in learning both node representations and graph representations and thus have facilitate many tasks on graphs we mentioned above. The essential building components of GNNs include the graph filtering operations and the graph pooling operations. The graph filtering operation aims to learn good node representations and the graph pooling operation is typically required to learn a representation for the entire graph. In this thesis, we conduct researches on both graph filtering operations and graph pooling operations from new perspectives. More specifically, we provide deep understandings and insights on existing graph filtering operations and propose novel graph pooling operations to facilitate graph classification tasks. 1.1 Dissertation Contributions The major contributions of this thesis are summarized as follows. • Numerous recent works have proposed various GNN models with different designs in the graph filtering operations. We raise a natural question – is there an intrinsic connection among these feature filtering operations and their assumptions? The significance of a positive answer to this question is two-fold. Firstly, it offers a new perspective to create a uniform understanding on representative graph filtering operations. Secondly, it enables us to develop a general framework that not only provides a unified view on multiple existing representative graph filtering operations, but also has the potential to inspire new ones. Our first contribution is to propose such a unified framework for graph filtering operations from the perspective of graph signal denoising. In particular, we mathematically establish that many existing graph filtering operations can be unified as the process of exactly, and sometimes approximately, addressing a graph signal denoising problem. This connection suggests that these graph filtering operations share a unified goal: to ensure feature smoothness of connected nodes. • Recently, many literature posits that when applied to the semi-supervised node classification task (details about this task will be later introduced in Section 2.2), general GNNs are widely believed to work well due to the homophily assumption (“like attracts like”), and fail to generalize to heterophilous graphs where dissimilar nodes connect. This also coincides 2 with our unified understanding on graph filtering operations that they are ensuring feature smoothness of connected nodes. Hence, recent works design new architectures to overcome such heterophily-related limitations, citing poor baseline performance and new architecture improvements on a few heterophilous graph benchmark datasets as evidence for this notion. However, we empirically find that standard graph convolutional networks (a kind of widely adopted GNN model) can actually achieve better performance than such carefully designed methods on some commonly used heterophilous graphs. This motivates us to reconsider whether homophily is truly necessary for good GNN performance. We find that this claim is not quite true, and in fact, GCNs can achieve strong performance on heterophilous graphs under certain conditions. We carefully characterizes these conditions, and provides supporting theoretical understanding and empirical observations. Finally, we examine existing heterophilous graphs benchmarks and reconcile how the GCN (under)performs on them based on this understanding. • To apply graph neural networks for the graph classification task, graph pooling operations to generate the graph representation from node representations are demanded. A common graph pooling operation adopted is to globally combine the node representations. However, rich structural information is overlooked. Thus a hierarchical pooling procedure with multiple pooling layers is desired to preserve the graph structure during the graph representation learning. There are some recent works on hierarchically learning graph representation analogous to the pooling step in conventional convolutional neural (CNN) networks. However, the local structural information is still largely neglected during the pooling process. We introduce a graph pooling operation based on graph spectral theory, which can utilize the node features and local structures during the pooling process. The method not only naturally incorporates both node representations and graph structure that helps achieve the state-of-the- art performance but also has nice theoretical properties such as perfect reconstruction and guaranteed information preserving. 3 • The spectral-based graph pooling operation we proposed above is based on grouping nodes into “supernodes” to perform the graph coarsening process. Another stream of graph pooling operations coarsen a graph by downsampling nodes. Existing downsampling-based methods typically select nodes according to learnable importance scores, i.e., those nodes with higher importance scores are more likely to be sampled. However, connected nodes tend to share similar importance scores, since their features are similar (especially after graph filtering operation), which results in suboptimal sampling of nodes. These sampled nodes cannot well preserve the original graph information, which causes huge information loss during the graph pooling. To remedy this issue, we proposed a graph pooling operation, which considers not only node importance but also node “representativeness”. Specifically, a node is considered less representative if a node in its local neighborhood is already sampled. We designed a representativeness score, which, together with the importance score, are utilized to sample the nodes. Equipped with this novel sampling approach, our proposed graph pooling operation has helped GNNs achieve state-of-the-art performance in graph classification. 1.2 Dissertation Structure The remainder of this dissertation is organized as follows. In Chapter 2, we introduce some backgrounds on graphs and graph neural networks. In Chapter 3, we proposed a unified framework for graph filtering operations from the perspective of graph signal denoising. It covers many existing graph filtering operations as special cases. The contents in this chapter were previously published as “A Unified View On Graph Neural Networks as Graph Signal Denoising” [52]. In Chapter 4, we reveal that homophily is not a necessity for graph neural networks (especially the graph filtering operation) to work well for semi-supervised node classification task. We provide empirical observations and theoretical understandings to support our findings. This chapter was previously published as “Is Homophily a Necesscity for Graph Neural Networks?” [51]. In Chapter 5, We proposed a novel graph pooling operation based on graph spectral theory, which nicely incorporate graph structures and node features during the pooling process. This chapter was previously published 4 as “Graph Convolutional Networks with EigenPooling” [55]. In Chapter 6, we propose a novel node downsampling technique, which considers both node importance and representativeness. The sampled nodes by this technique are widespread across the entire graph, thus captures more comprehensive information of the original graph. We further proposed a down-sample based graph pooling operation based on this downsampling technique. The materials presented in this chapter were previously published as “Graph Pooling with Representativeness” [46]. Finally, we conclude the dissertation in Chapter 7. 5 CHAPTER 2 BACKGROUND In this chapter, we briefly introduce the definition of graphs and the computational tasks on graphs. We also describe the general framework of graph neural networks, including a high-level discussion of the graph filtering operation and graph pooling operation. We will introduce basic knowledge about graph neural networks after discussing the graph filtering operation and graph pooling operation. This chapter provides groundwork for introducing our deep insights on the graph filtering operations and novel methodologies on graph pooling operations. 2.1 Graphs Let G = {V, E} denote a graph, where V and E are the sets of nodes and edges, respectively. The graph connection information can also be represented as an adjacency matrix A ∈ {0, 1} |V |×|V | , where |V | is the number of nodes in the graph. The 𝑖, 𝑗-th element of the adjacency matrix A[𝑖, 𝑗] is equal to 1 if and only if nodes 𝑖 and 𝑗 are adjacent to each other, otherwise A[𝑖, 𝑗] = 0. Each node 𝑖 is associated with a 𝑑-dimensional vector of node features x𝑖 ∈ R𝑑 ; the features for all nodes can be summarized as a matrix X ∈ R|V |×𝑑 . 2.2 Computational Tasks on Graphs There are numerous computational tasks on graphs, which can be majorly divided into two categories: node-focus tasks and graph-focus tasks. For node-focus tasks, we are usually given a single graph, and we care more about the properties for each individual node on this graph. Link prediction and node classification (or regression) are two examples of node-focus tasks. For the link prediction task, we are interested in predicting how nodes will connect with other nodes in the future time. For the node classification/regression task, we aim to infer some properties for the nodes. For graph-focus tasks, we typically deal with a set of graphs and our goal is to infer some properties for each individual of graph. In this thesis, we mainly deal with two representative tasks: 1) semi-supervised 6 node classification task, which is a node-focus task; and 2) graph classification task, which is a graph-focus task. Next, we formally describe these tasks as follows. 2.2.1 Semi-supervised Node Classification Task In the semi-supervised node classification task (SSNC), we assume that each node 𝑖 is associated with a label 𝑦𝑖 ∈ C, where C denotes the set of labels. We also denote the set of nodes with a given label 𝑐 ∈ C as V𝑐 . We assume that labels are only given for a subset of nodes V𝑙𝑎𝑏𝑒𝑙 ⊂ V. The goal of semi-supervised node classification is to learn a mapping 𝑓 : V → C utilizing the graph G, the node features X and the labels for nodes in V𝑙𝑎𝑏𝑒𝑙 . Note that the task is semi-supervised since all nodes (labeled or not) are involved in the training process. 2.2.2 Graph Classification Task In the graph classification task, we are given a set of labeled graphs {G𝑖 }𝑖 , each graph G𝑖 is associated with a label 𝑦𝑖 . We denote the set of labeled graphs as {(G𝑖 , 𝑦𝑖 )} and assume each graph is sampled from a distribution denoted as D. The task of the graph classification is to take the graph (structure information and node features) as input and predict its corresponding label. More formally, we need to learn a mapping 𝑓 : D → C, where we use C to denote the set of labels. To achieve good performance for this prediction, it is important to extract useful information from both graph structure and node features. Representation Node-focus Learning Tasks Node representations Figure 2.1: An overview of machine learning approaches for node-focus tasks 7 Representation Graph-focus Learning Tasks Graph representation Figure 2.2: An overview of machine learning approaches for graph-focus tasks 2.2.3 An Overview of Machine Learning Approaches for Tasks on Graphs When utilizing machine learning approaches to tackle these computational tasks on graphs, a crucial step is to learn proper representations (vectors or matrices) for either nodes or graphs. Once we obtain these representations, we can apply suitable existing machine learning algorithms on them according to the tasks we need to deal with. We illustrate an overview of machine learning approaches for node-focus and graph-focus tasks in Figure 2.1 and Figure 2.2, respectively. The orange box in these two figures illustrate the process of learning representations for graphs, which is the key for good performance of the downstream tasks. In this thesis, we study graph neural networks–a kind of deep learning methods on graphs, for the purpose of representation learning. 2.3 Graph Neural Networks Graph neural networks (GNNs) are a prominent approach for learning representations for graph structured data. They could be adopted to learn both node representations and graph representations. Thus, they are suitable for both node-focus and graph-focus tasks. There are majorly two operations in graph neural networks: the graph filtering operation and the graph pooling operation. The goal of graph filtering operation is to generate node features that potentially benefit the downstream tasks utilizing the graph structure and the input node features. On the other hand, the graph pooling operation is utilized to generate a single graph representation by summarizing information from the node representations. Next, we briefly introduce these two operations and then describe the general frameworks of GNNs built upon thesee two operations. When introducing these two operations, we consider a graph G, with adjacency matrix A and feature matrix X as input. 8 Graph Filtering A AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B Xf AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0TJoYxnBxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sjm6n/8MSNFbG+x3HCu4oOtIgEo+ikx6wTRqQ16UW9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUZX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGF1X/7rxSu87jKMIRHMMpBHAJNbiFOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kDeciQMQ== A X AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cK9gPaUDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByl/udJ1Sax/LRTBP0IzqSPOSMGit1sn4Qku5sUK25dXcOskq8gtSgQHNQ/eoPY5ZGKA0TVOue5ybGz6gynAmcVfqpxoSyCR1hz1JJI9R+Nj93Rs6sMiRhrGxJQ+bq74mMRlpPo8B2RtSM9bKXi/95vdSEN37GZZIalGyxKEwFMTHJfydDrpAZMbWEMsXtrYSNqaLM2IQqNgRv+eVV0r6oe1d19+Gy1rgt4ijDCZzCOXhwDQ24hya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPH/+tj1g= Input Output Figure 2.3: An overview of graph filtering operation 2.3.1 Graph Filtering Operation The graph filtering operation aims to refine node features to facilitate the down stream tasks. Specifically, as shown in Figure 2.3, a graph filtering operation takes the adjacency matrix A ∈ {0, 1} |V |×|V | and the feature matrix X ∈ R|V |×𝑑 as input. After the graph filtering process, the graph structure is the same as the input, i.e, the adjacency matrix is still A. On the other hand, new features X 𝑓 ∈ R|V |×𝑑 𝑓 with 𝑑 𝑓 denoting the dimension of output features, are produced from the filtering operation. Typically, when generating the new features, the graph filtering operation utilizes both the structure information and the input node features. Note that, the dimension of the output features 𝑑 𝑓 can be different from the dimension of the input features. When we adopt GNNs for learning node representations, we usually stack several graph filtering operations. We will introduce more details on the general frameworks in Section 2.3.3. 9 2.3.2 Graph Pooling Operation Graph Pooling Ap Xp AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoseoF48RzEOSJcxOZpMh81hmZoWw5Cu8eFDEq5/jzb9xkuxBEwsaiqpuuruihDNjff/bK6ysrq1vFDdLW9s7u3vl/YOmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTv3WE9WGKflgxwkNBR5IFjOCrZMes24Uo+tJL+mVK37VnwEtkyAnFchR75W/un1FUkGlJRwb0wn8xIYZ1pYRTielbmpogskID2jHUYkFNWE2O3iCTpzSR7HSrqRFM/X3RIaFMWMRuU6B7dAselPxP6+T2vgqzJhMUkslmS+KU46sQtPvUZ9pSiwfO4KJZu5WRIZYY2JdRiUXQrD48jJpnlWDi6p/f16p3eRxFOEIjuEUAriEGtxBHRpAQMAzvMKbp70X7937mLcWvHzmEP7A+/wBZc+QJA== AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0TJoYxnBxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sjm6n/8MSNFbG+x3HCu4oOtIgEo+ikx6wTRqQ16SW9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUZX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGF1X/7rxSu87jKMIRHMMpBHAJNbiFOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kDiPCQOw== A X AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cK9gPaUDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByl/udJ1Sax/LRTBP0IzqSPOSMGit1sn4Qku5sUK25dXcOskq8gtSgQHNQ/eoPY5ZGKA0TVOue5ybGz6gynAmcVfqpxoSyCR1hz1JJI9R+Nj93Rs6sMiRhrGxJQ+bq74mMRlpPo8B2RtSM9bKXi/95vdSEN37GZZIalGyxKEwFMTHJfydDrpAZMbWEMsXtrYSNqaLM2IQqNgRv+eVV0r6oe1d19+Gy1rgt4ijDCZzCOXhwDQ24hya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPH/+tj1g= Input Output Figure 2.4: An overview of graph pooling operation The graph filtering operation is able to learn node representations. Hence, for node-focus tasks, the graph filtering operation is sufficient. However, for graph-focus tasks, we need to learn a single representation for the entire graph. Thus, the graph pooling operation, which can summarize node representations to produce a graph representation, is necessary. Typically, the graph pooling operation takes a graph G as input and generate a smaller-size graph as output. We use Figure 2.4 as a demonstration. As shown in Figure 2.4, the output graph is a coarsened version of the input one. Specifically, a new adjacency matrix A 𝑝 ∈ {0, 1} 𝑁 𝑝 and a new feature matrix X 𝑝 ∈ R𝑁 𝑝 ×𝑑 𝑝 are generated by the graph pooling operations, where 𝑁 𝑝 < |V | denotes the number of nodes in the coarsened graph. Every time, we apply a pooling operation to a graph, we obtain a coarsened graph in return. We can further apply graph filtering operation and graph pooling operation on the coarsened graph until we finally obtain a single graph representation for the entire graph. We will 10 Graph Filtering Layer Activation Input … Output Figure 2.5: A general GNN framework for node-focus tasks introduce more details on the general framework of Graph Neural Networks in Section 2.3.3. 2.3.3 General Frameworks of Graph Neural Networks In this section, we describe the general GNN frameworks for learning representations for graphs. As we discussed earlier in Section 2.2, we need to learn node representations for node-focus tasks and graph representations for graph-focus tasks. The processes for learning these two types of representations are different from each other. Specifically, to learn node representations, the graph filtering operations are sufficient while both graph filtering operations and graph pooling operations are usually required for learning graph representations. Next, we separately introduce the GNN frameworks for learning these two types of representations. 2.3.3.1 General GNN Framework for Node-focus Tasks A general framework for learning node representations is shown in Figure 2.5. It consists two types of layers: the graph filtering layer and the activation layer. The graph filtering layer is a learning layer with a graph filtering operation. The activation layer is just a non-linear activation function such as ReLU and sigmoid, which are commonly adopted in deep learning architectures. As shown in Figure 2.5, the framework stacks several graph filtering layers and activation layers. Typically, a graph filtering layer is directly followed by an activation layer. 11 Graph Filtering Layer Activation Graph Pooling Layer Input … … … Output Block 1 Block n Figure 2.6: A general GNN framework for graph-focus tasks 2.3.3.2 General GNN Framework for Graph-focus Tasks Besides graph filtering layers and activation layers, the general GNN framework for graph-focus tasks also involves the graph pooling layer. As shown in Figure 2.6, the entire GNN framework can be divided into several learning blocks. Each block contains a single graph pooling layer, which follows a series of graph filtering layers and activation layers. Hence, after each block, a coarsened graph will be generated. This coarsened graph is served as the input for the next learning block. A single graph representation is produced as the output of this framework. 12 CHAPTER 3 A UNIFIED VIEW ON GRAPH FILTERING OPERATIONS AS GRAPH SIGNAL DENOISING 3.1 Introduction In this chapter, we study the graph filtering operations proposed in recent literature. Specifically, we provide a unified view on graph filtering operations from the perspective of graph signal denoising. As we discussed in Section 2.3.3, a GNN model for node-focus tasks is usually composed of several stacking graph filtering layers. Given a graph G with |V | nodes, a graph filtering layer typically contains a feature transformation and a feature aggregation operation as: Feature Transformation: X0𝑖𝑛 = 𝑓𝑡𝑟𝑎𝑛𝑠 (X𝑖𝑛 ); (3.1) Feature Aggregation: X𝑜𝑢𝑡 = 𝑓𝑎𝑔𝑔 (X0𝑖𝑛 ; G), (3.2) where X𝑖𝑛 ∈ R|V |×𝑑𝑖𝑛 and X𝑜𝑢𝑡 ∈ R|V |×𝑑𝑜𝑢𝑡 denote the input and output features of the GNN layer with 𝑑𝑖𝑛 and 𝑑 𝑜𝑢𝑡 as the corresponding dimensions, respectively. Note that the non-linear activation is not included in Eq. (3.2) to ease the discussion. The feature transformation operation 𝑓𝑡𝑟𝑎𝑛𝑠 (·) transforms the input of X𝑖𝑛 to X0𝑖𝑛 ∈ R|V |×𝑑𝑜𝑢𝑡 as its output; and the feature aggregation operation 𝑓𝑎𝑔𝑔 (·; G) updates the node features by aggregating the transformed node features via the graph G. In general, different graph filtering operations share similar feature transformations (often, a single feed-forward layer), while adopting different designs for aggregation operation. We raise a natural question – is there an intrinsic connection among these feature aggregation operations and their assumptions? In this chapter, we aim to build the connection among feature aggregation operations of the graph filtering operation in representative GNN models including GCN [40], GAT [84], PPNP and APPNP [42]. In particular, we mathematically establish that the aggregation operations in these models can be unified as the process of exactly, and sometimes approximately, addressing a graph signal denoising problem with Laplacian regularization [80]. This connection suggests that these aggregation operations share a unified goal: to ensure feature smoothness of 13 connected nodes. With this understanding, we propose a general framework, Ugnn, which not only provides a straightforward, unified view for many existing aggregation operations, but also suggests various promising directions to build new aggregation operations suitable for distinct applications. To demonstrate its potential, we build an instance of Ugnn called Ada-Ugnn, which is suited for handling varying smoothness properties across nodes, and conduct experiments to show its effectiveness. 3.2 Representative Graph Neural Networks Models In this section, we introduce notations for graphs and briefly summarize several representative GNN models. Especially, we introduce the graph filtering operations for these GNN models. A graph can be denoted as G = {V, E}, where V and E are its corresponding node and edge sets. The connections in G can be represented as an adjacency matrix A ∈ R|V |×|V | , with |V | the number of nodes in the graph. The Laplacian matrix of the graph G is denoted as L. It is defined as L = D − A, where D is a diagonal degree matrix corresponding to A. There are also normalized versions of 1 1 the Laplacian matrix such as L = I − D− 2 AD− 2 or L = I − D−1 A. In this work, we sometimes adopt different Laplacians to establish connections between different GNNs and the graph denoising problem, clarifying in the text. In this section, we generally use X𝑖𝑛 ∈ R|V |×𝑑𝑖𝑛 and X𝑜𝑢𝑡 ∈ R|V |×𝑑𝑜𝑢𝑡 to denote input and output features of graph filtering layers. Next, we describe a few representative GNN models. 3.2.1 Graph Convolutional Networks (GCN) Following Eq. (3.2), a single graph filtering layer in GCN (we also refer it as GCN layer hereafter for convenience) [40] can be written as follows: Feature Transformation: X0𝑖𝑛 = X𝑖𝑛 W; Feature Aggregation: X𝑜𝑢𝑡 = ÃX0𝑖𝑛 , (3.3) 14 where W ∈ R𝑑𝑖𝑛 ×𝑑𝑜𝑢𝑡 is a feature transformation matrix, and à is a normalized adjacency matrix which includes a self-loop, defined as follows: 1 1 Õ Õ Ã = D̂− 2 ÂD̂− 2 , with  = A + I and D = diag( Â1, 𝑗 , . . . , Â𝑁, 𝑗 ). (3.4) 𝑗 𝑗 In practice, multiple GCN layers can be stacked, where each layer takes the output of its previous layer as input. Non-linear activation functions are included between consecutive layers. 3.2.2 Graph Attention Networks (GAT) The graph filtering layer in graph attention networks (GAT) [84] adopts the same feature transforma- tion operation as GCN in Eq. (3.3). For convenience, hereafter, we also refer this graph filtering layer as GAT layer. The feature aggregation operation in a GAT layer (written node-wise) for a node 𝑖 is as:  Õ exp 𝑒𝑖 𝑗 X𝑜𝑢𝑡 [𝑖, :] = 𝛼𝑖 𝑗 X0𝑖𝑛 [ 𝑗, :], with 𝛼𝑖 𝑗 = Í . (3.5) exp (𝑒𝑖𝑘 ) 𝑗 ∈Ñ (𝑖) 𝑘∈Ñ (𝑖) where Ñ (𝑖) = N (𝑖) ∪ {𝑖} denotes the neighbors (self-inclusive) of node 𝑖, and X𝑜𝑢𝑡 [𝑖, :] is the 𝑖-th row of the matrix X𝑜𝑢𝑡 , i.e. the output node features of node 𝑖. In this aggregation operation, 𝛼𝑖 𝑗 is a learnable attention score to differentiate the importance of distinct nodes in the neighborhood. Specifically, 𝛼𝑖 𝑗 is a normalized form of 𝑒𝑖 𝑗 , which is modeled as: X0𝑖𝑛 [𝑖, :] kX0𝑖𝑛 [ 𝑗, :] a    𝑒𝑖 𝑗 = LeakyReLU (3.6) where [·k·] denotes the concatenation operation and a ∈ R2𝑑 is a learnable vector. Similar to GCN, a GAT model usually consists of multiple stacked GAT layers. 3.2.3 Personalized Propagation of Neural Predictions (PPNP) Personalized Propagation of Neural Predictions (PPNP) [42] introduces an aggregation operation based on Personalized PageRank (PPR). Specifically, the PPR matrix is defined as 𝛼(I − (1 − 𝛼) Ã) −1 , where 𝛼 ∈ (0, 1) is a hyper-parameter. The 𝑖 𝑗-th element of the PPR matrix specifies the influence 15 of node 𝑖 on node 𝑗. The feature transformation operation is modeled as Multi-layer Perception (MLP). The PPNP model can be written in the form of Eq. (3.2) as follows: Feature Transformation: X0𝑖𝑛 = MLP(X𝑖𝑛 ); Feature Aggregation: X𝑜𝑢𝑡 = 𝛼(I − (1 − 𝛼) Ã) −1 X0𝑖𝑛 . (3.7) Unlike GCN and GAT, PPNP only consists of a single feature aggregation layer, but with a potentially deep feature transformation. Since the matrix inverse in Eq. (3.7) is costly, [42] also introduces a practical, approximated version of PPNP, called APPNP, where the aggregation operation is performed in an iterative way as: (𝑘) (𝑘−1) X𝑜𝑢𝑡 = (1 − 𝛼) ÃX𝑜𝑢𝑡 + 𝛼X0𝑖𝑛 𝑘 = 1, . . . 𝐾, (3.8) (0) (𝐾) where X𝑜𝑢𝑡 = X0𝑖𝑛 and X𝑜𝑢𝑡 is the output of the feature aggregation operation. As proved in [42], (𝐾) X𝑜𝑢𝑡 converges to the solution obtained by PPNP, i.e., X𝑜𝑢𝑡 in Eq. (3.7). For convenience, we also call the graph filtering layers in the PPNP and APPNP model as PPNP layer and APPNP layer, respectively. 3.3 Graph filtering operations as graph signal denoising In this section, we aim to establish the connections between the introduced graph filtering layers and a graph signal denoising problem with Laplacian regularization. We first introduce the problem. Problem 1 (Graph Signal Denoising with Laplacian Regularization). Suppose that we are given a noisy signal X ∈ R|V |×𝑑 on a graph G. The goal of the problem is to recover a clean signal F ∈ R|V |×𝑑 , assumed to be smooth over G, by solving the following optimization problem: arg min L = kF − Xk 2𝐹 + 𝑐 · 𝑡𝑟 (F> LF), (3.9) F Note that the first term guides F to be close to X, while the second term 𝑡𝑟 (F> LF) is the Laplacian regularization that guides the smoothness of F over the graph. 𝑐 > 0 is a balancing constant. Assuming we adopt the unnormalized version of Laplacian matrix with L = D − A (the 16 adjacency matrix A is assumed to be binary), the second term in Eq. (3.9) can be written in an edge-centric way or a node-centric way as: Õ edge-centric: 𝑐 kF [𝑖, :] − F [ 𝑗, :] k 22 ; (𝑖, 𝑗)∈E 1 Õ Õ node-centric: 𝑐 kF [𝑖, :] − F [ 𝑗, :] k 22 . (3.10) 2 𝑖∈V 𝑗 ∈Ñ (𝑖) Clearly, from the edge-centric view, the regularization term measures the global smoothness of F, which is small when connected nodes share similar features. On the other hand, we can view the term Í 2 𝑗 ∈Ñ (𝑖) kF [𝑖, :] − F [ 𝑗, :] k 2 as a local smoothness measure for node 𝑖 as it measures the difference between node 𝑖 and all its neighbors. The regularization term can then be regarded as a summation of local smoothness over all nodes. Note that the adjacency matrix A is assumed to be binary when deriving Eq. (3.10). Similar formulations can also be derived to other types of Laplacian matrices. In the following subsections, we demonstrate the connections between aggregation operations in various GNN layers and the graph signal denoising problem. 3.3.1 Connection to PPNP layers and APPNP layers In this subsection, we establish the connection between the graph signal denoising problem (3.9) and the aggregation propagations in PPNP layer and APPNP layer in Theorem 1 and Theorem 2, respectively. Theorem 1. When we adopt the normalized Laplacian matrix L = I − Ã, with à defined in Eq. (3.4), the feature aggregation operation in a PPNP layer (Eq. (3.7)) can be regarded as exactly solving 1 the graph signal denoising problem (3.9) with X0𝑖𝑛 as the input noisy signal and 𝑐 = 𝛼 − 1. Proof. Note that the objective in Eq. (3.9) is convex. Hence, its closed-form solution F∗ to exactly solve the graph signal denosing problem can be obtained by setting its derivative to 0 as: 𝜕L = 2(F − X) + 2𝑐LF = 0 ⇒ F∗ = (I + 𝑐L) −1 X (3.11) 𝜕F 17 Given L = I − Ã, F∗ can be reformulated as:  −1 1  𝑐  −1 ∗ −1 F = (I + 𝑐L) X = I + 𝑐 I − à X= I− à X (3.12) 1+𝑐 1+𝑐 The feature aggregation operation in Eq. (3.7) is equivalent to the closed-form solution in Eq. (3.12) when we set 𝛼 = 1/(1 + 𝑐) and X = X0in . This completes the proof. Theorem 2. When we adopt the normalized Laplacian matrix L = I − Ã, the feature aggregation operation in a APPNP layer (Eq. (3.8)) approximately solves the graph signal denoising problem (3.9) 1 1 by iterative gradient descent with X0𝑖𝑛 as the input noisy signal, 𝑐 = 𝛼 − 1 and stepsize 𝑏 = 2+2𝑐 . Proof. To solve the graph signal denoising problem (3.9), we take iterative gradient method with the stepsize 𝑏. Specifically, the 𝑘-th step gradient descent on problem (3.9) is as follows: 𝜕L F (𝑘) ← F (𝑘−1) − 𝑏 · (F = F (𝑘−1) ) = (1 − 2𝑏 − 2𝑏𝑐)F (𝑘−1) + 2𝑏X + 2𝑏𝑐ÃF (𝑘−1) (3.13) 𝜕F 1 where F (0) = X. When we set the stepsize 𝑏 as 2+2𝑐 , we have the following iterative steps: 1 𝑐 F (𝑘) ← X+ ÃF (𝑘−1) , 𝑘 = 1, . . . 𝐾, (3.14) 1+𝑐 1+𝑐 which is equivalent to the iterative aggregation operation of the APPNP model in Eq. (3.8) with 1 X = X0in and 𝛼 = 1+𝑐 . This completes the proof. These two connections provide a new explanation on the hyper-parameter 𝛼 in PPNP and APPNP from the graph signal denoising perspective. Specifically, a smaller 𝛼 indicates a larger 𝑐, which means the obtained X𝑜𝑢𝑡 is enforced to be smoother over the graph. 3.3.2 Connection to GCN layers We draw the connection between the GCN layer [40] and the graph signal denoising problem in Theorem 3. Theorem 3. When we adopt the normalized Laplacian matrix L = I − Ã, the feature aggregation operation in a GCN layer Eq. (3.3) can be regarded as solving the graph signal denoising 1 problem (3.9) using one-step gradient descent with X0𝑖𝑛 as the input noisy signal and stepsize 𝑏 = 2𝑐 . 18 Algorithm 1: 𝐾-layer GCN As Denoising Input: Node Features X; Adjacency Matrix  Output: Refined Node Features H 1 Initialize X (0) ← X, 𝑘 ← 1; 2 while 1 ≤ 𝑘 ≤ 𝐾 do 3 (Feature Transformation) X (𝑘−1) 𝑓 = X (𝑘−1) W (𝑘−1) ; 4 (Feature Aggregation) Let X (𝑘−1)𝑓 be the input noisy signal of Eq. (3.9), i.e., S = X 𝑓 . Solve Problem (3.9) via one-step gradient descent as per Theorem 3 and denote the solution as X𝑔(𝑘) ; 5 (Activation) X (𝑘) = 𝜎(X𝑔(𝑘) ), where 𝜎(·) denotes an activation function. ; 6 𝑘 ← 𝑘 + 1; 7 H = X (K) ; 8 return H 𝜕L Proof. The gradient with respect to F at X is 𝜕F F=X = 2𝑐LX. Hence, one-step gradient descent for the graph signal denoising problem (3.9) can be described as: 𝜕L F←X−𝑏 = X − 2𝑏𝑐LX = (1 − 2𝑏𝑐)X + 2𝑏𝑐ÃX. (3.15) 𝜕F F=X 1 When stepsize 𝑏 = 2𝑐 and X = X0𝑖𝑛 , we have F ← ÃX0𝑖𝑛 , which is the same as the aggregation operation of GCN. With this connection, it is easy to verify that a GCN model with multiple GCN layers can be regarded as solving Problem 1 multiple times with different noisy signals as shown in Algorithm 1 (demonstrating for 𝐾-layer GCN). Specifically, in each layer, the aggregation component aims to solve Problem 1 with the transformed features as input noisy signal. 3.3.3 Connection to GAT layers To establish the connection between graph signal denoising and GAT layers [84], in this subsection, we adopt an unnormalized version of the Laplacian. It is defined based on the adjacency matrix with self-loop Â, i.e. L = D̂ −  with D̂ denoting the diagonal degree matrix of Â. Then, the denoising 19 problem in Eq. (3.9) can be rewritten from a node-centric view as: Õ 1Õ Õ arg min L = kF[𝑖, :] − X[𝑖, :] k 22 + 𝑐· kF[𝑖, :] − F[ 𝑗, :] k 22 , (3.16) F 2 𝑖∈V 𝑖∈V 𝑗 ∈Ñ (𝑖) where Ñ (𝑖) = N (𝑖) ∪ {𝑖} denotes the neighbors (self-inclusive) of node 𝑖. In Eq. (3.16), the constant 𝑐 is shared by all nodes, which indicates that the same level of local smoothness is enforced to all nodes. However, nodes in a real-world graph can have varied local smoothness. For nodes with low local smoothness, we should impose a relatively smaller 𝑐, while for those nodes with higher local smoothness, we need a larger 𝑐. Hence, instead of a unified 𝑐 as in Eq. (3.16), we could consider a node-dependent 𝑐𝑖 for each node 𝑖. Then, the optimization problem in Eq. (3.16) can be adjusted as: Õ 1Õ Õ arg min L = kF [𝑖, :] − X [𝑖, :] k 22 + 𝑐𝑖 · kF [𝑖, :] − F [ 𝑗, :] k 22 (3.17) F 2 𝑖∈V 𝑖∈V 𝑗 ∈Ñ (𝑖) We next show that the aggregation operation in GAT layers is closely connected to an approximate solution of problem (3.17) with the help of the following theorem. Í Theorem 4. With adaptive stepsize 𝑏𝑖 = 1/ (𝑐𝑖 + 𝑐 𝑗 ) for each node 𝑖, the process of taking one 𝑗 ∈Ñ (𝑖) step of gradient descent from X to solve problem (3.17) can be described as follows: Õ F[𝑖, :] ← 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 )X[ 𝑗, :]. (3.18) 𝑗 ∈Ñ (𝑖) Proof. The gradient of optimization problem in Eq. (3.17) with respect to F focusing on a node 𝑖 can be formulated as: 𝜕L Õ  = 2 (F [𝑖, :] − X [𝑖, :]) + 𝑐 𝑖 + 𝑐 𝑗 (F [𝑖, :] − F [ 𝑗, :]) , (3.19) 𝜕F [𝑖, :] 𝑗 ∈ Ñ (𝑖) where 𝑐 𝑗 in the second term appears since 𝑖 is also in the neighborhood of 𝑗. Then, the gradient at X is 𝜕L Õ  = 𝑐𝑖 + 𝑐 𝑗 (X [𝑖, :] − X [ 𝑗, :]) . (3.20) 𝜕F [𝑖, :] F[𝑖,:]=X[𝑖,:] 𝑗 ∈Ñ (𝑖) Thus, taking a step of gradient descent starting from X with stepsize 𝑏 can be described as follows: 𝜕L F [𝑖, :] ← X [𝑖, :] − 𝑏 · 𝜕F [𝑖, :] F[𝑖,:]=X[𝑖,:] 20  Õ  Õ  = 1−𝑏 𝑐𝑖 + 𝑐 𝑗 X [𝑖, :] + 𝑏 𝑐𝑖 + 𝑐 𝑗 X [ 𝑗, :] (3.21) 𝑗 ∈Ñ (𝑖) 𝑗 ∈Ñ (𝑖) Í Í Given 𝑏 = 1/ (𝑐𝑖 + 𝑐 𝑗 ), Eq. (3.21) can be rewritten as F[𝑖, :] ← 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 )X[ 𝑗, :], which 𝑗 ∈Ñ (𝑖) 𝑗 ∈Ñ (𝑖) completes the proof. Eq. (3.18) resembles the aggregation operation of GAT layers in Eq. (3.5) if we treat 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) Í as the attention score 𝛼𝑖 𝑗 . Note that we have (𝑐𝑖 + 𝑐 𝑗 ) = 1/𝑏𝑖 , for all 𝑖 ∈ V. So, (𝑐𝑖 + 𝑐 𝑗 ) can 𝑗 ∈Ñ (𝑖) be regarded as the pre-normalized attention score and 1/𝑏𝑖 can be regarded as the normalization constant. We further compare 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) with 𝛼𝑖 𝑗 by investigating the formulation of 𝑒𝑖 𝑗 in Eq. (3.6). Eq. (3.6) can be rewritten as: 𝑒𝑖 𝑗 = LeakyReLU (X0𝑖𝑛 [𝑖, :]a1 + X0𝑖𝑛 [ 𝑗, :]a2 ) (3.22) where a1 ∈ R𝑑 and a2 ∈ R𝑑 are learnable column vectors, which can be concatenated to form a in Eq. (3.6). Comparing 𝑒𝑖 𝑗 with (𝑐𝑖 + 𝑐 𝑗 ), we find that they take a similar form. Specifically, X0𝑖𝑛 [𝑖, :]a1 and X0𝑖𝑛 [ 𝑗, :]a2 can be regarded as the approximations of 𝑐𝑖 and 𝑐 𝑗 , respectively. The difference between 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) and 𝛼𝑖 𝑗 is that the normalization in Eq. (3.18) for 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) is achieved via summation rather than a softmax as in Eq. (3.5) for 𝛼𝑖 𝑗 . Note that since GAT layers make the 𝑐𝑖 and 𝑐 𝑗 learnable, they also include a non-linear activation in calculating 𝑒𝑖 𝑗 . By viewing the attention mechanism in GAT layers from the perspective of Eq. (3.18), namely that 𝑐𝑖 actually indicates a notion of local smoothness for node 𝑖, we can develop other ways to parameterize 𝑐𝑖 . For example, instead of directly using the node features of 𝑖 as an indicator of local smoothness like GAT layers, we can consider the neighborhood information. In fact, we adopt this idea to design a new aggregation operation in Section 3.5. 3.4 Ugnn: A Unified Framework for Graph Filtering Operations via Graph Signal Denoising In the previous section, we established that the aggregation operations in PPNP layers, APPNP layers, GCN layers and GAT layers are intimately connected to the graph signal denoising problem with (generalized) Laplacian regularization. In particular, from this perspective, all their aggregation 21 operations aim to ensure feature smoothness: either a global smoothness over the graph as in PPNP layers, APPNP layers and GCN layers, or a local smoothness for each node as in GAT layers. This understanding allows us to develop a unified feature aggregation operation by posing the following, more general graph signal denoising problem: Problem 2 (Generalized Ugnn Graph Signal Denoising Problem). arg min L = kF − Xk 2𝐹 + 𝑟 (C, F, G), (3.23) F where 𝑟 (C, F, G) denotes a flexible regularization term to enforce some prior over F. Note that we overload the notation C here: it can function as a scalar (like a global constant in GCN layers), a vector (like node-wise constants in GAT layers) or even a matrix (edge-wise constants) if we want to give flexibility to each node pair. Different choices of 𝑟 (·) imply different feature aggregation operations. Besides PPNP, APPNP, GCN and GAT, there are aggregation operations in more GNN models that can be associated with Problem 2 with different regularization terms such as PairNorm [101] and DropEdge [68] as detailed below. PairNorm and DropEdge, which are two recently proposed GNN enhancements for developing deeper GNN models, are corresponding to the following regularization terms: Õ Õ PairNorm: C𝑝 · kF[𝑖, :] − F[ 𝑗, :] k 22 − C𝑛 · kF[𝑖, :] − F[ 𝑗, :] k 22 , (3.24) (𝑖, 𝑗)∈E (𝑖, 𝑗)∉E Õ DropEdge: C𝑖 𝑗 · kF[𝑖, :] − F[ 𝑗, :] k 22 , where C𝑖 𝑗 ∈ {0, 1}. (3.25) (𝑖, 𝑗)∈E For PairNorm, C consists of C𝑝 , C𝑛 > 0 and the regularization term ensures connected nodes to be similar while disconnected nodes to be dissimilar. For DropEdge, C is a sparse matrix having the same shape as adjacency matrix. For each edge (𝑖, 𝑗), its corresponding C𝑖 𝑗 is sampled from a Bernoulli distribution with mean 1 − 𝑞, where 𝑞 is a pre-defined dropout rate. The above mentioned regularization terms are all related to the Laplacian regularization. Other regularization terms can also be adopted, which may lead to novel designs of GNN layers. For example, if we aim to enforce that the clean signal is piece-wise linear, we can adopt 𝑟 (C, F, G) = C · kLFk 1 designed for trend filtering [82, 89]. 22 With these discussions, we propose a unified framework (Ugnn) to design GNN layers from the graph signal processing perspective as: (1) Design a graph regularization term 𝑟 (C, F, G) in Problem 2 according to specific applications; (2) Feature Transformation: X0𝑖𝑛 = 𝑓𝑡𝑟𝑎𝑛𝑠 (X𝑖𝑛 ); and (3) Feature Aggregation: Solving Problem 2 with X = X0𝑖𝑛 and the designed 𝑟 (C, F, G). To demonstrate the potential of Ugnn, next we introduce a new GNN model Ada-Ugnn by instantiating Ugnn with 𝑟 (C, F, G) enforcing adaptive local smoothness across nodes. Note that we introduce Ada-Ugnn with node classification as the downstream task. 3.5 Ada-Ugnn: Adaptive Local Smoothing with Ugnn From the graph signal denoising perspective, PPNP, APPNP, and GCN enforces global smoothness by penalizing the difference with a constant C for all nodes. However, real-world graphs may consist of multiple groups of nodes which have different behaviors in connecting to similar neighbors. For example, Section 3.6.1 shows several graphs with varying distributions of local smoothness (as measured by label homophily): summarily, not all nodes are highly label-homophilic, and some nodes have considerably “noisier” neighborhoods than others. Moreover, as suggested by [90, 35], adversarial attacks on graphs tend to promote such label noise in graphs by connecting nodes from different classes and disconnecting nodes from the same class, rendering resultant graphs with varying local smoothness across nodes. Under these scenarios, a constant C might not be optimal and adaptive (i.e. non-constant) smoothness to different nodes is desired. As shown in Section 3.3.3 by viewing GAT’s aggregation as a solution to regularized graph signal denoising, GAT can be regarded as adopting an adaptive C for different nodes, which facilitates adaptive local smoothness. However, in GAT, the graph denoising problem is solved by a single step of gradient descent, which might still be suboptimal. Furthermore, when modeling the local smoothness factor 𝑐𝑖 in Eq. (3.18), GAT only uses features of node 𝑖 as input, which may not be optimal since by understanding 𝑐𝑖 as local smoothness, it should be intrinsically related to the neighborhood of node 𝑖. In this section, we adapt this notion directly into the Ugnn framework by introducing a new regularization term, and develop a resulting GNN model (Ada-Ugnn) which aims to enforce adaptive local smoothness to nodes in a different manner to GAT. We then utilize an iterative gradient descent method to 23 approximate the optimal solution for Problem 2 with the following regularization term: 2 1 Õ Õ F[𝑖, :] F[ 𝑗, :] 𝑟 (C, F, G) = · C𝑖 √ − p . (3.26) 2 𝑖 ∈V 𝑑𝑖 𝑑𝑗 𝑗 ∈ Ñ (𝑖) 2 where 𝑑𝑖 , 𝑑 𝑗 denotes the degree of node 𝑖 and 𝑗 respectively, and C𝑖 indicates the smoothness factor of node 𝑖, which is assumed to be a fixed scalar. Note that, the above regularization term can be regarded as a generalized version of the regularization term used in PPNP, APPNP, and GCN. Similar to PPNP and APPNP, Ada-Ugnn only consists of a single GNN layer. However, Ada-Ugnn assumes adaptive local smoothness. We next describe the feature transformation and aggregation operations of Ada-Ugnn, and show how to derive the model via Ugnn. 3.5.1 Feature Transformation Similar to PPNP layers and APPNPlayers, we adopt MLP for the feature transformation. Specifically, for a node classification task, the dimension of the output of the feature transformation X0𝑖𝑛 is the number of classes in the graph. 3.5.2 Feature Aggregation We use iterative gradient descent to solve Problem 2 with the regularization term in Eq. (3.26) The iterative gradient descent steps are stated in the following theorem. ! Theorem 5. With adaptive stepsize 𝑏 𝑖 = 1/ 2 + for each node 𝑖, the iterative gradient Í (C𝑖 + C 𝑗 )/𝑑𝑖 𝑗 ∈ Ñ (𝑖) descent steps to solve Problem 2 with the regularization term in Eq. (3.26) is as follows: Õ F (𝑘−1) [ 𝑗, :] F (𝑘) [𝑖, :] ← 2𝑏X[𝑖, :] + 𝑏 𝑖 (C𝑖 + C𝑖 ) p ; 𝑘 = 1, . . . . (3.27) 𝑗 ∈ Ñ (𝑖) 𝑑𝑖 𝑑 𝑗 where F (0) [𝑖, :] = X[𝑖, :] . Proof. The gradient of the optimization problem 2 with the regularization term in Eq. (3.26) with respect to F (focusing on node 𝑖) is as follows: ! 𝜕L Õ C𝑖 + C 𝑗 F[𝑖, :] F[ 𝑗, :] = 2(F[𝑖, :] − X[𝑖, :]) + √ √ − p , (3.28) 𝜕F[𝑖, :] 𝑑𝑖 𝑑𝑖 𝑑𝑗 𝑣 𝑗 ∈Ñ (𝑣 𝑖 ) 24 where C 𝑗 in the second term appears since node 𝑖 is also in the neighborhood of node 𝑗. The iterative gradient descent steps with adaptive stepsize 𝑏𝑖 can be formulated as follows: 𝜕L F (𝑘) [𝑖, :] ← F (𝑘−1) [𝑖, :] − 𝑏𝑖 · ; 𝑘 = 1, . . . (3.29) 𝜕F[𝑖, :] F[𝑖,:]=F (𝑘−1) [𝑖,:] With the gradient in Eq. (3.28), the iterative steps in Eq. (3.29) can be rewritten as: Õ C𝑖 + C 𝑗 (𝑘−1) F (𝑘) [𝑖, :] ←(1 − 2𝑏𝑖 − 𝑏𝑖 )F [𝑖, :] + 2𝑏𝑖 X[𝑖, :] 𝑑𝑖 𝑣 𝑗 ∈Ñ (𝑣 𝑖 ) Õ F (𝑘) [ 𝑗, :] + 𝑏𝑖 (C𝑖 + C 𝑗 ) p ; 𝑘 = 1, . . . (3.30) 𝑣𝑗 ∈Ñ (𝑣 ) 𝑑 𝑖 𝑑 𝑗 𝑖 ! Í Given 𝑏𝑖 = 1/ 2 + (C𝑖 + C 𝑗 )/𝑑𝑖 , the iterative steps in Eq. (3.30) can be re-written as follows: 𝑣 𝑗 ∈Ñ (𝑣 𝑖 ) Õ F (𝑘−1) [ 𝑗, :] F (𝑘) [𝑖, :] ← 2𝑏X[𝑖, :] + 𝑏𝑖 (C𝑖 + C 𝑗 ) p ; 𝑘 = 1, . . . , (3.31) 𝑣 𝑗 ∈Ñ (𝑣 𝑖 ) 𝑑𝑖 𝑑 𝑗 with F (0) [𝑖, :] = X[𝑖, :], which completes the proof. Following the iterative solution in Eq. (3.27), we model the aggregation operation (for node 𝑖) for Ada-Ugnn as follows: (𝑘−1) (𝑘) 0 Õ X𝑜𝑢𝑡 [ 𝑗, :] X𝑜𝑢𝑡 [𝑖, :] ← 2𝑏 𝑖 X𝑖𝑛 [𝑖, :] + 𝑏𝑖 (C𝑖 + C 𝑗 ) p ; 𝑘 = 1, . . . 𝐾, (3.32) 𝑣𝑗 ∈ Ñ ( 𝑣 ) 𝑑𝑖 𝑑 𝑗 𝑖 where 𝐾 is the number gradient descent iterations, C𝑖 can be considered as a positive scalar to control the level of “local smoothness” for node 𝑖 and 𝑏𝑖 can be calculated from {C 𝑗 | 𝑗 ∈ Ñ (𝑖)} as ! . However, in practice, C𝑖 is usually unknown. One possible solution is Í 𝑏 𝑖 = 1/ 2 + (C𝑖 + C 𝑗 )/𝑑𝑖 𝑗 ∈ Ñ (𝑖) to treat C𝑖 as hyper-parameters. But, treating C𝑖 as hyper-parameters for all nodes is impractical, since there are in total 𝑁 of them and we do not have their prior knowledge. Thus, we instead parameterize C𝑖 as a function of the information of the neighborhood of node 𝑖 as follows:   n o C𝑖 = 𝑠 · 𝜎 ℎ1 ℎ2 X0𝑗 | 𝑗 ∈ Ñ (𝑖) , (3.33) where ℎ2 (·) is a function to transform the neighborhood information of node 𝑖 to a vector, while ℎ1 (·) further transforms it to a scalar. 𝜎(·) denotes the sigmoid function, which maps the output 25 scalar from ℎ1 (·) to (0, 1) and 𝑠 can be treated as a hyper-parameter controlling the upper bound of C𝑖 . ℎ1 (·) can be modeled as a single layer fully-connected neural network. There are different designs for ℎ2 (·) such as channel-wise mean or variance [13]. In this chapter, we adopt channel-wise variance as the ℎ2 (·) function (as a measure of diversity). A APPNP layer can be regarded a special case of Ada-Ugnn, where ℎ2 (·) is modeled as a constant function producing 1 as the output for all (𝐾) nodes. For the node classification task, the representation X𝑜𝑢𝑡 , which is obtained after 𝐾 iterations as in Eq. (3.32), is directly softmax normalized row-wise and its 𝑖-th row indicates the discrete class distribution of node 𝑖. 3.6 Experiment In this section, we evaluate how the proposed Ada-Ugnn handles graphs with varying local smoothness. We conduct node classification experiments on natural graphs, and also evaluate the model’s robustness under adversarial attacks. We note that our main goal in proposing/evaluating Ada-Ugnn is to demonstrate the promise of deriving new aggregations as solutions of denoising problems, rather than state-of-the-art performance. 3.6.1 Node Classification In this section, we conduct the node classification task. We first introduce the datasets and the experimental settings in Section 3.6.1.1 and then present the results in Section 3.6.3.1. 3.6.1.1 Datasets and Experimental Settings We conduct the node classification task on 8 datasets from various domains including citation, social, co-authorship and co-purchase networks. Specifically, we use three citation networks including Cora, Citeseer, and Pubmed [74]; one social network, BlogCatalog [33]; two co-authorship networks including Coauthor-CS and Coauthor-PH [76]; and two co-purchase networks including Amazon-Comp and Amazon Photos [76]. Descriptions and detail statistics about these datasets can be found below. 26 3.6.2 Datasets #Nodes #Edges #Labels #Features Cora 2708 13264 7 1433 Citeseer 3327 12431 6 3703 Pubmed 19717 108365 3 500 BlogCatalog 5196 348682 6 8189 Amazon-Comp 13381 504937 10 767 Amazon-Photo 7487 245573 8 745 Coauthor-CS 18333 182121 15 6805 Coauthor-PH 34493 530417 5 8415 Table 3.1: Dataset summary statistics. • Citation Networks: Cora, Citeseer and Pubmed are widely adopted benchmarks of GNN models. In these graphs, nodes represent documents and edges denote the citation links between them. Each node is associated bag-of-words features of its corresponding document and also a label indicating the research field of the document. • Blogcatalog: BlogCatalog is an online blogging community where bloggers can follow each other. The BlogCatalog graph consists of blogger as nodes while their social relations as edges. Each blogger is associated with some features generated from key words of his/her blogs. The bloggers are labeled according to their interests. • Co-purchase Graph: Amazon-Comp and Amazon-Photo are co-purchase graphs, where nodes represent items and edges indicate that two items are frequently bought together. Each item is associated with bag-of-words features extract from its corresponding reviews. The labels of items are given by the category of them. • Co-authorship Graphs: Coauthor-CS and Coauthor-PH are co-authorship graphs, where nodes are authors and edges indicating the co-authorship between authors. Each author is associated with some features representing the keywords of his/her papers. The label of an author indicates the his/her most active research field. 27 1750 2000 1750 12000 1000 1500 1500 10000 800 1250 1250 8000 #nodes #nodes #nodes #nodes 1000 1000 600 750 6000 750 400 500 500 4000 250 2000 200 250 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 Local Label Smoothness Local Label Smoothness Local Label Smoothness Local Label Smoothness (a) Cora (b) Citeseer (c) Pubmed (d) BlogCatalog 6000 10000 25000 4000 5000 8000 20000 4000 3000 #nodes #nodes #nodes 6000 #nodes 15000 3000 2000 4000 10000 2000 1000 2000 5000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 Local Label Smoothness Local Label Smoothness Local Label Smoothness Local Label Smoothness (e) Amazon-Comp (f) Amazon-Photo (g) Coauthor-CS (h) Coauthor-PH Figure 3.1: Distribution of local label smoothness (homophily) on different graph datasets: note the non-homogeneity of smoothness values. Some statistics of these graphs are shown in Table 3.1. To provide a sense of the local smoothness properties of these datasets, in addition to the summary statistics, we further present the distribution of local label smoothness in these datasets. For a node 𝑣 𝑖 we formally define the local label smoothness as follows Í 1{𝑙 (𝑖) = 𝑙 ( 𝑗)} 𝑗 ∈N (𝑖) ls(𝑖) = (3.34) |N (𝑖)| where 𝑙 (𝑣 𝑖 ) denotes the label of node 𝑣 𝑖 and 1{𝑎} is an indicator function, which takes 1 as output only when 𝑎 is true, otherwise 0. The distributions of local label smoothness for all 8 datasets are presented in Figure 3.1. 3.6.3 Node Classification Accuracy For Nodes with Low-level and High-level Local Label Smoothness The performance of nodes with low local label smoothness and high local label smoothness in Citeseer, Pubmed, Amazon-Photo and Coauthor-PH are presented in Figure 3.3. Notably, the variety in local label smoothness within several real-world datasets – also observed in [75] – clearly motivates the importance of the adaptive smoothness assumption in Ada-Ugnn. 28 Table 3.2: Node Classification Accuracy on Various Datasets Dataset GCN GAT APPNP Ada-Ugnn Cora 81.75±0.8 82.56±0.8 84.49±0.6 84.59±0.8* Citeseer 70.13±1.0 70.77±0.8 71.97±0.6 72.05±0.5 Pubmed 78.56±0.5 78.88±0.5 79.92±0.5 79.70±0.4 BlogCatalog 71.38±2.7 72.90±1.2 92.43±0.9 93.33±0.3*** Amazon-Comp 82.79±1.3 83.01±1.5 82.99±1.6 83.40±1.3*** Amazon-Photo 89.60±1.5 90.33±1.2 91.38±1.2 91.44±1.2 Coauthor-CS 91.55±0.6 90.95±0.7 91.69±0.4 92.33±0.5*** Coauthor-PH 93.23±0.7 92.86±0.7 93.84±0.5 93.92±0.6** ∗, ∗∗, ∗∗∗ indicate the improvement over APPNP is significant at 𝑝 < 0.1, 0.05 and 0.005 For the citation networks, we use the standard split as provided in [40, 96]. For BlogCatalog, we adopt the split provided in [102]. For both the citation networks and BlogCatalog, the experiments are run with 30 random seeds and the average results are reported. For co-authorship and co-purchase networks, we utilize 20 labels per class for training, 30 nodes per class for validation and the remaining nodes for test. This process is repeated 20 times, which results in 20 different training/validation/test splits. For each split, the experiment is repeated for 20 times with different initialization. The average results over 20 × 20 experiments are reported. We compare our methods with the methods introduced in Section 3.2 including GCN, GAT and APPNP. Note that we do not include PPNP as it is difficult to scale for most of the datasets due to the calculation of inverse in Eq. 3.7. For all methods, we tune the hyperparameters from the following options: 1) learning rate: {0.005, 0.01, 0.05}; 2) weight decay {5𝑒−04, 5𝑒−05, 5𝑒−06, 5𝑒−07, 5𝑒−08}; and 3) dropout rate: {0.2, 0.5, 0.8}. For APPNP and our method we further tune the number of iterations 𝐾 and the upper bound 𝑠 for 𝑐𝑖 in Eq. (3.33) from the following range: 1) 𝐾: {5, 10}; and 𝑠: {1, 9, 19}. Note that we treat APPNP as a special case of our proposed method with ℎ2 (·) = 1. 3.6.3.1 Performance Comparison The performance comparison is shown in Table 3.2, where 𝑡-test is used to test the significance. First, GAT outperforms GCN in most datasets. It indicates that modeling adaptive local smoothness is helpful. Second, APPNP/Ada-Ugnn outperform GCN/GAT in most settings, suggesting that 29 APPNP 100 100 80 ADA-UGNN 80 80 80 Test Accuracy (%) Test Accuracy (%) Test Accuracy (%) Test Accuracy (%) 60 60 60 60 40 40 40 40 20 20 20 20 0 Low High 0 Low High 0 Low High 0 Low High (a) Cora (b) BlogCatalog (c) Amazon-Comp (d) Coauthor-CS Figure 3.2: Accuracy for nodes with low and high local label smoothness. iterative gradient descent may offer advantages to single-step gradients, due to their better ability to achieve a solution closer to the optimal. Third, and most notably, the proposed Ada-Ugnn achieves consistently better performance than GCN/GAT, and outperforms or matches the state-of-the-art APPNP across datasets. Notice that in some datasets such as Cora, Citeseer, and Coauthor-PH, the improvements of the proposed model compared with APPNP are not very significant. Figure 3.1 shows that these datasets have extremely skewed local label smoothness distributions, with the majority of nodes having perfect, 1.0, label homophily (they are only connected to other nodes of the same label). APPNP shines in such cases, since its assumption of ℎ2 (·) = 1 is ideal for these nodes (designating maximal local smoothness). Conversely, our model has the challenging task of learning ℎ2 (·) – in such skewed cases, learning ℎ2 (·) may be quite challenging and unfruitful. On the other hand, for datasets with higher diversity in local label smoothness across nodes such as BlogCatalog and Amazon-Comp, the proposed Ada-Ugnn achieves more significant improvements. To further validate, we partition the nodes in the test set of each dataset into two groups: (1) high smoothness: those with local label smoothness >0.5, and (2) low smoothness: those with ≤0.5, and evaluate accuracy for APPNP and the proposed Ada-Ugnn for each group. The results for Cora, BlogCatalog, Amazon-Comp and Coauthor-CS are presented in Figure 3.2 while the results for the remaining datasets can be found in Figure 3.3. Figure 3.2 clearly shows that Ada-Ugnn consistently improves performance for low-smoothness nodes in most datasets, while keeping comparable (or marginally worse) performance for high-smoothness nodes. In cases where many nodes have low-level smoothness (like BlogCatalog or Amazon-Comp), our method can notably 30 100 100 80 APPNP ADA-UGNN 80 70 80 80 Test Accuracy (%) Test Accuracy (%) 60 Test Accuracy (%) Test Accuracy (%) 60 50 60 60 40 40 30 40 40 20 20 20 20 10 0 Low High 0 Low High 0 Low High 0 Low High (a) Citeseer (b) Pubmed (c) Amazon-Photo (d) Coauthor-PH Figure 3.3: Accuracy with low label smoothness and high label smoothness nodes. Note the consistent improvement in low smoothness cases, enabled by adaptive local smoothing. 1600 1400 800 1000 1400 1200 700 1200 1000 800 600 1000 500 #nodes #nodes #nodes #nodes 800 600 800 400 600 600 400 300 400 400 200 200 200 200 100 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 Local Label Smoothness Local Label Smoothness Local Label Smoothness Local Label Smoothness (a) 0% (b) 5% (c) 15% (d) 25% Figure 3.4: Distribution of local label smoothness on Cora with various attack perturbation rates. 1200 700 1000 800 1000 700 600 800 800 600 500 600 500 #nodes #nodes #nodes 400 #nodes 600 400 300 400 400 300 200 200 200 200 100 100 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 Local Label Smoothness Local Label Smoothness Local Label Smoothness Local Label Smoothness (a) 0% (b) 5% (c) 15% (d) 25% Figure 3.5: Distribution of local label smoothness on Citeseer with various attack perturbation rates. improve overall performance. 3.6.4 Robustness Under Adversarial Attacks Graph adversarial attacks tend to connect nodes from different classes while disconnect nodes from the same class, which typically leads to more diverse distributions of local smoothness level. We present the distributions of the graphs generated by Mettack [108] with different perturbation rate for Cora, Citeseer and Pubmed in Figure 3.4, Figure 3.5 and Figure 3.6, respectively. 31 10000 6000 5000 12000 8000 5000 4000 10000 8000 6000 4000 3000 #nodes 6000 #nodes #nodes 3000 #nodes 4000 2000 4000 2000 2000 2000 1000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 Local Label Smoothness Local Label Smoothness Local Label Smoothness Local Label Smoothness (a) 0% (b) 5% (c) 15% (d) 25% Figure 3.6: Distribution of local label smoothness on Pubmed with various attack perturbation rates. To further demonstrate that Ada-Ugnn can handle graphs with varying local label smoothness better than alternatives, we conduct experiments to show its robustness under adversarial attacks. Specifically, we adopt Mettack [108] to perform the attacks. Mettack produces non-targeted attacks which aim to impair test set node classification performance by strategically adding or removing edges from the victim graph. We utilize the attacked graphs (5%-25% perturb rate) from [35] and follow the same setting, i.e., each method is run with 10 random seeds and the average performance is reported. These attacked graphs are generated from Cora, Citeseer and Pubmed, respectively and only the largest connected component is retained in each graph. Furthermore, the training, validation and test split ratio is 10/10/80%, which is different from the standard splits we use in Section 3.6.1. Thus, the performances reported in this section is not directly comparable with those in the previous section. We compare our method both with standard GNNs discussed in Section 3.2 (GCN, GAT, APPNP), but also with recent state-of-the-art defense techniques against adversarial attacks including GCN-Jaccard [90], GCN-SVD [19], Pro-GNN-fs and Pro-GNN [35]. The detailed description of these methods is as follows: • GCN-Jaccard [90]: GCN-Jaccard aims to pre-process a given attacked graph by removing those edges added by the attackers. Specifically, Jaccard smilarlity is utilized to measure the feature similarity between connected pairs of nodes. The edges between node pairs with low-similarity are removed by the algorithm. This pre-processed graph is then utilized for the node classification task. • GCN-SVD [19]: GCN-SVD is also a pre-process method. It use SVD to decompose the 32 85 75.0 88 80 72.5 86 75 Test Accuracy(%) Test Accuracy(%) Test Accuracy(%) 70.0 84 70 APPNP 67.5 65 ADA-UGNN 82 GCN 65.0 60 GAT 80 GCN-Jaccard 62.5 55 GCN-SVD 60.0 78 Pro-GNN-fs 50 Pro-GNN 57.5 76 0% 5% 10% 15% 20% 25% 0% 5% 10% 15% 20% 25% 0% 5% 10% 15% 20% 25% Perturbation Rate(%) Perturbation Rate(%) Perturbation Rate(%) (a) Cora (b) Citeseer (c) Pubmed Figure 3.7: Robustness under adversarial attacks (node classification accuracy). adjacency matrix of a given perturbed graph and then obtain its low-rank approximation. The low-rank approximation is believed to be cleaner as graph adversarial attacks are observed to be high-rank in [19]. • Pro-GNN [35]: Pro-GNN tries to learn a cleaner graph while training the node classification model at the same time. Specifically, it treats the adjacency as parameters, which is optimized during the training stage. Several different constraints are enforced to this learnable adjacency matrix, including: 1) the learned adjacency matrix should be close to the original adjacency matrix; 2) the learned adjacency matrix should be low-rank; and 3) the learned adjacency matrix should ensure feature smoothness. Pro-GNN-fs is a variant of Pro-GNN where the third constraint, i.e. feature smoothness, is not enforced. Results under varying perturbation rates (attack intensities) are shown in Figure 3.7. Again, we observe that GAT outperforms GCN, suggesting the appeal of an adaptive local smoothness assumption. Here, our method (orange) substantially outperforms GCN, GAT and APPNP by a large margin, especially in scenarios with high perturbation rate. Moreover, the proposed Ada-Ugnn is also even more robust than several specially designed adversarial defense methods, like GCN-Jaccard and GCN-SVD, which are based on pre-processing the adversarial attack graphs to obtain cleaner ones, thanks to its adaptive smoothness assumption. Compared with Pro-GNN-fs, our method performs comparably or even better in a few settings, especially when perturbation rate is high. Furthermore, in these settings, the performance of our method is even closer to Pro-GNN, which is 33 the current state-of-the art adversarial defense technique. Note that, Pro-GNN-fs and Pro-GNN involves learning cleaner adjacency matrices of the attacked graphs, and thus has 𝑂 (𝑀) parameters (M denotes the number of edges in a graph), while our proposed model has far less parameters. Specifically, we have 𝑂 (𝑑𝑖𝑛 · 𝑑 𝑜𝑢𝑡 ) for feature transformation and 𝐻 parameters for modelling ℎ1 (·) with 𝐻 denoting the number of labels. 3.6.5 Investigation on Number of Gradient Descent Steps in ADA-UGNN In this section, we conducted experiments to check how the performance of ADA-UGNN is affected by 𝐾. For each 𝐾, we run the experiments on standard splits of Cora, Citeseer and Pubmed with 30 random seeds (i.e., the same setting as in Section 3.6.) The average performance is reported. As shown in Figure 3.8, the performance increases quickly as 𝐾 gets larger when 𝐾 is relatively small. After 𝐾 becomes large, the performance either slowly grows or slightly fluctuates as 𝐾 further increases. 0.720 0.84 0.795 0.715 0.790 Test Accuracy(%) Test Accuracy(%) Test Accuracy(%) 0.83 0.710 0.785 0.82 0.705 0.780 0.81 0.700 0.775 0.80 0.770 0.695 0.79 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 K K K (a) Cora (b) Citeseer (c) Pubmed Figure 3.8: ADA-UGNN performance (test accuracy) under different numbers of gradient steps (𝐾). 3.7 Related Works There are mainly two streams of work in developing GNN models, i.e, spectral-based and spatial- based. When designing spectral-based GNNs, graph convolution [80], defined based on spectral theory, is utilized to design graph neural network layers together with the feature transformation and non-linearity [6, 29, 15]. These designs of the spectral-based graph convolution are tightly related with graph signal processing, and they can be regarded as graph filters. Low-pass graph 34 filters can usually be adopted to denoise graph signals [9]. In fact, most algorithms discussed in our work can be regarded as low-pass graph filters. With the emergence of GCN [40], which can be regarded as a simplified spectral-based and also a spatial-based graph convolution operator, numerous spatial-based GNN models have since been developed [28, 84, 58, 24, 26]. Graph signal denoising is to infer a cleaner graph signal given a noisy signal, and can be usually formulated as a graph regularized optimization problem [9]. Recently, several works connect GCN with graph signal denoising with Laplacian regularization [64, 101], where they found the aggregation process in GCN models can be regarded as the first-order approximation of the optimal solution of the denoising problem. On the other hand, GNNs are also utilized to develop novel algorithms for graph denoising [8]. Unlike these works, our paper details how a family of GNN models can be unified with a graph signal denoising perspective, and demonstrates its promise for new architecture design. 35 CHAPTER 4 IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS? 4.1 Introduction As introduced in Section 3.2, a GNN model learns node representation by recursively neighborhood aggregation process (stacking multiple graph filtering operations), where each node aggregates and transforms features from its neighbors. The node representations can then be utilized for downstream node classification or regression tasks. Due to this neighborhood aggregation mechanism, existing literature posits that strong homophily of the underlying graph is a necessity for GNNs1 to achieve good performance on semi-supervised node classification tasks (check details about this task in Section 2.2) [1, 106, 105, 10, 56, 27, 31, 92]. In general, homophily describes the phenomenon that nodes tend to connect with “similar” or “alike” others. Homophily is observed in a wide range of real-world graphs including friendship networks [57], political networks [25, 62], citation networks [12] and more. Under the homophily assumption, through the aggregation process, a node’s representation is “smoothed” via its neighbors’ representations, since each node is able to receive additional information from neighboring nodes, which are likely to share the same label. Several recent works [106, 105] claim that GNNs are implicitly (or explicitly) designed with homophily in mind, are not suitable for graphs exhibiting heterophily, where connected nodes are prone to have different properties or labels, e.g dating networks or molecular networks [106]. To address this limitation, these works accordingly design and modify neural architectures and demonstrate outperformance over other GNN models on several heterophilous graphs. However, we empirically find that the standard graph convolutional network (GCN) [40], a fundamental, representative GNN model which we focus on in this work is actually able to outperform these heterophily-specific models on several heterophilous graphs after careful hyperparameter tuning. This motivates us to 1 In this chapter, we are dealing with a node-focus tasks. Hence, when we mention GNNs, we refer to the GNN models that follow the general framework for node-focus task described in Section 2.3.3.1 36 reconsider the popular notion in literature that GNNs exhibit a homophilous inductive bias, and more specifically that strong homophily is crucial to strong GNN performance. Counter to this idea, we find that the standard GCN model has the potential to work well for heterophilous graphs under suitable conditions. We demonstrate simple intuition with the following toy example: 0 1 0 1 0 1 0 1 Figure 4.1: A perfectly heterophilous graph on which GCN achieves perfect class separation. Consider the perfectly heterophilous graph (all inter-class edges) shown in Figure 4.1, where the color indicates the node label. Blue-labeled and orange-labeled nodes are associated with the scalar feature 0 and 1, respectively. If we consider a single-layer GCN by performing an averaging feature aggregation over all neighboring nodes (without self-connection), it is clear that all blue nodes will have a representation of 1, while the orange nodes will have that of 0. Additional layers/aggregations will continue to alternate the features between the two types of nodes. Regardless of the number of layers, the two classes can still be perfectly separated. In this toy example, each blue (orange) node only connects orange (blue) nodes, and all blue (orange) nodes share similar neighborhood patterns in terms of their neighbors’ label/feature distributions. Naturally, real graphs are rather more complicated. In this work, we elucidate this intuition and extend it to a more general case: put simply, given a (homophilous or heterophilous) graph, GCN has the potential to achieve good performance if nodes with the same label share similar neighborhood patterns. We theoretically support this argument by investigating the learned node embeddings from the GCN model. We further show how GCN performance is only loosely linked to homophily or heterophily in the underlying graph, and demonstrate how different notions of heterophily can yield better or worse discriminative 37 performance. 4.2 Preliminaries 4.2.1 Homophily in Graphs In this work, we focus on investigating performance in the context of graph homophily and heterophily properties. Homophily in graphs is typically defined based on similarity between connected node pairs, where two nodes are considered similar if they share the same node label. The homophily ratio is defined based on this intuition. Definition 1 (Homophily). Given a graph G = {V, E} and node label vector 𝑦, the edge homophily ratio is defined as the fraction of edges that connect nodes with the same labels. Formally, we have: ( 𝑗,𝑘)∈E 1 (𝑦 𝑗 = 𝑦 𝑘 ) Í ℎ(G, {𝑦𝑖 ; 𝑖 ∈ V}) = , (4.1) |E | where |E | is the number of edges in the graph and 1 (·) is the indicator function. A graph is typically considered to be highly homophilous when ℎ(·) is large (typically, 0.5 ≤ ℎ(·) ≤ 1), given suitable label context. In such case, edges predominantly connect nodes with the same labels. On the other hand, a graph with a low edge homophily ratio is considered to be heterophilous. In future discourse, we write ℎ(·) as ℎ when discussing given a fixed graph and label context. Note that homophily is a distinct property from heterogeneity; the former refers to class labels of connected nodes, while the latter refers to their types. 4.2.2 Graph Convolutional Networks As we discussed in Chapter 3, graph neural networks learn node representations by aggregating and transforming information over the graph structure. There are different designs and architectures for the aggregation and transformation process, which leads to different graph neural network models [72, 40, 28, 84, 26, 103]. In this Chapter, we mainly focus on the the graph convolutional network (GCN). We have described the GCN model in Section 3.2. Here, our definition of the GCN 38 model is a a little bit different from that. Hence, we re-describe it as follows for convenience. The GCN model consists of several GCN graph filtering layers (GCN layers). The 𝑘-th GCN layer takes the following form: H (𝑘) = 𝜎( ÂH (𝑘−1) W (𝑘) ), where 𝜎(·) is some non-linear activation function, W (𝑘) ∈ R𝑙×𝑙 is a parameter matrix to transform the features, and  is a normalized adjacency matrix to perform the aggregation. Typically, we use associated node features as the input for the GCN model, i.e, H (0) = X. Different variants of the normalization matrix can be adopted including the row-normalized adjacency matrix and symmetric adjacency matrix listed as follows: Row Normalized Adjacency Matrix:  = D−1 (A + I) (4.2) 1 1 Symmetric Adjacency Matrix:  = D− 2 (A + I)D− 2 (4.3) where D is a diagonal matrix and D[𝑖, 𝑖] = 𝑑𝑒𝑔(𝑖) + 1 with 𝑑𝑒𝑔(𝑖) denoting the degree of node 𝑖. The matrix I ∈ {0, 1} |V |×|V | is an identity matrix. Hence, adding I to the adjacency matrix A can be regarded as adding self-connections for nodes. 4.3 Graph Convolutional Networks under Heterophily Table 4.1: SSNC accuracy on two heterophilous datasets: standard GCN outperforms all heterophily- specific models on these graphs. Chameleon Squirrel Method (ℎ = 0.23) (ℎ = 0.22) GCN 67.96 ± 1.82 54.47 ± 1.17 H2GCN-1 57.11 ± 1.58 36.42 ± 1.89 H2GCN-2 59.39 ± 1.98 37.90 ± 2.02 CPGNN-MLP 54.53 ± 2.37 29.13 ± 1.57 CPGNN-Cheby 65.17 ± 3.17 29.25 ± 4.17 GPRGNN 66.31 ± 2.05 50.56 ± 1.51 Considerable prior literature posits that graph neural networks (such as GCN) work by assuming and exploiting homophily assumptions in the underlying graph [56, 27, 92]. To this end, researchers have determined that such models are considered to be ill-suited for heterophilous graphs, where the homophily ratio is low [106, 105, 10]. To deal with this limitation, researchers proposed several methods including H2GNN [106], CPGNN [105] and GPRGNN [10], which are explicitly designed 39 to handle heterophilous graphs via architectural choices (e.g. adding skip-connections, carefully choosing aggregators, etc.) To demonstrate their assumptions, these works cited the poor SSNC performance of GCN on several heterophilous graphs, and to demonstrate their proposed solutions, they cited their design’s relative outperformance of GCN in these tasks. In this section, we revisit the claim that GCNs have fundamental homophily assumptions and are not suited for heterophilous graphs. To this end, we first observe empirically that the GCN model achieves fairly good performance on some of the commonly used heterophilous graphs; specifically, we present SSNC performance on two commonly used heterophilous graph datasets, Chameleon and Squirrel in Table 4.1 (see Appendix A.2 for further details about the datasets and models). Both Chameleon and Squirrel are highly heterophilous (ℎ≈0.2). We find that with some hyperparameter tuning, GCN can outperform alternative methods uniquely designed to operate on heterophilous graphs by a sizeable margin. This observation suggests that GCN does not always “underperform” on heterophilous graphs, and it leads us to reconsider the prevalent assumption in literature. Hence, we next examine how GCNs learn representations, and how this information is used in downstream SSNC tasks, homophily and heterophily assumptions aside. 4.3.1 When does GCN learn similar embeddings? GCNs have been shown to be able to capture the local graph topological and structural information [94, 60]. Specifically, the aggregation step in the GCN model is able to capture and discriminate neighborhood distribution information (more precisely, the mean of the neighborhood features) [94]. Let us consider the two nodes 𝑎 and 𝑏 shown in Figure 4.2, where we use color to indicate the label of each node. If we further assume that all nodes share the same label are associated with exactly the same features, then clearly, after 1-step aggregation, the GCN model with the row-normalized adjacency matrix (see Eq. (4.2); ignoring the self-connection here) will output exactly the same embedding for nodes 𝑎 and 𝑏 (the average features of “orange”, “yellow” and “green” node). Accordingly, [94] reasons that the GCN model lacks expressiveness since it cannot differentiate nodes 𝑎 and node 𝑏 in the embedding space, because although they share the same neighborhood 40 a b Figure 4.2: Two nodes share the same neighborhood distribution; GCN learns equivalent embeddings for 𝑎 and 𝑏. distribution, they differ concretely in neighborhood structure (here, w.r.t. count). However, in this example SSNC task, mapping nodes 𝑎 and 𝑏 to the same location in the embedding space is not an adverse effect, but explicitly desirable, since they share the same label and we are not concerned with telling them apart. Intuitively, if all nodes with the same label are mapped to the same embedding and embeddings for different labels are distinct, then, nodes are perfectly separable in the embedding space and thus can be effortlessly classified (one instance in which expressive power and task performance do not coincide). One instance in which such separability arises is in the case of the “ideal graph” as mentioned in [102], which considers a perfectly homophilous graph of |C| (fully) connected components, where all nodes in a component share the same label. Of course, in reality, such idealized graphs which meet these precise assumptions are rare to nonexistent. Thus, to consider a more practical scenario, we assume that both features and neighborhood patterns for nodes with a certain label are sampled from some fixed distributions. Under these conditions, nodes with the same label may not be mapped to exactly the same embedding. Instead, we aim to characterize how close the learned embeddings of same-label nodes are. Intuitively, if the learned embeddings for nodes with the same label are very close, we expect SSNC performance to be good (given that embeddings of nodes of other classes are far), as class separability is high (low intra-class variance and high inter-class variance) [22]. We prove that, for graphs meeting suitable conditions (features and neighborhood pattern for nodes with a certain label are sampled from some fixed distributions), the distance between the embeddings learned by a single GCN layer of any node pair that shares the same label is bounded by a small quantity with high probability. We 41 elaborate below. Assumptions on Graphs. We consider a graph G, where each node 𝑖 has features x𝑖 ∈ R𝑙 and label 𝑦𝑖 . We assume that (1) G is 𝑑-regular; (2) The features of node 𝑖 are sampled from feature distribution F𝑦𝑖 , i.e, x𝑖 ∼ F𝑦𝑖 , with 𝜇(F𝑦𝑖 ) denoting its mean; (3) Dimensions of x𝑖 are independent to each other; (4) The features in X are bounded by a positive scalar 𝐵, i.e, max𝑖, 𝑗 |X[𝑖, 𝑗]| ≤ 𝐵; (5) For node 𝑖, its neighbor’s labels are independently sampled from neighbor distribution D𝑦𝑖 . The sampling is repeated for 𝑑 times to sample the labels for 𝑑 neighbors. We denote a graph following these assumptions (1)-(5) as G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈ C}, 𝑑}. Note that we use the subscripts in F𝑦𝑖 and D𝑦𝑖 to indicate that these two distributions are shared by all nodes with the same label as node 𝑖. For regular graphs, the two kinds of normalized adjacency matrices we introduced in Section 4.2.2 lead to the same GCN model. Specifically, for a single layer GCN model, the process can be written in the following form for node 𝑖: Õ 1 h𝑖 = Wx 𝑗 , (4.4) 𝑑+1 𝑗 ∈{𝑖}∪N (𝑖) where W ∈ R𝑙×𝑙 is the parameter matrix and x 𝑗 is the input features for node 𝑗. h 𝑗 is the pre-activation output from the GCN model, and an element-wise non-linear activation function can be applied to h𝑖 to obtain the final output. Theorem 6. Consider a graph G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈ C}, 𝑑}, which follows Assump- tions (1)-(5). For any node 𝑖 ∈ V, the expectation of the pre-activation output of a single-layer GCN model from Eq (4.4) is given by   1 𝑑 E[h𝑖 ] = W 𝜇(F𝑦𝑖 ) + E𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x] . (4.5) 𝑑+1 𝑑+1 and for any 𝑡 > 0, the probability that the distance between the observation h𝑖 and its expectation is larger than 𝑡 is bounded by (𝑑 + 1)𝑡 2   P (kh𝑖 − E[h𝑖 ] k 2 ≥ 𝑡) ≤ 2 · 𝑙 · exp − 2 , (4.6) 2𝜌 (W)𝐵2 𝑙 where 𝑙 is the feature dimensionality and 𝜌(W) denotes the largest singular value of W 42 To prove Theorem 6, we first introduce the celebrated Hoeffding inequality below. Lemma 7 (Hoeffding’s Inequality). Let 𝑍1 , . . . , 𝑍𝑛 be independent bounded random variables with 𝑍𝑖 ∈ [𝑎, 𝑏] for all 𝑖, where −∞ < 𝑎 ≤ 𝑏 < ∞. Then 𝑛 ! 2𝑛𝑡 2   1Õ P (𝑍𝑖 − E [𝑍𝑖 ]) ≥ 𝑡 ≤ exp − 𝑛 𝑖=1 (𝑏 − 𝑎) 2 and ! 𝑛 2𝑛𝑡 2   1Õ P (𝑍𝑖 − E [𝑍𝑖 ]) ≤ −𝑡 ≤ exp − 𝑛 𝑖=1 (𝑏 − 𝑎) 2 for all 𝑡 ≥ 0. Proof of Theorem 6. Proof. The expectation of h𝑖 can be derived as follows.  Õ   1  E [h𝑖 ] = E   Wx 𝑗   𝑗 ∈{𝑖}∪N (𝑖) 𝑑 + 1    1 Õ 1 = WE [x𝑖 ] + WE[x 𝑗 ] 𝑑+1 𝑑+1 𝑗 ∈N (𝑖) 1 Õ = W𝜇(F𝑦𝑖 ) + WE𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x] 𝑑+1 𝑗 ∈N(𝑖)   1 𝑑 =W 𝜇(F𝑦𝑖 ) + E𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x] . 𝑑+1 𝑑+1 We utilize Hoeffding’s Inequality to prove the bound in Eq. (4.6). Let x𝑖 [𝑘], 𝑘 = 1, . . . , 𝑙 denote the 𝑖-th element of x. Then, for any dimension 𝑘, {x 𝑗 [𝑘], 𝑗 ∈ N (𝑖)} is a set of independent bounded random variables. Hence, directly applying Hoeffding’s inequality, for any 𝑡1 ≥ 0, we have the following bound: ! Õ   (𝑑 + 1)𝑡12 x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡1 ® ≤ 2 exp − © ª P­ { 𝑗 ∈{𝑖}∪N (𝑖)} 2𝐵2 « ¬ 43 Í   √ If x𝑗 − E x𝑗 ≥ 𝑙𝑡1 , then at least for one 𝑘 ∈ {1, . . . , 𝑙}, the inequality { 𝑗 ∈{𝑖}∪N (𝑖)} 2 Í   x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡1 holds. Hence, we have { 𝑗 ∈{𝑖}∪N (𝑖)} √ 𝑙     Õ   ©Ø   Õ   ª  x𝑗 − E x𝑗 ≥ 𝑙𝑡1 ® ≤ P ­ x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡 1 ® © ª P­   { 𝑗 ∈{𝑖}∪N (𝑖)} « 𝑘=1  { 𝑗 ∈{𝑖}∪N (𝑖)}   « 2 ¬ ¬ Õ 𝑙 Õ   ≤ x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡1 ® © ª P­ 𝑘=1 « { 𝑗 ∈{𝑖}∪N (𝑖)} ¬ ! (𝑑 + 1)𝑡 1 2 = 2 · 𝑙 · exp − 2𝐵2 𝑡2 Let 𝑡1 = √ , then we have 𝑙 ! Õ   (𝑑 + 1)𝑡 22 x𝑗 − E x𝑗 ≥ 𝑡2 ® ≤ 2 · 𝑙 · exp − © ª P­ 2𝐵2 𝑙 « { 𝑗 ∈{𝑖}∪N (𝑖)} 2 ¬ Furthermore, we have Õ   ª kh𝑖 − E[h𝑖 ] k 2 = W ­ x𝑗 − E x𝑗 ® © «{ 𝑗 ∈{𝑖}∪N (𝑖)} ¬ 2 Õ   ≤ kWk 2 x𝑗 − E x𝑗 { 𝑗 ∈{𝑖}∪N (𝑖)} 2 Õ   = 𝜌(W) x𝑗 − E x𝑗 , { 𝑗 ∈{𝑖}∪N (𝑖)} 2 where kWk 2 is the matrix 2-norm of W. Note that the last line uses the identity kWk 2 = 𝜌(W). Then, for any 𝑡 > 0, we have Õ   P (kh𝑖 − E[h𝑖 ] k 2 ≥ 𝑡) ≤ P ­ 𝜌(W) x𝑗 − E x𝑗 ≥ 𝑡® © ª { 𝑗 ∈{𝑖}∪N (𝑖)} « 2 ¬ Õ   𝑡 ª = P­ x𝑗 − E x𝑗 ≥ © ® 𝜌(W) « { 𝑗 ∈{𝑖}∪N (𝑖)} 2 ¬ 2   (𝑑 + 1)𝑡 ≤ 2 · 𝑙 · exp − 2 , 2𝜌 (W)𝐵2 𝑙 which completes the proof. 44 Theorem 8. Consider a graph G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈ C}, 𝑑}, which follows Assump- tions (1)-(5). For any two nodes 𝑖, 𝑗 ∈ V that share the same label, i.e, 𝑦𝑖 = 𝑦 𝑗 , the probability that the distance between their output pre-activation embeddings (from Eq. (4.4)) is larger than 𝑡 (𝑡>0) is bounded by (𝑑 + 1)𝑡 2    P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 ≤ 4 · 𝑙 · exp − 2 (4.7) 8𝜌 (W)𝐵2 𝑙 Proof. If kh𝑖 − h 𝑗 k 2 ≥ 𝑡, then, at least one of the following two inequalities holds. 𝑡 kh𝑖 − E[h𝑖 ] k 2 ≥ 2 𝑡 kh 𝑗 − E[h 𝑗 ] k 2 ≥ 2 Hence, we have  n 𝑡oØn 𝑡 o P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 ≤ P kh𝑖 − E[h𝑖 ] k 2 ≥ kh 𝑗 − E[h 𝑗 ] k 2 ≥ 2 2  𝑡  𝑡 ≤ P kh𝑖 − E[h𝑖 ] k 2 ≥ + P kh 𝑗 − E[h 𝑗 ] k 2 ≥ 2  2 (𝑑 + 1)𝑡 2  ≤ 4 · 𝑙 · exp − 2 8𝜌 (W)B2 𝑙 Remark 1. Theorem 8 can be naturally extended to include non-linear activation functions. Specifically, many commonly used activation functions 𝜎(·) (e.g. ReLU and sigmoid) are 1-Lipschitz  continuous [71], i.e., |𝜎(𝑥) − 𝜎(𝑦)| ≤ |𝑥 − 𝑦|. Hence, we have P k𝜎(h𝑖 ) − 𝜎(h 𝑗 )k 2 ≥ 𝑡 ≤  P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 . Theorem 6 demonstrates two key ideas. First, in expectation, all nodes with the same label have the same embedding (Eq. (4.5)). Second, the distance between the output embedding of a node and its expectation is small with high probability. Building upon Theorem 6, Theorem 8 further demonstrates that two nodes with the same label are thus close to each other with high probability, anchored by their shared expected value. With Remark 1, we further have that the post-activation outputs (e.g. final embeddings from a complete GCN layer with transformation, aggregation and 45 nonlinearity) are also likewise bounded given common, suitable choices. Together, these results show that the GCN model is able to map nodes with the same label close to each other in the embedding space under given assumptions, facilitating the SSNC task earlier discussed. Based on these understandings, we have the following key (informal) observations. Observation 1 (GCN under Homophily). In homophilous graphs, the neighborhood distribution of nodes with the same label (w.l.o.g 𝑐) can be approximately regarded as a highly skewed discrete D𝑐 , with most of the mass concentrated on the category 𝑐. Thus, different labels clearly have distinct distributions. Hence, the GCN model typically excels in SSNC on such graphs. Observation 2 (GCN under Heterophily). In heterophilous graphs, if the neighborhood distribution of nodes with the same label (w.l.o.g. 𝑐) is (approximately) sampled from a fixed distribution D𝑐 , and different labels have distinguishable distributions, then the GCN model can excel in the SSNC task. Note that the second observation follows from considering a “worst-case” scenario where D𝑐 = D𝑐 0 for two distinct classes 𝑐, 𝑐0, implying that they would be frequently misclassified. Notably, our findings illustrate that disruptions of certain conditions inhibit GCN performance on heterophilous graphs, but heterophily is neither a sufficient nor necessary condition for poor GCN performance. Section 4.3.2 shows experiments bolstering this argument. We note that these “idealized” conditions for GCN SSNC performance we discuss offer a more general perspective than those in prior literature; [102] only discusses perfectly homophilous (ℎ = 1) graphs as ideal for SSNC, and several works [27, 92] aim to learn graph structure over independent samples for downstream graph learning tasks using the same perspective. Our findings show that under suitable conditions, heterophilous graphs can also be ideal in terms of effortless node classification. We note that although our derivations are for GCN, a similar line of analysis can be used for more general 46 message-passing neural networks. Algorithm 2: Heterophilous Edge Addition |C|−1 |C|−1 Input: G = {V, E}, 𝐾, {D𝑐 } 𝑐=0 and {V𝑐 } 𝑐=0 Output: G 0 = {V, E 0 } 1 Initialize G 0 = {V, E}, 𝑘 = 1 ; 2 while 1 ≤ 𝑘 ≤ 𝐾 do 3 Sample node 𝑖 ∼ Uniform (V); 4 Obtain the label, 𝑦𝑖 of node 𝑖; 5 Sample a label 𝑐 ∼ D𝑦𝑖 ; 6 Sample node 𝑗 ∼ Uniform (V𝑐 ); 7 Update edge set E 0 = E 0 ∪ {(𝑖, 𝑗)}; 8 𝑘 ← 𝑘 + 1; 9 return G 0 = {V, E 0 } 4.3.2 How does classification performance change over the homophily-heterophily spec- trum? We next evaluate how downstream performance changes as we make a homophilous graph more and more heterophilous. Prima facie wisdom from prior literature would indicate that performance degrades markedly as heterophily increases, when other factors (e.g. labels, features) are kept fixed. We next conduct experiments to substantiate our claims in Observations 1 and 2. Specifically, we consider two settings: (1) evaluating classification performance in heterophilous graphs under optimistic assumptions, where different labels have distinct distributions, and (2) evaluating classification performance under more pessimistic cases, where different labels’ distributions are muddled. 47 0.9 0.8 0.8 Test Accuracy(%) Test Accuracy(%) 0.7 0.7 0.6 0.6 =1 =1 =0.8 =0.8 =0.6 0.5 =0.6 0.5 =0.4 =0.4 = 0.2 0.4 = 0.2 =0 =0 0.4 0.3 0.84 0.79 0.74 0.69 0.65 0.58 0.53 0.48 0.44 0.38 0.34 0.3 0.81 0.74 0.68 0.62 0.58 0.51 0.45 0.41 0.37 0.32 0.27 0.24 Homophily Ratio Homophily Ratio (a) Synthetic graphs generated from Cora (b) Synthetic graphs generated from Citeseer Figure 4.3: SSNC accuracy of GCN on synthetic graphs with various homophily ratios, generated by adding heterophilous edges according to pre-defined target distributions on Cora (a) and Citeseer (b): performance first decreases, then increases, forming a 𝑉-shape (see the black lines). 4.3.2.1 Targeted Heterophilous Edge Addition We start with common, real-world benchmark graphs, and modify their topology by adding synthetic, cross-label edges that connect nodes with different labels. Note that we only add cross-label edges to ensure that as more edges are added, the graphs become more and more heterophilous. Following our discussion in Observation 2, we try to construct synthetic graphs that have similar neighborhood distributions for nodes with the same label. Specifically, given a real-world graph G, we first define a discrete neighborhood target distribution D𝑐 for each label 𝑐 ∈ C, as per Assumption 5 in Section 4.3.1. We then follow these target distributions to add cross-label edges. The process of generating new graphs by adding edges to G is shown in Algorithm 2. As shown in Algorithm 2, we add a total 𝐾 edges to the given graph G: to add each edge, we first uniformly sample a node 𝑖 from V with label 𝑦𝑖 , then we sample a label 𝑐 from C according to D𝑦𝑖 , and finally, we uniformly sample a node 𝑗 from V𝑐 and add the edge (𝑖, 𝑗) to the graph. In the limit (as 𝐾 → ∞), the neighborhood distributions in the generated graph converge to D𝑦𝑖 regardless of initial topology, and homophily decreases (ℎ → 0). We generate synthetic graphs based on several real-world graphs. We present the results based on Cora and Citeseer [74]. The results for other datasets can be found in Appendix A.1. Both Cora and Citeseer exhibit strong homophily. The Cora graph consists of 2,708 nodes and 5,278 edges and it has 7 labels. The Citeseer graph has 6 labels and consists of 3,327 nodes and 4,676 edges. For both datasets, we fix D𝑐 for all labels. Although many suitable D𝑐 could 48 be specified in line with Observation 2, we fix one set for illustration and brevity. For Cora, we specify D0 as Categorical([0, 1/2, 0, 0, 0, 0, 1/2]), in which, the (𝑐 + 1)-th element corresponds to the probability mass for sampling a node with label 𝑐 as a neighbor; all D𝑐 are distinct. For both datasets, we vary 𝐾 over 11 values and thus generate 11 graphs. Notably, as 𝐾 increases, the homophily ℎ decreases. More detailed information about the {D𝑐 , 𝑐 ∈ C} and 𝐾 for both datasets is included in Appendix A.1. Figure 4.3(a-b) show SSNC results (accuracy) on graphs generated based on Cora and Citeseer, respectively. The black line in both figures shows results for the presented setting (we introduce 𝛾 in the following subsection). Without loss of generality, we use Cora (a) to discuss our findings, since observations are similar over these datasets. Each point on the black line in Figure 4.3(a) represents the performance of GCN model on a certain generated graph and the corresponding value in 𝑥-axis denotes the homophily ratio of this graph. The point with homophily ratio ℎ = 0.85 denotes the original Cora graph, i.e, 𝐾 = 0. We observe that as 𝐾 increases, ℎ decreases, and while the classification performance first decreases, it eventually begins to increase, showing a 𝑉-shape pattern. For instance, when ℎ = 0.3 (a rather heterophilous graph), the GCN model achieves an impressive 86% accuracy, even higher than that achieved on the original Cora graph. We note that performance continues to increase as 𝐾 increases further to the right (we censor due to space limitations). This clearly demonstrates that the GCN model can work well on heterophilous graphs under certain conditions. Intuitively, the 𝑉-shape arises due to a “phase transition”, where initial topology is overridden by added edges according to the associated D𝑐 target neighbor distributions. In the original graph, the homophily ratio is quite high (ℎ = 0.85), and classification behavior is akin to that discussed in Observation 1, where nodes with the same label have quite similar neighborhood patterns. As we add edges to the graph, the originally evident neighborhood patterns which benefited classification performance are perturbed by added edges and gradually become less informative, which leads to the performance decrease in the decreasing segment of the 𝑉-shape in Figure 4.3(a). Then, as we keep adding more edges, the neighborhood pattern gradually approaches D𝑐 for all 𝑐, corresponding to the increasing segment of the 𝑉-shape. As 𝐾 → ∞, neighborhood distributions 49 1.0 1.0 1.0 1.0 0 0.92 0.31 0.42 0.04 0.04 0.46 0.29 0 0.85 0.54 0.58 0.36 0.39 0.63 0.52 0 0.8 0.76 0.74 0.64 0.7 0.77 0.76 0 0.8 0.82 0.78 0.7 0.77 0.81 0.83 1 0.31 0.94 0.37 0.4 0.02 0.02 0.46 0.8 1 0.54 0.91 0.58 0.55 0.39 0.4 0.66 0.8 1 0.76 0.85 0.77 0.68 0.73 0.76 0.83 0.8 1 0.82 0.85 0.8 0.73 0.8 0.83 0.85 0.8 2 0.42 0.37 0.92 0.5 0.41 0.01 0.01 2 0.58 0.58 0.85 0.64 0.57 0.39 0.37 2 0.74 0.77 0.8 0.71 0.72 0.72 0.74 2 0.78 0.8 0.79 0.67 0.75 0.8 0.82 0.6 0.6 0.6 0.6 3 0.04 0.4 0.5 0.85 0.49 0.39 0.03 3 0.36 0.55 0.64 0.79 0.65 0.54 0.36 3 0.64 0.68 0.71 0.72 0.71 0.67 0.67 3 0.7 0.73 0.67 0.73 0.69 0.72 0.74 0.4 0.4 0.4 0.4 4 0.04 0.02 0.41 0.49 0.91 0.36 0.45 4 0.39 0.39 0.57 0.65 0.84 0.58 0.6 4 0.7 0.73 0.72 0.71 0.79 0.76 0.77 4 0.77 0.8 0.75 0.69 0.77 0.78 0.81 5 0.46 0.02 0.01 0.39 0.36 0.94 0.26 0.2 5 0.63 0.4 0.39 0.54 0.58 0.89 0.5 0.2 5 0.77 0.76 0.72 0.67 0.76 0.83 0.76 0.2 5 0.81 0.83 0.8 0.72 0.78 0.83 0.84 0.2 6 0.29 0.46 0.01 0.03 0.45 0.26 0.97 6 0.52 0.66 0.37 0.36 0.6 0.5 0.93 6 0.76 0.83 0.74 0.67 0.77 0.76 0.89 6 0.83 0.85 0.82 0.74 0.81 0.84 0.88 0 1 2 3 4 5 6 0.0 0 1 2 3 4 5 6 0.0 0 1 2 3 4 5 6 0.0 0 1 2 3 4 5 6 0.0 (a) 𝛾 = 0, Acc: 86% (b) 𝛾 = 0.4, Acc: 66% (c) 𝛾 = 0.8, Acc: 44% (d) 𝛾 = 1, Acc: 41% Figure 4.4: Cross-class neighborhood similarity on synthetic graphs generated from Cora; all graphs have ℎ = 0.3, but with varying neighborhood distributions as per the noise parameter 𝛾: as 𝛾 increases, the intra-class similarity and the inter-class similarity become closer to each other, indicating that the distributions for different classes become more indistinguishable. will converge to the target D𝑐 , and classification accuracy will reach 100% (see Appendix A.1). 4.3.2.2 Introducing noise to neighborhood distributions In Section 4.3.2.1, we showed that the GCN model can achieve reasonable performance on heterophilous graphs constructed following distinct, pre-defined neighborhood patterns. As per Observation 2, our theoretical understanding suggests that performance should degrade under heterophily if the distributions of different labels get more and more indistinguishable. Hence, we next demonstrate this empirically by introducing controllable noise levels into our edge addition strategy. We adopt a strategy similar to that described in Algorithm 2, but with the key difference being that we introduce an additional parameter 𝛾, which controls the probability that we add cross-label edges randomly rather than following the pre-defined distribution {D𝑐 , 𝑐 ∈ C}. The pseudo code to describe the process to generate graphs with a noise level 𝛾 is shown in Algorithm 3. The only difference from Algorithm 2 is in Line 6-7, we randomly add edges if the generated random number 𝑟 is smaller than the pre-defined 𝛾 (with probability 𝛾). For nodes of a given class 𝑐 (w.l.o.g), compared to the edges added according to D𝑐 , the randomly added edges can be regarded as noise. Specifically, by increasing the noise parameter 𝛾, we increase the similarity between D𝑐 , D𝑐 0 for any pair of labels 𝑐, 𝑐0. If 𝛾 = 1, then all neighborhood distributions will be indistinguishable (they will all be approximately Uniform (|C|). By fixing 𝐾 and varying 𝛾, we can generate graph variants with exactly the same homophily ratio but with differing similarities between D𝑐 and D𝑐 0 . 50 As in Section 4.3.2.1, we create graphs by adding edges at various 𝐾, but also vary 𝛾 ∈ [0, 1] in increments of 0.2 on both Cora and Citeseer. We report the SSNC performance on these graphs in Figure 4.3. Algorithm 3: Heterophilous Edge Addition with Noise |C|−1 |C|−1 Input: G = {V, E}, 𝐾, {D𝑐 } 𝑐=0 and {V𝑐 } 𝑐=0 Output: G 0 = {V, E 0 } 1 Initialize G 0 = {V, E}, 𝑘 = 1 ; 2 while 1 ≤ 𝑘 ≤ 𝐾 do 3 Sample node 𝑖 ∼ Uniform (V); 4 Obtain the label, 𝑦𝑖 of node 𝑖; 5 Sample a number 𝑟 ∼ Uniform(0,1) ; // Uniform(0,1) denotes the continuous standard uniform distribution 6 if 𝑟 ≤ 𝛾 then 7 Sample a label 𝑐 ∼ Uniform (C \ {𝑦𝑖 }); 8 else 9 Sample a label 𝑐 ∼ D𝑦𝑖 ; 10 Sample node 𝑗 ∼ Uniform (V𝑐 ); 11 Update edge set E 0 = E 0 ∪ {(𝑖, 𝑗)}; 12 𝑘 ← 𝑘 + 1; 13 return G 0 = {V, E 0 } We make several observations from the results: The noise level affects the performance significantly when the homophily ratio is low. For example, observing Figure 4.3(a) vertically at homophily ratio ℎ = 0.3, higher 𝛾 clearly results in worse performance. This indicates that not only the fixed-ness of the neighborhood distributions, but their similarities are important for the SSNC task. It also indicates that there are “good” and “bad” kinds of heterophily; adding yet another supporting claim that heterophily in and of itself is not a problem for GCNs. On the other hand, high 𝛾 does not too-negatively impact when 𝐾 is small, since noise is minimal and the original graph 51 topology is yet largely homophilous (when ℎ is large, towards the left of the plot). At this stage, both “good” (fixed and disparate patterns) and “bad” (randomly added edges) heterophilous edges introduce noise to the dominant homophilous patterns which allow GCNs to separate classes well. When the noise level 𝛾 is not too large, we can still observe the 𝑉-shape: e.g. 𝛾 = 0.4 in Figure 4.3(a) and 𝛾 = 0.2 in Figure 4.3 (b); this is because the designed pattern is not totally dominated by the noise. However, when 𝛾 is too high, adding edges will constantly decrease the performance, as nodes of different classes has indistinguishably similar neighborhoods. To further demonstrate how 𝛾 affects the neighborhood distributions in the generated graph, we examine the cross-class neighborhood similarity, which we define as follows: Definition 2 (Cross-Class Neighborhood Similarity). Given a graph G = {V, E} and node labels y for all nodes, the cross-class neighborhood similarity between classes 𝑐, 𝑐0 ∈ C is given by 1 Õ s(𝑐, 𝑐0) = 𝑐𝑜𝑠 (𝑑 (𝑖), 𝑑 ( 𝑗)) (4.8) |V𝑐 ||V𝑐 0 | 𝑖∈V , 𝑗 ∈V 0 𝑐 𝑐 where V𝑐 (w.l.o.g) indicates the set of nodes in class 𝑐 and 𝑑 (𝑖) denotes the empirical histogram (over |C| classes) of node 𝑖’s neighbors’ labels, and the function cos(·, ·) measures the cosine similarity. When 𝑐 = 𝑐0, 𝑠(𝑐, 𝑐0) calculates the intra-class similarity, otherwise, it calculates the inter-class similarity from a neighborhood label distribution perspective. Intuitively, if nodes with the same label share the same neighborhood distributions, the intra-class similarity should be high. Likewise, to ensure the neighborhood pattern for nodes with different labels are distinguishable, the inter-class similarity should be low. To illustrate how various 𝛾 values affect the neighborhood patterns, we illustrate the intra-class and inter-class similarities in Figure 4.4 for 𝛾 = 0, 0.4, 0.8, 1 on graphs generated from Cora with homophily ratio ℎ = 0.3. The diagonal cells in each heatmap indicate the intra-class similarity while off-diagonal cells indicate inter-class similarity. Clearly, when 𝛾 is small, the intra-class similarity is high while the inter-class similarity is low, which demonstrates the existence of strongly discriminative neighborhood patterns in the graph. As 𝛾 increases, the 52 intra-class and inter-class similarity get closer, becoming more and more indistinguishable, leading to bad performance due to indistinguishable distributions as referenced in Observation 2. 4.4 Revisiting GCN’s Performance on Real-world Graphs How can we reconcile our findings with existing literature on GCN’s performance under heterophily? As noted in Section 4.3, we observed that on some publicly available heterophilous graph benchmark datasets, the GCN model surprisingly outperforms methods specifically designed to deal with heterophilous graphs. In this section, we first give more details on these experiments. We next investigate why the GCN model does or does not work well on certain datasets utilizing the understanding developed in earlier sections. 4.4.1 Evaluating Node Classification on Homophilous and Heterophilous Benchmarks Following previous work [65, 106], we evaluate the performance of the GCN model on several real- world graphs with different levels of homophily. We include the citation networks Cora, Citeseer and Pubmed [40], which are highly homophilous. We also adopt several heterophilous benchmark datasets including Chameleon, Squirrel, Actor, Cornell, Wisconsin and Texas [65]. Ap- pendix A.2.1 gives descriptions and summary statistics of these datasets. For all datasets, we follow the experimental setting provided in [65], which consists of 10 random splits with proportions 48/32/20% corresponding to training/validation/test for each graph. For each split, we use 10 random seeds, and report the average performance and standard deviation across 100 runs. We compare the standard GCN model [40] with several recently proposed methods specifically designed for heterophilous graphs including H2GCN [106], GPR-GNN [10], and CPGNN [105]. A brief introduction of these methods can be found in Appendix A.2.2. We carefully tuned the parameters for all models. The detailed parameter tuning settings can be found in Appendix A.2.4. The experimental setting in H2GCN [106] is the same as [65], hence we include the results for their method reported in the original paper. The node classification performance (accuracy) of these models is reported in Table 4.2. Notably, 53 all models achieve comparable performance on graphs with high homophily (Cora, Citeseer, and Pubmed), as expected. For the heterophilous graphs, results are comparatively mixed. The GCN model outperforms all other methods on Squirrel and Chameleon, while underperforming on the other datasets (Actor, Cornell, Wisconsin, and Texas). However, on these datasets, a two-layer feedforward model (denoted as MLP) always achieves comparable or even better performance than even the methods specifically designed for these graphs. Hence, we suspect that in these datasets, the graph offers un-useful (and sometimes actively damaging) information for node classification. Note that, both H2GCN and GPRGNN have skip connection (or similar architecture) to directly connect the input features to the output. Their good performance on Actor, Cornell, Wisconsin, and Texas is likely attributed to this residual design. To verify this idea, we implement a simple method, which linearly combines the features output from GCN model and MLP model (denoted as MLP+GCN in Table 4.2; further details in Appendix A.2.3) and its performance is comparable or even better than heterophily-specific methods on all heterophilous graphs. 4.4.2 Investigating GCN performance on Homophilous and Heterophilous Benchmarks Our work so far illustrates that the popular notion of GCNs not being suitable for heterophily, or heterophily being a mandate for good GCN performance are at worst myths, and at best misleading. We believe this is due to prior evaluations having been conducted on a limited number of publicly available heterophilous graph benchmarks with relatively poor performance, which led to a (prima facie) conclusion that heterophily is the responsible for these issues rather than any of the other intrinsic properties of these datasets. In this subsection, we aim to use the understanding we developed in Section 4.3 to explain why GCN does (not) work well on real-world graphs. As in Section 4.3.2.2, we inspect cross-class neighborhood similarity (Eq.(4.8)) for each dataset; due to the space limit, we only include representative ones here (Cora, Chameleon, Actor and Cornell); see Figure 4.5). Heatmaps for the other datasets can be found in Appendix A.3. From Figure 4.5(a), it is clear that the intra-class similarity is much higher than the inter-similarity ones, hence Cora contains distinct neighborhood patterns, consistent with Observation 1. In Figure 4.5(b), we can 54 Table 4.2: GNN node classification performance (accuracy) on homophilous and heterophilous graphs. Standard GCN or MLP+GCN methods (top) outperform or achieve comparable performance to heterophily-specific methods (bottom). Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Wisconsin Texas GCN 87.12 ± 1.38 76.50 ± 1.61 88.52 ± 0.41 67.96 ± 1.82 54.47 ± 1.17 30.31 ± 0.98 59.35 ± 4.19 61.76 ± 6.15 63.81 ± 5.27 MLP 75.04 ± 1.97 72.40 ± 1.97 87.84 ± 0.30 48.11 ± 2.23 31.68 ± 1.90 36.17 ± 1.09 84.86 ± 6.04 86.29 ± 4.50 83.30 ± 4.54 MLP + GCN 87.01 ± 1.35 76.35 ± 1.85 89.77 ± 0.39 68.04 ± 1.86 54.48 ± 1.11 36.24 ± 1.09 84.82 ± 4.87 86.43 ± 4.00 83.60 ± 6.04 H2GCN-1 86.92 ± 1.37 77.07 ±1.64 89.40 ± 0.34 57.11 ± 1.58 36.42 ± 1.89 35.86 ±1.03 82.16 ± 6.00 86.67 ± 4.69 84.86 ± 6.77 H2GCN-2 87.81 ± 1.35 76.88 ± 1.77 89.59 ± 0.33 59.39 ± 1.98 37.90 ± 2.02 35.62 ± 1.30 82.16 ± 6.00 85.88 ± 4.22 82.16 ± 5.28 CPGNN-MLP 85.84 ± 1.20 74.80 ± 0.92 86.58 ± 0.37 54.53 ± 2.37 29.13 ± 1.57 35.76 ± 0.92 79.93 ± 6.12 84.58 ± 2.72 82.62 ± 6.88 CPGNN-Cheby 87.23 ± 1.31 76.64 ± 1.43 88.41 ± 0.33 65.17 ± 3.17 29.25 ± 4.17 34.28 ± 0.77 75.08 ± 7.51 79.19 ± 2.80 75.96 ± 5.66 GPR-GNN 86.79 ± 1.27 75.55 ± 1.56 86.79 ± 0.55 66.31 ± 2.05 50.56 ± 1.51 33.94 ± 0.95 79.27 ± 6.03 83.73 ± 4.02 84.43 ± 4.10 1.0 1.0 1.0 1.0 0 0.84 0.07 0.03 0.15 0.12 0.11 0.16 0 0.579 0.579 0.484 0.483 0.467 0 0.508 0.51 0.508 0.504 0.507 0 0.6 0.69 0.53 0.57 0.51 1 0.07 0.87 0.09 0.1 0.04 0.06 0.01 0.8 0.8 0.8 0.8 1 0.579 0.593 0.505 0.508 0.469 1 0.51 0.511 0.509 0.506 0.509 1 0.69 1.0 0.57 0.57 0.56 2 0.03 0.09 0.95 0.05 0.01 0.05 0.01 0.6 0.6 0.6 0.6 3 0.15 0.1 0.05 0.9 0.11 0.07 0.04 2 0.484 0.505 0.696 0.676 0.671 2 0.508 0.509 0.508 0.504 0.507 2 0.53 0.57 0.48 0.52 0.47 0.4 0.4 0.4 0.4 4 0.12 0.04 0.01 0.11 0.89 0.04 0.02 3 0.483 0.508 0.676 0.73 0.702 3 0.504 0.506 0.504 0.501 0.504 3 0.57 0.57 0.52 0.58 0.5 5 0.11 0.06 0.05 0.07 0.04 0.87 0.12 0.2 0.2 0.2 0.2 6 0.16 0.01 0.01 0.04 0.02 0.12 0.89 4 0.467 0.469 0.671 0.702 0.739 4 0.507 0.509 0.507 0.504 0.508 4 0.51 0.56 0.47 0.5 0.45 0 1 2 3 4 5 6 0.0 0 1 2 3 4 0.0 0 1 2 3 4 0.0 0 1 2 3 4 0.0 (a) Cora (b) Chameleon (c) Actor (d) Cornell Figure 4.5: Cross-class neighborhood similarity on homophilous graphs (a) and heterophilous graphs (b-d). The deviations between inter and intra-class similarities substantiate GCN’s performance on these datasets. observe that in Chameleon, intra-class similarity is generally higher than inter-class similarity, though not as strong as in Figure 4.5(a). Additionally, there is an apparent gap between labels 0, 1 and 2, 3, 4, which contributes in separating nodes of the former 2 from the latter 3 classes, but potentially increasing misclassification within each of the two groupings. These observations also help substantiate why GCN can achieve reasonable performance (much higher than MLP) on Chameleon. The GCN model underperforms MLP in Actor and we suspect that the graph does not provide useful information. The heatmap for Actor in Figure 4.5 shows that the intra-class and inter-class similarities are almost equivalent, making the neighborhood distributions for different classes hard to distinguish and leading to bad GCN performance. Similar observations can be made for Cornell. Note that Cornell only consists of 183 nodes and 280 edges, hence, the similarities shown in Figure 4.5 are impacted significantly by few samples (e.g. there is a single node with label 1, leading to perfect intra-class similarity for label 1). 55 4.5 Related Work Graph neural networks (GNNs) are powerful models for graph representation learning. They have been widely adopted to tackle numerous applications from various domains [40, 20, 2, 79]. [72] proposed the first GNN model, to tackle both node and graph level tasks. Subsequently, [6] and [15] generalized convolutional neural networks to graphs from the graph spectral perspective. [40] simplified the spectral GNN model and proposed graph convolutional networks (GCNs). Since then, numerous GNN variants, which follow specific forms of feature transformation (linear layers) and aggregation have been proposed [84, 28, 26, 43]. The aggregation process can be usually understood as feature smoothing [47, 52, 34, 107]. Hence, several recent works claim [106, 105, 10], assume [27, 92, 102] or remark upon [1, 56, 31] GNN models homophily-reliance or unsuitability in capturing heterophily. Several recent works specifically develop GNN models choices to tackle heterophilous graphs by carefully designing or modifying model architectures such as Geom-GCN [65], H2GCN [106], GPR-GNN [10], and CPGNN [105]. 56 CHAPTER 5 EIGENPOOLING: A GRAPH POOLING OPERATION FROM THE SPECTRAL GRAPH THEORY PERSPECTIVE 5.1 Introduction In this chapter, we work on graph level representation learning with a focus on the task of graph classification. The task of graph classification is to predict the label of a given graph utilizing its associated features and graph structure. GNNs1 can extract graph representation while using all associated information. Majority of existing graph neural networks [14, 18, 26, 49] have been designed to generate good node representations, and then globally summarize the node representations as the graph representation. These methods are inherently “flat” since they treat all the nodes equivalently when generating graph representation using the node representations. In other words, the entire graph structure information is totally neglected during this process. However, nodes are naturally of different statuses and roles in a graph, and they should contribute differently to the graph level representation. Furthermore, graphs often have different local structures (or subgraphs), which contain vital graph characteristics. For instance, in a graph of a protein, atoms (nodes) are connected via bonds (edges); some local structures, which consist of groups of atoms and their direct bonds, can represent some specific functional units, which, in turn, are important to tell the functionality of the entire protein [77, 18, 5]. These local structures are also not captured during the global summarizing process. To generate the graph representation which preserves the local and global graph structures, a hierarchical pooling process, analogous to the pooling process in conventional convolutional neural (CNN) networks [44], is needed. There are very recent works investigating the pooling procedure for graph neural networks [97, 15, 21, 81]. These methods group nodes into subgraphs (supernodes), coarsen the graph based on these subgraphs and then the entire graph information is reduced to the coarsened graph by 1 In this chapter, we are dealing with a graph-focus tasks. Hence, when we mention GNNs, we refer to the GNN models that follow the general framework for graph-focus task described in Section 2.3.3.2 57 generating features of supernodes from their corresponding nodes in subgraphs. However, when pooling the features for supernodes, average pooling or max pooling have been usually adopted where the structures of these group nodes (the local structures) are still neglected. With the local structures, the nodes in the subgraphs are of different statuses and roles when they contribute to the supernode representations. It is challenging to design a general graph pooling operation while incorporating the local structure information as 1) the subgraphs may contain different numbers of nodes, thus a fixed size graph pooling operation cannot work for all subgraphs; and 2) the subgraphs could have very different structures, which may require different approaches to summarize the information for the supernode representation. To address the aforementioned challenges, we design a novel graph pooling operation EigenPooling based on the eigenvectors of the subgraphs, which naturally have the same size of each subgraph and can effectively capture the local structures when summarizing node features for supernodes. EigenPooling can be used as pooling layers to stack with any graph neural network layers to form a novel framework EigenGCN for graph classification. 5.2 The Proposed Framework – EigenGCN In this chapter, we aim to develop a Graph Neural Networks (GNN) model, which consists of convolutional layers and pooling layers, to learn graph representations such that graph level classification can be applied. Before going to the details, we first introduced some notations and the problem setting. Problem Setting: A graph can be represented as G = {E, V}, where V = {𝑣 1 , . . . , 𝑣 𝑁 } is the set of 𝑁 nodes and E is the set of edges. The graph structure information can also be represented by an adjacency matrix A ∈ R𝑁×𝑁 . Furthermore, each node in the graph is associated with node features and we use X ∈ R𝑁×𝑑 to denote the node feature matrix, where 𝑑 is the dimension of features. Note that this node feature matrix can also be viewed as a 𝑑-dimensional graph signal [80] defined on the graph G. In the graph classification setting, we have a set of graphs {G𝑖 }𝑖 , each graph G𝑖 is associated with a label 𝑦𝑖 . The task of the graph classification is to take the graph (structure information and node features) as input and predict its corresponding label. To make the prediction, 58 it is important to extract useful information from both graph structure and node features. We aim to design graph convolution layers and EigenPooling to hierarchically extract graph features, which finally learns a vector representation of the input graph for graph classification. 5.2.1 An Overview of EigenGCN In this work, we build our model based on Graph Convolutional Networks (GCN) layers [40], which has been demonstrated to be effective in node-level representation learning. While the GCN model is originally designed for semi-supervised node classification, we only discuss the part for node representation learning but ignoring the classification part. The GCN model is stacked by several GCN graph filtering layers The GCN model has been designed for learning node representations. In the end, the output of the GCN model is a matrix instead of a vector. The procedure of the GCN model is rather “flat”, as it can only “pass message” between nodes through edges but cannot summarize the node information into the higher level graph representation. A simple way to summarize the node information to generate graph level representation is global pooling. For example, we could use the average of the node representations as the graph representation. However, in this way, a lot of key information is ignored and the graph structure is also totally overlooked during the pooling process. To address this challenge, we propose eigenvector based pooling layers EigenPooling to hierar- chically summarize node information and generate graph representation. An illustrative example is demonstrated in Figure 5.1. In particular, several graph pooling layers are added between GCN layers. Each of the pooling layers pools the graph signal defined on a graph into a graph signal defined on a coarsened version of the input graph, which consists of fewer nodes. Thus, the design of the graph pooling layers consists of two components: 1) graph coarsening, which divides the graph into a set of subgraphs and form a coarsened graph by treating subgraphs as supernodes; and 2) transform the original graph signal information into the graph signal defined on the coarsened graph with EigenPooling. We coarsen the graph based on a subgraph partition. Given a subgraph partition with no overlaps between subgraphs, we treat each of the subgraphs as a supernode. To form a 59 𝑐𝑜𝑛𝑣% 𝑐𝑜𝑛𝑣& 𝑐𝑜𝑛𝑣' 𝑐𝑜𝑛𝑣( 𝑝𝑜𝑜𝑙% 𝑐𝑜𝑛𝑣) 𝑐𝑜𝑛𝑣* 𝑝𝑜𝑜𝑙& 𝑝𝑜𝑜𝑙' 𝑙𝑎𝑏𝑒𝑙 Figure 5.1: An illustrative example of the general framework coarsened graph of the supernodes, we determine the connectivity between the supernodes by the edges across the subgraphs. During the pooling process, for each of the subgraphs, we summarize the information of the graph signal on the subgraph to the supernode. With graph coarsening, we utilize the graph structure information to form coarsened graphs, which makes it possible to learn representations level by level in a hierarchical way. With EigenPooling, we can learn node features of the coarsened graph that exploits the subgraph structure as well as the node features of the input graph. Figure 5.1 shows an illustrative example, where a binary graph classification is performed. In this illustrative example, the graph is coarsened three times and finally becomes a single supernode. The input is a graph signal (the node features), which can be multi-dimensional. For the ease of illustration, we do not show the node features on the graph. Two convolutional layers are applied to the graph signal. Then, the graph signal is pooled to a signal defined on the coarsened graph. This procedure (two convolution layers and one pooling layer) is repeated two more times and the graph signal is finally pooled to a signal on a single node. This pooled signal on the single node, which is a vector, can be viewed as the graph representation. The graph representation then goes through several fully connected layers and the prediction is made upon the output of the last layer. Next, we introduce details of graph coarsening and EigenPooling of EigenGCN. 60 5.2.2 Graph Coarsening In this subsection, we introduce how we perform the graph coarsening. As we mentioned in the previous subsection, the coarsening process is based on subgraph partition. There are different ways to separate a given graph to a set of subgraphs with no overlapping nodes. In this paper, we adopt spectral clustering to obtain the subgraphs, so that we can control the number of the subgraphs, which, in turn, determines the pooling ratio. We leave other options as future work. Given a set of subgraphs, we treat them as supernodes and build the connections between them as similar in [83]. An example of the graph coarsening and supernodes is shown in Figure 5.1, where a subgraph and its supernodes are denoted using the same color. Next, we introduce how to mathematically describe the subgraphs, supernodes, and their relations. Let c be a partition of a graph G, which consists of 𝐾 connected subgraphs {G (𝑘) } 𝐾𝑘=1 . For the graph G, we have the adjacency matrix A ∈ R𝑁×𝑁 and the feature matrix X ∈ R𝑁×𝑑 . Let 𝑁 𝑘 denote the number of nodes in the subgraph G (𝑘) and Γ (𝑘) is the list of nodes in subgraph G (𝑘) . Note that each of the subgraph can be also viewed as a supernode. For each subgraph G (𝑘) , we can define a sampling operator C (𝑘) ∈ R𝑁×𝑁 𝑘 as follows: C (𝑘) [𝑖, 𝑗] = 1 if and only if Γ (𝑘) ( 𝑗) = 𝑣 𝑖 , (5.1) where C (𝑘) [𝑖, 𝑗] denotes the element in the (𝑖, 𝑗)-th position of C (𝑘) [𝑖, 𝑗] and Γ (𝑘) ( 𝑗) is the 𝑗-th element in the node list Γ (𝑘) . This operator provides a relation between nodes in the subgraph G (𝑘) and the nodes in the original graph. Given a single dimensional graph signal x ∈ R𝑁×1 defined on the original entire graph, the induced signal that is only defined on the subgraph G (𝑘) can be written as x (𝑘) = (C (𝑘) )𝑇 x. (5.2) On the other hand, we can also use C (𝑘) to up-sample a graph signal x (𝑘) defined only on the subgraph G (𝑘) to the entire graph G by x̄ = C (𝑘) x (𝑘) . (5.3) 61 It keeps the values of the nodes in the subgraph untouched while setting the values of all the other nodes that do not belong to the subgraph to 0. The operator can be applied to multi-dimensional signal X ∈ R 𝑁×𝑑 in a similar way. The induced adjacency matrix A (𝑘) ∈ R𝑁 𝑘 ×𝑁 𝑘 of the subgraph G (𝑘) , which only describes the connection within the subgraph G (𝑘) , can be obtained as A (𝑘) = (C (𝑘) )𝑇 AC (𝑘) . (5.4) The intra-subgraph adjacency matrix of the graph G, which only consists of the edges inside each subgraph, can be represented as Õ 𝐾 A𝑖𝑛𝑡 = C (𝑘) A (𝑘) (C (𝑘) )𝑇 . (5.5) 𝑘=1 Then the inter-subgraph adjacency matrix of graph G, which only consists of the edges between subgraphs, can be represented as A𝑒𝑥𝑡 = A − A𝑖𝑛𝑡 . Let G𝑐𝑜𝑎𝑟 denote the coarsened graph, which consists of the supernodes and their connections. We define the assignment matrix S ∈ R𝑁×𝐾 , which indicates whether a node belongs to a specific subgraph as: S[𝑖, 𝑗] = 1 if and only if 𝑣 𝑖 ∈ Γ ( 𝑗) . Then, the adjacency matrix of the coarsened graph is given as A𝑐𝑜𝑎𝑟 = S𝑇 A𝑒𝑥𝑡 S. (5.6) With Graph Coarsening, we can obtain the connectivity of G𝑐𝑜𝑎𝑟 , i.e., A𝑐𝑜𝑎𝑟 . Obviously, A𝑐𝑜𝑎𝑟 encodes the network structure information of G. Next, we describe how to obtain the node features X𝑐𝑜𝑎𝑟 of G𝑐𝑜𝑎𝑟 using EigenPooling. With A𝑐𝑜𝑎𝑟 and X𝑐𝑜𝑎𝑟 , we can stack more layers of GCN-GraphCoarsening-EigenPooling to learn higher level representations of the graph for classification. 5.2.3 Eigenvector-Based Pooling – EigenPooling In this subsection, we introduce EigenPooling, aiming to obtain X𝑐𝑜𝑎𝑟 that encodes network structure information and node features of G. Globally, the pooling operation is to transform a graph signal 62 defined on a given graph to a corresponding graph signal defined on the coarsened version of this graph. It is expected that the important information of the original graph signal can be largely preserved in the transformed graph signal. Locally, for each of the subgraph, we summarize the features of the nodes in this subgraph to a single representation of the supernode. It is necessary to consider the structure of the subgraph when we perform the summarizing, as the subgraph structure also encodes important information. However, common adopted pooling methods such as max pooling [97, 15] or average pooling [18] ignored the graph structure. In some works [63], the subgraph structure is used to find a canonical ordering of the nodes, which is, however, very difficult and expensive. In this work, to use the structure of the subgraphs, we design the pooling operator based on the graph spectral theory by facilitating the eigenvectors of the Laplacian matrix of the subgraph. Next, we first briefly review the graph Fourier transform and then introduce the design of EigenPooling based on graph Fourier transform. 5.2.3.1 Graph Fourier Transform Given a graph G = {E, V} with A ∈ R𝑁×𝑁 being the adjacency matrix and X ∈ R𝑁×𝑑 being the node feature matrix. Without loss of generality, for the following description, we consider 𝑑 = 1, i.e., x ∈ R𝑁×1 , which can be viewed as a single dimensional graph signal defined on the graph G [70]. This is the spatial view of a graph signal, which maps each node in the graph to a scalar value (or a vector if the graph signal is multi-dimensional). Analogous to the classical signal processing, we can define graph Fourier transform [80] and spectral representation of the graph signal in the spectral domain. To define the graph signal in the spectral domain, we need to use the Laplacian matrix [11] 𝑁 Í L = D − A, where D is the diagonal degree matrix with D[𝑖, 𝑖] = 𝐴[𝑖, 𝑗]. The Laplacian matrix 𝑗=1 L can be used to define the “smoothness” of a graph signal [80] as follows: 𝑁 1Õ 𝑇 𝑠(x) = x Lx = A[𝑖, 𝑗] (x[𝑖] − x[ 𝑗]) 2 . (5.7) 2 𝑖, 𝑗 𝑠(x) measures the smoothness of the graph signal x. The smoothness of a graph signal depends on how dramatically the value of connected nodes can change. The smaller 𝑠(x), the more smooth it 63 is. For example, for a connected graph, a graph signal with the same value on all the nodes has a smoothness of 0, which means “extremely smooth” with no change. As L is a real symmetric positive matrix, it has a completed set of orthonormal eigenvectors {𝑢 𝑙 }𝑙=1 𝑁 . These eigenvectors are also known as the graph Fourier modes [80], which are associated with the ordered real non-negative eigenvalues {𝜆 𝑙 }𝑙𝑁 . Given a graph signal x, the graph Fourier transform can be obtained as follows x̂ = UT x, (5.8) where U = [𝑢 1 , . . . , 𝑢 𝑁 ] ∈ R𝑁×𝑁 is the matrix consists of the eigenvectors of L. The vector x̂ obtained after the transform is the representation of the graph signal in the spectral domain. Correspondingly, the inverse graph Fourier transform, which transfers the spectral representation back to the spatial representation, can be denoted as: x = Ux̂. (5.9) Note that we can also view each the eigenvector 𝑢 𝑙 of the Laplacian matrix L as a graph signal, and its corresponding eigenvalue 𝜆 𝑙 can measure its “smoothness”. For any of the eigenvector 𝑢 𝑙 , we have: 𝑠(u𝑙 ) = u𝑇𝑙 Lu𝑙 = u𝑇𝑙 𝜆 𝑙 u𝑙 = 𝜆 𝑙 . (5.10) The eigenvectors (or Fourier modes) are a set of base signals with different “smoothness” defined on the graph G. Thus, the graph Fourier transform of a graph signal x can be also viewed as linearly decomposing the signal x into the set of base signals. x̂ can be viewed as the coefficients of the linear combination of the base signals to obtain the original signal x. 5.2.3.2 The Design of Pooling Operations Since graph Fourier transform can transform graph signal to spectral domain which takes into consideration both graph structure and graph signal information, we adopt graph Fourier transform 64 to design pooling operators, which pool the graph signal defined on a given graph G to a signal defined on its coarsened version G𝑐𝑜𝑎𝑟 . The design of the pooling operator is based on graph Fourier transform of the subgraphs {G 𝑘 } 𝐾𝑘=1 . Let L (𝑘) denote the Laplacian matrix of the subgraph G (𝑘) . We denote the eigenvectors of the Laplacian matrix L (𝑘) as 𝑢 1(𝑘) , . . . , 𝑢 𝑁(𝑘)𝑘 . We then use the up-sampling operator 𝐶 (𝑘) to up-sample these eigenvectors (base signals on this subgraph) into the entire graph and get the up-sampled version as: ū𝑙(𝑘) = C (𝑘) u𝑙(𝑘) , 𝑙 = 1 . . . 𝑁 𝑘 . (5.11) With the up-sampled eigenvectors, we organize them into matrices to form pooling operators. Let Θ𝑙 ∈ R𝑁×𝐾 denote the pooling operator consisting of all the 𝑙-th eigenvectors from all the subgraphs Θ𝑙 = [𝑢¯ 𝑙(1) , . . . , 𝑢¯ 𝑙(𝐾) ] (5.12) Note that the subgraphs are not necessary all with the same number of nodes, which means that the number of eigenvectors can be different. Let 𝑁𝑚𝑎𝑥 = max 𝑁 𝑘 be the largest number of nodes 𝑘=1,...,𝐾 among all the subgraphs. Then, for a subgraph G (𝑘) with 𝑁 𝑘 nodes, we set u𝑙(𝑘) for 𝑁 𝑘 < 𝑙 ≤ 𝑁𝑚𝑎𝑥 as 0 ∈ R𝑁 𝑘 ×1 . The pooling process with 𝑙-th pooling operator Θ𝑙 can be described as X𝑙 = Θ𝑇𝑙 X (5.13) where X𝑙 ∈ R𝐾×𝑑 is the pooled result using the 𝑙-th pooling operator. The 𝑘-th row of X𝑙 contains the information pooled from the 𝑘-th subgraph, which is the representation of the 𝑘-th supernode. Following this construction, we build a set of 𝑁𝑚𝑎𝑥 pooling operators. To combine the information pooled by different pool operators, we can concatenate them together as follows: X 𝑝𝑜𝑜𝑙𝑒𝑑 = [X0 , . . . , X𝑁𝑚𝑎𝑥 ]. (5.14) where X 𝑝𝑜𝑜𝑙𝑒𝑑 ∈ R𝐾×𝑑·𝑁𝑚𝑎𝑥 is the final pooled results. For efficient computation, instead of using the results pooled by all the pooling operators, we can choose to only use the first 𝐻 of them as follows: X𝑐𝑜𝑎𝑟 = X 𝑝𝑜𝑜𝑙𝑒𝑑 = [X0 , . . . , X𝐻 ]. (5.15) 65 In Section 5.3.1 and Section 5.3.2, we will show that with 𝐻  𝑁𝑚𝑎𝑥 , we can still preserve most of the information. We will further empirically investigate the effect of choice of 𝐻 in Section 5.4 5.3 Theoretical Analysis of EigenPooling In this section, we provide a theoretical analysis of EigenPooling by understanding it from local and global perspectives. We prove that the pooling operation can preserve useful information to be processed by the following GCN layers. We also verify that EigenGCN is permutation invariant, which lays the foundation for graph classification with EigenGCN. 5.3.1 A Local View of EigenPooling In this subsection, we analyze the pooling operator from a local perspective focusing on a specific subgraph G (𝑘) . For the subgraph G (𝑘) , the pooling operator tries to summarize the nodes’ features and form a representation for the corresponding supernode of the subgraph. For a pooling operator Θ𝑙 , the part that is effective on the subgraph G (𝑘) , is only the up-sampled eigenvector ū𝑙(𝑘) as the other eigenvectors have 0 values on the subgraph G (𝑘) . Without the loss of generality, let’s consider a single dimensional graph signal x ∈ R𝑁×1 defined on the G, the pooling operation on G (𝑘) can be represented as: ( ū𝑙(𝑘) )𝑇 x = (u𝑙(𝑘) )𝑇 x (𝑘) , (5.16) which is the Fourier coefficient of the Fourier mode u𝑙(𝑘) of the subgraph G (𝑘) . Thus, from a local perspective, the pooling process is a graph Fourier transform of the graph signal defined on the subgraph. As we introduced in the Section 5.2.3.1, each of the Fourier modes (eigenvectors) is associated with an eigenvalue, which measures its smoothness. The Fourier coefficient of the corresponding Fourier mode provides the information to indicate the importance of this Fourier mode to the signal. The coefficient summarizes the graph signal information utilizing both the node features and the subgraph structure as the smoothness is related to both of them. Each of the coefficients characterizes a different property (smoothness) of the graph signal. Using the first 𝐻 coefficients while discarding the others means that we focus more on the “smoother” part of the graph 66 signal, which is common in a lot of applications such as signal denoising and compression [83, 9, 61]. Therefore, we can use the squared summation of the coefficients to measure how much information can be preserved as shown in the following theorem. Theorem 9. Let x be a graph signal defined on the graph G and U = [u1 , . . . , u𝑁 ] be the Fourier modes of this graph. Let x̂ be the corresponding Fourier coefficients, i.e., x̂ = U𝑇 x. Let 𝐻 || x̂[1:𝐻] || 2 x0 = x̂[𝑙] · u𝑙 be the signal re-constructed using only the first 𝐻 Fourier modes. Then ||x̂||2 2 Í 𝑙=1 2 can measure the information being preserved by this re-construction. Here x̂[1 : 𝐻] denotes the vector consisting of the first 𝐻 elements of x̂. Í𝑁 Proof. According to Eq.(5.9), x can be written as x = x̂[𝑙] · u𝑙 . Since U is orthogonal, we have 𝑙=1 𝐻 x̂[𝑙] · u𝑙 k 22 Í k kx0 k 22 𝑙=1 k x̂[1 : 𝐻] k 22 = = (5.17) kxk 22 𝑁 k x̂k 22 u𝑙 k 22 Í k x̂[𝑙] · 𝑙=1 which completes the proof. It is common that for natural graph signal that the magnitude of the spectral form of the graph k x̂[1:𝐻] k 22 signal is concentrated on the first few coefficients [70, 80], which means that k x̂k 22 ≈ 1 for 𝐻  𝑁 𝑘 . In other words, by using the first 𝐻 coefficients, we can preserve the majority of the information while reducing the computational cost. We will empirically verify it in the experiment section. 5.3.2 A Global View of EigenPooling In this subsection, we analyze the pooling operators from a global perspective focusing on the entire graph G. The pooling operators we constructed can be viewed as a filterbank [83]. Each of the filters in the filterbank filters the given graph signal and obtains a new graph signal. In our case, the filtered signal is defined on the coarsened graph G𝑐𝑜𝑎𝑟 . Without the loss of generality, we consider a 𝑁 𝑚𝑎𝑥 single dimensional signal x ∈ R𝑁×1 of G, then the filtered signals are {x𝑙 }𝑙=1 . Next, we describe some key properties of these pooling operators. 67 Property 1: Perfect Reconstruction: The first property is that when 𝑁𝑚𝑎𝑥 number of filters are used, the input graph signal can be perfectly reconstructed from the filtered signals. 𝑁 𝑚𝑎𝑥 Lemma 10. The graph signal x can be perfectly reconstructed from its filtered signals {x𝑙 }𝑙=1 𝑁Í𝑚𝑎𝑥 𝑁 𝑚𝑎𝑥 together with the pooling operators {Θ𝑙 }𝑙=1 as x = Θ𝑙 x𝑙 . 𝑙=1 Proof. With the definition of Θ𝑙 given in Eq.(5.12), we have 𝑁Õ𝑚𝑎𝑥 𝑁Õ 𝑚𝑎𝑥 Õ 𝐾 Õ𝐾 𝑁Õ 𝑚𝑎𝑥 Θ𝑙 x𝑙 = ū𝑙(𝑘) · x𝑙 [𝑘] = ū𝑙(𝑘) · x𝑙 [𝑘] (5.18) 𝑙=1 𝑙=1 𝑘=0 𝑘=0 𝑙=1 From Eq.(5.13), we know that x𝑙 [𝑘] = ( ū𝑙(𝑘) )𝑇 x. Together with the fact that ū𝑙(𝑘) = C (𝑘) u𝑙(𝑘) in 𝑁Í𝑚𝑎𝑥 Eq.(5.11), we can rewrite ū𝑙(𝑘) · x𝑙 [𝑘] as 𝑙=1 𝑁Õ𝑚𝑎𝑥 𝑁Õ𝑚𝑎𝑥 𝑇 ū𝑙(𝑘) u𝑙(𝑘) u𝑙(𝑘) )C (𝑘) x (𝑘) 𝑇 · x𝑙 [𝑘] = C ( (5.19) 𝑙=1 𝑙=1 𝑇 𝑇 u𝑙(𝑘) u𝑙(𝑘) u𝑙(𝑘) u𝑙(𝑘) = I, since that {u𝑙(𝑘) }𝑙=1 𝑁𝑘 Í𝑁𝑚𝑎𝑥 Í𝑁𝐾 Obviously, 𝑙=1 = 𝑙=1 are orthonormal and 𝑁Í𝑚𝑎𝑥 {u𝑙(𝑘) }𝑙=𝑁 𝑁 𝑚𝑎𝑥 𝑘 +1 are all 0 vectors. Thus, ū𝑙(𝑘) · x𝑙 [𝑘] = C (𝑘) x (𝑘) . Substitute this to Eq.(5.18), 𝑙=1 we arrive at 𝑁Õ 𝑚𝑎𝑥 Õ 𝐾 Θ𝑙 x𝑙 = C (𝑘) x (𝑘) = x (5.20) 𝑙=1 𝑘=1 which completes the proof. 𝑁 𝑚𝑎𝑥 From Lemma 10, we know if 𝑁𝑚𝑎𝑥 number of filters are chosen, the filtered signals {x𝑙 }𝑙=1 can preserve all the information from x. Thus, together with graph coarsening, eigenvector pooling can preserve the signal information of the input graph and can enlarge the receptive filed, which allows us to finally learn a vector representation for graph classification. Property 2: Energy/Information Preserving The second property is that the filtered signals preserve all the energy when 𝑁𝑚𝑎𝑥 filters are chosen. To show this, we first give the following lemma, which serves as a tool for demonstrating property 2. 𝑁 𝑚𝑎𝑥 Lemma 11. All the columns in the operators {Θ𝑙 }𝑙=1 are orthogonal to each other. 68 Proof. By definition, we know that, for the same 𝑘, i.e, the same subgraph, u𝑙(𝑘) , 𝑙 = 1, . . . 𝑁𝑚𝑎𝑥 are orthogonal to each other, which means ū𝑙(𝑘) , 𝑙 = 1, . . . 𝑁𝑚𝑎𝑥 are also orthogonal to each other. In addition, all the ū𝑙(𝑘) with different 𝑘 are also orthogonal to each other as they only have non-zero values on different subgraphs. With the above lemma, we can further conclude that the ℓ2 norm of graph signal x is equal to the 𝑁 𝑚𝑎𝑥 summation of the ℓ2 norm of the pooled signals {x𝑙 }𝑙=1 . The proof is given as follows: Lemma 12. The ℓ2 norm of the graph signal x is equal to the summation of the ℓ2 norm of the 𝑁 𝑚𝑎𝑥 pooled signals {x𝑙 }𝑙=1 : 𝑁Õ𝑚𝑎𝑥 ||x|| 22 = ||x𝑙 || 22 (5.21) 𝑙=1 Proof. From Lemma 10 and 11, we have 𝑁Õ 𝐾 𝑁Õ 2 𝑚𝑎𝑥 Õ 𝑚𝑎𝑥 kxk 22 = k Θ𝑙 x𝑙 k 22 = ū𝑙(𝑘) · x𝑙 [𝑘] 𝑙=1 𝑘=0 𝑙=1 2 Õ 𝐾 𝑁Õ 𝑚𝑎𝑥 𝑁Õ 𝑚𝑎𝑥 = x2𝑙 [𝑘] = ||x𝑙 || 22 𝑘=0 𝑙=1 𝑙=1 which completes the proof. Property 3: Approximate Energy Preserving Lemma 12 describes the energy preserving when 𝑁𝑚𝑎𝑥 number of filters are chosen. In practice, we only need 𝐻  𝑁𝑚𝑎𝑥 of filters for efficient computation. Next we show that even with 𝐻 number of filters, the filtered signals preserve most of the energy/information. 𝐻 Theorem 13. Let x0 = Í Θ𝑙 x𝑙 be the graph signal reconstructed only using the first 𝐻 pooling 𝑙=1 𝑁Í𝐻 ||x𝑙 || 22 operators and pooled signals {x𝑙 }𝑙=1 𝐻 . Then the ratio 𝑙=1 𝑁𝑚𝑎𝑥 can measure the portion of information ||x𝑙 || 22 Í 𝑙=1 being preserved by this x0. 69 𝑁Í𝑚𝑎𝑥 𝐻 Proof. As shown in Lemma 5.3.2, kxk 22 = kx𝑙 k 22 . Similarly, we can show that kx0 k 22 = kx𝑙 k 22 . Í 𝑙=1 𝑙=1 The portion of the information being preserved can be represented as 𝑁 Í𝐻 ||x𝑙 || 22 kx0 k 22 𝑙=1 = . (5.22) kxk 22 𝑁Í 𝑚𝑎𝑥 ||x𝑙 || 22 𝑙=1 which completes the proof. Since for natural graph signals, the magnitude of the spectral form of the graph signal is concentrated in the first few coefficients [70], which means that even with 𝐻 filters, EigenPooling can preserve the majority of the information/energy. 5.3.3 Permutation Invariance of EigenGCN EigenGCN takes the adjacency matrix A and the node feature matrix X as input and aims to learn a vector representation of the graph. The nodes in a graph do not have a specific ordering, i.e., A and X can be permuted. Obviously, for the same graph, we want EigenGCN to extract the same representation no matter which permutation of A and X are used as input. Thus, in this subsection, we prove that EigenGCN is permutation invariant, which lays the foundation of using EigenGCN for graph classification. Theorem 14. Let P = {0, 1}𝑛×𝑛 be any permutation matrix, then EigenGCN (A, X) = EigenGCN (PAP𝑇 , PX), i.e., EigenGCN is permutation invariant. Proof. In order to prove that EigenGCN is permutation invariant, we only need to show that it’s key components GCN, graph coarsening and EigenPooling are permutation invariant. For GCN, before 1 1 permutation, the output is F = 𝑅𝑒𝐿𝑈 ( D̃− 2 ÃD̃− 2 XW𝑖 ). With permutation, the output becomes  1 1  1 1 𝑅𝑒𝐿𝑈 (PD̃− 2 ÃD̃− 2 P𝑇 )(PX)W = 𝑅𝑒𝐿𝑈 (PD̃− 2 ÃD̃− 2 XW) = PF 70 where we have used P𝑇 P = I. This shows that the effect of permutation on GCN only permutes the order of the node representations but doesn’t change the value of the representations. Second, the graph coarsening is done by spectral clustering with A. No matter which permutation we have, the detected subgraphs will not change. Finally, EigenPooling summarizes the information within each subgraph. Since the subgraph structures are not affected by the permutation and the representation of each node in the subgraphs is also not affected by the permutation, we can see that the supernodes’ representations after EigenPooling are not affected by the permutation. In addition, the inter-connectivity of supernodes is not affected since it’s determined by spectral clustering. Thus, we can say that one step of GCN-Graph Coarsening-EigenPooling is permutation invariant. Since finally EigenGCN learns one vector representation of the input graph, we can conclude that the vector representation is the same under any permutation P. 5.4 Experiment In this section, we conduct experiments to evaluate the effectiveness of the proposed framework EigenGCN. Specifically, we aim to answer two questions: • Can EigenGCN improve the graph classification performance by the design of EigenPooling? • How reliable it is to use 𝐻 number of filters for pooling? We begin by introducing datasets and experimental settings. We then compare EigenGCN with representative and state-of-the-art baselines for graph classification to answer the first question. We further conduct analysis on graph signals to verify the reasonableness of using 𝐻 filters, which answers the second question. 5.4.1 Datasets To verify whether the proposed framework can hierarchically learn good graph representations for classification, we evaluate EigenGCN on 6 widely used benchmark data sets for graph classifica- tion [38], which includes three protein graph data sets, i.e., ENZYMES [5, 73], PROTEINS [5, 17], 71 and D&D [17, 77]; one mutagen data set Mutagenicity [67, 37] (We denoted as MUTAG in Table 5.1 and Table 5.2); and two data sets that consist of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, NCI1 and NCI109 [88]. Some statistics of the data sets can be found in Table 5.1. From the table, we can see that the used data sets contain varied numbers of graphs and have different graph sizes. We include data sets of different domains, sample and graph sizes to give a comprehensive understanding of how EigenGCN performs with data sets under various conditions. Table 5.1: Statistics of datasets ENZYMES PROTEINS D& D NCI1 NCI109 MUTAG # graphs 600 1,113 1,178 4,110 4,127 4,337 mean |V| 32.63 39.06 284.32 29.87 29.68 30.32 # classes 6 2 2 2 2 2 5.4.2 Baselines and Experimental Settings To compare the performance of graph classification, we consider some representative and state-of- the-art graph neural network models with various pooling layers. Next, we briefly introduce these baseline approaches as well as the experimental settings for them. • GCN [40] is a graph neural network framework proposed for semi-supervised node classifi- cation. It learns node representations by aggregating information from neighbors. As, the GCN model does not consist of a pooling layer, to obtain the graph representation, we use the average of the learned node representations as the graph representation. We use it as a baseline to compare whether a hierarchical pooling layer is necessary. • GraphSage [28] is similar as the GCN and provides various aggregation method. Here, we adopt the mean aggregator for node representation learning. The average of the learned node representations is served as the graph representation. • SET2SET. This baseline is also built upon GCN, it is also “flat” but uses set2set architecture introduced in [86] instead of averaging over all the nodes. We select this method to further 72 show whether a hierarchical pooling layer is necessary no matter average or other pooling methods are used. • DGCNN [98] is built upon the GCN layer. The features of nodes are sorted before feeding them into traditional 1-D convolutional and dense layers [98]. This method is also “flat” without a hierarchical pooling procedure. • Diff-pool [97] is a graph neural network model designed for graph level representation learning with differential pooling layers. It uses node representations learned by an additional convolutional layer to learn the subgraphs (supernodes) and coarsen the graph based on it. We select this model as it achieves state-of-art performance on the graph classification task. • EigenGCN-H represents various variants of the proposed framework EigenGCN, where 𝐻 denotes the number of pooling operators we use for EigenPooling. In this evaluation, we choose 𝐻 = 1, 2, 3. For each of the data sets, we randomly split it to 3 parts, i.e., 80% as training set, 10% as validation set and 10% as testing set. We repeat the randomly splitting process 10 times, and the average performance of the 10 different splits are reported. The parameters of baselines are chosen based on their performance on the validate set. For the proposed framework, we use the 9 splits of the training set and validation set to tune the structure of the graph neural network as well as the learning rate. The same structure and learning rate are then used for all 9 splits. Following previous work [97], we adopt the widely used evaluation metric, i.e., Accuracy, for graph classification to evaluate the performance. 5.4.3 Performance on Graph Classification Each experiment is run 10 times and the average graph classification performance in terms of accuracy is reported in Table 5.2. From the table, We make the following observations: 73 • Diff-pool and the EigenGCN framework perform better than those methods without a hierarchical pooling procedure in most of the cases. Aggregating the node information hierarchically can help learn better graph representations. • The proposed framework EigenGCN shares the same convolutional layer with GCN, GraphSage, and SET2SET. However, the proposed framework (with different 𝐻) outperforms them with a large margin in most of the data sets. This further indicates the necessity of the hierarchical pooling procedure. In other words, the proposed EigenPooling can indeed help the graph classification performance. • In most of the data sets, we can observe that the variants of the EigenGCN with more eigenvectors achieve better performance than those with fewer eigenvectors. Including more eigenvectors, which suggests that we can preserve more information during pooling, can help learn better graph representations in most of the cases. In some of data sets, including more eigenvector does not bring any improvement in performance or even make the performance worse. Theoretically, we are able to preserve more information by using more eigenvectors. However, noise signals may be also preserved, which can be filtered when using fewer eigenvectors. • The proposed EigenGCN achieves the state-of-the-art or at least comparable performance on all the data sets, which shows the effectiveness of the proposed framework EigenGCN. To sum up, EigenPooling can help learn better graph representation and the proposed framework EigenGCN with EigenPooling can achieve state-of-the-art performance in graph classification task. 5.4.4 Understanding Graph Signals In this subsection, we investigate the distribution of the Fourier coefficients on signals in real data. We aim to show that for natural graph signals, most of the information/energy concentrates on the first few Fourier models (or eigenvectors). This paves us a way to only use 𝐻 filters in EigenPooling. Specifically, given one data set, for each graph G𝑖 with 𝑁𝑖 nodes and its associated signal X𝑖 ∈ R𝑁𝑖 ×𝑑 , 74 Table 5.2: Performance comparison. Data sets Baselines ENZYMES PROTEINS D&D NCI1 NCI109 MUTAG GCN 0.440 0.740 0.759 0.725 0.707 0.780 GraphSage 0.554 0.746 0.766 0.732 0.703 0.785 SET2SET 0.380 0.727 0.745 0.715 0.686 0.764 DGCNN 0.410 0.732 0.778 0.729 0.723 0.788 Diff-pool 0.636 0.759 0.780 0.760 0.741 0.806 EigenGCN-1 0.650 0.751 0.775 0.760 0.746 0.801 EigenGCN-2 0.645 0.754 0.770 0.767 0.748 0.789 EigenGCN-3 0.645 0.766 0.786 0.770 0.749 0.795 we first calculate the graph Fourier transform and obtain the coefficients X̂𝑖 ∈ R𝑁𝑖 ×𝑑 . We then k X̂𝑖 [1:𝐻,:] k 22 calculate the following ratio: 𝑟𝑖𝐻 = k X̂𝑖 k 22 , where X̂𝑖 [1 : 𝐻, :] denotes the first 𝑖 rows of the matrix X̂𝑖 for various values of 𝐻. According to Theorem 13, this ratio measures how much information can be preserved by the first 𝐻 coefficients. We then average the ratio over the entire data set and obtain Õ 𝑟𝐻 = 𝑟𝑖𝐻 . (5.23) 𝑖 Note that if 𝐻 > 𝑁𝑖 , we set 𝑟𝑖𝐻 = 1. We visualize the ratio for each of the data set up to 𝐻 = 40 in Figure 5.2. As shown in Figure 5.2, for most of the data set, the magnitude of the coefficients concentrated in the first few coefficients, which demonstrates the reasonableness of using only 𝐻  𝑁𝑚𝑎𝑥 filters in EigenPooling. In addition, using 𝐻 filters can save computational cost. 5.5 Related Work In recent years, graph neural network models, which try to extend deep neural network models to graph structured data, have attracted increasing interests. These graph neural network models have been applied to various applications in many different areas. In [40], a graph neural network model that tries to learn node representation by aggregating the node features from its neighbors, is applied to perform semi-supervised node classification. Similar methods were later proposed to further enhance the performance by including attention mechanism [84]. GraphSage [97], which allows 75 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 ENZYMES 0.4 PROTEINS 0.4 NCI1 0.2 0.2 0.2 0.0 0 10 20 30 40 0.0 0 10 20 30 40 0.0 0 10 20 30 40 (a) ENZYMES (b) PROTEINS (c) NCI1 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 NCI109 0.4 Mutagenicity 0.4 DD 0.2 0.2 0.2 0.0 0 10 20 30 40 0.0 0 10 20 30 40 0.0 0 10 20 30 40 (d) NCI109 (e) Mutagenicity (f) DD Figure 5.2: Understanding graph signals more flexible aggregation procedure, was designed for the same task. There are some graph neural networks models designed to reason the dynamics of physical systems where the model is applied to predict future states of nodes given their previous states [3, 69]. Most of the aforementioned methods can fit in the framework of “message passing” neural networks [26], which mainly involves transforming, propagating and aggregating node features across the graph through edges. Another stream of graph neural networks was developed based on the graph Fourier transform [15, 6, 29, 45]. The features are first transferred to the spectral domain, next filtered with learnable filters and then transferred back to the spatial domain. The connection between these two streams of works is shown in [15, 40]. Comprehensive surveys on graph neural networks can be found in [104, 93, 100, 4]. However, the design of the graph neural network layers is inherently “flat”, which means the output of pure graph neural network layers is node representations for all the nodes in the graph. To apply graph neural networks to the graph classification task, an approach to summarize the learned node representations and generate the graph representation is needed. A simple way to generate 76 the graph representations is to globally combine the node representations. Different combination approaches have been investigated, which include averaging over all node representation as the graph representation [18], adding a “virtual node” connected to all the nodes in the graph and using its node representation as the graph representation [49], and using conventional fully connected layers or convolutional layers after arranging the graph to the same size [98, 26]. However, these global pooling methods cannot hierarchically learn graph representations, thus ignoring important information in the graph structure. There are a few recent works [15, 97, 81, 21] investigating learning graph representations with a hierarchical pooling procedure. These methods usually involve two steps 1) coarsen a graph by grouping nodes into supernode to form a hierarchical structure and 2) learn supernode representations level by level and finally obtain the graph representation. These methods use mean-pooling or max-pooling when they generate supernodes representation, which neglects the important structure information in the subgraphs. In this paper, we propose a pooling operator based on local graph Fourier transform, which utilizes the subgraph structure as well as the node features for generating the supernode representations. 77 CHAPTER 6 REPPOOL: A GRAPH POOLING OPERATION WITH REPRESENTATIVENESS 6.1 Introduction In this chapter, we continue considering the hierarchical pooling procedure. In the previous chapter, we group nodes into “supernodes” to coarsen graphs during the graph pooling process. There is another stream of works on graph pooling operations, which are based on downsmapling. These graph pooling operations identify and sample important nodes to build coarsened graphs. This perspective enables GNNs1 to iteratively coarsen the input graph and gradually reduce the information into the graph representation level by level. However, when performing downsampling (or node selection), most of these existing pooling methods [23, 32, 99] only focus on the importance of the nodes. In Figure 6.1, we illustrate one downsampling step by such a pooling operation Figure 6.1: An illustration of node selection based on the importance by gPool [23] at the first layer. Red nodes are the selected ones. gPool might select nodes that are redundantly concentrated on some substructures while totally neglecting information of other substructures. 1 In this chapter, we are dealing with a graph-focus tasks. Hence, when we mention GNNs, we refer to the GNN models that follow the general framework for graph-focus task described in Section 2.3.3.2 78 gPool [23], where the selected nodes are highlighted in red. As shown in Figure 6.1, if only node importance is considered in the pooling operator, several limitations are naturally introduced. First, the selected nodes tend to be redundantly concentrated on some substructures while completely overlooking information in other substructures (e.g., substructures around nodes 𝑣 22 and 𝑣 23 ). One possible reason is that gPool selects nodes according to a learnable importance score, which takes node features as input. Hence, connected nodes may have similar scores, as their node features tend to be similar. Furthermore, important nodes tend to connect with each other in many real-world graphs. For instance, in social networks, celebrities tend to make friends with other celebrities, and famous researchers usually cooperate with others who are also influential in the same area. The importance score can be viewed as a global measure, which ignores the local representativeness of nodes. The representativeness indicates the importance of a node for its local substructure. Specifically, the selection of one node will reduce the representativeness of other nodes in the same local substructure though they may also have high global importance scores. Thus, we would rather choose those nodes that have relatively smaller importance scores but are more representative for other substructures, with the purpose to cover more substructures and preserve more information from the original graph. For example, in Figure 6.1, nodes 𝑣 22 and 𝑣 23 are more preferred than node 𝑣 36 or 𝑣 10 . Second, generating the coarsened graph and its corresponding node features only based on these selected nodes could result in an unrepresentative structure and lead to huge information loss. For example, gPool utilizes the original connections between the selected nodes to build the connections for the coarsened graph, which might render the resulting coarsened graph unconnected; furthermore, to generate the node features for the coarsened graph, gPool solely utilizes the node features of the selected nodes while discarding features on all the non-selected ones. Therefore, strategies are needed to better preserve the information of the original graph during the pooling process in both the stages of node selection and coarsened graph generation. In this paper, we propose a novel graph pooling operator RepPool. It can select nodes that are both globally important and can represent more substructures. Moreover, it utilizes both selected 79 and un-selected nodes when generating the coarsened graph and its corresponding node features. 6.2 Related Work This section briefly reviews related work, including general graph neural network models and various pooling methods. More details on GNNs and their applications can be found in the book Deep Learning on Graphs [53]. 6.2.1 Graph Neural Networks GNNs can be roughly divided into two categories – spectral [7, 15, 29, 48, 41] and spatial models [24, 28, 63, 85]. Spectral methods generally define convolutional operations in the Fourier domain based on the graph spectral theory. Bruna et al. [7] first proposed a variant of graph convolutional operation by utilizing spectral filters. This method is difficult to apply on large-scale graphs due to its high computational cost and non-locality property. To address this challenge, Defferrard et al. [15] tried to improve its efficiency by approximating K-polynomial filters. Kipf and Welling [41] proposed a simplified model [15] by approximating the first-order Chebyshev expansion. The goal of the spatial approaches is to define graph convolutional operations that can work on graphs directly. It generally passes information to the next layer via propagating and aggregating information from local neighborhood. Among them, Hamilton et al. [28] aimed to learn node representations by sampling and fixing the size of the neighborhood. Velickovi et al. [85] utilized the attention mechanism to operate in the entire neighborhood. GNNs have also been extended to various types of graphs, including signed graphs [16], dynamic graphs [50], and multi-dimensional graphs [54]. However, conventional GNNs mentioned above are flat: learning node representations for all nodes, and cannot work on the graph classification task directly. 6.2.2 Graph Pooling To apply GNNs to the graph classification task, a graph level representation is needed. Early common methods performed global pooling to combine node representations into one. For instance, 80 Duvenaud et al. [18] averaged all node representations as the graph representation. Vinyals et al. [87] proposed to learn graph representation by utilizing LSTMs [30]. Zhang et al. [98] sorted node features in a consistent order to learn graph representation with a fixed size. Nevertheless, these methods cannot generate hierarchical graph representations, leading to the loss of important information. There have been some recent works that tried to design hierarchical pooling procedures. DiffPool [97] introduced an assignment matrix to group nodes into a set of clusters. gPool [23] and AttPool [32] selected top-K nodes according to a proposed score to form a coarser subgraph for the next layer. EigenPool [55] introduced a pooling operator based on graph Fourier transform. ASAP [66] devised a cluster scoring procedure to select nodes by using a graph convolution method which can capture local extremum information. Besides sampling nodes to form a new subgraph, HGP-SL [99] also focused on constructing a refined graph structure for the subgraph. However, most of these methods ignore representativeness in the node selection procedure. In addition, when forming a new subgraph, some methods [23, 99, 66] only use the original data from selected nodes while discarding non-selected nodes, e.g., node features of selected nodes or connections between selected nodes, which might cause information loss. 6.3 The Proposed Model The proposed model mostly follows the general framework that we introduce in Section 2.3.3.2 with some tiny discrepancies. More specifically, instead of only using the output from the last pooling layer, we also combine representations learned from the intermediate layers for graph classification. In this Section, we first describe an overview of the framework and then introduce the proposed graph pooling operation in detail. 6.3.1 An Overview An overview of the proposed framework is shown in Figure 6.2a, which mainly consists of three types of modules, i.e., the GCN module, the pooling module and the prediction module. The GCN modules and the pooling modules are utilized during the process of learning the graph 81 conv1 conv2 Pooling1 GCN1 conv1 conv2 Pooling2 Readout … Label GCN2 conv1 MLP conv2 Pooling3 GCN3 (a) An Overview of the graph classification architecture Input graph Select nodes Pooled graph (b) An Illustration of the pooling Operator Figure 6.2: An Illustration of the proposed hierarchical graph classification architecture and the proposed pooling module. In the node selection process in (b), 4 nodes are selected step by step. Note that the red color indicates that the node has been selected. representations, while the prediction module takes the learned graph representation as input and makes the prediction. Specifically, during the graph representation learning process, one pooling module is stacked over one GCN module and this procedure repeats several times. In particular, the GCN module is used to refine the node representations and the pooling module is utilized to generate a coarsened graph and its corresponding node features, which is then used as the input to another GCN module. We call the combination of a GCN module and a pooling module as a learning block of the proposed framework. At each learning block, a graph-level representation is generated by applying node-wise max-pooling to the outputs of the pooling module. Then a 82 readout function is applied to these graph-level representations from different layers to generate a final graph representation, which goes through Multi-Layer Perceptron (MLP) layers to perform the graph classification task in the prediction module. In this way, the whole architecture can be optimized in an end-to-end manner. More details about each module are below: • The GCN Module: It consists of 𝐾𝑔 subsequent GCN Graph Filtering layers (GCN layers), and takes the output of its previous GCN layer as input. Specifically, the input of the first GCN layer in a GCN module is either the input features (for the first learning block) or the output of the previous learning block. • The Pooling Module: It contains two major components in the pooling module: 1) A component to select nodes from the input graph; and 2) a component to build a coarser graph by updating the feature matrix and the adjacency matrix utilizing the selected nodes. • The Prediction Module: As shown in Figure 6.2, the proposed hierarchical architecture repeats GCN module and pooling module for several times, thus we could have several graph representations from each learning block. Following a common setting [97, 32], we employ a readout function to concatenate graph representations from all learning blocks to form the finalized one. At last, finalized graph representations are fed into MLP layers with softmax classifier. The loss is defined as cross entropy loss over the graph labels. Through training the whole architecture, a predicting label can be obtained for each input graph. Next, we give details about the components of the proposed pooling module. Without the loss of generality, we use A ∈ R|V |×|V | and X ∈ R|V |×𝑑 to denote the input graph for the pooling module, A𝑛𝑒𝑤 ∈ R 𝑘×𝑘 and X𝑛𝑒𝑤 ∈ R 𝑘×𝑑 to denote the output coarser graph which is also the input graph of the next GCN module, where 𝑘 is the number of selected nodes and 𝑘 < |V |. 6.3.2 Node Selection During the pooling process, to preserve the graph structure and feature information as much as possible, we need to select nodes with significant importance in the graph. However, if we only 83 focus on selecting important nodes, we may end up with selecting redundant nodes from some substructures, while ignoring nodes from other substructures, which leads to a huge information loss. In fact, it is evident that nodes in the same substructure may have mutual exclusion effect on their importance [91]. In other words, highly important nodes in one substructure could render others less important. To alleviate this issue, we introduce node representativeness in addition to importance, with the purpose to include more diverse important nodes and cover as many substructures as possible. Specifically, we propose an importance score and a representativeness score, which are combined to generate an overall score to serve as the criterion to perform the node selection. Next, we first introduce these two scores and then describe how we can utilize the combination of them to select nodes. 6.3.2.1 Node Importance For the node importance, we introduce a global score to evaluate how much information it contains. In general, if a node has a large degree and its neighbors contain rich information, namely its neighbors are relatively important, it means that this node probably is also important and contains rich information. Hence, we define the importance score of a node utilizing its information as well as its neighbors’ information. In particular, the importance score 𝑠𝑖 for the node 𝑣 𝑖 can be defined as follows: © Õ 1 𝑠𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ­ 𝑝(𝑣 𝑗 ) ® (6.1) ª 𝑑𝑒𝑔(𝑣 𝑗 ) 𝑣 « 𝑗 ∈N (𝑣 𝑖 ) ¬ where N (𝑣 𝑖 ) denotes the set of neighbors of the node 𝑣 𝑖 , 𝑑𝑒𝑔(𝑣 𝑗 ) is the degree of the node 𝑣 𝑗 , 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 () is the sigmoid function, and 𝑝(𝑣 𝑗 ) is defined as follows: x𝑗p 𝑝(𝑣 𝑗 ) = (6.2) ||p|| with x 𝑗 the input feature of node 𝑣 𝑗 (i.e., x 𝑗 is the 𝑗-th row of X) and p ∈ R𝑑 a learnable vector to project x 𝑗 to a scalar 𝑝(𝑣 𝑗 ). 84 6.3.2.2 Node Representativeness On the one hand, since only selecting nodes based on importance is likely to choose nodes to cover some substructures while overlooking other substructures, we aim to select various nodes from more substructures. On the other hand, highly important nodes in one substructure could render others less important. Both above intuitions suggest that in addition to importance, we should select nodes representing more substructures (or representativeness). Correspondingly, we introduce the representativeness score. To cover more substructures and mitigate the influence of existing important nodes, we should select nodes distant from these existing ones. This paves us a way to define the representativeness score. Specifically, we select the nodes one by one and the representativeness score needs to be generated during each selection step. Assuming that we have already selected a set of nodes and now proceed to select the next node. The set of the indices of the selected nodes is denoted as 𝑖𝑑𝑥. Then, the representativeness score 𝛿𝑖 of the remaining candidate node 𝑣 𝑖 is defined as follows: 𝛿𝑖 = 𝑓 (ℎ(𝑣 𝑖 , 𝑣 𝑗 ), 𝑗 ∈ 𝑖𝑑𝑥) (6.3) where ℎ is a function to measure the distance between nodes 𝑣 𝑖 and 𝑣 𝑗 , and 𝑓 is a function to define the distance between node 𝑣 𝑖 and nodes in 𝑖𝑑𝑥. In this paper, we empirically find that the following choices of ℎ and 𝑓 work very well – (1) ℎ defines the distance between 𝑣 𝑖 and 𝑣 𝑗 as the length of the shortest path between them; and (2) 𝑓 defines the distance between node 𝑣 𝑖 and nodes in 𝑖𝑑𝑥 as the minimal pair-wise distance of 𝑣 𝑖 to nodes in 𝑖𝑑𝑥. With these choices, 𝛿𝑖 can be rewritten as: 𝛿𝑖 = min ℎ(𝑣 𝑖 , 𝑣 𝑗 ) (6.4) 𝑗 ∈𝑖𝑑𝑥 where if a candidate node is close to a selected node, its representativeness score will be small, otherwise it will be relatively large. This helps to select nodes that are more likely to be far from the selected nodes. 85 6.3.2.3 Node Selection Algorithm We then obtain an overall selection score 𝛾𝑖 for 𝑣 𝑖 by combining this two scores as follows: 𝛾𝑖 = 𝑔(𝑠𝑖 , 𝛿𝑖 ), (6.5) where 𝑔 is a function to combine these two sources. We can have many settings for 𝑔 such as linear combination or we even can use a forward neural network to make 𝑔 learnable. However, in this work, we empirically set 𝛾𝑖 = 𝑠𝑖 ∗ 𝛿𝑖 and leave other settings as one future work. The Algorithm of the node selection process is given in Algorithm 4, where 𝑘 = 𝑛 ∗ 𝑟 is the number of nodes selected in the new subgraph with the predefined pooling ratio 𝑟. In line 3, we first choose the node with the maximal importance score as the selected node. For the remaining nodes, we compute the current representativeness score, which is shown in line 5. In line 6, we combine both the global importance score and the representativeness score into an overall score. Then the next node is selected with maximal overall score which balances node importance and representativeness. This process is repeated in 𝑘 − 1 times. In this way, nodes are selected one by one in a greedy fashion. Meanwhile, by incorporating the influence of representativeness, selected nodes tend to locate in various substructures of the graph. With the selected nodes, we will generate a new coarser graph and more details will be discussed in the following subsection. Thus, the new coarser graph may lose some information from the original graph. If we calculate node distance ℎ(𝑣 𝑖 , 𝑣 𝑗 ) based on the new coarser graph, it may be inaccurate. Thus, to preserve this information and distinguish nodes from different substructures, we will fix the distance ℎ(𝑣 𝑖 , 𝑣 𝑗 ) as the distance in the original graph during the pooling process. The benefits of this setting are two-fold. First, we preserve the original distance information during the pooling process. Second, we only need to compute distance among nodes in the original graph once and then use it at each learning block 86 instead of computing the distance in each new coarser graph. Algorithm 4: Node Selection Input: Score set s ∈ R𝑛×1 , sample number 𝑘 Output: Indexes of selected nodes 𝑖𝑑𝑥 1 Initialize 𝑖𝑑𝑥 ← 𝑠𝑒𝑡 (), I ← indexes of all nodes, 𝑗 = 1 ; 2 𝑖𝑑𝑥.add(arg max𝑖∈I (s𝑖 )); 3 while 1 ≤ 𝑗 ≤ 𝑘 do 4 𝛿𝑖 ← Eq. (6.4); 5 𝛾𝑖 ← Eq. (6.5); 6 𝑖𝑑𝑥.add(arg max𝑖∈I/𝑖𝑑𝑥 (𝛾𝑖 )); 7 𝑗 ← 𝑗 + 1; 8 return 𝑖𝑑𝑥 6.3.3 Coarser Graph Generation With the selected nodes, a coarser graph with significant information extracted from the original graph can be built by learning the feature matrix and the adjacency matrix. However, if we only utilize the selected nodes to generate the adjacency matrix and the feature matrix, we may lose too much structural and feature information about the original graph, which is contained in the non-selected nodes. Hence, in this work, to better preserve such information, we propose to learn the adjacency matrix and the feature matrix utilizing both the selected nodes and the non-selected nodes. Next, we describe the two processes in detail. 6.3.3.1 Generating the Feature Matrix To integrate node features from non-selected nodes, new feature matrix is generated by aggregating information from the original graph. In particular, we learn an assignment matrix to assign a node into selected ones. Formally, the assignment matrix B ∈ R𝑛×𝑘 is defined as follows: B = XWb ( X̂)𝑇 (6.6) 87 X̂ = X(𝑖𝑑𝑥, :) (6.7) where Wb ∈ R𝑑×𝑑 is a learnable parameter and X̂ ∈ R 𝑘×𝑑 consists of 𝑘 row vectors extracted from X. B𝑖 𝑗 indicates how likely 𝑣 𝑖 is associated with a selected node 𝑣 𝑗 . With the soft-assignment matrix B, we can aggregate the information of the non-selected nodes to the selected nodes. Additionally, we consider that nodes with higher global scores contribute more to the aggregation. Namely, information from more important nodes is more worthwhile to be preserved. Hence, we utilize the importance score as a gating system to control the information flow. The new features can be generated as follows: X𝑛𝑒𝑤 = (𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B M))𝑇 (X (s1𝑇𝑑 )) (6.8) where s ∈ R𝑛 is a vector with elements defined in Eq. (6.1) (e.g., s𝑖 is the importance score for 𝑣 𝑖 ), 1𝑑 ∈ R𝑑 is a vector of size 𝑑 with all elements being 1, 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥() is a row-wise softmax function to normalize the assignment scores for each node and M ∈ R𝑛×𝑘 is a masking matrix with M𝑖 𝑗 = 1 if A(:, 𝑖𝑑𝑥)𝑖 𝑗 > 0 and M𝑖 𝑗 = 0 otherwise. We mask the soft-assignment matrix B, so that the selected nodes can only aggregate information from those non-selected nodes that are connected to them in the original graph. 6.3.3.2 Generating The Adjacency Matrix As similar in [97], the new adjacency matrix is generated based on the soft-assignment matrix B and the original adjacency matrix A to maintain the graph connectivity as: A𝑛𝑒𝑤 = (𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B M))𝑇 A(𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B M)) (6.9) where we only aggregate information for the selected nodes from those non-selected nodes which are connected in the original graph; thus we mask the soft-assignment matrix B similar to generating the feature matrix in the above subsection. With X𝑛𝑒𝑤 and A𝑛𝑒𝑤 , a new coarser graph is obtained which is fed into another GCN module in the next learning block. In the way, the original graph can be coarsened, repeatedly. 88 6.3.4 Time Complexity Analysis The computation of the proposed model depends on the GCN module and the pooling module. For a learning block, we use A ∈ R𝑛×𝑛 and X ∈ R𝑛×𝑑 to denote the input graph. The computation of a single GCN layer in the GCN module is 𝑂 (|E |𝑑 + 𝑛𝑑 2 ) where |E | is the number of edges. Then the time complexity of a GCN module is 𝑂 (𝐾𝑔 (|E |𝑑 + 𝑛𝑑 2 )) where 𝐾𝑔 is the number of GCN layers. In the pooling module, the computations of Eq. (6.1), (6.2), (6.4), (6.5), (6.6), (6.8), (6.9) are 𝑂 (|E |), 𝑂 (𝑛𝑑), 𝑂 (𝑛𝑘 2 ), 𝑂 (𝑛𝑘), 𝑂 (𝑛𝑑𝑘 + 𝑘 𝑑 2 ), 𝑂 (𝑛𝑘 𝑑), 𝑂 (𝑘 |E | + 𝑛𝑘 2 ), respectively. Thus, the total time complexity of the pooling module is 𝑂 (𝑛𝑑𝑘 + 𝑘 𝑑 2 + 𝑘 |E | + 𝑛𝑘 2 ). The overall time complexity of the proposed model is 𝑂 (𝐾 𝑝 (𝑛𝑑𝑘 + 𝑘 𝑑 2 + 𝑘 |E | + 𝑛𝑘 2 + 𝐾𝑔 (|E |𝑑 + 𝑛𝑑 2 ))) where 𝐾 𝑝 is the number of learning blocks. Note that, it costs 𝑂 (𝑛|E |) to calculate the distance between any pair of nodes in the graph. However, we do not include it into the time complexity of the proposed algorithm as it can be pre-calculated and used for all learning blocks and training epochs. Furthermore, the process of calculating the pair-wise distance in a graph can be easily paralleled. 6.4 Experiment In this section, we conduct experiments to demonstrate the effectiveness of the proposed framework. We first introduce the datasets we use, and the state-of-the-art baselines we compared with. Then, we present the graph classification results together with some analysis. We further conduct ablation study to demonstrate the importance of some of the major components in the proposed framework. Finally, to better understand the proposed framework, we perform a case study and parameter analysis. 6.4.1 Datasets In order to evaluate the performance of the proposed model, we conducted experiments on eight publicly available datasets. Statistics of the eight datasets are listed in Table 6.1 and next we give more details about them: 89 Table 6.1: Statistics of the eight datasets. Graphs: total number of graphs. Nodes (avg): the average number of nodes. Edges (avg): the average number of edges. #Class: the total number of classes. Datasets Graphs Nodes (avg) Edges (avg) #Class NCI1 4,110 29.87 32.30 2 NCI109 4,127 29.68 32.13 2 PROTEINS 1,113 39.06 72.82 2 D&D 1178 284.32 715.66 2 MUTAG 188 17.93 19.79 2 Synthie 400 95.00 172.93 4 PTC 344 25.56 25.96 2 IMDB-BINARY 1,000 19.77 96.53 2 • NCI1 and NCI109 [88] are datasets for anti-cancer activity, where each graph is a chemical compound with nodes representing atoms and edges representing chemical bonds. • PROTEINS and D&D [17, 5] are two protein graph datasets where nodes represent the amino acids and two nodes are connected if they are less than 6 Angstroms apart [99]. • MUTAG [77] is a dataset of mutagenic aromatic and heteroaromatic nitro compounds. The label indicates whether or not a compound has a mutagenic effect on a specific bacterium. • Synthie [59] is synthetic dataset which is subdivided into four classes. • PTC [78, 36] is a dataset of molecules dealing with carcinogenicity. • IMDB-BINARY [95] is a movie collaboration dataset which contains actor/actress and genre information of different movies on IMDB. 6.4.2 Baselines We compare the proposed model with six state-of-the-art graph pooling methods with hierarchical architectures: • DiffPool [97]: It groups nodes to generate supernodes in a differentiable way. These supernodes are then used to generate the coarsened graphs. 90 Table 6.2: Comparisons of graph classification performance in terms of accuracy. Note that we highlight the best accuracy result in bold. Datasets Methods NCI1 NCI109 PROTEINS D&D MUTAG Synthie PTC IMDB-BINARY AttPool-L 80.05 80.58 79.01 79.96 90.55 63.75 75.62 69.00 AttPool-G 80.22 82.14 79.73 81.20 89.44 64.00 75.88 68.80 gPool 71.58 70.90 78.08 81.58 88.84 55.01 67.18 75.60 DiffPool 82.43 81.31 79.37 82.96 93.89 62.75 73.82 75.30 HGP-SL 78.07 76.13 78.83 79.45 90.00 54.50 70.88 67.70 EigenPool 82.92 81.43 80.09 81.04 91.11 64.50 72.94 56.20 ASAP 73.95 72.62 78.17 81.28 88.78 29.25 69.93 68.75 RepPool 82.00 82.28 80.72 82.06 95.56 64.50 78.24 75.70 • gPool [23]: It selects nodes by utilizing a scalar projection operator to measure how much information nodes contain. These selected nodes are then utilized to build the coarsened graph. • AttPool [32]: Improved upon gPool, it utilizes attention mechanism to select significant nodes for graph representation adaptively. Specifically, a local attention and a global attention mechanism are proposed, which leads to two variants of AttPool, i.e. AttPool-L and AttPool-G. • HGP-SL [99]: It incorporates graph pooling and structure learning into a unified model to learn hierarchical graph representations. To preserve the integrity of the topological information, it introduces a structure learning mechanism to learn a refined graph structure. • ASAP [66]: It is based on the idea of selecting important nodes to coarsen the graph as proposed in gPool. To measure the importance of each node, it utilizes a self-attention network along with a modified GNN formulation. It also learns a sparse soft cluster assignment for nodes to form new graphs. • EigenPool [55]: It learns hierarchical graph representations based on graph Fourier transform, which utilizes node features and local structures during the pooling process. 91 6.4.3 Experimental Settings Following the setting of previous works [32, 23], each dataset is split into 10 different splits. We perform a 10-fold cross validation to evaluate the performance of the framework, where each time we use 9 splits as training set and the remaining 1 split as test set. The average performance over 10 runs is reported. For baseline algorithms, we use source codes released by authors and the best performance of them are reported. For the proposed model, we tune parameters by varying their values in different scales. For instance, the pooling ratio 𝑟 is set to be in the range of [0.1, 0.5], the learning rate 𝜇 is searched in {0.0005,0.001,0.002,0.003,0.004,0.005,0.01}, the number of pooling layers (i.e., the number of learning blocks) 𝐾 𝑝 and the number of convolutional layers in the GCN module 𝐾𝑔 are both selected in the range of [1, 4]. A two-layers MLP is adopted with 64 hidden units for all datasets, followed by softmax layer. We use cross entropy loss for training and optimize the model utilizing the Adam optimizer [39]. More details about the parameter selection of the proposed model will be discussed later. 6.4.4 Graph Classification Performance We compare the performance of RepPool with baseline algorithms on 8 datasets. The graph classification results are reported in Table 6.2. We can make the following observations from the table: • Overall, the proposed model obtains the highest accuracy value on most of the datasets compared with baseline algorithms. In particular, the proposed model achieves approximate 3.11% higher accuracy over the best baseline on the PTC dataset and 1.78% on the Synthie dataset. • The proposed model consistently outperforms gPool and AttPool-G, which demonstrates the effectiveness of selecting nodes that are both important and representative. • The AttPool-L considers the local importance using a local attention mechanism. The proposed model achieves better performance than AttPool-L consistently, which indicates that 92 Table 6.3: Graph classification performance for RepPool and its variants. Datasets Synthie PTC IMDB-BINARY Module RepPool_W/O_RA 56.50 73.82 66.90 RepPool_W/O_Rep 64.00 75.00 75.10 RepPool_W/O_Ass 61.50 74.71 67.50 RepPool 64.50 78.24 75.70 (a) NCI1: RepPool (b) NCI1: gPool (c) PTC: RepPool (d) PTC: gPool Figure 6.3: Visualization of node selection results at first layer in RepPool and gPool. Figures (a) and (b) are outputs on the NCI1 dataset and figures (c), (d) are outputs on the PTC dataset. Each node is attached with a number indicting its index in the graph. 25% nodes are selected and highlighted in red. the proposed framework captures the representativeness of nodes. • Although EigenPool and DiffPool exhibit competitive performance on some datasets, RepPool achieves 34.70% improvement over EigenPool on the IMDB-BINARY dataset and 5.99% over DiffPool on the PTC dataset. This verifies the effectiveness and robustness of RepPool to generate the hierarchical graph representations over different datasets. 6.4.5 Ablation Study In this subsection, we validate the effectiveness of the proposed components in the proposed model. Specifically, we investigate whether the inclusion of the node representativeness score plays an important role in the node selection process. We also investigate whether aggregating 93 information from non-selected nodes via the learned soft-assignment matrix is effective for the graph classification task. To achieve the goal, we define the following variants of the proposed framework: • RepPool_W/O_Rep: It denotes RepPool without the representativeness score combined to select nodes, i.e., only the node importance is considered. • RepPool_W/O_Ass: It is a variant of RepPool by excluding the soft-assignment matrix when updating the feature matrix, i.e., excluding contributions from non-selected nodes. • RepPool_W/O_RA: it denotes RepPool without representativeness and contributions from non-selected nodes. We conduct the ablation study on the following three datasets: Synthie, PTC, and IMDB- BINARY. The results are shown in Table 6.3. We can make the following observations from the table: • RepPool consistently outperforms RepPool_W/O_Rep on three datasets, which demonstrates the effectiveness of modeling representiveness in our framework. • RepPool_W/O_Ass performs not as good as RepPool, which indicates the necessity of including the soft-assignment to aggregate information from the non-selected nodes when generating the new feature matrix. • RepPool_W/O_RA obtains worse performance than both RepPool_W/O_Rep and Rep- Pool_W/O_Ass. Both components significantly contribute to the proposed framework. 6.4.6 Case Study During the node selection process, the proposed framework considers both importance and representativeness of nodes. In this subsection, we illustrate that the nodes selected by RepPool do not concentrate together but are distributed across different structures over the graph. Specifically, we also show the nodes selected by gPool to make a comparison. We randomly sample a graph from 94 the NCI1 dataset and another graph from the PTC dataset. For each graph, 25% nodes are selected from the first layer in the corresponding pooling operator. The node selection results are shown in Figure 6.3, where the selected nodes are highlighted in red. As shown in Figure 6.3, we can find that the proposed model RepPool potentially selects nodes distributing in different substructures that cover the whole graph. On the contrary, gPool is likely to select nodes that are concentrated in the same area and thus important information of other parts might be neglected. For example, in Figure 6.3b, most of selected nodes are in the substructure around node 𝑣 19 , but only one node 𝑣 3 is selected for the substructure around node 𝑣 24 . Moreover, node 𝑣 19 tends to share similar properties with other nodes around it. Therefore, redundant information might be provided by selecting them all. Similar situation occurs in Figure 6.3d. In addition, gPool tends to select margin nodes, e.g., nodes 𝑣 3 , 𝑣 6 , 𝑣 34 in Figure 6.3b and node 𝑣 19 in Figure 6.3d, that might contain less valuable information than representative nodes. Compared with gPool, RepPool tries to select the representative nodes. For instance in Figure 6.3a, RepPool selects the hub node 𝑣 24 that integrates information from other two hub nodes 𝑣 22 and 𝑣 23 , and thus could provide much valuable information. 6.4.7 Parameter Analysis We further study the sensitivity of several key hyper-parameters: pooling ratio 𝑟, learning rate 𝜇, pooling layer number (i.e., number of learning blocks) 𝐾 𝑝 , and the number of convolutional layers in the GCN module 𝐾𝑔 . We investigate how these parameters affect the graph classification task by varying their values in different scales on three datasets: Synthie, PTC, and IMDB-BINARY. Results are presented in Figure 6.4. We can find some similar trends in different datasets. For instance, as shown in Figure 6.4a, the performance of RepPool tends to decrease when 𝑟 increases. It suggests that redundant information might be provided when 𝑟 becomes larger. In most cases, RepPool exhibits the best performance when 𝐾 𝑝 and 𝐾𝑔 are assigned to be relatively smaller numbers as shown in Figure 6.4c and Figure 6.4d. 95 80 75 Accuracy % 75 Accuracy % 70 70 65 65 60 60 1 2 3 4 5 0.0005 0.001 0.002 0.003 0.004 0.005 0.01 Pooling ratio r Learning rate (a) (b) 80 80 Accuracy % 75 Accuracy % 75 70 70 65 65 60 60 1 2 3 4 1 2 3 4 Pooling layers Kp Convolutional layers Kg (c) (d) Synthie PTC IMDB-BINARY 80 Accuracy % Figure 6.4: Graph classification results by varying four parameters: pooling ratio 𝑟, learning rate 𝜇, pooling layers 𝐾 𝑝 , and GCN layers 𝐾𝑔 on three datasets: Synthie, PTC, and IMDB-BINARY. 70 96 60 1 2 3 4 CHAPTER 7 CONCLUSION Graph Neural Networks have shown their great capability in graph representation learning, thus facilitating numerous graph-related applications, including node classification, link prediction, and graph classification. In this dissertation, we study two essential operations for graph neural networks. For graph filtering operations, we mainly focus on investigating the existing architecture designs and providing insightful understandings of them from new perspectives. Compared with the graph filtering operations, the graph pooling operations are underresearched and we proposed two novel graph pooling operations from new perspectives, which broaden the scope of the research of graph pooling operations. In this chapter, we summarize our research outcomes and identify some potential future research directions. 7.1 Summary We first show how various graph filtering operations in representative GNN models including GCN, PPNP, APPNP and GAT can be unified mathematically as natural instances of graph denoising problems. Specifically, the aggregation operations in these models can be regarded as exactly or approximately addressing such denoising problems subject to Laplacian regularization. With these observations, we propose a general framework, Ugnn, which enables the design of new GNN models from the denoising perspective via regularizer design. As an example demonstrating the promise of this paradigm, we instantiate the Ugnn framework with a regularizer addressing adaptive local smoothness across nodes, a property prevalent in several real-world graphs, and proposed and evaluated a suitable new GNN model, Ada-Ugnn. Due to this intrinsic connection to the graph signal denoising problem, most existing graph filtering operations inherently have “feature smoothing” effects. Hence, it is widely believed that, when applying to node classification tasks, GNN models inherently assume strong homophily and thus fail to generalize to graphs with heterophily. We revisit this popular notion in literature, and show 97 that it is not quite true. We investigate one representative model, GCN, and show empirically that it can achieve good performance on heterophilous graphs under certain conditions, sometimes even better than more complex architectures specifically designed for such graphs (and indeed, correct the record for some common benchmark datasets). We next analyze theoretically what conditions are required for GCNs to learn similar embeddings for nodes of the same class, facilitating the semi-supervised node classification task; put simply, when nodes with the same label share similar neighborhood patterns, and different classes have distinguishable patterns, GCN can achieve strong class separation, regardless of homophily or heterophily properties. Empirical analysis supports our theoretical findings. Finally, we revisit several existing homophilous and heterophilous semi- supervised node classification benchmark graphs, and investigate GCN’s empirical performance in light of our understanding. When GNNs are adopted to graph-level tasks, in addition to the graph filtering operation, the graph pooling operation is required to summarize the node representations to form graph representation. To preserve the structure of graphs, hierarchical graph pooling operations have been developed. They coarsen the graph step by step until a graph representation is obtained. There are majorly two streams of works in designing hierarchical graph pooling operations depending on their ways to coarsen the graph. One of them coarsens a graph by downsampling nodes while the other coarsens a graph by clustering similar nodes to form supernodes. These downsampled nodes (or clustered supernodes) serve as the nodes for the coarsened graph, and their representations and connections between them are then generated. We investigate both types of graph pooling operations. We first design EigenPooling, a supernode-based pooling operation based on local graph Fourier transform, which can extract subgraph information utilizing both node features and structure of the subgraph. We provide a theoretical analysis of this pooling operation from both local and global perspectives. This proposed graph pooling operation together with a subgraph-based graph coarsening method forms the pooling layer, which can be incorporated into any graph neural networks to hierarchically learn graph level representations. We further proposed a graph neural network framework EigenGCN by combining the proposed pooling layers with the GCN grpah filtering 98 layers. Comprehensive graph classification experiments were conducted on 6 commonly used graph classification benchmarks. Our proposed framework achieves state-of-the-art performance on most of the data sets, which demonstrates its effectiveness. We then investigate the downsampling-based graph pooling operations. We introduce RepPool, which focuses on preserving information of both node importance and node representativeness. Specifically, the devised pooling module consists of two components, i.e., node selection and coarser subgraph generation. In the first component, we propose an overall score, which combines a global score for measuring node importance and a score for consideration of representativeness. Based on the overall score, a greedy algorithm is designed to select important and representative nodes one by one. In the second component, a new coarser subgraph is formed by utilizing information of non-selected nodes to build connections and node features. The proposed RepPool module stacked over GCN convolutional layers forms a hierarchical architecture to conduct the graph classification task. Comprehensive experiments on various widely used benchmark datasets demonstrate the superiority of RepPool compared to a variety of state-of-the-art methods. 7.2 Future Directions Though GNNs work well in graph representation learning and have advanced many applications, there are a lot of fundamental questions remaining unanswered, which we briefly summarized as follows: “Why do GNNs work?”; “How do GNNs work?”; and “Do GNNs work properly?” Answering these questions are important for understanding GNNs better and applying them to applications comfortably. “Why do GNNs work?” To answer the first question, we plan to investigate the expressiveness of GNNs. We have connected graph filtering operations with a graph denoising problem and thus provide some insights for GNNs in learning node representations. It is promising to further expand this work and also investigate the expressiveness of GNNs for learning graph representations, especially for the graph pooling operations. Our investigation on how GNNs works on heterophilous graphs also provides some understandings on what GNNs can capture. However, it still has some 99 limitations. Firstly, our current theoretical analysis majorly focuses on the GCN model; we hope to extend this analysis in the future to more general models with various graph filtering operations. Secondly, we plan to continue characterization of sufficient distinguishability of neighborhood and feature distributions to facilitate best class separation for future investigation. Also, the notion of homophily in our current work is defined based on node labels. As a future direction, we would like to extend the analysis to more broad and general definition of homophily. “How do GNNs work?” Just like traditional deep models, GNNs lack interpretability. Without understanding their decision making mechanism, GNNs can not be fully trusted and applied to fields like healthcare, where understanding the prediction is as important as achieving high prediction performance. Hence, it is important to answer the second question by investigating the interpretability of GNN models. More specifically, the interpretability of both graph filtering operations and graph pooling operations should be investigated. “Do GNNs work properly?” Finally, the third question involves many properties of GNNs including robustness and fairness. Recent studies show that GNNs can be easily fooled by deliberately-crafted perturbations, called adversarial attacks. The vulnerability to adversarial attacks preevent GNNs from safety-critical applications. Therefore, it is of great significance to develop robust GNN algorithms to defend adversarial attacks. Specifically, it is important to study the robustness of both graph filtering operations and graph pooling operations. Furthermore, GNNs likely exhibit algorithmic unfairness with respect to certain groups as traditional deep models. Besides these potential unfairness issues inherited from traditional deep models, they also have specific issues induced by graph structures. For example, the extremely skewed degree distribution in scale-free networks may cause the algorithms to bias towards nodes with large degrees (e.g., celebrities in online social networks), thus being unfair to the low-degree nodes. More effort is required to develop GNNs (both graph filtering operations and graph pooling operations) with fairness. 100 APPENDIX 101 APPENDIX A IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS? A.1 Additional Details and Results for Section 4.3.2.1 A.1.1 Details on the generated graphs In this subsection, we present the details of the graphs that we generate in Section 4.3.2.1. Specifically, we detail the distributions {D𝑐 , 𝑐 ∈ C} used in the examples, the number of added edges 𝐾, and the homophily ratio ℎ. We provide the details for Cora and Citeseer in the following subsections. Note that the choices of distributions shown here are for illustrative purposes, to coincide with Observations 1 and 2. We adapted circulant matrix-like designs due to their simplicity. A.1.1.1 Cora There are 7 labels, which we denote as {0, 1, 2, 3, 4, 5, 6}. The distributions {D𝑐 , 𝑐 ∈ C} are listed as follows. The values of 𝐾 and the homophily ratio of their corresponding generated graphs are shown in Table A.1. D0 : Categorical ([0, 0.5, 0, 0, 0, 0, 0.5]), D1 : Categorical ([0.5, 0, 0.5, 0, 0, 0, 0]), D2 : Categorical ([0, 0.5, 0, 0.5, 0, 0, 0]), D3 : Categorical ([0, 0, 0.5, 0, 0.5, 0, 0]), D4 : Categorical ([0, 0, 0, 0.5, 0, 0.5, 0]), D5 : Categorical ([0, 0, 0, 0, 0.5, 0, 0.5]), D6 : Categorical ([0.5, 0, 0, 0, 0, 0.5, 0]). 102 Table A.1: 𝐾 and ℎ values for generated graphs based on Cora 𝐾 1003 2006 3009 4012 6018 8024 10030 12036 16048 20060 24072 ℎ 0.789 0.737 0.692 0.652 0.584 0.529 0.483 0.445 0.384 0.338 0.302 A.1.1.2 Citeseer There are 6 labels, which we denote as {0, 1, 2, 3, 4, 5}. The distributions {D𝑐 , 𝑐 ∈ C} are listed as follows. The values of 𝐾 and the homophily ratio of their corresponding generated graphs are shown in Table A.2. D0 : Categorical ([0, 0.5, 0, 0, 0, 0.5]), D1 : Categorical ([0.5, 0, 0.5, 0, 0, 0]), D2 : Categorical ([0, 0.5, 0, 0.5, 0, 0]), D3 : Categorical ([0, 0, 0.5, 0, 0.5, 0]), D4 : Categorical ([0, 0, 0, 0.5, 0, 0.5]), D5 : Categorical ([0.5, 0, 0, 0, 0.5, 0]). Table A.2: 𝐾 and ℎ values for generated graphs based on Citeseer 𝐾 1204 2408 3612 4816 7224 9632 12040 14448 19264 24080 28896 ℎ 0.735 0.675 0.625 0.581 0.510 0.454 0.410 0.373 0.316 0.275 0.243 A.1.2 Results on more datasets: Chameleon and Squirrel We conduct similar experiments as those in Section 4.3.2.1 based on Chameleon and Squirrel. Note that both Squirrel and Chameleon have 5 labels, which we denote as {0, 1, 2, 3, 4}. We pre-define the same distributions for them as listed as follows. The values of 𝐾 and the homophily ratio of their corresponding generated graphs based on Squirrel and Chameleon are shown in Table A.3 and Table A.4, respectively. 103 D0 : Categorical ([0, 0.5, 0, 0, 0.5]), D1 : Categorical ([0.5, 0, 0.5, 0, 0]), D2 : Categorical ([0, 0.5, 0, 0.5, 0]), D3 : Categorical ([0, 0, 0.5, 0, 0.5]), D4 : Categorical ([0.5, 0, 0, 0.5, 0]). (A.1) Table A.3: 𝐾 and ℎ values for generated graphs based on Squirrel 𝐾 12343 24686 37030 49374 61716 74060 86404 98746 111090 12434 135776 ℎ 0.232 0.225 0.219 0.207 0.201 0.196 0.191 0.186 0.181 0.178 0.174 Table A.4: 𝐾 and ℎ values for generated graphs based on Chameleon K 1932 3866 5798 7730 9964 11596 13528 15462 17394 19326 21260 Homo. Ratio 0.249 0.242 0.236 0.230 0.224 0.218 0.213 0.208 0.203 0.198 0.194 =0 0.85 =0 0.9 0.80 Test Accuracy 0.8 Test Accuracy 0.75 0.7 0.70 0.6 0.65 0.5 0.60 0.23 0.23 0.22 0.21 0.21 0.2 0.2 0.19 0.19 0.18 0.18 0.17 0.26 0.25 0.24 0.24 0.23 0.22 0.22 0.21 0.21 0.2 0.2 0.19 Homophily Ratio Homophily Ratio (a) Synthetic graphs generated from Squirrel (b) Synthetic graphs generated from Chameleon Figure A.1: Performance of GCN on synthetic graphs with various homophily ratio. The performance of the GCN model on these two sets of graphs (generated from Squirrel and Chameleon) is shown in Figure A.1. The observations are similar to what we found for those generated graphs based on Cora and Citeseer in Section 4.3.2.1. Note that the original Squirrel and Chameleon graphs already have very low homophily, but we still observe a 𝑉-shape from the 104 1.00 =0 0.95 0.90 Test Accuracy 0.85 0.80 0.75 0.70 0.65 0.84 0.79 0.74 0.69 0.65 0.58 0.53 0.48 0.44 0.38 0.34 0.3 0.27 0.25 0.23 0.21 0.2 0.18 0.17 0.16 0.12 Homophily Ratio Figure A.2: The GCN’s performance approaches 100% as we keep adding edges following the specific patterns figures. This is because there are some neighborhood patterns in the original graphs, which are distinct from those that we designed for addition. Hence, when we add edges in the early stage, the performance decreases. As we add more edges, the designed pattern starts to mask the original patterns and the performance starts to increase. A.1.3 GCN’s Performance in the Limit (as 𝐾 → ∞) In this subsection, we illustrate that as 𝐾 → ∞, the accuracy of the GCN model approaches 100%. Specifically, we set 𝐾 to a set of larger numbers as listed in Table A.5. Ideally, when 𝐾 → ∞, the homophily ratio will approach 0 and the model performance will approach 100% (for diverse-enough D𝑐 ). The performance of the GCN model on the graphs described in Table A.1 and Table A.5 are shown in Figure A.2. Clearly, the performance of the GCN model approaches the maximum as we continue to increase 𝐾. Table A.5: Extended 𝐾 and ℎ values for generated graphs based on Cora 𝐾 28084 32096 36108 40120 44132 48144 52156 56168 80240 ℎ 0.272 0.248 0.228 0.211 0.196 0.183 0.172 0.162 0.120 105 A.2 Experimental Settings: Datasets and Models A.2.1 Datasets We give the number of nodes, edges, homophily ratios and distinct classes of datasets we used in this paper in Table A.6. Table A.6: Benchmark dataset summary statistics. Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Wisconsin Texas # Nodes (|V |) 2708 3327 19717 2277 5201 7600 183 251 183 # Edges (|E |) 5278 4676 44327 31421 198493 26752 280 466 295 Homophily Ratio (ℎ) 0.81 0.74 0.80 0.23 0.22 0.22 0.3 0.21 0.11 # Classes (|C|) 7 6 3 5 5 5 5 5 5 A.2.2 Models • H2GCN [106] specifically designed several architectures to deal with heterophilous graphs, which include ego- and neighbor-embedding separation (skip connection), aggregation from higher-order neighborhoods, and combination of intermediate representations. We include two variants H2GCN-1 and H2GCN-2 with 1 or 2 steps of aggregations, respectively. We adopt the code published by the authors at https://github.com/GemsLab/H2GCN. • GPR-GNN [10] performs feature aggregation for multiple steps and then linearly combines the features aggregated with different steps. The weights of the linear combination are learned during the model training. Note that it also includes the original features before aggregation in the combination. We adopt the code published by the authors at https: //github.com/jianhao2016/GPRGNN. • CPGNN [105] incorporates the label compatibility matrix to capture the connection information between classes. We adopted two variants of CPGNN that utilize MLP and ChebyNet [15] as base models to pre-calculate the compatibility matrix, respectively. We use two aggregation layers for both variants. We adopt the code published by the authors at https://github. com/GemsLab/CPGNN. 106 A.2.3 MLP+GCN We implement a simple method to linearly combine the learned features from the GCN model and (2) an MLP model. Let H𝐺𝐶𝑁 ∈ R|V |×|C| denote the output features from a 2-layer GCN model, where |V | and |C| denote the number of nodes and the number of classes, respectively. Similarly, we use H (2) 𝑀 𝐿𝑃 ∈ R |V |×|C| to denote the features output from a 2-layer MLP model. We then combine them for classification. The process can be described as follows. (2) H = 𝛼 · H𝐺𝐶𝑁 + (1 − 𝛼) · H (2) 𝑀 𝐿𝑃 , (A.2) where 𝛼 is a hyperparameter balancing the two components. We then apply a row-wise softmax to each row of H to perform the classification. A.2.4 Parameter Tuning and Resources Used We tune parameters for GCN, GPR-GCN, CPGNN, and MLP+GCN from the following options: • learning rate: {0.002, 0.005, 0.01, 0.05} • weight decay {5𝑒−04, 5𝑒−05, 5𝑒−06, 5𝑒−07, 5𝑒−08, 1𝑒−05, 0} • dropout rate: {0, 0.2, 0.5, 0.8}. For GPR-GNN, we use the “PPR” as the initialization for the coefficients. For MLP+GCN, we tune 𝛼 from {0.2, 0.4, 0.6, 0.8, 1}. Note that the parameter search range encompasses the range adopted in the original papers to avoid unfairness issues. All experiments are run on a cluster equipped with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz CPUs and NVIDIA Tesla K80 GPUs. A.3 Heatmaps for Other Benchmarks We provide the heatmaps for Citeseer and Pubmed in Figure A.3 and those for Squirrel, Texas, and Wisconsin in Figure A.4. For the Citeseer and Pubmed, which have high homophily, the 107 observations are similar to those of Cora as we described in Section 4.4.2. For Squirrel, there are some patterns; the intra-class similarity is generally higher than inter-class similarities. However, these patterns are not very strong, i.e, the differences between them are not very large, which means that the neighborhood patterns of different labels are not very distinguishable from each other. This substantiates the middling performance of GCN on Squirrel. Both Texas and Wisconsin are very small, with 183 nodes, 295 edges and 251 nodes, 466 edges, respectively. The average degree is extremely small (< 2). Hence, the similarities presented in the heatmap may present strong bias. Especially, in Texas, there is only 1 node with label 1. In Wisconsin, there are only 10 nodes with label 0. 1.0 0.5 1.0 0 0.65 0.26 0.12 0.14 0.17 0.07 0.8 0.0 0.82 0.16 0.24 0.8 1 0.26 0.79 0.18 0.08 0.09 0.05 0.5 2 0.12 0.18 0.85 0.17 0.06 0.09 0.6 0.6 1.0 0.16 0.92 0.19 3 0.14 0.08 0.17 0.85 0.06 0.05 0.4 0.4 1.5 4 0.17 0.09 0.06 0.06 0.87 0.1 0.2 2.0 0.24 0.19 0.88 0.2 5 0.07 0.05 0.09 0.05 0.1 0.87 0 1 2 3 4 5 0.0 2.50.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 (a) Citeseer (b) Pubmed Figure A.3: Neighborhood Similarity On Citeseer and Pubmed. On both graphs, the intra-class similarity is clearly higher than the inter-class ones. 108 1.0 1.0 1.0 0 0.61 0.63 0.64 0.65 0.66 0 0.65 0.03 0.61 0.17 0.21 0 0.75 0.25 0.27 0.36 0.51 0.8 0.8 0.8 1 0.63 0.7 0.7 0.71 0.72 1 0.03 1.0 0.1 0.33 0.31 1 0.25 0.75 0.71 0.64 0.35 0.6 0.6 0.6 2 0.64 0.7 0.75 0.74 0.75 2 0.61 0.1 0.68 0.31 0.41 2 0.27 0.71 0.72 0.63 0.45 0.4 0.4 0.4 3 0.65 0.71 0.74 0.78 0.77 3 0.17 0.33 0.31 0.58 0.58 3 0.36 0.64 0.63 0.69 0.51 0.2 0.2 0.2 4 0.66 0.72 0.75 0.77 0.8 4 0.21 0.31 0.41 0.58 0.64 4 0.51 0.35 0.45 0.51 0.72 0 1 2 3 4 0.0 0 1 2 3 4 0.0 0 1 2 3 4 0.0 (a) Squirrel (b) Texas (c) Wisconsin Figure A.4: Neighborhood Similarity On Squirrel, Texas and Wisconsin. The inter-class similarity on Squirrel is slightly higher than intra-class similarity for most classes, which substantiates the middling performance of GCN. Both Texas and Wisconsin are quite small, hence the cross-class similarity in these two graphs present severe bias and may not provide precise information about these graphs. 109 BIBLIOGRAPHY 110 BIBLIOGRAPHY [1] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. MixHop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 21–29. PMLR, 09–15 Jun 2019. [2] Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675, 2017. [3] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016. [4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. [5] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005. [6] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013. [7] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. [8] Siheng Chen, Yonina C Eldar, and Lingxiao Zhao. Graph unrolling networks: Interpretable neural networks for graph signal denoising. arXiv preprint arXiv:2006.01301, 2020. [9] Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovacevic. Signal denoising on graphs via graph filtering. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 872–876. IEEE, 2014. [10] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations, 2021. [11] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997. 111 [12] Valerio Ciotti, Moreno Bonaventura, Vincenzo Nicosia, Pietro Panzarasa, and Vito Latora. Homophily and missing links in citation networks. EPJ Data Science, 5:1–14, 2016. [13] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718, 2020. [14] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pages 2702–2711, 2016. [15] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016. [16] Tyler Derr, Yao Ma, and Jiliang Tang. Signed graph convolutional networks. In 2018 IEEE International Conference on Data Mining (ICDM), pages 929–934. IEEE, 2018. [17] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003. [18] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015. [19] Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 169–177, 2020. [20] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The World Wide Web Conference, pages 417–426, 2019. [21] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018. [22] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936. [23] Hongyang Gao and Shuiwang Ji. Graph u-nets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 2083–2092, 2019. [24] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD, pages 1416–1424, 2018. [25] Elisabeth R Gerber, Adam Douglas Henry, and Mark Lubell. Political homophily and collaboration in regional planning networks. American Journal of Political Science, 57(3):598– 610, 2013. 112 [26] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017. [27] Jonathan Halcrow, Alexandru Mosoi, Sam Ruth, and Bryan Perozzi. Grale: Designing networks for graph learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2523–2532, 2020. [28] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017. [29] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph- structured data. arXiv preprint arXiv:1506.05163, 2015. [30] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [31] Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard T. B. Ma, Hongzhi Chen, and Ming-Chang Yang. Measuring and improving the use of graph information in graph neural networks. In International Conference on Learning Representations, 2020. [32] Jingjia Huang, Zhangheng Li, Nannan Li, Shan Liu, and Ge Li. Attpool: Towards hierar- chical feature representation in graph convolutional networks via attention mechanism. In Proceedings of the IEEE International Conference on Computer Vision, pages 6480–6489, 2019. [33] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 731–739, 2017. [34] Junteng Jia and Austin R Benson. A unifying generative model for graph learning algorithms: Label propagation, graph convolutions, and combinations. arXiv preprint arXiv:2101.07730, 2021. [35] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203, 2020. [36] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Kernels for graphs. Kernel methods in computational biology, 39(1):101–113, 2004. [37] Jeroen Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1):312–320, 2005. [38] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016. [39] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 113 [40] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [41] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [42] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018. [43] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019. [44] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [45] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664, 2017. [46] Juanhui Li, Yao Ma, Yiqi Wang, Charu Aggarwal, Chang-Dong Wang, and Jiliang Tang. Graph pooling with representativeness. In 2020 IEEE International Conference on Data Mining (ICDM), pages 302–311. IEEE, 2020. [47] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [48] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018. [49] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. [50] Yao Ma, Ziyi Guo, Zhaochun Ren, Yihong Eric Zhao, Jiliang Tang, and Dawei Yin. Dynamic graph neural networks. CoRR, abs/1810.10627, 2018. [51] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a neccesity for graph neural networks? arXiv preprint arXiv:2106.06134, 2021. [52] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. arXiv preprint arXiv:2010.01777, 2020. [53] Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press, 2020. [54] Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. Multi-dimensional graph convolutional networks. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 657–665. SIAM, 2019. 114 [55] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 723–731, 2019. [56] Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. Improving graph neural networks with simple architecture design. arXiv preprint arXiv:2105.07634, 2021. [57] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001. [58] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on CVPR, pages 5115–5124, 2017. [59] Christopher Morris, Nils M Kriege, Kristian Kersting, and Petra Mutzel. Faster kernels for graphs with continuous attributes via hashing. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1095–1100. IEEE, 2016. [60] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4602–4609, 2019. [61] Sunil K Narang and Antonio Ortega. Perfect reconstruction two-channel wavelet filter banks for graph structured data. IEEE Transactions on Signal Processing, 60(6):2786–2799, 2012. [62] Mark Newman. Networks. Oxford university press, 2018. [63] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016. [64] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550, 2019. [65] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom- gcn: Geometric graph convolutional networks. In International Conference on Learning Representations, 2020. [66] Ekagra Ranjan, Soumya Sanyal, and Partha Pratim Talukdar. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. arXiv preprint arXiv:1911.07979, 2019. [67] Kaspar Riesen and Horst Bunke. Iam graph database repository for graph based pattern recognition and machine learning. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 287–297. Springer, 2008. 115 [68] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019. [69] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242, 2018. [70] Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644–1656. [71] Kevin Scaman and Aladin Virmaux. Lipschitz regularity of deep neural networks: analysis and efficient estimation. arXiv preprint arXiv:1805.10965, 2018. [72] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar- dini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008. [73] Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn, and Dietmar Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic acids research, 32(suppl_1):D431–D433, 2004. [74] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008. [75] Neil Shah. Scale-free, attributed and class-assortative graph generation to facilitate introspec- tion of graph neural networks. KDD Mining and Learning with Graphs, 2020. [76] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018. [77] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011. [78] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pages 488–495, 2009. [79] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921, 2019. [80] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3):83–98, 2013. [81] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. 116 [82] Ryan J Tibshirani et al. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014. [83] Nicolas Tremblay and Pierre Borgnat. Subgraph-based filterbanks for graph signals. IEEE Transactions on Signal Processing, 64(15):3827–3840. [84] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [85] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [86] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015. [87] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. [88] Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008. [89] Yu-Xiang Wang, James Sharpnack, Alexander J Smola, and Ryan J Tibshirani. Trend filtering on graphs. The Journal of Machine Learning Research, 17(1):3651–3691, 2016. [90] Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint arXiv:1903.01610, 2019. [91] Liang Wu, Xia Hu, and Huan Liu. Relational learning with social status analysis. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 513–522, 2016. [92] Xuan Wu, Lingxiao Zhao, and Leman Akoglu. A quest for structure: jointly learning the graph structure and semi-supervised classification. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 87–96, 2018. [93] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019. [94] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. [95] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015. 117 [96] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pages 40–48. PMLR, 2016. [97] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804, 2018. [98] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. 2018. [99] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang. Hierarchical graph pooling with structure learning. arXiv preprint arXiv:1911.05954, 2019. [100] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint arXiv:1812.04202, 2018. [101] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019. [102] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Woodford, Meng Jiang, and Neil Shah. Data augmentation for graph neural networks. In AAAI, 2020. [103] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020. [104] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018. [105] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. arXiv preprint arXiv:2009.13566, 2020. [106] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems, 33, 2020. [107] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. arXiv preprint arXiv:2101.11859, 2021. [108] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412, 2019. 118