ADVANCED OPERATORS FOR GRAPH NEURAL NETWORKS
                              By
                           Yao Ma
                     A DISSERTATION
                         Submitted to
                 Michigan State University
         in partial fulfillment of the requirements
                      for the degree of
        Computer Science – Doctor of Philosophy
                             2021


                                                ABSTRACT
                ADVANCED OPERATORS FOR GRAPH NEURAL NETWORKS
                                                     By
                                                  Yao Ma
Graphs, which encode pairwise relations between entities, are a kind of universal data structure for
many real-world data, including social networks, transportation networks, and chemical molecules.
Many important applications on these data can be treated as computational tasks on graphs. For
example, friend recommendation in social networks can be regarded as a link prediction task and
predicting properties of chemical compounds can be treated as a graph classification task. An
essential step to facilitate these tasks is to learn vector representations either for nodes or the entire
graphs. Given its great success of representation learning in images and text, deep learning offers
great promise for graphs. However, compared to images and text, deep learning on graphs faces
immense challenges. Graphs are irregular where nodes are unordered and each of them can have a
distinct number of neighbors. Thus, traditional deep learning models cannot be directly applied
to graphs, which calls for dedicated efforts for designing novel deep graph models. To help meet
this pressing demand, we developed and investigated novel GNN algorithms to generalize deep
learning techniques to graph-structured data. Two key operations in GNNs are the graph filtering
operation, which aims to refine node representations; and the graph pooling operation, which aims
to summarize node representations to obtain a graph representation. In this thesis, we provide deep
understandings or develop novel algorithms for these two operations from new perspectives. For
graph filtering operations, we propose a unified framework from the perspective of graph signal
denoising, which demonstrates that most existing graph filtering operations are conducting feature
smoothing. Then, we further investigate what information typical graph filtering operations can
capture and how they can be understood beyond feature smoothing. For graph pooling operations,
we study the procedure of pooling from the perspective of graph spectral theory and present a novel
graph pooling operation. We also propose a technique to downsample nodes considering both mode


importance and representativeness, which leads to a novel graph pooling operation.

To my parents, siblings, and entire family for their love and support.
                                 iv


                                    ACKNOWLEDGEMENTS
First and foremost, I would like to thank my supervisor, Dr. Jiliang Tang, for his invaluable advice,
inspiration, encouragement, and support during the course of my Ph.D. degree. I have learned so
much from him throughout these years, ranging from finding and addressing challenging research
problems, writing insightful papers, giving attractive presentations, to efficiently mentoring students
and managing a research lab. Dr. Tang is not only a supervisor in academia but also a sincere friend
who guided me with his knowledge and experience in many aspects of life. Dr. Tang, I cannot thank
you enough.
    I would like to express my deepest appreciation to my dissertation committee members, Dr.
Anil K. Jain, Dr. Charu C. Aggarwal, Dr. Pan-Ning Tan, Dr. Kenneth A. Frank, and Dr. Nitesh V.
Chawla for their insightful comments and helpful suggestions.
    I was fortunate to work as interns at JD.com and Snap Research with amazing colleagues and
mentors: Zhaochun Ren, Ziheng Jiang, Ziyi Guo, Yuan He, and Dawei Yin from JD.com; and Neil
Shah, Yozen Liu, Tong Zhao, and Leonardo Neves from Snap Research. I enjoyed three wonderful
and productive summers with you.
    I am grateful to have had the pleasure and fortune of having supportive and encouraging friends
and colleagues during my Ph.D. I am thankful to all my colleagues from the Data Science and
Engineering Lab: Ibrahim Ahmed, Dr. Meznah Almutairy, Aaron Brookhouse, Jamell Dacon, Dr.
Wenqi Fan, Bryan Hendryx, Dr. Jiangtao Huang, Wei Jin, Cassidy Johnson, Dr. Hamid Karimi, Dr.
Tyler Derr, Juanhui Li, Yaxin Li, Haochen Liu, Hua Liu, Xiaorui Liu, Daniel K. Ofori-Dankwa,
Namratha Shah, Hannah Striebel, Pegah Varghaei, Chenxing Wang, Wentao Wang, Xiaoyang Wang,
Dr. Xin Wang, Yiqi Wang, Dr. Zhiwei Wang, Han Xu, Hansheng Zhao, Yongqi Han, Charles L.
Green, and Dr. Xiangyu Zhao. In particular, thanks to my DSE collaborators: Ibrahim Ahmed,
Dr. Wenqi Fan, Dr. Jiangtao Huang, Dr. Tyler Derr, Wei Jin, Juanhui Li, Haochen Liu, Hua Liu,
Xiaorui Liu, Daniel K. Ofori-Dankwa, Wentao Wang, Xiaoyang Wang, Dr. Xin Wang, Yiqi Wang,
Dr. Zhiwei Wang, Han Xu, Hansheng Zhao, Yongqi Han, Charles L. Green, and Dr. Xiangyu Zhao.
                                                  v


I would like to extend my sincere thanks to all my collaborators from outside the DSE Lab: Dr. Anil
Jain, Dr. Charu K. Aggarwal, Dr. Joseph Thekinen, Hasan Bayhan, Dr. Sinem Mollaoglu, Eric
Zhao, Dr. Lingfei Wu, Dr. Tengfei Ma, Dr. Yu Rong, Dr. Tingyang Xu, Dr. Junzhou Huang, Dr.
Wenbing Huang, Dr. Hong Cheng, Dr. Zitao Liu, Dr. Hui Liu, Dr. Jianping Wang, Dr. Qing Li, Dr.
Xianfeng Tang, Dr. Yuan He, Dr. Caiyan Jia, Dr. Jian Yu, Debayan Deb, Tong Zhao, Yozen Liu,
and Dr. Neil Shah. I look forward to continued collaboration with you.
    Finally, I would like to thank my family, for their unconditional love and support.
                                                  vi


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             1
   1.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    2
   1.2 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . .            . . . .  6
   2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
   2.2 Computational Tasks on Graphs . . . . . . . . . . . . . . . . . . . . . . .      . . . .  6
       2.2.1 Semi-supervised Node Classification Task . . . . . . . . . . . . . .       . . . .  7
       2.2.2 Graph Classification Task . . . . . . . . . . . . . . . . . . . . . . .    . . . .  7
       2.2.3 An Overview of Machine Learning Approaches for Tasks on Graphs             . . . .  8
   2.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . .  8
       2.3.1 Graph Filtering Operation . . . . . . . . . . . . . . . . . . . . . .      . . . .  9
       2.3.2 Graph Pooling Operation . . . . . . . . . . . . . . . . . . . . . . .      . . . . 10
       2.3.3 General Frameworks of Graph Neural Networks . . . . . . . . . . .          . . . . 11
               2.3.3.1 General GNN Framework for Node-focus Tasks . . . . . .           . . . . 11
               2.3.3.2 General GNN Framework for Graph-focus Tasks . . . . .            . . . . 12
CHAPTER 3    A UNIFIED VIEW ON GRAPH FILTERING OPERATIONS AS GRAPH
             SIGNAL DENOISING . . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 13
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
   3.2 Representative Graph Neural Networks Models . . . . . . . . . . . . . . . . .        . . 14
       3.2.1 Graph Convolutional Networks (GCN) . . . . . . . . . . . . . . . . . .         . . 14
       3.2.2 Graph Attention Networks (GAT) . . . . . . . . . . . . . . . . . . . .         . . 15
       3.2.3 Personalized Propagation of Neural Predictions (PPNP) . . . . . . . . .        . . 15
   3.3 Graph filtering operations as graph signal denoising . . . . . . . . . . . . . . .   . . 16
       3.3.1 Connection to PPNP layers and APPNP layers . . . . . . . . . . . . . .         . . 17
       3.3.2 Connection to GCN layers . . . . . . . . . . . . . . . . . . . . . . . .       . . 18
       3.3.3 Connection to GAT layers . . . . . . . . . . . . . . . . . . . . . . . .       . . 19
   3.4 Ugnn: A Unified Framework for Graph Filtering Operations via Graph Signal
       Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 21
   3.5 Ada-Ugnn: Adaptive Local Smoothing with Ugnn . . . . . . . . . . . . . . .           . . 23
       3.5.1 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 24
       3.5.2 Feature Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 24
   3.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 26
       3.6.1 Node Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 26
               3.6.1.1 Datasets and Experimental Settings . . . . . . . . . . . . . .       . . 26
                                              vii


      3.6.2   Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
      3.6.3   Node Classification Accuracy For Nodes with Low-level and High-level
              Local Label Smoothness . . . . . . . . . . . . . . . . . . . . . . . . .     . . 28
              3.6.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . .         . . 29
      3.6.4 Robustness Under Adversarial Attacks . . . . . . . . . . . . . . . . . .       . . 31
      3.6.5 Investigation on Number of Gradient Descent Steps in ADA-UGNN . .              . . 34
  3.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 34
CHAPTER 4 IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS?                                . 36
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
  4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 38
      4.2.1 Homophily in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 38
      4.2.2 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . .       . 38
  4.3 Graph Convolutional Networks under Heterophily . . . . . . . . . . . . . . . . .       . 39
      4.3.1 When does GCN learn similar embeddings? . . . . . . . . . . . . . . . .          . 40
      4.3.2 How does classification performance change over the homophily-heterophily
              spectrum? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 47
              4.3.2.1 Targeted Heterophilous Edge Addition . . . . . . . . . . . . .         . 48
              4.3.2.2 Introducing noise to neighborhood distributions . . . . . . . . .      . 50
  4.4 Revisiting GCN’s Performance on Real-world Graphs . . . . . . . . . . . . . . .        . 53
      4.4.1 Evaluating Node Classification on Homophilous and Heterophilous Bench-
              marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 53
      4.4.2 Investigating GCN performance on Homophilous and Heterophilous
              Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 54
  4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 56
CHAPTER 5   EIGENPOOLING: A GRAPH POOLING OPERATION FROM THE
            SPECTRAL GRAPH THEORY PERSPECTIVE . . . . . . . . . . . . .                    . . 57
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
  5.2 The Proposed Framework – EigenGCN . . . . . . . . . . . . . . . . . . . . . .        . . 58
      5.2.1 An Overview of EigenGCN . . . . . . . . . . . . . . . . . . . . . . . .        . . 59
      5.2.2 Graph Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 61
      5.2.3 Eigenvector-Based Pooling – EigenPooling . . . . . . . . . . . . . . . .       . . 62
              5.2.3.1 Graph Fourier Transform . . . . . . . . . . . . . . . . . . . .      . . 63
              5.2.3.2 The Design of Pooling Operations . . . . . . . . . . . . . . .       . . 64
  5.3 Theoretical Analysis of EigenPooling . . . . . . . . . . . . . . . . . . . . . . .   . . 66
      5.3.1 A Local View of EigenPooling . . . . . . . . . . . . . . . . . . . . . .       . . 66
      5.3.2 A Global View of EigenPooling . . . . . . . . . . . . . . . . . . . . .        . . 67
      5.3.3 Permutation Invariance of EigenGCN . . . . . . . . . . . . . . . . . .         . . 70
  5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 71
      5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 71
      5.4.2 Baselines and Experimental Settings . . . . . . . . . . . . . . . . . . .      . . 72
      5.4.3 Performance on Graph Classification . . . . . . . . . . . . . . . . . . .      . . 73
      5.4.4 Understanding Graph Signals . . . . . . . . . . . . . . . . . . . . . . .      . . 74
  5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 75
                                             viii


CHAPTER 6    REPPOOL: A GRAPH POOLING OPERATION WITH REPRESENTA-
             TIVENESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 78
   6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
   6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 80
       6.2.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 80
       6.2.2 Graph Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 80
   6.3 The Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 81
       6.3.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 81
       6.3.2 Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 83
              6.3.2.1 Node Importance . . . . . . . . . . . . . . . . . . . . . . . .       . . 84
              6.3.2.2 Node Representativeness . . . . . . . . . . . . . . . . . . . .       . . 85
              6.3.2.3 Node Selection Algorithm . . . . . . . . . . . . . . . . . . .        . . 86
       6.3.3 Coarser Graph Generation . . . . . . . . . . . . . . . . . . . . . . . .       . . 87
              6.3.3.1 Generating the Feature Matrix . . . . . . . . . . . . . . . . .       . . 87
              6.3.3.2 Generating The Adjacency Matrix . . . . . . . . . . . . . . .         . . 88
       6.3.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . .       . . 89
   6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 89
       6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 89
       6.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 90
       6.4.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 92
       6.4.4 Graph Classification Performance . . . . . . . . . . . . . . . . . . . .       . . 92
       6.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 93
       6.4.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 94
       6.4.7 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 95
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
   7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
   7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
                                               ix


                                        LIST OF TABLES
Table 3.1: Dataset summary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 3.2: Node Classification Accuracy on Various Datasets . . . . . . . . . . . . . . . . 29
Table 4.1: SSNC accuracy on two heterophilous datasets: standard GCN outperforms all
           heterophily-specific models on these graphs. . . . . . . . . . . . . . . . . . . . . 39
Table 4.2: GNN node classification performance (accuracy) on homophilous and het-
           erophilous graphs. Standard GCN or MLP+GCN methods (top) outperform or
           achieve comparable performance to heterophily-specific methods (bottom). . . . 55
Table 5.1: Statistics of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Table 5.2: Performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Table 6.1: Statistics of the eight datasets. Graphs: total number of graphs. Nodes (avg):
           the average number of nodes. Edges (avg): the average number of edges.
           #Class: the total number of classes. . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 6.2: Comparisons of graph classification performance in terms of accuracy. Note
           that we highlight the best accuracy result in bold. . . . . . . . . . . . . . . . . . 91
Table 6.3: Graph classification performance for RepPool and its variants. . . . . . . . . . . 93
Table A.1: 𝐾 and ℎ values for generated graphs based on Cora        . . . . . . . . . . . . . . . 103
Table A.2: 𝐾 and ℎ values for generated graphs based on Citeseer . . . . . . . . . . . . . 103
Table A.3: 𝐾 and ℎ values for generated graphs based on Squirrel . . . . . . . . . . . . . 104
Table A.4: 𝐾 and ℎ values for generated graphs based on Chameleon . . . . . . . . . . . . 104
Table A.5: Extended 𝐾 and ℎ values for generated graphs based on Cora . . . . . . . . . . 105
Table A.6: Benchmark dataset summary statistics. . . . . . . . . . . . . . . . . . . . . . . . 106
                                                 x


                                        LIST OF FIGURES
Figure 2.1: An overview of machine learning approaches for node-focus tasks . . . . . . . .         7
Figure 2.2: An overview of machine learning approaches for graph-focus tasks . . . . . . .          8
Figure 2.3: An overview of graph filtering operation . . . . . . . . . . . . . . . . . . . . .      9
Figure 2.4: An overview of graph pooling operation . . . . . . . . . . . . . . . . . . . . . 10
Figure 2.5: A general GNN framework for node-focus tasks . . . . . . . . . . . . . . . . . 11
Figure 2.6: A general GNN framework for graph-focus tasks . . . . . . . . . . . . . . . . . 12
Figure 3.1: Distribution of local label smoothness (homophily) on different graph datasets:
            note the non-homogeneity of smoothness values. . . . . . . . . . . . . . . . . . 28
Figure 3.2: Accuracy for nodes with low and high local label smoothness. . . . . . . . . . . 30
Figure 3.3: Accuracy with low label smoothness and high label smoothness nodes. Note
            the consistent improvement in low smoothness cases, enabled by adaptive
            local smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 3.4: Distribution of local label smoothness on Cora with various attack perturbation
            rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 3.5: Distribution of local label smoothness on Citeseer with various attack
            perturbation rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 3.6: Distribution of local label smoothness on Pubmed with various attack pertur-
            bation rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 3.7: Robustness under adversarial attacks (node classification accuracy). . . . . . . . 33
Figure 3.8: ADA-UGNN performance (test accuracy) under different numbers of gradient
            steps (𝐾). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 4.1: A perfectly heterophilous graph on which GCN achieves perfect class separation. . . . 37
Figure 4.2: Two nodes share the same neighborhood distribution; GCN learns equivalent
            embeddings for 𝑎 and 𝑏. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
                                                  xi


Figure 4.3: SSNC accuracy of GCN on synthetic graphs with various homophily ratios,
            generated by adding heterophilous edges according to pre-defined target
            distributions on Cora (a) and Citeseer (b): performance first decreases, then
            increases, forming a 𝑉-shape (see the black lines). . . . . . . . . . . . . . . . . 48
Figure 4.4: Cross-class neighborhood similarity on synthetic graphs generated from Cora;
            all graphs have ℎ = 0.3, but with varying neighborhood distributions as per the
            noise parameter 𝛾: as 𝛾 increases, the intra-class similarity and the inter-class
            similarity become closer to each other, indicating that the distributions for
            different classes become more indistinguishable. . . . . . . . . . . . . . . . . . 50
Figure 4.5: Cross-class neighborhood similarity on homophilous graphs (a) and het-
            erophilous graphs (b-d). The deviations between inter and intra-class similari-
            ties substantiate GCN’s performance on these datasets. . . . . . . . . . . . . . 55
Figure 5.1: An illustrative example of the general framework . . . . . . . . . . . . . . . . . 60
Figure 5.2: Understanding graph signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 6.1: An illustration of node selection based on the importance by gPool [23] at the
            first layer. Red nodes are the selected ones. gPool might select nodes that
            are redundantly concentrated on some substructures while totally neglecting
            information of other substructures. . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 6.2: An Illustration of the proposed hierarchical graph classification architecture
            and the proposed pooling module. In the node selection process in (b), 4
            nodes are selected step by step. Note that the red color indicates that the node
            has been selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 6.3: Visualization of node selection results at first layer in RepPool and gPool.
            Figures (a) and (b) are outputs on the NCI1 dataset and figures (c), (d) are
            outputs on the PTC dataset. Each node is attached with a number indicting its
            index in the graph. 25% nodes are selected and highlighted in red. . . . . . . . 93
Figure 6.4: Graph classification results by varying four parameters: pooling ratio 𝑟,
            learning rate 𝜇, pooling layers 𝐾 𝑝 , and GCN layers 𝐾𝑔 on three datasets:
            Synthie, PTC, and IMDB-BINARY. . . . . . . . . . . . . . . . . . . . . . . . 96
Figure A.1: Performance of GCN on synthetic graphs with various homophily ratio. . . . . 104
Figure A.2: The GCN’s performance approaches 100% as we keep adding edges following
            the specific patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure A.3: Neighborhood Similarity On Citeseer and Pubmed. On both graphs, the
            intra-class similarity is clearly higher than the inter-class ones. . . . . . . . . . 108
                                                  xii


Figure A.4: Neighborhood Similarity On Squirrel, Texas and Wisconsin. The inter-
            class similarity on Squirrel is slightly higher than intra-class similarity for
            most classes, which substantiates the middling performance of GCN. Both
            Texas and Wisconsin are quite small, hence the cross-class similarity in
            these two graphs present severe bias and may not provide precise information
            about these graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
                                               xiii


                            LIST OF ALGORITHMS
1 𝐾-layer GCN As Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Heterophilous Edge Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Heterophilous Edge Addition with Noise . . . . . . . . . . . . . . . . . . . . . . . 51
4 Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
                                         xiv


                                             CHAPTER 1
                                          INTRODUCTION
Recent years have witnessed a rapid growth in our ability to generate and gather data from numerous
platforms in the online world and various sensors in the physical world. Graphs provide a universal
representation for most of these data such as online social networks, transportation networks, and
chemical molecules, where entities are denoted as nodes while their relations are denoted as edges.
Hence, many real-world applications can be regarded as computational tasks on graphs. For example,
for online social network platform such as Facebook, it is important to always suggest potential
friends for the users to connect with. This friend recommendation process is helpful to keep the
users and the entire platform active. This task of friend recommendation can be treated as a link
prediction task on a graph (the social network can be regarded as a graph). In the drug discovery
industry, to discover effective drugs,the researchers need to evaluate millions of different structures
of chemical compounds, which is prohibitively expensive in time and resources. Predicting the
properties of the chemical compounds accurately may largely reduce the search space and thus help
speed up the process. The task of predicting properties of chemical compounds can be regarded as a
graph classification (or regression) task, where each chemical compound can be modeled as a graph.
To facilitate these computational tasks on graphs, it is essential to learn proper representations for
graphs. Deep learning techniques have shown their great capability in learning representations for
images and text and hence offer great promise for graphs. However, generalizing deep learning to
graphs faces tremendous challenges. Unlike images and text, which are usually well-structured
as grids, graphs are irregular from many perspectives. Specifically, nodes in a graph do not have
meaningful orderings and they can also have distinct number of nodes in their neighborhoods.
Thus, traditional deep learning techniques designed for regular grid data are not directly suitable to
be applied to graphs, which calls for dedicated efforts to develop algorithms for graph-structured
data. Recently, graph neural networks (GNNs), which generalize deep learning techniques to graph
structured data, have been attracting increasing attention. Graph neural networks have shown their
                                                   1


great capability in learning both node representations and graph representations and thus have
facilitate many tasks on graphs we mentioned above. The essential building components of GNNs
include the graph filtering operations and the graph pooling operations. The graph filtering operation
aims to learn good node representations and the graph pooling operation is typically required to
learn a representation for the entire graph. In this thesis, we conduct researches on both graph
filtering operations and graph pooling operations from new perspectives. More specifically, we
provide deep understandings and insights on existing graph filtering operations and propose novel
graph pooling operations to facilitate graph classification tasks.
1.1     Dissertation Contributions
The major contributions of this thesis are summarized as follows.
     • Numerous recent works have proposed various GNN models with different designs in the
       graph filtering operations. We raise a natural question – is there an intrinsic connection
       among these feature filtering operations and their assumptions? The significance of a positive
       answer to this question is two-fold. Firstly, it offers a new perspective to create a uniform
       understanding on representative graph filtering operations. Secondly, it enables us to develop
       a general framework that not only provides a unified view on multiple existing representative
       graph filtering operations, but also has the potential to inspire new ones. Our first contribution
       is to propose such a unified framework for graph filtering operations from the perspective of
       graph signal denoising. In particular, we mathematically establish that many existing graph
       filtering operations can be unified as the process of exactly, and sometimes approximately,
       addressing a graph signal denoising problem. This connection suggests that these graph
       filtering operations share a unified goal: to ensure feature smoothness of connected nodes.
     • Recently, many literature posits that when applied to the semi-supervised node classification
       task (details about this task will be later introduced in Section 2.2), general GNNs are widely
       believed to work well due to the homophily assumption (“like attracts like”), and fail to
       generalize to heterophilous graphs where dissimilar nodes connect. This also coincides
                                                     2


  with our unified understanding on graph filtering operations that they are ensuring feature
  smoothness of connected nodes. Hence, recent works design new architectures to overcome
  such heterophily-related limitations, citing poor baseline performance and new architecture
  improvements on a few heterophilous graph benchmark datasets as evidence for this notion.
  However, we empirically find that standard graph convolutional networks (a kind of widely
  adopted GNN model) can actually achieve better performance than such carefully designed
  methods on some commonly used heterophilous graphs. This motivates us to reconsider
  whether homophily is truly necessary for good GNN performance. We find that this claim
  is not quite true, and in fact, GCNs can achieve strong performance on heterophilous
  graphs under certain conditions. We carefully characterizes these conditions, and provides
  supporting theoretical understanding and empirical observations. Finally, we examine existing
  heterophilous graphs benchmarks and reconcile how the GCN (under)performs on them based
  on this understanding.
• To apply graph neural networks for the graph classification task, graph pooling operations to
  generate the graph representation from node representations are demanded. A common graph
  pooling operation adopted is to globally combine the node representations. However, rich
  structural information is overlooked. Thus a hierarchical pooling procedure with multiple
  pooling layers is desired to preserve the graph structure during the graph representation
  learning. There are some recent works on hierarchically learning graph representation
  analogous to the pooling step in conventional convolutional neural (CNN) networks. However,
  the local structural information is still largely neglected during the pooling process. We
  introduce a graph pooling operation based on graph spectral theory, which can utilize the
  node features and local structures during the pooling process. The method not only naturally
  incorporates both node representations and graph structure that helps achieve the state-of-the-
  art performance but also has nice theoretical properties such as perfect reconstruction and
  guaranteed information preserving.
                                              3


     • The spectral-based graph pooling operation we proposed above is based on grouping nodes
        into “supernodes” to perform the graph coarsening process. Another stream of graph pooling
        operations coarsen a graph by downsampling nodes. Existing downsampling-based methods
        typically select nodes according to learnable importance scores, i.e., those nodes with higher
        importance scores are more likely to be sampled. However, connected nodes tend to share
        similar importance scores, since their features are similar (especially after graph filtering
        operation), which results in suboptimal sampling of nodes. These sampled nodes cannot well
        preserve the original graph information, which causes huge information loss during the graph
        pooling. To remedy this issue, we proposed a graph pooling operation, which considers not
        only node importance but also node “representativeness”. Specifically, a node is considered
        less representative if a node in its local neighborhood is already sampled. We designed a
        representativeness score, which, together with the importance score, are utilized to sample the
        nodes. Equipped with this novel sampling approach, our proposed graph pooling operation
        has helped GNNs achieve state-of-the-art performance in graph classification.
1.2      Dissertation Structure
The remainder of this dissertation is organized as follows. In Chapter 2, we introduce some
backgrounds on graphs and graph neural networks. In Chapter 3, we proposed a unified framework
for graph filtering operations from the perspective of graph signal denoising. It covers many
existing graph filtering operations as special cases. The contents in this chapter were previously
published as “A Unified View On Graph Neural Networks as Graph Signal Denoising” [52]. In
Chapter 4, we reveal that homophily is not a necessity for graph neural networks (especially the graph
filtering operation) to work well for semi-supervised node classification task. We provide empirical
observations and theoretical understandings to support our findings. This chapter was previously
published as “Is Homophily a Necesscity for Graph Neural Networks?” [51]. In Chapter 5, We
proposed a novel graph pooling operation based on graph spectral theory, which nicely incorporate
graph structures and node features during the pooling process. This chapter was previously published
                                                    4


as “Graph Convolutional Networks with EigenPooling” [55]. In Chapter 6, we propose a novel
node downsampling technique, which considers both node importance and representativeness.
The sampled nodes by this technique are widespread across the entire graph, thus captures more
comprehensive information of the original graph. We further proposed a down-sample based graph
pooling operation based on this downsampling technique. The materials presented in this chapter
were previously published as “Graph Pooling with Representativeness” [46]. Finally, we conclude
the dissertation in Chapter 7.
                                               5


                                             CHAPTER 2
                                          BACKGROUND
In this chapter, we briefly introduce the definition of graphs and the computational tasks on graphs.
We also describe the general framework of graph neural networks, including a high-level discussion
of the graph filtering operation and graph pooling operation. We will introduce basic knowledge
about graph neural networks after discussing the graph filtering operation and graph pooling
operation. This chapter provides groundwork for introducing our deep insights on the graph filtering
operations and novel methodologies on graph pooling operations.
2.1     Graphs
Let G = {V, E} denote a graph, where V and E are the sets of nodes and edges, respectively. The
graph connection information can also be represented as an adjacency matrix A ∈ {0, 1} |V |×|V | ,
where |V | is the number of nodes in the graph. The 𝑖, 𝑗-th element of the adjacency matrix A[𝑖, 𝑗]
is equal to 1 if and only if nodes 𝑖 and 𝑗 are adjacent to each other, otherwise A[𝑖, 𝑗] = 0. Each
node 𝑖 is associated with a 𝑑-dimensional vector of node features x𝑖 ∈ R𝑑 ; the features for all nodes
can be summarized as a matrix X ∈ R|V |×𝑑 .
2.2     Computational Tasks on Graphs
There are numerous computational tasks on graphs, which can be majorly divided into two categories:
node-focus tasks and graph-focus tasks. For node-focus tasks, we are usually given a single graph,
and we care more about the properties for each individual node on this graph. Link prediction and
node classification (or regression) are two examples of node-focus tasks. For the link prediction task,
we are interested in predicting how nodes will connect with other nodes in the future time. For the
node classification/regression task, we aim to infer some properties for the nodes. For graph-focus
tasks, we typically deal with a set of graphs and our goal is to infer some properties for each
individual of graph. In this thesis, we mainly deal with two representative tasks: 1) semi-supervised
                                                   6


node classification task, which is a node-focus task; and 2) graph classification task, which is a
graph-focus task. Next, we formally describe these tasks as follows.
2.2.1   Semi-supervised Node Classification Task
In the semi-supervised node classification task (SSNC), we assume that each node 𝑖 is associated
with a label 𝑦𝑖 ∈ C, where C denotes the set of labels. We also denote the set of nodes with a given
label 𝑐 ∈ C as V𝑐 . We assume that labels are only given for a subset of nodes V𝑙𝑎𝑏𝑒𝑙 ⊂ V. The
goal of semi-supervised node classification is to learn a mapping 𝑓 : V → C utilizing the graph G,
the node features X and the labels for nodes in V𝑙𝑎𝑏𝑒𝑙 . Note that the task is semi-supervised since
all nodes (labeled or not) are involved in the training process.
2.2.2   Graph Classification Task
In the graph classification task, we are given a set of labeled graphs {G𝑖 }𝑖 , each graph G𝑖 is associated
with a label 𝑦𝑖 . We denote the set of labeled graphs as {(G𝑖 , 𝑦𝑖 )} and assume each graph is sampled
from a distribution denoted as D. The task of the graph classification is to take the graph (structure
information and node features) as input and predict its corresponding label. More formally, we
need to learn a mapping 𝑓 : D → C, where we use C to denote the set of labels. To achieve
good performance for this prediction, it is important to extract useful information from both graph
structure and node features.
                                      Representation                                    Node-focus
                                         Learning                                         Tasks
                                                           Node representations
           Figure 2.1: An overview of machine learning approaches for node-focus tasks
                                                     7


                                      Representation                                Graph-focus
                                         Learning                                      Tasks
                                                         Graph representation
           Figure 2.2: An overview of machine learning approaches for graph-focus tasks
2.2.3   An Overview of Machine Learning Approaches for Tasks on Graphs
When utilizing machine learning approaches to tackle these computational tasks on graphs, a
crucial step is to learn proper representations (vectors or matrices) for either nodes or graphs. Once
we obtain these representations, we can apply suitable existing machine learning algorithms on
them according to the tasks we need to deal with. We illustrate an overview of machine learning
approaches for node-focus and graph-focus tasks in Figure 2.1 and Figure 2.2, respectively. The
orange box in these two figures illustrate the process of learning representations for graphs, which
is the key for good performance of the downstream tasks. In this thesis, we study graph neural
networks–a kind of deep learning methods on graphs, for the purpose of representation learning.
2.3     Graph Neural Networks
Graph neural networks (GNNs) are a prominent approach for learning representations for graph
structured data. They could be adopted to learn both node representations and graph representations.
Thus, they are suitable for both node-focus and graph-focus tasks. There are majorly two operations
in graph neural networks: the graph filtering operation and the graph pooling operation. The goal of
graph filtering operation is to generate node features that potentially benefit the downstream tasks
utilizing the graph structure and the input node features. On the other hand, the graph pooling
operation is utilized to generate a single graph representation by summarizing information from the
node representations. Next, we briefly introduce these two operations and then describe the general
frameworks of GNNs built upon thesee two operations. When introducing these two operations, we
consider a graph G, with adjacency matrix A and feature matrix X as input.
                                                     8


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Graph Filtering
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          A
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          <latexit sha1_base64="PoRo6a8DxympskYEIUqucH9LTfY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B</latexit>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Xf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         <latexit sha1_base64="+SBODXroQW7gTez1IgnW8vrmESc=">AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0TJoYxnBxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sjm6n/8MSNFbG+x3HCu4oOtIgEo+ikx6wTRqQ16UW9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUZX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGF1X/7rxSu87jKMIRHMMpBHAJNbiFOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kDeciQMQ==</latexit>
            A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              X
            <latexit sha1_base64="PoRo6a8DxympskYEIUqucH9LTfY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B</latexit> <latexit sha1_base64="IE6KHv0rfe25ioR3jftf7gpj6wg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cK9gPaUDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByl/udJ1Sax/LRTBP0IzqSPOSMGit1sn4Qku5sUK25dXcOskq8gtSgQHNQ/eoPY5ZGKA0TVOue5ybGz6gynAmcVfqpxoSyCR1hz1JJI9R+Nj93Rs6sMiRhrGxJQ+bq74mMRlpPo8B2RtSM9bKXi/95vdSEN37GZZIalGyxKEwFMTHJfydDrpAZMbWEMsXtrYSNqaLM2IQqNgRv+eVV0r6oe1d19+Gy1rgt4ijDCZzCOXhwDQ24hya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPH/+tj1g=</latexit>
           Input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Figure 2.3: An overview of graph filtering operation
2.3.1    Graph Filtering Operation
The graph filtering operation aims to refine node features to facilitate the down stream tasks.
Specifically, as shown in Figure 2.3, a graph filtering operation takes the adjacency matrix
A ∈ {0, 1} |V |×|V | and the feature matrix X ∈ R|V |×𝑑 as input. After the graph filtering process, the
graph structure is the same as the input, i.e, the adjacency matrix is still A. On the other hand, new
features X 𝑓 ∈ R|V |×𝑑 𝑓 with 𝑑 𝑓 denoting the dimension of output features, are produced from the
filtering operation. Typically, when generating the new features, the graph filtering operation utilizes
both the structure information and the input node features. Note that, the dimension of the output
features 𝑑 𝑓 can be different from the dimension of the input features. When we adopt GNNs for
learning node representations, we usually stack several graph filtering operations. We will introduce
more details on the general frameworks in Section 2.3.3.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               9


2.3.2   Graph Pooling Operation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Graph Pooling
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Ap                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Xp
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           <latexit sha1_base64="CXKcF2LYN0uaW/tDdfj68LqujDA=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoseoF48RzEOSJcxOZpMh81hmZoWw5Cu8eFDEq5/jzb9xkuxBEwsaiqpuuruihDNjff/bK6ysrq1vFDdLW9s7u3vl/YOmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTv3WE9WGKflgxwkNBR5IFjOCrZMes24Uo+tJL+mVK37VnwEtkyAnFchR75W/un1FUkGlJRwb0wn8xIYZ1pYRTielbmpogskID2jHUYkFNWE2O3iCTpzSR7HSrqRFM/X3RIaFMWMRuU6B7dAselPxP6+T2vgqzJhMUkslmS+KU46sQtPvUZ9pSiwfO4KJZu5WRIZYY2JdRiUXQrD48jJpnlWDi6p/f16p3eRxFOEIjuEUAriEGtxBHRpAQMAzvMKbp70X7937mLcWvHzmEP7A+/wBZc+QJA==</latexit>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <latexit sha1_base64="/wVENdmLdJiTC77uaJk3byP4bo8=">AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0TJoYxnBxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sjm6n/8MSNFbG+x3HCu4oOtIgEo+ikx6wTRqQ16SW9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUZX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGF1X/7rxSu87jKMIRHMMpBHAJNbiFOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kDiPCQOw==</latexit>
            A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              X
            <latexit sha1_base64="PoRo6a8DxympskYEIUqucH9LTfY=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoseqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildtYLQnIz7ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8SloXVa9WdR8uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AHcuo9B</latexit> <latexit sha1_base64="IE6KHv0rfe25ioR3jftf7gpj6wg=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cK9gPaUDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByl/udJ1Sax/LRTBP0IzqSPOSMGit1sn4Qku5sUK25dXcOskq8gtSgQHNQ/eoPY5ZGKA0TVOue5ybGz6gynAmcVfqpxoSyCR1hz1JJI9R+Nj93Rs6sMiRhrGxJQ+bq74mMRlpPo8B2RtSM9bKXi/95vdSEN37GZZIalGyxKEwFMTHJfydDrpAZMbWEMsXtrYSNqaLM2IQqNgRv+eVV0r6oe1d19+Gy1rgt4ijDCZzCOXhwDQ24hya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPH/+tj1g=</latexit>
          Input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Figure 2.4: An overview of graph pooling operation
    The graph filtering operation is able to learn node representations. Hence, for node-focus
tasks, the graph filtering operation is sufficient. However, for graph-focus tasks, we need to learn a
single representation for the entire graph. Thus, the graph pooling operation, which can summarize
node representations to produce a graph representation, is necessary. Typically, the graph pooling
operation takes a graph G as input and generate a smaller-size graph as output. We use Figure 2.4
as a demonstration. As shown in Figure 2.4, the output graph is a coarsened version of the input
one. Specifically, a new adjacency matrix A 𝑝 ∈ {0, 1} 𝑁 𝑝 and a new feature matrix X 𝑝 ∈ R𝑁 𝑝 ×𝑑 𝑝
are generated by the graph pooling operations, where 𝑁 𝑝 < |V | denotes the number of nodes in
the coarsened graph. Every time, we apply a pooling operation to a graph, we obtain a coarsened
graph in return. We can further apply graph filtering operation and graph pooling operation on the
coarsened graph until we finally obtain a single graph representation for the entire graph. We will
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 10


                               Graph Filtering Layer               Activation
     Input                                           …                                     Output
                     Figure 2.5: A general GNN framework for node-focus tasks
introduce more details on the general framework of Graph Neural Networks in Section 2.3.3.
2.3.3    General Frameworks of Graph Neural Networks
In this section, we describe the general GNN frameworks for learning representations for graphs.
As we discussed earlier in Section 2.2, we need to learn node representations for node-focus tasks
and graph representations for graph-focus tasks. The processes for learning these two types of
representations are different from each other. Specifically, to learn node representations, the graph
filtering operations are sufficient while both graph filtering operations and graph pooling operations
are usually required for learning graph representations. Next, we separately introduce the GNN
frameworks for learning these two types of representations.
2.3.3.1    General GNN Framework for Node-focus Tasks
A general framework for learning node representations is shown in Figure 2.5. It consists two types
of layers: the graph filtering layer and the activation layer. The graph filtering layer is a learning
layer with a graph filtering operation. The activation layer is just a non-linear activation function
such as ReLU and sigmoid, which are commonly adopted in deep learning architectures. As shown
in Figure 2.5, the framework stacks several graph filtering layers and activation layers. Typically, a
graph filtering layer is directly followed by an activation layer.
                                                   11


                      Graph Filtering Layer        Activation        Graph Pooling Layer
    Input                …                       …                      …                     Output
                        Block 1                                         Block n
                    Figure 2.6: A general GNN framework for graph-focus tasks
2.3.3.2   General GNN Framework for Graph-focus Tasks
Besides graph filtering layers and activation layers, the general GNN framework for graph-focus
tasks also involves the graph pooling layer. As shown in Figure 2.6, the entire GNN framework can
be divided into several learning blocks. Each block contains a single graph pooling layer, which
follows a series of graph filtering layers and activation layers. Hence, after each block, a coarsened
graph will be generated. This coarsened graph is served as the input for the next learning block. A
single graph representation is produced as the output of this framework.
                                                   12


                                             CHAPTER 3
      A UNIFIED VIEW ON GRAPH FILTERING OPERATIONS AS GRAPH SIGNAL
                                             DENOISING
3.1      Introduction
In this chapter, we study the graph filtering operations proposed in recent literature. Specifically, we
provide a unified view on graph filtering operations from the perspective of graph signal denoising.
As we discussed in Section 2.3.3, a GNN model for node-focus tasks is usually composed of several
stacking graph filtering layers. Given a graph G with |V | nodes, a graph filtering layer typically
contains a feature transformation and a feature aggregation operation as:
                              Feature Transformation: X0𝑖𝑛 = 𝑓𝑡𝑟𝑎𝑛𝑠 (X𝑖𝑛 );                        (3.1)
                              Feature Aggregation: X𝑜𝑢𝑡 = 𝑓𝑎𝑔𝑔 (X0𝑖𝑛 ; G),                         (3.2)
where X𝑖𝑛 ∈ R|V |×𝑑𝑖𝑛 and X𝑜𝑢𝑡 ∈ R|V |×𝑑𝑜𝑢𝑡 denote the input and output features of the GNN layer
with 𝑑𝑖𝑛 and 𝑑 𝑜𝑢𝑡 as the corresponding dimensions, respectively. Note that the non-linear activation
is not included in Eq. (3.2) to ease the discussion. The feature transformation operation 𝑓𝑡𝑟𝑎𝑛𝑠 (·)
transforms the input of X𝑖𝑛 to X0𝑖𝑛 ∈ R|V |×𝑑𝑜𝑢𝑡 as its output; and the feature aggregation operation
 𝑓𝑎𝑔𝑔 (·; G) updates the node features by aggregating the transformed node features via the graph G.
     In general, different graph filtering operations share similar feature transformations (often, a
single feed-forward layer), while adopting different designs for aggregation operation. We raise
a natural question – is there an intrinsic connection among these feature aggregation operations
and their assumptions? In this chapter, we aim to build the connection among feature aggregation
operations of the graph filtering operation in representative GNN models including GCN [40],
GAT [84], PPNP and APPNP [42]. In particular, we mathematically establish that the aggregation
operations in these models can be unified as the process of exactly, and sometimes approximately,
addressing a graph signal denoising problem with Laplacian regularization [80]. This connection
suggests that these aggregation operations share a unified goal: to ensure feature smoothness of
                                                   13


connected nodes. With this understanding, we propose a general framework, Ugnn, which not only
provides a straightforward, unified view for many existing aggregation operations, but also suggests
various promising directions to build new aggregation operations suitable for distinct applications.
To demonstrate its potential, we build an instance of Ugnn called Ada-Ugnn, which is suited
for handling varying smoothness properties across nodes, and conduct experiments to show its
effectiveness.
3.2     Representative Graph Neural Networks Models
In this section, we introduce notations for graphs and briefly summarize several representative GNN
models. Especially, we introduce the graph filtering operations for these GNN models. A graph
can be denoted as G = {V, E}, where V and E are its corresponding node and edge sets. The
connections in G can be represented as an adjacency matrix A ∈ R|V |×|V | , with |V | the number of
nodes in the graph. The Laplacian matrix of the graph G is denoted as L. It is defined as L = D − A,
where D is a diagonal degree matrix corresponding to A. There are also normalized versions of
                                              1     1
the Laplacian matrix such as L = I − D− 2 AD− 2 or L = I − D−1 A. In this work, we sometimes
adopt different Laplacians to establish connections between different GNNs and the graph denoising
problem, clarifying in the text. In this section, we generally use X𝑖𝑛 ∈ R|V |×𝑑𝑖𝑛 and X𝑜𝑢𝑡 ∈ R|V |×𝑑𝑜𝑢𝑡
to denote input and output features of graph filtering layers. Next, we describe a few representative
GNN models.
3.2.1    Graph Convolutional Networks (GCN)
Following Eq. (3.2), a single graph filtering layer in GCN (we also refer it as GCN layer hereafter
for convenience) [40] can be written as follows:
            Feature Transformation: X0𝑖𝑛 = X𝑖𝑛 W;      Feature Aggregation: X𝑜𝑢𝑡 = ÃX0𝑖𝑛 ,        (3.3)
                                                   14


where W ∈ R𝑑𝑖𝑛 ×𝑑𝑜𝑢𝑡 is a feature transformation matrix, and Ã is a normalized adjacency matrix
which includes a self-loop, defined as follows:
                     1      1
                                                                                         Õ                       Õ
           Ã = D̂− 2 ÂD̂− 2 ,   with         Â = A + I         and D = diag(                 Â1, 𝑗 , . . . ,    Â𝑁, 𝑗 ). (3.4)
                                                                                            𝑗                    𝑗
In practice, multiple GCN layers can be stacked, where each layer takes the output of its previous
layer as input. Non-linear activation functions are included between consecutive layers.
3.2.2    Graph Attention Networks (GAT)
The graph filtering layer in graph attention networks (GAT) [84] adopts the same feature transforma-
tion operation as GCN in Eq. (3.3). For convenience, hereafter, we also refer this graph filtering
layer as GAT layer. The feature aggregation operation in a GAT layer (written node-wise) for a node
𝑖 is as:
                                                                                                            
                                       Õ                                                       exp 𝑒𝑖 𝑗
                     X𝑜𝑢𝑡 [𝑖, :] =             𝛼𝑖 𝑗 X0𝑖𝑛 [ 𝑗, :],     with       𝛼𝑖 𝑗 =     Í                     .           (3.5)
                                                                                                   exp (𝑒𝑖𝑘 )
                                     𝑗 ∈Ñ (𝑖)                                           𝑘∈Ñ (𝑖)
where Ñ (𝑖) = N (𝑖) ∪ {𝑖} denotes the neighbors (self-inclusive) of node 𝑖, and X𝑜𝑢𝑡 [𝑖, :] is the 𝑖-th
row of the matrix X𝑜𝑢𝑡 , i.e. the output node features of node 𝑖. In this aggregation operation, 𝛼𝑖 𝑗 is
a learnable attention score to differentiate the importance of distinct nodes in the neighborhood.
Specifically, 𝛼𝑖 𝑗 is a normalized form of 𝑒𝑖 𝑗 , which is modeled as:
                                                                    X0𝑖𝑛 [𝑖, :] kX0𝑖𝑛 [ 𝑗, :] a
                                                                                              
                                 𝑒𝑖 𝑗 = LeakyReLU                                                                             (3.6)
where [·k·] denotes the concatenation operation and a ∈ R2𝑑 is a learnable vector. Similar to GCN,
a GAT model usually consists of multiple stacked GAT layers.
3.2.3    Personalized Propagation of Neural Predictions (PPNP)
Personalized Propagation of Neural Predictions (PPNP) [42] introduces an aggregation operation
based on Personalized PageRank (PPR). Specifically, the PPR matrix is defined as 𝛼(I − (1 − 𝛼) Ã) −1 ,
where 𝛼 ∈ (0, 1) is a hyper-parameter. The 𝑖 𝑗-th element of the PPR matrix specifies the influence
                                                                15


of node 𝑖 on node 𝑗. The feature transformation operation is modeled as Multi-layer Perception
(MLP). The PPNP model can be written in the form of Eq. (3.2) as follows:
                         Feature Transformation: X0𝑖𝑛 = MLP(X𝑖𝑛 );
                         Feature Aggregation: X𝑜𝑢𝑡 = 𝛼(I − (1 − 𝛼) Ã) −1 X0𝑖𝑛 .                 (3.7)
Unlike GCN and GAT, PPNP only consists of a single feature aggregation layer, but with a potentially
deep feature transformation. Since the matrix inverse in Eq. (3.7) is costly, [42] also introduces
a practical, approximated version of PPNP, called APPNP, where the aggregation operation is
performed in an iterative way as:
                              (𝑘)               (𝑘−1)
                            X𝑜𝑢𝑡  = (1 − 𝛼) ÃX𝑜𝑢𝑡    + 𝛼X0𝑖𝑛 𝑘 = 1, . . . 𝐾,                    (3.8)
          (0)             (𝐾)
where X𝑜𝑢𝑡    = X0𝑖𝑛 and X𝑜𝑢𝑡 is the output of the feature aggregation operation. As proved in [42],
  (𝐾)
X𝑜𝑢𝑡  converges to the solution obtained by PPNP, i.e., X𝑜𝑢𝑡 in Eq. (3.7). For convenience, we also
call the graph filtering layers in the PPNP and APPNP model as PPNP layer and APPNP layer,
respectively.
3.3     Graph filtering operations as graph signal denoising
In this section, we aim to establish the connections between the introduced graph filtering layers and
a graph signal denoising problem with Laplacian regularization. We first introduce the problem.
Problem 1 (Graph Signal Denoising with Laplacian Regularization). Suppose that we are given
a noisy signal X ∈ R|V |×𝑑 on a graph G. The goal of the problem is to recover a clean signal
F ∈ R|V |×𝑑 , assumed to be smooth over G, by solving the following optimization problem:
                               arg min L = kF − Xk 2𝐹 + 𝑐 · 𝑡𝑟 (F> LF),                          (3.9)
                                    F
    Note that the first term guides F to be close to X, while the second term 𝑡𝑟 (F> LF) is the
Laplacian regularization that guides the smoothness of F over the graph. 𝑐 > 0 is a balancing
constant. Assuming we adopt the unnormalized version of Laplacian matrix with L = D − A (the
                                                   16


adjacency matrix A is assumed to be binary), the second term in Eq. (3.9) can be written in an
edge-centric way or a node-centric way as:
                                                     Õ
                                edge-centric: 𝑐              kF [𝑖, :] − F [ 𝑗, :] k 22 ;
                                                   (𝑖, 𝑗)∈E
                                                  1 Õ Õ
                                node-centric:      𝑐                  kF [𝑖, :] − F [ 𝑗, :] k 22 .        (3.10)
                                                  2 𝑖∈V
                                                            𝑗 ∈Ñ (𝑖)
Clearly, from the edge-centric view, the regularization term measures the global smoothness of F,
which is small when connected nodes share similar features. On the other hand, we can view the term
Í                                   2
  𝑗 ∈Ñ (𝑖) kF [𝑖, :] − F [ 𝑗, :] k 2 as a local smoothness measure for node 𝑖 as it measures the difference
between node 𝑖 and all its neighbors. The regularization term can then be regarded as a summation
of local smoothness over all nodes. Note that the adjacency matrix A is assumed to be binary when
deriving Eq. (3.10). Similar formulations can also be derived to other types of Laplacian matrices.
In the following subsections, we demonstrate the connections between aggregation operations in
various GNN layers and the graph signal denoising problem.
3.3.1      Connection to PPNP layers and APPNP layers
In this subsection, we establish the connection between the graph signal denoising problem (3.9)
and the aggregation propagations in PPNP layer and APPNP layer in Theorem 1 and Theorem 2,
respectively.
Theorem 1. When we adopt the normalized Laplacian matrix L = I − Ã, with Ã defined in Eq. (3.4),
the feature aggregation operation in a PPNP layer (Eq. (3.7)) can be regarded as exactly solving
                                                                                                   1
the graph signal denoising problem (3.9) with X0𝑖𝑛 as the input noisy signal and 𝑐 =               𝛼 − 1.
Proof. Note that the objective in Eq. (3.9) is convex. Hence, its closed-form solution F∗ to exactly
solve the graph signal denosing problem can be obtained by setting its derivative to 0 as:
                               𝜕L
                                      = 2(F − X) + 2𝑐LF = 0 ⇒ F∗ = (I + 𝑐L) −1 X                          (3.11)
                                𝜕F
                                                               17


Given L = I − Ã, F∗ can be reformulated as:
                                                      −1         1            𝑐    −1
                  ∗             −1
                 F = (I + 𝑐L)      X = I + 𝑐 I − Ã        X=           I−         Ã X             (3.12)
                                                                 1+𝑐          1+𝑐
The feature aggregation operation in Eq. (3.7) is equivalent to the closed-form solution in Eq. (3.12)
when we set 𝛼 = 1/(1 + 𝑐) and X = X0in . This completes the proof.
Theorem 2. When we adopt the normalized Laplacian matrix L = I − Ã, the feature aggregation
operation in a APPNP layer (Eq. (3.8)) approximately solves the graph signal denoising problem (3.9)
                                                                            1                        1
by iterative gradient descent with X0𝑖𝑛 as the input noisy signal, 𝑐 =      𝛼 − 1 and stepsize 𝑏 = 2+2𝑐 .
Proof. To solve the graph signal denoising problem (3.9), we take iterative gradient method with
the stepsize 𝑏. Specifically, the 𝑘-th step gradient descent on problem (3.9) is as follows:
                           𝜕L
     F (𝑘) ← F (𝑘−1) − 𝑏 ·     (F = F (𝑘−1) ) = (1 − 2𝑏 − 2𝑏𝑐)F (𝑘−1) + 2𝑏X + 2𝑏𝑐ÃF (𝑘−1)          (3.13)
                           𝜕F
                                                    1
where F (0) = X. When we set the stepsize 𝑏 as     2+2𝑐 , we have the following iterative steps:
                                       1          𝑐
                            F (𝑘) ←        X+        ÃF (𝑘−1) , 𝑘 = 1, . . . 𝐾,                    (3.14)
                                     1+𝑐        1+𝑐
which is equivalent to the iterative aggregation operation of the APPNP model in Eq. (3.8) with
                    1
X = X0in and 𝛼 =  1+𝑐 . This completes the proof.
    These two connections provide a new explanation on the hyper-parameter 𝛼 in PPNP and APPNP
from the graph signal denoising perspective. Specifically, a smaller 𝛼 indicates a larger 𝑐, which
means the obtained X𝑜𝑢𝑡 is enforced to be smoother over the graph.
3.3.2   Connection to GCN layers
We draw the connection between the GCN layer [40] and the graph signal denoising problem in
Theorem 3.
Theorem 3. When we adopt the normalized Laplacian matrix L = I − Ã, the feature aggregation
operation in a GCN layer Eq. (3.3) can be regarded as solving the graph signal denoising
                                                                                                          1
problem (3.9) using one-step gradient descent with X0𝑖𝑛 as the input noisy signal and stepsize 𝑏 =        2𝑐 .
                                                   18


  Algorithm 1: 𝐾-layer GCN As Denoising
    Input: Node Features X; Adjacency Matrix Â
    Output: Refined Node Features H
  1 Initialize X (0) ← X, 𝑘 ← 1;
  2 while 1 ≤ 𝑘 ≤ 𝐾 do
  3      (Feature Transformation) X (𝑘−1) 𝑓   = X (𝑘−1) W (𝑘−1) ;
  4      (Feature Aggregation) Let X (𝑘−1)𝑓   be the input noisy signal of Eq. (3.9), i.e., S = X 𝑓 .
         Solve Problem (3.9) via one-step gradient descent as per Theorem 3 and denote the
         solution as X𝑔(𝑘) ;
  5      (Activation) X (𝑘) = 𝜎(X𝑔(𝑘) ), where 𝜎(·) denotes an activation function. ;
  6      𝑘 ← 𝑘 + 1;
  7 H = X (K) ;
  8 return H
                                                 𝜕L
Proof. The gradient with respect to F at X is    𝜕F F=X  = 2𝑐LX. Hence, one-step gradient descent for
the graph signal denoising problem (3.9) can be described as:
                                  𝜕L
                     F←X−𝑏                  = X − 2𝑏𝑐LX = (1 − 2𝑏𝑐)X + 2𝑏𝑐ÃX.                     (3.15)
                                  𝜕F   F=X
                       1
When stepsize 𝑏 =     2𝑐   and X = X0𝑖𝑛 , we have F ← ÃX0𝑖𝑛 , which is the same as the aggregation
operation of GCN.
    With this connection, it is easy to verify that a GCN model with multiple GCN layers can be
regarded as solving Problem 1 multiple times with different noisy signals as shown in Algorithm 1
(demonstrating for 𝐾-layer GCN). Specifically, in each layer, the aggregation component aims to
solve Problem 1 with the transformed features as input noisy signal.
3.3.3    Connection to GAT layers
To establish the connection between graph signal denoising and GAT layers [84], in this subsection,
we adopt an unnormalized version of the Laplacian. It is defined based on the adjacency matrix with
self-loop Â, i.e. L = D̂ − Â with D̂ denoting the diagonal degree matrix of Â. Then, the denoising
                                                    19


problem in Eq. (3.9) can be rewritten from a node-centric view as:
                           Õ                                   1Õ                 Õ
          arg min L =            kF[𝑖, :] − X[𝑖, :] k 22 +                 𝑐·              kF[𝑖, :] − F[ 𝑗, :] k 22 ,  (3.16)
                F                                              2 𝑖∈V
                           𝑖∈V                                                  𝑗 ∈Ñ (𝑖)
where Ñ (𝑖) = N (𝑖) ∪ {𝑖} denotes the neighbors (self-inclusive) of node 𝑖. In Eq. (3.16), the constant
𝑐 is shared by all nodes, which indicates that the same level of local smoothness is enforced to all
nodes. However, nodes in a real-world graph can have varied local smoothness. For nodes with low
local smoothness, we should impose a relatively smaller 𝑐, while for those nodes with higher local
smoothness, we need a larger 𝑐. Hence, instead of a unified 𝑐 as in Eq. (3.16), we could consider a
node-dependent 𝑐𝑖 for each node 𝑖. Then, the optimization problem in Eq. (3.16) can be adjusted as:
                          Õ                                     1Õ                  Õ
         arg min L =            kF [𝑖, :] − X [𝑖, :] k 22 +                𝑐𝑖 ·             kF [𝑖, :] − F [ 𝑗, :] k 22 (3.17)
               F                                                2 𝑖∈V
                          𝑖∈V                                                     𝑗 ∈Ñ (𝑖)
We next show that the aggregation operation in GAT layers is closely connected to an approximate
solution of problem (3.17) with the help of the following theorem.
                                                         Í
Theorem 4. With adaptive stepsize 𝑏𝑖 = 1/                       (𝑐𝑖 + 𝑐 𝑗 ) for each node 𝑖, the process of taking one
                                                      𝑗 ∈Ñ (𝑖)
step of gradient descent from X to solve problem (3.17) can be described as follows:
                                                      Õ
                                      F[𝑖, :] ←               𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 )X[ 𝑗, :].                                  (3.18)
                                                    𝑗 ∈Ñ (𝑖)
Proof. The gradient of optimization problem in Eq. (3.17) with respect to F focusing on a node 𝑖
can be formulated as:
                          𝜕L                                      Õ                   
                                  = 2 (F [𝑖, :] − X [𝑖, :]) +              𝑐 𝑖 + 𝑐 𝑗 (F [𝑖, :] − F [ 𝑗, :]) ,           (3.19)
                        𝜕F [𝑖, :]
                                                                𝑗 ∈ Ñ (𝑖)
 where 𝑐 𝑗 in the second term appears since 𝑖 is also in the neighborhood of 𝑗. Then, the gradient at
X is
                            𝜕L                         Õ                      
                                                  =               𝑐𝑖 + 𝑐 𝑗 (X [𝑖, :] − X [ 𝑗, :]) .                    (3.20)
                         𝜕F [𝑖, :]  F[𝑖,:]=X[𝑖,:]    𝑗 ∈Ñ (𝑖)
Thus, taking a step of gradient descent starting from X with stepsize 𝑏 can be described as follows:
                                                    𝜕L
                  F [𝑖, :] ← X [𝑖, :] − 𝑏 ·
                                                  𝜕F [𝑖, :]    F[𝑖,:]=X[𝑖,:]
                                                              20


                                             Õ                                 Õ                    
                               = 1−𝑏                  𝑐𝑖 + 𝑐 𝑗    X [𝑖, :] +              𝑏 𝑐𝑖 + 𝑐 𝑗 X [ 𝑗, :]                 (3.21)
                                            𝑗 ∈Ñ (𝑖)                           𝑗 ∈Ñ (𝑖)
                   Í                                                                            Í
Given 𝑏 = 1/              (𝑐𝑖 + 𝑐 𝑗 ), Eq. (3.21) can be rewritten as F[𝑖, :] ←                        𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 )X[ 𝑗, :], which
                𝑗 ∈Ñ (𝑖)                                                                    𝑗 ∈Ñ (𝑖)
completes the proof.
     Eq. (3.18) resembles the aggregation operation of GAT layers in Eq. (3.5) if we treat 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 )
                                                               Í
as the attention score 𝛼𝑖 𝑗 . Note that we have                       (𝑐𝑖 + 𝑐 𝑗 ) = 1/𝑏𝑖 , for all 𝑖 ∈ V. So, (𝑐𝑖 + 𝑐 𝑗 ) can
                                                            𝑗 ∈Ñ (𝑖)
be regarded as the pre-normalized attention score and 1/𝑏𝑖 can be regarded as the normalization
constant. We further compare 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) with 𝛼𝑖 𝑗 by investigating the formulation of 𝑒𝑖 𝑗 in Eq. (3.6).
Eq. (3.6) can be rewritten as:
                                     𝑒𝑖 𝑗 = LeakyReLU (X0𝑖𝑛 [𝑖, :]a1 + X0𝑖𝑛 [ 𝑗, :]a2 )                                        (3.22)
where a1 ∈ R𝑑 and a2 ∈ R𝑑 are learnable column vectors, which can be concatenated to form a
in Eq. (3.6). Comparing 𝑒𝑖 𝑗 with (𝑐𝑖 + 𝑐 𝑗 ), we find that they take a similar form. Specifically,
X0𝑖𝑛 [𝑖, :]a1 and X0𝑖𝑛 [ 𝑗, :]a2 can be regarded as the approximations of 𝑐𝑖 and 𝑐 𝑗 , respectively. The
difference between 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) and 𝛼𝑖 𝑗 is that the normalization in Eq. (3.18) for 𝑏𝑖 (𝑐𝑖 + 𝑐 𝑗 ) is
achieved via summation rather than a softmax as in Eq. (3.5) for 𝛼𝑖 𝑗 . Note that since GAT layers
make the 𝑐𝑖 and 𝑐 𝑗 learnable, they also include a non-linear activation in calculating 𝑒𝑖 𝑗 . By viewing
the attention mechanism in GAT layers from the perspective of Eq. (3.18), namely that 𝑐𝑖 actually
indicates a notion of local smoothness for node 𝑖, we can develop other ways to parameterize 𝑐𝑖 . For
example, instead of directly using the node features of 𝑖 as an indicator of local smoothness like
GAT layers, we can consider the neighborhood information. In fact, we adopt this idea to design a
new aggregation operation in Section 3.5.
3.4       Ugnn: A Unified Framework for Graph Filtering Operations via Graph
          Signal Denoising
In the previous section, we established that the aggregation operations in PPNP layers, APPNP
layers, GCN layers and GAT layers are intimately connected to the graph signal denoising problem
with (generalized) Laplacian regularization. In particular, from this perspective, all their aggregation
                                                                 21


operations aim to ensure feature smoothness: either a global smoothness over the graph as in PPNP
layers, APPNP layers and GCN layers, or a local smoothness for each node as in GAT layers. This
understanding allows us to develop a unified feature aggregation operation by posing the following,
more general graph signal denoising problem:
Problem 2 (Generalized Ugnn Graph Signal Denoising Problem).
                                   arg min L = kF − Xk 2𝐹 + 𝑟 (C, F, G),                                  (3.23)
                                        F
where 𝑟 (C, F, G) denotes a flexible regularization term to enforce some prior over F.
    Note that we overload the notation C here: it can function as a scalar (like a global constant
in GCN layers), a vector (like node-wise constants in GAT layers) or even a matrix (edge-wise
constants) if we want to give flexibility to each node pair. Different choices of 𝑟 (·) imply different
feature aggregation operations. Besides PPNP, APPNP, GCN and GAT, there are aggregation
operations in more GNN models that can be associated with Problem 2 with different regularization
terms such as PairNorm [101] and DropEdge [68] as detailed below. PairNorm and DropEdge,
which are two recently proposed GNN enhancements for developing deeper GNN models, are
corresponding to the following regularization terms:
                       Õ                                           Õ
         PairNorm:             C𝑝 · kF[𝑖, :] − F[ 𝑗, :] k 22 −            C𝑛 · kF[𝑖, :] − F[ 𝑗, :] k 22 , (3.24)
                     (𝑖, 𝑗)∈E                                    (𝑖, 𝑗)∉E
                        Õ
         DropEdge:             C𝑖 𝑗 · kF[𝑖, :] − F[ 𝑗, :] k 22 , where C𝑖 𝑗 ∈ {0, 1}.                     (3.25)
                      (𝑖, 𝑗)∈E
For PairNorm, C consists of C𝑝 , C𝑛 > 0 and the regularization term ensures connected nodes to
be similar while disconnected nodes to be dissimilar. For DropEdge, C is a sparse matrix having
the same shape as adjacency matrix. For each edge (𝑖, 𝑗), its corresponding C𝑖 𝑗 is sampled from a
Bernoulli distribution with mean 1 − 𝑞, where 𝑞 is a pre-defined dropout rate.
    The above mentioned regularization terms are all related to the Laplacian regularization.
Other regularization terms can also be adopted, which may lead to novel designs of GNN layers.
For example, if we aim to enforce that the clean signal is piece-wise linear, we can adopt
𝑟 (C, F, G) = C · kLFk 1 designed for trend filtering [82, 89].
                                                        22


    With these discussions, we propose a unified framework (Ugnn) to design GNN layers from
the graph signal processing perspective as: (1) Design a graph regularization term 𝑟 (C, F, G)
in Problem 2 according to specific applications; (2) Feature Transformation: X0𝑖𝑛 = 𝑓𝑡𝑟𝑎𝑛𝑠 (X𝑖𝑛 );
and (3) Feature Aggregation: Solving Problem 2 with X = X0𝑖𝑛 and the designed 𝑟 (C, F, G). To
demonstrate the potential of Ugnn, next we introduce a new GNN model Ada-Ugnn by instantiating
Ugnn with 𝑟 (C, F, G) enforcing adaptive local smoothness across nodes. Note that we introduce
Ada-Ugnn with node classification as the downstream task.
3.5    Ada-Ugnn: Adaptive Local Smoothing with Ugnn
From the graph signal denoising perspective, PPNP, APPNP, and GCN enforces global smoothness
by penalizing the difference with a constant C for all nodes. However, real-world graphs may consist
of multiple groups of nodes which have different behaviors in connecting to similar neighbors. For
example, Section 3.6.1 shows several graphs with varying distributions of local smoothness (as
measured by label homophily): summarily, not all nodes are highly label-homophilic, and some
nodes have considerably “noisier” neighborhoods than others. Moreover, as suggested by [90, 35],
adversarial attacks on graphs tend to promote such label noise in graphs by connecting nodes from
different classes and disconnecting nodes from the same class, rendering resultant graphs with
varying local smoothness across nodes. Under these scenarios, a constant C might not be optimal
and adaptive (i.e. non-constant) smoothness to different nodes is desired. As shown in Section 3.3.3
by viewing GAT’s aggregation as a solution to regularized graph signal denoising, GAT can be
regarded as adopting an adaptive C for different nodes, which facilitates adaptive local smoothness.
However, in GAT, the graph denoising problem is solved by a single step of gradient descent, which
might still be suboptimal. Furthermore, when modeling the local smoothness factor 𝑐𝑖 in Eq. (3.18),
GAT only uses features of node 𝑖 as input, which may not be optimal since by understanding 𝑐𝑖 as
local smoothness, it should be intrinsically related to the neighborhood of node 𝑖. In this section,
we adapt this notion directly into the Ugnn framework by introducing a new regularization term,
and develop a resulting GNN model (Ada-Ugnn) which aims to enforce adaptive local smoothness
to nodes in a different manner to GAT. We then utilize an iterative gradient descent method to
                                                 23


approximate the optimal solution for Problem 2 with the following regularization term:
                                                                                                    2
                                               1 Õ                 Õ         F[𝑖, :] F[ 𝑗, :]
                                  𝑟 (C, F, G) = ·         C𝑖                  √      − p         .                        (3.26)
                                               2 𝑖 ∈V                           𝑑𝑖          𝑑𝑗
                                                                𝑗 ∈ Ñ (𝑖)                          2
 where 𝑑𝑖 , 𝑑 𝑗 denotes the degree of node 𝑖 and 𝑗 respectively, and C𝑖 indicates the smoothness
factor of node 𝑖, which is assumed to be a fixed scalar. Note that, the above regularization term can
be regarded as a generalized version of the regularization term used in PPNP, APPNP, and GCN.
Similar to PPNP and APPNP, Ada-Ugnn only consists of a single GNN layer. However, Ada-Ugnn
assumes adaptive local smoothness. We next describe the feature transformation and aggregation
operations of Ada-Ugnn, and show how to derive the model via Ugnn.
3.5.1   Feature Transformation
Similar to PPNP layers and APPNPlayers, we adopt MLP for the feature transformation. Specifically,
for a node classification task, the dimension of the output of the feature transformation X0𝑖𝑛 is the
number of classes in the graph.
3.5.2   Feature Aggregation
We use iterative gradient descent to solve Problem 2 with the regularization term in Eq. (3.26) The
iterative gradient descent steps are stated in the following theorem.
                                                                                     !
Theorem 5. With adaptive stepsize 𝑏 𝑖 = 1/ 2 +                                          for each node 𝑖, the iterative gradient
                                                              Í
                                                                      (C𝑖 + C 𝑗 )/𝑑𝑖
                                                           𝑗 ∈ Ñ (𝑖)
descent steps to solve Problem 2 with the regularization term in Eq. (3.26) is as follows:
                                                       Õ                      F (𝑘−1) [ 𝑗, :]
                      F (𝑘) [𝑖, :] ← 2𝑏X[𝑖, :] + 𝑏 𝑖             (C𝑖 + C𝑖 )       p           ;    𝑘 = 1, . . . .         (3.27)
                                                     𝑗 ∈ Ñ (𝑖)
                                                                                     𝑑𝑖 𝑑 𝑗
 where F (0) [𝑖, :] = X[𝑖, :] .
Proof. The gradient of the optimization problem 2 with the regularization term in Eq. (3.26) with
respect to F (focusing on node 𝑖) is as follows:
                                                                                                                  !
                     𝜕L                                            Õ         C𝑖 + C 𝑗 F[𝑖, :] F[ 𝑗, :]
                               = 2(F[𝑖, :] − X[𝑖, :]) +                        √            √       − p             ,    (3.28)
                   𝜕F[𝑖, :]                                                       𝑑𝑖          𝑑𝑖          𝑑𝑗
                                                              𝑣 𝑗 ∈Ñ (𝑣 𝑖 )
                                                                24


where C 𝑗 in the second term appears since node 𝑖 is also in the neighborhood of node 𝑗. The iterative
gradient descent steps with adaptive stepsize 𝑏𝑖 can be formulated as follows:
                                                                                  𝜕L
                         F (𝑘) [𝑖, :] ← F (𝑘−1) [𝑖, :] − 𝑏𝑖 ·                                                   ;   𝑘 = 1, . . .    (3.29)
                                                                              𝜕F[𝑖, :]    F[𝑖,:]=F (𝑘−1) [𝑖,:]
With the gradient in Eq. (3.28), the iterative steps in Eq. (3.29) can be rewritten as:
                                                                        Õ           C𝑖 + C 𝑗 (𝑘−1)
                          F (𝑘) [𝑖, :] ←(1 − 2𝑏𝑖 − 𝑏𝑖                                         )F         [𝑖, :] + 2𝑏𝑖 X[𝑖, :]
                                                                                       𝑑𝑖
                                                                    𝑣 𝑗 ∈Ñ (𝑣 𝑖 )
                                                          Õ                         F (𝑘) [ 𝑗, :]
                                                + 𝑏𝑖              (C𝑖 + C 𝑗 ) p                   ;    𝑘 = 1, . . .                 (3.30)
                                                      𝑣𝑗 ∈Ñ (𝑣 )
                                                                                        𝑑 𝑖 𝑑 𝑗
                                                               𝑖
                                                              !
                                  Í
Given 𝑏𝑖 = 1/ 2 +                           (C𝑖 + C 𝑗 )/𝑑𝑖 , the iterative steps in Eq. (3.30) can be re-written as follows:
                             𝑣 𝑗 ∈Ñ (𝑣 𝑖 )
                                                                      Õ                         F (𝑘−1) [ 𝑗, :]
                     F (𝑘) [𝑖, :] ← 2𝑏X[𝑖, :] + 𝑏𝑖                                (C𝑖 + C 𝑗 )      p              ;  𝑘 = 1, . . . , (3.31)
                                                                  𝑣 𝑗 ∈Ñ (𝑣 𝑖 )
                                                                                                     𝑑𝑖 𝑑 𝑗
with F (0) [𝑖, :] = X[𝑖, :], which completes the proof.
     Following the iterative solution in Eq. (3.27), we model the aggregation operation (for node 𝑖)
for Ada-Ugnn as follows:
                                                                                                  (𝑘−1)
                         (𝑘)                        0
                                                                        Õ                       X𝑜𝑢𝑡    [ 𝑗, :]
                     X𝑜𝑢𝑡    [𝑖, :]  ←       2𝑏 𝑖 X𝑖𝑛 [𝑖, :] + 𝑏𝑖                  (C𝑖 + C 𝑗 ) p                ;   𝑘 = 1, . . . 𝐾,  (3.32)
                                                                    𝑣𝑗  ∈ Ñ ( 𝑣 )
                                                                                                     𝑑𝑖 𝑑 𝑗
                                                                                𝑖
 where 𝐾 is the number gradient descent iterations, C𝑖 can be considered as a positive scalar to
control the level of “local smoothness” for node 𝑖 and 𝑏𝑖 can be calculated from {C 𝑗 | 𝑗 ∈ Ñ (𝑖)} as
                                            !
                                              . However, in practice, C𝑖 is usually unknown. One possible solution is
                Í
𝑏 𝑖 = 1/ 2 +            (C𝑖 + C 𝑗 )/𝑑𝑖
             𝑗 ∈ Ñ (𝑖)
to treat C𝑖 as hyper-parameters. But, treating C𝑖 as hyper-parameters for all nodes is impractical,
since there are in total 𝑁 of them and we do not have their prior knowledge. Thus, we instead
parameterize C𝑖 as a function of the information of the neighborhood of node 𝑖 as follows:
                                                                  n                                o
                                                  C𝑖 = 𝑠 · 𝜎 ℎ1 ℎ2 X0𝑗 | 𝑗 ∈ Ñ (𝑖)                           ,                     (3.33)
where ℎ2 (·) is a function to transform the neighborhood information of node 𝑖 to a vector, while
ℎ1 (·) further transforms it to a scalar. 𝜎(·) denotes the sigmoid function, which maps the output
                                                                                25


scalar from ℎ1 (·) to (0, 1) and 𝑠 can be treated as a hyper-parameter controlling the upper bound
of C𝑖 . ℎ1 (·) can be modeled as a single layer fully-connected neural network. There are different
designs for ℎ2 (·) such as channel-wise mean or variance [13]. In this chapter, we adopt channel-wise
variance as the ℎ2 (·) function (as a measure of diversity). A APPNP layer can be regarded a special
case of Ada-Ugnn, where ℎ2 (·) is modeled as a constant function producing 1 as the output for all
                                                              (𝐾)
nodes. For the node classification task, the representation X𝑜𝑢𝑡  , which is obtained after 𝐾 iterations
as in Eq. (3.32), is directly softmax normalized row-wise and its 𝑖-th row indicates the discrete class
distribution of node 𝑖.
3.6     Experiment
In this section, we evaluate how the proposed Ada-Ugnn handles graphs with varying local
smoothness. We conduct node classification experiments on natural graphs, and also evaluate the
model’s robustness under adversarial attacks. We note that our main goal in proposing/evaluating
Ada-Ugnn is to demonstrate the promise of deriving new aggregations as solutions of denoising
problems, rather than state-of-the-art performance.
3.6.1    Node Classification
In this section, we conduct the node classification task. We first introduce the datasets and the
experimental settings in Section 3.6.1.1 and then present the results in Section 3.6.3.1.
3.6.1.1    Datasets and Experimental Settings
We conduct the node classification task on 8 datasets from various domains including citation, social,
co-authorship and co-purchase networks. Specifically, we use three citation networks including
Cora, Citeseer, and Pubmed [74]; one social network, BlogCatalog [33]; two co-authorship
networks including Coauthor-CS and Coauthor-PH [76]; and two co-purchase networks including
Amazon-Comp and Amazon Photos [76]. Descriptions and detail statistics about these datasets can
be found below.
                                                  26


3.6.2   Datasets
                                        #Nodes    #Edges     #Labels  #Features
                          Cora            2708     13264        7       1433
                       Citeseer           3327     12431        6       3703
                         Pubmed          19717    108365        3        500
                     BlogCatalog          5196    348682        6       8189
                     Amazon-Comp         13381    504937       10        767
                    Amazon-Photo          7487    245573        8        745
                     Coauthor-CS         18333    182121       15       6805
                     Coauthor-PH         34493    530417        5       8415
                               Table 3.1: Dataset summary statistics.
    • Citation Networks: Cora, Citeseer and Pubmed are widely adopted benchmarks of GNN
      models. In these graphs, nodes represent documents and edges denote the citation links
      between them. Each node is associated bag-of-words features of its corresponding document
      and also a label indicating the research field of the document.
    • Blogcatalog: BlogCatalog is an online blogging community where bloggers can follow
      each other. The BlogCatalog graph consists of blogger as nodes while their social relations
      as edges. Each blogger is associated with some features generated from key words of his/her
      blogs. The bloggers are labeled according to their interests.
    • Co-purchase Graph: Amazon-Comp and Amazon-Photo are co-purchase graphs, where
      nodes represent items and edges indicate that two items are frequently bought together. Each
      item is associated with bag-of-words features extract from its corresponding reviews. The
      labels of items are given by the category of them.
    • Co-authorship Graphs: Coauthor-CS and Coauthor-PH are co-authorship graphs, where
      nodes are authors and edges indicating the co-authorship between authors. Each author is
      associated with some features representing the keywords of his/her papers. The label of an
      author indicates the his/her most active research field.
                                                  27


              1750                                               2000
                                                                 1750                                               12000                                         1000
              1500
                                                                 1500                                               10000                                          800
              1250
                                                                 1250                                                8000
     #nodes                                             #nodes                                             #nodes                                        #nodes
              1000                                               1000                                                                                              600
               750                                                                                                   6000
                                                                  750                                                                                              400
               500                                                500                                                4000
               250                                                                                                   2000                                          200
                                                                  250
                 0 0.0   0.2 0.4 0.6 0.8          1.0               0 0.0   0.2 0.4 0.6 0.8          1.0                0 0.0    0.2 0.4 0.6 0.8 1.0                 0 0.0       0.2 0.4 0.6 0.8          1.0
                         Local Label Smoothness                             Local Label Smoothness                              Local Label Smoothness                           Local Label Smoothness
                         (a) Cora                                       (b) Citeseer                                        (c) Pubmed                             (d) BlogCatalog
              6000                                                                                                  10000                                         25000
                                                                 4000
              5000                                                                                                   8000                                         20000
              4000                                               3000
     #nodes                                             #nodes                                             #nodes
                                                                                                                     6000
                                                                                                                                                         #nodes
                                                                                                                                                                  15000
              3000                                               2000                                                4000                                         10000
              2000
                                                                 1000                                                2000                                          5000
              1000
                 0 0.0   0.2 0.4 0.6 0.8          1.0               0 0.0   0.2 0.4 0.6 0.8          1.0                0 0.0    0.2 0.4 0.6 0.8 1.0                     0 0.0    0.2 0.4 0.6 0.8 1.0
                         Local Label Smoothness                             Local Label Smoothness                              Local Label Smoothness                           Local Label Smoothness
               (e) Amazon-Comp                                   (f) Amazon-Photo                                    (g) Coauthor-CS                               (h) Coauthor-PH
Figure 3.1: Distribution of local label smoothness (homophily) on different graph datasets: note the
non-homogeneity of smoothness values.
Some statistics of these graphs are shown in Table 3.1.
   To provide a sense of the local smoothness properties of these datasets, in addition to the
summary statistics, we further present the distribution of local label smoothness in these datasets.
For a node 𝑣 𝑖 we formally define the local label smoothness as follows
                                              Í
                                                   1{𝑙 (𝑖) = 𝑙 ( 𝑗)}
                                                                                             𝑗 ∈N (𝑖)
                                                                             ls(𝑖) =                                                                                                                       (3.34)
                                                                                                            |N (𝑖)|
where 𝑙 (𝑣 𝑖 ) denotes the label of node 𝑣 𝑖 and 1{𝑎} is an indicator function, which takes 1 as output
only when 𝑎 is true, otherwise 0. The distributions of local label smoothness for all 8 datasets are
presented in Figure 3.1.
3.6.3          Node Classification Accuracy For Nodes with Low-level and High-level Local Label
               Smoothness
The performance of nodes with low local label smoothness and high local label smoothness in
Citeseer, Pubmed, Amazon-Photo and Coauthor-PH are presented in Figure 3.3.
   Notably, the variety in local label smoothness within several real-world datasets – also observed
in [75] – clearly motivates the importance of the adaptive smoothness assumption in Ada-Ugnn.
                                                                                                       28


                     Table 3.2: Node Classification Accuracy on Various Datasets
                  Dataset                    GCN           GAT           APPNP         Ada-Ugnn
                  Cora                    81.75±0.8     82.56±0.8      84.49±0.6       84.59±0.8*
                  Citeseer                70.13±1.0     70.77±0.8      71.97±0.6       72.05±0.5
                  Pubmed                  78.56±0.5     78.88±0.5      79.92±0.5       79.70±0.4
                  BlogCatalog             71.38±2.7     72.90±1.2      92.43±0.9       93.33±0.3***
                  Amazon-Comp             82.79±1.3     83.01±1.5      82.99±1.6       83.40±1.3***
                  Amazon-Photo            89.60±1.5     90.33±1.2      91.38±1.2       91.44±1.2
                  Coauthor-CS             91.55±0.6     90.95±0.7      91.69±0.4       92.33±0.5***
                  Coauthor-PH             93.23±0.7     92.86±0.7      93.84±0.5       93.92±0.6**
                 ∗, ∗∗, ∗∗∗ indicate the improvement over APPNP is significant at 𝑝 < 0.1, 0.05 and 0.005
For the citation networks, we use the standard split as provided in [40, 96]. For BlogCatalog,
we adopt the split provided in [102]. For both the citation networks and BlogCatalog, the
experiments are run with 30 random seeds and the average results are reported. For co-authorship
and co-purchase networks, we utilize 20 labels per class for training, 30 nodes per class for validation
and the remaining nodes for test. This process is repeated 20 times, which results in 20 different
training/validation/test splits. For each split, the experiment is repeated for 20 times with different
initialization. The average results over 20 × 20 experiments are reported. We compare our methods
with the methods introduced in Section 3.2 including GCN, GAT and APPNP. Note that we do not
include PPNP as it is difficult to scale for most of the datasets due to the calculation of inverse in
Eq. 3.7. For all methods, we tune the hyperparameters from the following options: 1) learning rate:
{0.005, 0.01, 0.05}; 2) weight decay {5𝑒−04, 5𝑒−05, 5𝑒−06, 5𝑒−07, 5𝑒−08}; and 3) dropout rate:
{0.2, 0.5, 0.8}. For APPNP and our method we further tune the number of iterations 𝐾 and the
upper bound 𝑠 for 𝑐𝑖 in Eq. (3.33) from the following range: 1) 𝐾: {5, 10}; and 𝑠: {1, 9, 19}. Note
that we treat APPNP as a special case of our proposed method with ℎ2 (·) = 1.
3.6.3.1    Performance Comparison
The performance comparison is shown in Table 3.2, where 𝑡-test is used to test the significance.
First, GAT outperforms GCN in most datasets. It indicates that modeling adaptive local smoothness
is helpful. Second, APPNP/Ada-Ugnn outperform GCN/GAT in most settings, suggesting that
                                                           29


                                 APPNP                                 100                                                                            100
                           80    ADA-UGNN                                                                      80
                                                                        80                                                                             80
       Test Accuracy (%)                           Test Accuracy (%)                       Test Accuracy (%)                      Test Accuracy (%)
                           60                                                                                  60
                                                                        60                                                                             60
                           40                                                                                  40                                      40
                                                                        40
                           20                                           20                                     20                                      20
                            0    Low        High                         0   Low   High                         0   Low    High                         0   Low   High
                                (a) Cora                           (b) BlogCatalog                             (c) Amazon-Comp                    (d) Coauthor-CS
                                Figure 3.2: Accuracy for nodes with low and high local label smoothness.
iterative gradient descent may offer advantages to single-step gradients, due to their better ability to
achieve a solution closer to the optimal. Third, and most notably, the proposed Ada-Ugnn achieves
consistently better performance than GCN/GAT, and outperforms or matches the state-of-the-art
APPNP across datasets. Notice that in some datasets such as Cora, Citeseer, and Coauthor-PH,
the improvements of the proposed model compared with APPNP are not very significant. Figure 3.1
shows that these datasets have extremely skewed local label smoothness distributions, with the
majority of nodes having perfect, 1.0, label homophily (they are only connected to other nodes of the
same label). APPNP shines in such cases, since its assumption of ℎ2 (·) = 1 is ideal for these nodes
(designating maximal local smoothness). Conversely, our model has the challenging task of learning
ℎ2 (·) – in such skewed cases, learning ℎ2 (·) may be quite challenging and unfruitful. On the other
hand, for datasets with higher diversity in local label smoothness across nodes such as BlogCatalog
and Amazon-Comp, the proposed Ada-Ugnn achieves more significant improvements.
   To further validate, we partition the nodes in the test set of each dataset into two groups: (1)
high smoothness: those with local label smoothness >0.5, and (2) low smoothness: those with ≤0.5,
and evaluate accuracy for APPNP and the proposed Ada-Ugnn for each group. The results for
Cora, BlogCatalog, Amazon-Comp and Coauthor-CS are presented in Figure 3.2 while the results
for the remaining datasets can be found in Figure 3.3. Figure 3.2 clearly shows that Ada-Ugnn
consistently improves performance for low-smoothness nodes in most datasets, while keeping
comparable (or marginally worse) performance for high-smoothness nodes. In cases where many
nodes have low-level smoothness (like BlogCatalog or Amazon-Comp), our method can notably
                                                                                          30


                                                                                                                                                     100                                                          100
                        80           APPNP
                                     ADA-UGNN                                           80
                        70                                                                                                                            80                                                           80
    Test Accuracy (%)                                               Test Accuracy (%)
                        60
                                                                                                                                 Test Accuracy (%)                                            Test Accuracy (%)
                                                                                        60
                        50                                                                                                                            60                                                           60
                        40                                                              40
                        30                                                                                                                            40                                                           40
                        20                                                              20                                                            20                                                           20
                        10
                         0           Low              High                               0        Low              High                                0       Low               High                               0   Low               High
                              (a) Citeseer                                                    (b) Pubmed                                        (c) Amazon-Photo                                                  (d) Coauthor-PH
Figure 3.3: Accuracy with low label smoothness and high label smoothness nodes. Note the
consistent improvement in low smoothness cases, enabled by adaptive local smoothing.
                  1600                                                            1400                                                                                                                      800
                                                                                                                                               1000
                  1400                                                            1200                                                                                                                      700
                  1200                                                            1000                                                               800                                                    600
                  1000                                                                                                                                                                                      500
    #nodes                                                          #nodes                                                       #nodes                                                       #nodes
                                                                                   800                                                               600
                   800                                                                                                                                                                                      400
                   600                                                             600
                                                                                                                                                     400                                                    300
                   400                                                             400                                                                                                                      200
                                                                                   200                                                               200
                   200                                                                                                                                                                                      100
                     0 0.0           0.2 0.4 0.6 0.8          1.0                    0 0.0        0.2 0.4 0.6 0.8          1.0                         0 0.0   0.2 0.4 0.6 0.8          1.0                   0 0.0     0.2 0.4 0.6 0.8          1.0
                                     Local Label Smoothness                                       Local Label Smoothness                                       Local Label Smoothness                                   Local Label Smoothness
                                      (a) 0%                                                       (b) 5%                                                      (c) 15%                                                  (d) 25%
Figure 3.4: Distribution of local label smoothness on Cora with various attack perturbation rates.
                  1200                                                                                                                                                                                      700
                                                                                  1000                                                         800
                  1000                                                                                                                         700                                                          600
                                                                                        800
                        800                                                                                                                    600                                                          500
                                                                                        600                                                    500
                                                                    #nodes                                                       #nodes                                                       #nodes
                                                                                                                                                                                                            400
    #nodes              600                                                                                                                    400                                                          300
                        400                                                             400                                                    300
                                                                                                                                               200                                                          200
                        200                                                             200                                                                                                                 100
                                                                                                                                               100
                             0 0.0   0.2 0.4 0.6 0.8          1.0                         0 0.0   0.2 0.4 0.6 0.8          1.0                   0 0.0         0.2 0.4 0.6 0.8          1.0                   0 0.0     0.2 0.4 0.6 0.8          1.0
                                     Local Label Smoothness                                       Local Label Smoothness                                       Local Label Smoothness                                   Local Label Smoothness
                                      (a) 0%                                                       (b) 5%                                                      (c) 15%                                                  (d) 25%
Figure 3.5: Distribution of local label smoothness on Citeseer with various attack perturbation
rates.
improve overall performance.
3.6.4                   Robustness Under Adversarial Attacks
Graph adversarial attacks tend to connect nodes from different classes while disconnect nodes from
the same class, which typically leads to more diverse distributions of local smoothness level. We
present the distributions of the graphs generated by Mettack [108] with different perturbation rate
for Cora, Citeseer and Pubmed in Figure 3.4, Figure 3.5 and Figure 3.6, respectively.
                                                                                                                             31


                                                           10000                                         6000                                               5000
             12000
                                                            8000                                         5000                                               4000
             10000
              8000                                          6000                                         4000                                               3000
    #nodes    6000                                #nodes                                        #nodes   3000                                      #nodes
                                                            4000                                                                                            2000
              4000                                                                                       2000
              2000                                          2000                                         1000                                               1000
                 0 0.0    0.2 0.4 0.6 0.8 1.0                  0 0.0    0.2 0.4 0.6 0.8 1.0                 0 0.0   0.2 0.4 0.6 0.8          1.0               0 0.0   0.2 0.4 0.6 0.8          1.0
                         Local Label Smoothness                        Local Label Smoothness                       Local Label Smoothness                             Local Label Smoothness
                         (a) 0%                                        (b) 5%                                       (c) 15%                                            (d) 25%
Figure 3.6: Distribution of local label smoothness on Pubmed with various attack perturbation rates.
   To further demonstrate that Ada-Ugnn can handle graphs with varying local label smoothness
better than alternatives, we conduct experiments to show its robustness under adversarial attacks.
Specifically, we adopt Mettack [108] to perform the attacks. Mettack produces non-targeted attacks
which aim to impair test set node classification performance by strategically adding or removing
edges from the victim graph. We utilize the attacked graphs (5%-25% perturb rate) from [35] and
follow the same setting, i.e., each method is run with 10 random seeds and the average performance
is reported. These attacked graphs are generated from Cora, Citeseer and Pubmed, respectively
and only the largest connected component is retained in each graph. Furthermore, the training,
validation and test split ratio is 10/10/80%, which is different from the standard splits we use in
Section 3.6.1. Thus, the performances reported in this section is not directly comparable with those
in the previous section. We compare our method both with standard GNNs discussed in Section 3.2
(GCN, GAT, APPNP), but also with recent state-of-the-art defense techniques against adversarial
attacks including GCN-Jaccard [90], GCN-SVD [19], Pro-GNN-fs and Pro-GNN [35]. The detailed
description of these methods is as follows:
    • GCN-Jaccard [90]: GCN-Jaccard aims to pre-process a given attacked graph by removing
        those edges added by the attackers. Specifically, Jaccard smilarlity is utilized to measure
        the feature similarity between connected pairs of nodes. The edges between node pairs with
        low-similarity are removed by the algorithm. This pre-processed graph is then utilized for the
        node classification task.
    • GCN-SVD [19]: GCN-SVD is also a pre-process method. It use SVD to decompose the
                                                                                            32


                    85                                                    75.0                                                    88
                    80                                                    72.5                                                    86
                    75
 Test Accuracy(%)                                      Test Accuracy(%)                                        Test Accuracy(%)
                                                                          70.0                                                    84
                    70        APPNP                                       67.5
                    65        ADA-UGNN                                                                                            82
                              GCN                                         65.0
                    60        GAT                                                                                                 80
                              GCN-Jaccard                                 62.5
                    55        GCN-SVD                                     60.0                                                    78
                              Pro-GNN-fs
                    50        Pro-GNN                                     57.5                                                    76
                         0%    5% 10% 15% 20% 25%                                0%    5% 10% 15% 20% 25%                              0%   5% 10% 15% 20% 25%
                                Perturbation Rate(%)                                    Perturbation Rate(%)                                 Perturbation Rate(%)
                                   (a) Cora                                           (b) Citeseer                                          (c) Pubmed
                              Figure 3.7: Robustness under adversarial attacks (node classification accuracy).
                         adjacency matrix of a given perturbed graph and then obtain its low-rank approximation. The
                         low-rank approximation is believed to be cleaner as graph adversarial attacks are observed to
                         be high-rank in [19].
                    • Pro-GNN [35]: Pro-GNN tries to learn a cleaner graph while training the node classification
                         model at the same time. Specifically, it treats the adjacency as parameters, which is optimized
                         during the training stage. Several different constraints are enforced to this learnable adjacency
                         matrix, including: 1) the learned adjacency matrix should be close to the original adjacency
                         matrix; 2) the learned adjacency matrix should be low-rank; and 3) the learned adjacency
                         matrix should ensure feature smoothness. Pro-GNN-fs is a variant of Pro-GNN where the
                         third constraint, i.e. feature smoothness, is not enforced.
                    Results under varying perturbation rates (attack intensities) are shown in Figure 3.7. Again,
we observe that GAT outperforms GCN, suggesting the appeal of an adaptive local smoothness
assumption. Here, our method (orange) substantially outperforms GCN, GAT and APPNP by a large
margin, especially in scenarios with high perturbation rate. Moreover, the proposed Ada-Ugnn is
also even more robust than several specially designed adversarial defense methods, like GCN-Jaccard
and GCN-SVD, which are based on pre-processing the adversarial attack graphs to obtain cleaner
ones, thanks to its adaptive smoothness assumption. Compared with Pro-GNN-fs, our method
performs comparably or even better in a few settings, especially when perturbation rate is high.
Furthermore, in these settings, the performance of our method is even closer to Pro-GNN, which is
                                                                                           33


the current state-of-the art adversarial defense technique. Note that, Pro-GNN-fs and Pro-GNN
involves learning cleaner adjacency matrices of the attacked graphs, and thus has 𝑂 (𝑀) parameters
(M denotes the number of edges in a graph), while our proposed model has far less parameters.
Specifically, we have 𝑂 (𝑑𝑖𝑛 · 𝑑 𝑜𝑢𝑡 ) for feature transformation and 𝐻 parameters for modelling ℎ1 (·)
with 𝐻 denoting the number of labels.
3.6.5                      Investigation on Number of Gradient Descent Steps in ADA-UGNN
In this section, we conducted experiments to check how the performance of ADA-UGNN is affected
by 𝐾. For each 𝐾, we run the experiments on standard splits of Cora, Citeseer and Pubmed with
30 random seeds (i.e., the same setting as in Section 3.6.) The average performance is reported. As
shown in Figure 3.8, the performance increases quickly as 𝐾 gets larger when 𝐾 is relatively small.
After 𝐾 becomes large, the performance either slowly grows or slightly fluctuates as 𝐾 further
increases.
                                                                     0.720
                    0.84                                                                                               0.795
                                                                     0.715
                                                                                                                       0.790
 Test Accuracy(%)                                 Test Accuracy(%)                                  Test Accuracy(%)
                    0.83
                                                                     0.710                                             0.785
                    0.82
                                                                     0.705                                             0.780
                    0.81
                                                                     0.700                                             0.775
                    0.80                                                                                               0.770
                                                                     0.695
                    0.79
                           1 2 3 4 5 6 7 8 9 10                              1 2 3 4 5 6 7 8 9 10                              1 2 3 4 5 6 7 8 9 10
                                    K                                                 K                                                 K
                               (a) Cora                                       (b) Citeseer                                       (c) Pubmed
Figure 3.8: ADA-UGNN performance (test accuracy) under different numbers of gradient steps (𝐾).
3.7                        Related Works
There are mainly two streams of work in developing GNN models, i.e, spectral-based and spatial-
based. When designing spectral-based GNNs, graph convolution [80], defined based on spectral
theory, is utilized to design graph neural network layers together with the feature transformation
and non-linearity [6, 29, 15]. These designs of the spectral-based graph convolution are tightly
related with graph signal processing, and they can be regarded as graph filters. Low-pass graph
                                                                                   34


filters can usually be adopted to denoise graph signals [9]. In fact, most algorithms discussed in
our work can be regarded as low-pass graph filters. With the emergence of GCN [40], which can
be regarded as a simplified spectral-based and also a spatial-based graph convolution operator,
numerous spatial-based GNN models have since been developed [28, 84, 58, 24, 26].
     Graph signal denoising is to infer a cleaner graph signal given a noisy signal, and can be usually
formulated as a graph regularized optimization problem [9]. Recently, several works connect
GCN with graph signal denoising with Laplacian regularization [64, 101], where they found the
aggregation process in GCN models can be regarded as the first-order approximation of the optimal
solution of the denoising problem. On the other hand, GNNs are also utilized to develop novel
algorithms for graph denoising [8]. Unlike these works, our paper details how a family of GNN
models can be unified with a graph signal denoising perspective, and demonstrates its promise for
new architecture design.
                                                   35


                                                   CHAPTER 4
            IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS?
4.1      Introduction
As introduced in Section 3.2, a GNN model learns node representation by recursively neighborhood
aggregation process (stacking multiple graph filtering operations), where each node aggregates and
transforms features from its neighbors. The node representations can then be utilized for downstream
node classification or regression tasks. Due to this neighborhood aggregation mechanism, existing
literature posits that strong homophily of the underlying graph is a necessity for GNNs1 to achieve
good performance on semi-supervised node classification tasks (check details about this task in
Section 2.2) [1, 106, 105, 10, 56, 27, 31, 92]. In general, homophily describes the phenomenon
that nodes tend to connect with “similar” or “alike” others. Homophily is observed in a wide
range of real-world graphs including friendship networks [57], political networks [25, 62], citation
networks [12] and more. Under the homophily assumption, through the aggregation process, a node’s
representation is “smoothed” via its neighbors’ representations, since each node is able to receive
additional information from neighboring nodes, which are likely to share the same label. Several
recent works [106, 105] claim that GNNs are implicitly (or explicitly) designed with homophily
in mind, are not suitable for graphs exhibiting heterophily, where connected nodes are prone to
have different properties or labels, e.g dating networks or molecular networks [106]. To address
this limitation, these works accordingly design and modify neural architectures and demonstrate
outperformance over other GNN models on several heterophilous graphs. However, we empirically
find that the standard graph convolutional network (GCN) [40], a fundamental, representative GNN
model which we focus on in this work is actually able to outperform these heterophily-specific
models on several heterophilous graphs after careful hyperparameter tuning. This motivates us to
    1 In this chapter, we are dealing with a node-focus tasks. Hence, when we mention GNNs, we refer to the GNN
models that follow the general framework for node-focus task described in Section 2.3.3.1
                                                          36


reconsider the popular notion in literature that GNNs exhibit a homophilous inductive bias, and
more specifically that strong homophily is crucial to strong GNN performance. Counter to this idea,
we find that the standard GCN model has the potential to work well for heterophilous graphs under
suitable conditions. We demonstrate simple intuition with the following toy example:
                                            0             1
                                            0             1
                                            0              1
                                            0              1
      Figure 4.1: A perfectly heterophilous graph on which GCN achieves perfect class separation.
    Consider the perfectly heterophilous graph (all inter-class edges) shown in Figure 4.1, where the
color indicates the node label. Blue-labeled and orange-labeled nodes are associated with the scalar
feature 0 and 1, respectively. If we consider a single-layer GCN by performing an averaging feature
aggregation over all neighboring nodes (without self-connection), it is clear that all blue nodes will
have a representation of 1, while the orange nodes will have that of 0. Additional layers/aggregations
will continue to alternate the features between the two types of nodes. Regardless of the number
of layers, the two classes can still be perfectly separated. In this toy example, each blue (orange)
node only connects orange (blue) nodes, and all blue (orange) nodes share similar neighborhood
patterns in terms of their neighbors’ label/feature distributions. Naturally, real graphs are rather
more complicated.
    In this work, we elucidate this intuition and extend it to a more general case: put simply, given
a (homophilous or heterophilous) graph, GCN has the potential to achieve good performance if
nodes with the same label share similar neighborhood patterns. We theoretically support this
argument by investigating the learned node embeddings from the GCN model. We further show
how GCN performance is only loosely linked to homophily or heterophily in the underlying graph,
and demonstrate how different notions of heterophily can yield better or worse discriminative
                                                  37


performance.
4.2     Preliminaries
4.2.1    Homophily in Graphs
In this work, we focus on investigating performance in the context of graph homophily and heterophily
properties. Homophily in graphs is typically defined based on similarity between connected node
pairs, where two nodes are considered similar if they share the same node label. The homophily
ratio is defined based on this intuition.
Definition 1 (Homophily). Given a graph G = {V, E} and node label vector 𝑦, the edge homophily
ratio is defined as the fraction of edges that connect nodes with the same labels. Formally, we have:
                                                      ( 𝑗,𝑘)∈E 1 (𝑦 𝑗 = 𝑦 𝑘 )
                                                    Í
                               ℎ(G, {𝑦𝑖 ; 𝑖 ∈ V}) =                           ,                 (4.1)
                                                               |E |
where |E | is the number of edges in the graph and 1 (·) is the indicator function.
    A graph is typically considered to be highly homophilous when ℎ(·) is large (typically,
0.5 ≤ ℎ(·) ≤ 1), given suitable label context. In such case, edges predominantly connect nodes with
the same labels. On the other hand, a graph with a low edge homophily ratio is considered to be
heterophilous. In future discourse, we write ℎ(·) as ℎ when discussing given a fixed graph and label
context. Note that homophily is a distinct property from heterogeneity; the former refers to class
labels of connected nodes, while the latter refers to their types.
4.2.2    Graph Convolutional Networks
As we discussed in Chapter 3, graph neural networks learn node representations by aggregating and
transforming information over the graph structure. There are different designs and architectures
for the aggregation and transformation process, which leads to different graph neural network
models [72, 40, 28, 84, 26, 103]. In this Chapter, we mainly focus on the the graph convolutional
network (GCN). We have described the GCN model in Section 3.2. Here, our definition of the GCN
                                                   38


model is a a little bit different from that. Hence, we re-describe it as follows for convenience. The
GCN model consists of several GCN graph filtering layers (GCN layers). The 𝑘-th GCN layer takes
the following form: H (𝑘) = 𝜎( ÂH (𝑘−1) W (𝑘) ), where 𝜎(·) is some non-linear activation function,
W (𝑘) ∈ R𝑙×𝑙 is a parameter matrix to transform the features, and Â is a normalized adjacency matrix
to perform the aggregation. Typically, we use associated node features as the input for the GCN
model, i.e, H (0) = X. Different variants of the normalization matrix can be adopted including the
row-normalized adjacency matrix and symmetric adjacency matrix listed as follows:
                         Row Normalized Adjacency Matrix: Â = D−1 (A + I)                       (4.2)
                                                                  1           1
                         Symmetric Adjacency Matrix: Â = D− 2 (A + I)D− 2                       (4.3)
where D is a diagonal matrix and D[𝑖, 𝑖] = 𝑑𝑒𝑔(𝑖) + 1 with 𝑑𝑒𝑔(𝑖) denoting the degree of node 𝑖.
The matrix I ∈ {0, 1} |V |×|V | is an identity matrix. Hence, adding I to the adjacency matrix A can
be regarded as adding self-connections for nodes.
4.3    Graph Convolutional Networks under Heterophily
Table 4.1: SSNC accuracy on two heterophilous datasets: standard GCN outperforms all heterophily-
specific models on these graphs.
                                                Chameleon     Squirrel
                                    Method       (ℎ = 0.23)   (ℎ = 0.22)
                                     GCN        67.96 ± 1.82 54.47 ± 1.17
                                   H2GCN-1      57.11 ± 1.58 36.42 ± 1.89
                                   H2GCN-2      59.39 ± 1.98 37.90 ± 2.02
                                CPGNN-MLP       54.53 ± 2.37 29.13 ± 1.57
                                CPGNN-Cheby     65.17 ± 3.17 29.25 ± 4.17
                                   GPRGNN       66.31 ± 2.05 50.56 ± 1.51
    Considerable prior literature posits that graph neural networks (such as GCN) work by assuming
and exploiting homophily assumptions in the underlying graph [56, 27, 92]. To this end, researchers
have determined that such models are considered to be ill-suited for heterophilous graphs, where the
homophily ratio is low [106, 105, 10]. To deal with this limitation, researchers proposed several
methods including H2GNN [106], CPGNN [105] and GPRGNN [10], which are explicitly designed
                                                    39


to handle heterophilous graphs via architectural choices (e.g. adding skip-connections, carefully
choosing aggregators, etc.) To demonstrate their assumptions, these works cited the poor SSNC
performance of GCN on several heterophilous graphs, and to demonstrate their proposed solutions,
they cited their design’s relative outperformance of GCN in these tasks.
     In this section, we revisit the claim that GCNs have fundamental homophily assumptions and
are not suited for heterophilous graphs. To this end, we first observe empirically that the GCN
model achieves fairly good performance on some of the commonly used heterophilous graphs;
specifically, we present SSNC performance on two commonly used heterophilous graph datasets,
Chameleon and Squirrel in Table 4.1 (see Appendix A.2 for further details about the datasets and
models). Both Chameleon and Squirrel are highly heterophilous (ℎ≈0.2). We find that with some
hyperparameter tuning, GCN can outperform alternative methods uniquely designed to operate on
heterophilous graphs by a sizeable margin. This observation suggests that GCN does not always
“underperform” on heterophilous graphs, and it leads us to reconsider the prevalent assumption in
literature. Hence, we next examine how GCNs learn representations, and how this information is
used in downstream SSNC tasks, homophily and heterophily assumptions aside.
4.3.1    When does GCN learn similar embeddings?
GCNs have been shown to be able to capture the local graph topological and structural information [94,
60]. Specifically, the aggregation step in the GCN model is able to capture and discriminate
neighborhood distribution information (more precisely, the mean of the neighborhood features) [94].
Let us consider the two nodes 𝑎 and 𝑏 shown in Figure 4.2, where we use color to indicate the label
of each node. If we further assume that all nodes share the same label are associated with exactly
the same features, then clearly, after 1-step aggregation, the GCN model with the row-normalized
adjacency matrix (see Eq. (4.2); ignoring the self-connection here) will output exactly the same
embedding for nodes 𝑎 and 𝑏 (the average features of “orange”, “yellow” and “green” node).
Accordingly, [94] reasons that the GCN model lacks expressiveness since it cannot differentiate
nodes 𝑎 and node 𝑏 in the embedding space, because although they share the same neighborhood
                                                  40


                             a                                                b
Figure 4.2: Two nodes share the same neighborhood distribution; GCN learns equivalent embeddings
for 𝑎 and 𝑏.
distribution, they differ concretely in neighborhood structure (here, w.r.t. count). However, in
this example SSNC task, mapping nodes 𝑎 and 𝑏 to the same location in the embedding space
is not an adverse effect, but explicitly desirable, since they share the same label and we are not
concerned with telling them apart. Intuitively, if all nodes with the same label are mapped to the
same embedding and embeddings for different labels are distinct, then, nodes are perfectly separable
in the embedding space and thus can be effortlessly classified (one instance in which expressive
power and task performance do not coincide). One instance in which such separability arises is in
the case of the “ideal graph” as mentioned in [102], which considers a perfectly homophilous graph
of |C| (fully) connected components, where all nodes in a component share the same label.
     Of course, in reality, such idealized graphs which meet these precise assumptions are rare
to nonexistent. Thus, to consider a more practical scenario, we assume that both features and
neighborhood patterns for nodes with a certain label are sampled from some fixed distributions.
Under these conditions, nodes with the same label may not be mapped to exactly the same embedding.
Instead, we aim to characterize how close the learned embeddings of same-label nodes are. Intuitively,
if the learned embeddings for nodes with the same label are very close, we expect SSNC performance
to be good (given that embeddings of nodes of other classes are far), as class separability is high
(low intra-class variance and high inter-class variance) [22]. We prove that, for graphs meeting
suitable conditions (features and neighborhood pattern for nodes with a certain label are sampled
from some fixed distributions), the distance between the embeddings learned by a single GCN layer
of any node pair that shares the same label is bounded by a small quantity with high probability. We
                                                  41


elaborate below.
    Assumptions on Graphs. We consider a graph G, where each node 𝑖 has features x𝑖 ∈ R𝑙 and
label 𝑦𝑖 . We assume that (1) G is 𝑑-regular; (2) The features of node 𝑖 are sampled from feature
distribution F𝑦𝑖 , i.e, x𝑖 ∼ F𝑦𝑖 , with 𝜇(F𝑦𝑖 ) denoting its mean; (3) Dimensions of x𝑖 are independent
to each other; (4) The features in X are bounded by a positive scalar 𝐵, i.e, max𝑖, 𝑗 |X[𝑖, 𝑗]| ≤ 𝐵; (5)
For node 𝑖, its neighbor’s labels are independently sampled from neighbor distribution D𝑦𝑖 . The
sampling is repeated for 𝑑 times to sample the labels for 𝑑 neighbors.
    We denote a graph following these assumptions (1)-(5) as G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈
C}, 𝑑}. Note that we use the subscripts in F𝑦𝑖 and D𝑦𝑖 to indicate that these two distributions are
shared by all nodes with the same label as node 𝑖. For regular graphs, the two kinds of normalized
adjacency matrices we introduced in Section 4.2.2 lead to the same GCN model. Specifically, for a
single layer GCN model, the process can be written in the following form for node 𝑖:
                                                   Õ         1
                                         h𝑖 =                   Wx 𝑗 ,                               (4.4)
                                                            𝑑+1
                                               𝑗 ∈{𝑖}∪N (𝑖)
where W ∈ R𝑙×𝑙 is the parameter matrix and x 𝑗 is the input features for node 𝑗. h 𝑗 is the pre-activation
output from the GCN model, and an element-wise non-linear activation function can be applied to
h𝑖 to obtain the final output.
Theorem 6. Consider a graph G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈ C}, 𝑑}, which follows Assump-
tions (1)-(5). For any node 𝑖 ∈ V, the expectation of the pre-activation output of a single-layer
GCN model from Eq (4.4) is given by
                                                                                 
                                            1                 𝑑
                            E[h𝑖 ] = W           𝜇(F𝑦𝑖 ) +       E𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x] .                 (4.5)
                                          𝑑+1               𝑑+1
and for any 𝑡 > 0, the probability that the distance between the observation h𝑖 and its expectation is
larger than 𝑡 is bounded by
                                                                       (𝑑 + 1)𝑡 2
                                                                                  
                           P (kh𝑖 − E[h𝑖 ] k 2 ≥ 𝑡) ≤ 2 · 𝑙 · exp − 2                ,               (4.6)
                                                                    2𝜌 (W)𝐵2 𝑙
where 𝑙 is the feature dimensionality and 𝜌(W) denotes the largest singular value of W
                                                       42


    To prove Theorem 6, we first introduce the celebrated Hoeffding inequality below.
Lemma 7 (Hoeffding’s Inequality). Let 𝑍1 , . . . , 𝑍𝑛 be independent bounded random variables with
𝑍𝑖 ∈ [𝑎, 𝑏] for all 𝑖, where −∞ < 𝑎 ≤ 𝑏 < ∞. Then
                                       𝑛
                                                                   !
                                                                                    2𝑛𝑡 2
                                                                                           
                                  1Õ
                               P           (𝑍𝑖 − E [𝑍𝑖 ]) ≥ 𝑡 ≤ exp −
                                  𝑛 𝑖=1                                         (𝑏 − 𝑎) 2
and                                                                 !
                                      𝑛
                                                                                     2𝑛𝑡 2
                                                                                            
                                 1Õ
                              P           (𝑍𝑖 − E [𝑍𝑖 ]) ≤ −𝑡 ≤ exp −
                                 𝑛 𝑖=1                                            (𝑏 − 𝑎) 2
for all 𝑡 ≥ 0.
Proof of Theorem 6.
Proof. The expectation of h𝑖 can be derived as follows.
                                             Õ                          
                                                            1           
                            E [h𝑖 ] = E                         Wx 𝑗 
                                             𝑗 ∈{𝑖}∪N (𝑖) 𝑑 + 1         
                                                                        
                                            1                    Õ         1
                                     =            WE [x𝑖 ] +                   WE[x 𝑗 ]
                                         𝑑+1                             𝑑+1
                                                               𝑗 ∈N (𝑖)
                                            1                      Õ
                                     =            W𝜇(F𝑦𝑖 ) +             WE𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x]
                                         𝑑+1
                                                                 𝑗 ∈N(𝑖)
                                                                                            
                                                   1                  𝑑
                                     =W               𝜇(F𝑦𝑖 ) +            E𝑐∼D 𝑦𝑖 ,x∼F𝑐 [x] .
                                                𝑑+1                 𝑑+1
    We utilize Hoeffding’s Inequality to prove the bound in Eq. (4.6). Let x𝑖 [𝑘], 𝑘 = 1, . . . , 𝑙 denote
the 𝑖-th element of x. Then, for any dimension 𝑘, {x 𝑗 [𝑘], 𝑗 ∈ N (𝑖)} is a set of independent bounded
random variables. Hence, directly applying Hoeffding’s inequality, for any 𝑡1 ≥ 0, we have the
following bound:
                                                                                                      !
                             Õ                                                          (𝑑 + 1)𝑡12
                                         x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡1 ® ≤ 2 exp −
                     ©                                                     ª
                  P­
                       { 𝑗 ∈{𝑖}∪N (𝑖)}
                                                                                               2𝐵2
                     «                                                     ¬
                                                             43


                   Í                             √
     If                     x𝑗 − E x𝑗            ≥    𝑙𝑡1 , then at least for one 𝑘 ∈ {1, . . . , 𝑙}, the inequality
          { 𝑗 ∈{𝑖}∪N (𝑖)}
                                              2
       Í                                 
                    x 𝑗 [𝑘] − E x 𝑗 [𝑘]       ≥ 𝑡1 holds. Hence, we have
{ 𝑗 ∈{𝑖}∪N (𝑖)}
                                                   √                  𝑙                                                 
                                                                                                                          
                 Õ                                             ©Ø            Õ                                    ª
                                                                                                                          
                            x𝑗 − E x𝑗           ≥    𝑙𝑡1 ® ≤ P ­                              x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡 1 ®
        ©                                                ª
     P­
                                                                                                                         
          { 𝑗 ∈{𝑖}∪N (𝑖)}                                         « 𝑘=1  { 𝑗 ∈{𝑖}∪N (𝑖)}
                                                                                                                         
        «                                     2          ¬                                                                ¬
                                                               Õ 𝑙               Õ                                 
                                                           ≤                                x 𝑗 [𝑘] − E x 𝑗 [𝑘] ≥ 𝑡1 ®
                                                                        ©                                               ª
                                                                      P­
                                                               𝑘=1 « { 𝑗 ∈{𝑖}∪N (𝑖)}                                    ¬
                                                                                                !
                                                                                  (𝑑 + 1)𝑡 1 2
                                                           = 2 · 𝑙 · exp −
                                                                                     2𝐵2
             𝑡2
Let 𝑡1 =     √   , then we have
               𝑙
                                                                                                               !
                                    Õ                                                           (𝑑 + 1)𝑡 22
                                               x𝑗 − E x𝑗             ≥ 𝑡2 ® ≤ 2 · 𝑙 · exp −
                            ©                                               ª
                          P­
                                                                                                     2𝐵2 𝑙
                            « { 𝑗 ∈{𝑖}∪N (𝑖)}                   2           ¬
     Furthermore, we have
                                                                    Õ                       ª
                                 kh𝑖 − E[h𝑖 ] k 2 = W ­                          x𝑗 − E x𝑗 ®
                                                           ©
                                                           «{ 𝑗 ∈{𝑖}∪N (𝑖)}                       ¬ 2
                                                                        Õ                     
                                                   ≤ kWk 2                          x𝑗 − E x𝑗
                                                                 { 𝑗 ∈{𝑖}∪N (𝑖)}
                                                                                                      2
                                                                       Õ                      
                                                   = 𝜌(W)                          x𝑗 − E x𝑗            ,
                                                                 { 𝑗 ∈{𝑖}∪N (𝑖)}
                                                                                                      2
where kWk 2 is the matrix 2-norm of W. Note that the last line uses the identity kWk 2 = 𝜌(W).
     Then, for any 𝑡 > 0, we have
                                                                              Õ                     
                        P (kh𝑖 − E[h𝑖 ] k 2 ≥ 𝑡) ≤ P ­ 𝜌(W)                             x𝑗 − E x𝑗            ≥ 𝑡®
                                                         ©                                                        ª
                                                                       { 𝑗 ∈{𝑖}∪N (𝑖)}
                                                         «                                                2       ¬
                                                                  Õ                                     𝑡 ª
                                                   = P­                         x𝑗 − E x𝑗           ≥
                                                         ©
                                                                                                                 ®
                                                                                                        𝜌(W)
                                                         « { 𝑗 ∈{𝑖}∪N (𝑖)}                        2              ¬
                                                                                       2
                                                                                          
                                                                              (𝑑 + 1)𝑡
                                                   ≤ 2 · 𝑙 · exp − 2                         ,
                                                                           2𝜌 (W)𝐵2 𝑙
which completes the proof.
                                                                  44


Theorem 8. Consider a graph G = {V, E, {F𝑐 , 𝑐 ∈ C}, {D𝑐 , 𝑐 ∈ C}, 𝑑}, which follows Assump-
tions (1)-(5). For any two nodes 𝑖, 𝑗 ∈ V that share the same label, i.e, 𝑦𝑖 = 𝑦 𝑗 , the probability that
the distance between their output pre-activation embeddings (from Eq. (4.4)) is larger than 𝑡 (𝑡>0)
is bounded by
                                                                       (𝑑 + 1)𝑡 2
                                                                                 
                                                 
                            P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 ≤ 4 · 𝑙 · exp − 2                                    (4.7)
                                                                      8𝜌 (W)𝐵2 𝑙
Proof. If kh𝑖 − h 𝑗 k 2 ≥ 𝑡, then, at least one of the following two inequalities holds.
                                                                   𝑡
                                              kh𝑖 − E[h𝑖 ] k 2 ≥
                                                                   2
                                                                   𝑡
                                             kh 𝑗 − E[h 𝑗 ] k 2 ≥
                                                                   2
Hence, we have
                                        n                      𝑡oØn                           𝑡 o
              P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 ≤ P kh𝑖 − E[h𝑖 ] k 2 ≥                  kh 𝑗 − E[h 𝑗 ] k 2 ≥
                                                                 2                              2
                                                               𝑡                            𝑡
                                    ≤ P kh𝑖 − E[h𝑖 ] k 2 ≥          + P kh 𝑗 − E[h 𝑗 ] k 2 ≥
                                                                2                             2
                                                       (𝑑 + 1)𝑡 2
                                                   
                                    ≤ 4 · 𝑙 · exp − 2
                                                     8𝜌 (W)B2 𝑙
Remark 1. Theorem 8 can be naturally extended to include non-linear activation functions.
Specifically, many commonly used activation functions 𝜎(·) (e.g. ReLU and sigmoid) are 1-Lipschitz
                                                                                                      
continuous [71], i.e., |𝜎(𝑥) − 𝜎(𝑦)| ≤ |𝑥 − 𝑦|. Hence, we have P k𝜎(h𝑖 ) − 𝜎(h 𝑗 )k 2 ≥ 𝑡 ≤
                   
P kh𝑖 − h 𝑗 k 2 ≥ 𝑡 .
    Theorem 6 demonstrates two key ideas. First, in expectation, all nodes with the same label have
the same embedding (Eq. (4.5)). Second, the distance between the output embedding of a node
and its expectation is small with high probability. Building upon Theorem 6, Theorem 8 further
demonstrates that two nodes with the same label are thus close to each other with high probability,
anchored by their shared expected value. With Remark 1, we further have that the post-activation
outputs (e.g. final embeddings from a complete GCN layer with transformation, aggregation and
                                                      45


nonlinearity) are also likewise bounded given common, suitable choices. Together, these results
show that the GCN model is able to map nodes with the same label close to each other in the
embedding space under given assumptions, facilitating the SSNC task earlier discussed. Based on
these understandings, we have the following key (informal) observations.
Observation 1 (GCN under Homophily). In homophilous graphs, the neighborhood distribution of
nodes with the same label (w.l.o.g 𝑐) can be approximately regarded as a highly skewed discrete D𝑐 ,
with most of the mass concentrated on the category 𝑐. Thus, different labels clearly have distinct
distributions. Hence, the GCN model typically excels in SSNC on such graphs.
Observation 2 (GCN under Heterophily). In heterophilous graphs, if the neighborhood distribution
of nodes with the same label (w.l.o.g. 𝑐) is (approximately) sampled from a fixed distribution D𝑐 ,
and different labels have distinguishable distributions, then the GCN model can excel in the SSNC
task.
    Note that the second observation follows from considering a “worst-case” scenario where
D𝑐 = D𝑐 0 for two distinct classes 𝑐, 𝑐0, implying that they would be frequently misclassified.
Notably, our findings illustrate that disruptions of certain conditions inhibit GCN performance
on heterophilous graphs, but heterophily is neither a sufficient nor necessary condition for poor
GCN performance. Section 4.3.2 shows experiments bolstering this argument. We note that these
“idealized” conditions for GCN SSNC performance we discuss offer a more general perspective
than those in prior literature; [102] only discusses perfectly homophilous (ℎ = 1) graphs as ideal
for SSNC, and several works [27, 92] aim to learn graph structure over independent samples for
downstream graph learning tasks using the same perspective. Our findings show that under suitable
conditions, heterophilous graphs can also be ideal in terms of effortless node classification. We note
that although our derivations are for GCN, a similar line of analysis can be used for more general
                                                 46


message-passing neural networks.
  Algorithm 2: Heterophilous Edge Addition
                                    |C|−1           |C|−1
    Input: G = {V, E}, 𝐾, {D𝑐 } 𝑐=0       and {V𝑐 } 𝑐=0
    Output: G 0 = {V, E 0 }
  1 Initialize G 0 = {V, E}, 𝑘 = 1 ;
  2 while 1 ≤ 𝑘 ≤ 𝐾 do
  3      Sample node 𝑖 ∼ Uniform (V);
  4      Obtain the label, 𝑦𝑖 of node 𝑖;
  5      Sample a label 𝑐 ∼ D𝑦𝑖 ;
  6      Sample node 𝑗 ∼ Uniform (V𝑐 );
  7      Update edge set E 0 = E 0 ∪ {(𝑖, 𝑗)};
  8      𝑘 ← 𝑘 + 1;
  9 return G 0 = {V, E 0 }
4.3.2   How does classification performance change over the homophily-heterophily spec-
         trum?
We next evaluate how downstream performance changes as we make a homophilous graph more
and more heterophilous. Prima facie wisdom from prior literature would indicate that performance
degrades markedly as heterophily increases, when other factors (e.g. labels, features) are kept fixed.
    We next conduct experiments to substantiate our claims in Observations 1 and 2. Specifically,
we consider two settings: (1) evaluating classification performance in heterophilous graphs under
optimistic assumptions, where different labels have distinct distributions, and (2) evaluating
classification performance under more pessimistic cases, where different labels’ distributions are
muddled.
                                                  47


                                                                                                         0.9
                   0.8                                                                                   0.8
Test Accuracy(%)                                                                      Test Accuracy(%)
                   0.7                                                                                   0.7
                                                                                                         0.6
                   0.6       =1                                                                                    =1
                             =0.8                                                                                  =0.8
                             =0.6                                                                        0.5       =0.6
                   0.5       =0.4                                                                                  =0.4
                             = 0.2                                                                       0.4       = 0.2
                             =0                                                                                    =0
                   0.4                                                                                   0.3
                         0.84 0.79 0.74 0.69 0.65 0.58 0.53 0.48 0.44 0.38 0.34 0.3                            0.81 0.74 0.68 0.62 0.58 0.51 0.45 0.41 0.37 0.32 0.27 0.24
                                                Homophily Ratio                                                                       Homophily Ratio
                         (a) Synthetic graphs generated from Cora                                        (b) Synthetic graphs generated from Citeseer
Figure 4.3: SSNC accuracy of GCN on synthetic graphs with various homophily ratios, generated by
adding heterophilous edges according to pre-defined target distributions on Cora (a) and Citeseer
(b): performance first decreases, then increases, forming a 𝑉-shape (see the black lines).
4.3.2.1                      Targeted Heterophilous Edge Addition
We start with common, real-world benchmark graphs, and modify their topology by adding synthetic,
cross-label edges that connect nodes with different labels. Note that we only add cross-label edges
to ensure that as more edges are added, the graphs become more and more heterophilous. Following
our discussion in Observation 2, we try to construct synthetic graphs that have similar neighborhood
distributions for nodes with the same label. Specifically, given a real-world graph G, we first
define a discrete neighborhood target distribution D𝑐 for each label 𝑐 ∈ C, as per Assumption 5 in
Section 4.3.1. We then follow these target distributions to add cross-label edges. The process of
generating new graphs by adding edges to G is shown in Algorithm 2. As shown in Algorithm 2,
we add a total 𝐾 edges to the given graph G: to add each edge, we first uniformly sample a node 𝑖
from V with label 𝑦𝑖 , then we sample a label 𝑐 from C according to D𝑦𝑖 , and finally, we uniformly
sample a node 𝑗 from V𝑐 and add the edge (𝑖, 𝑗) to the graph. In the limit (as 𝐾 → ∞), the
neighborhood distributions in the generated graph converge to D𝑦𝑖 regardless of initial topology,
and homophily decreases (ℎ → 0). We generate synthetic graphs based on several real-world graphs.
We present the results based on Cora and Citeseer [74]. The results for other datasets can be
found in Appendix A.1.
                    Both Cora and Citeseer exhibit strong homophily. The Cora graph consists of 2,708 nodes
and 5,278 edges and it has 7 labels. The Citeseer graph has 6 labels and consists of 3,327 nodes
and 4,676 edges. For both datasets, we fix D𝑐 for all labels. Although many suitable D𝑐 could
                                                                                      48


be specified in line with Observation 2, we fix one set for illustration and brevity. For Cora, we
specify D0 as Categorical([0, 1/2, 0, 0, 0, 0, 1/2]), in which, the (𝑐 + 1)-th element corresponds
to the probability mass for sampling a node with label 𝑐 as a neighbor; all D𝑐 are distinct. For
both datasets, we vary 𝐾 over 11 values and thus generate 11 graphs. Notably, as 𝐾 increases, the
homophily ℎ decreases. More detailed information about the {D𝑐 , 𝑐 ∈ C} and 𝐾 for both datasets is
included in Appendix A.1.
    Figure 4.3(a-b) show SSNC results (accuracy) on graphs generated based on Cora and Citeseer,
respectively. The black line in both figures shows results for the presented setting (we introduce 𝛾
in the following subsection). Without loss of generality, we use Cora (a) to discuss our findings,
since observations are similar over these datasets. Each point on the black line in Figure 4.3(a)
represents the performance of GCN model on a certain generated graph and the corresponding
value in 𝑥-axis denotes the homophily ratio of this graph. The point with homophily ratio ℎ = 0.85
denotes the original Cora graph, i.e, 𝐾 = 0. We observe that as 𝐾 increases, ℎ decreases, and while
the classification performance first decreases, it eventually begins to increase, showing a 𝑉-shape
pattern. For instance, when ℎ = 0.3 (a rather heterophilous graph), the GCN model achieves an
impressive 86% accuracy, even higher than that achieved on the original Cora graph. We note
that performance continues to increase as 𝐾 increases further to the right (we censor due to space
limitations). This clearly demonstrates that the GCN model can work well on heterophilous graphs
under certain conditions. Intuitively, the 𝑉-shape arises due to a “phase transition”, where initial
topology is overridden by added edges according to the associated D𝑐 target neighbor distributions.
In the original graph, the homophily ratio is quite high (ℎ = 0.85), and classification behavior is akin
to that discussed in Observation 1, where nodes with the same label have quite similar neighborhood
patterns. As we add edges to the graph, the originally evident neighborhood patterns which benefited
classification performance are perturbed by added edges and gradually become less informative,
which leads to the performance decrease in the decreasing segment of the 𝑉-shape in Figure 4.3(a).
Then, as we keep adding more edges, the neighborhood pattern gradually approaches D𝑐 for all 𝑐,
corresponding to the increasing segment of the 𝑉-shape. As 𝐾 → ∞, neighborhood distributions
                                                   49


                                         1.0                                       1.0                                        1.0                                      1.0
    0 0.92 0.31 0.42 0.04 0.04 0.46 0.29      0 0.85 0.54 0.58 0.36 0.39 0.63 0.52       0 0.8 0.76 0.74 0.64 0.7 0.77 0.76       0 0.8 0.82 0.78 0.7 0.77 0.81 0.83
    1 0.31 0.94 0.37 0.4 0.02 0.02 0.46  0.8  1 0.54 0.91 0.58 0.55 0.39 0.4 0.66  0.8   1 0.76 0.85 0.77 0.68 0.73 0.76 0.83 0.8 1 0.82 0.85 0.8 0.73 0.8 0.83 0.85   0.8
    2 0.42 0.37 0.92 0.5 0.41 0.01 0.01       2 0.58 0.58 0.85 0.64 0.57 0.39 0.37       2 0.74 0.77 0.8 0.71 0.72 0.72 0.74      2 0.78 0.8 0.79 0.67 0.75 0.8 0.82
                                         0.6                                       0.6                                        0.6                                      0.6
    3 0.04 0.4 0.5 0.85 0.49 0.39 0.03        3 0.36 0.55 0.64 0.79 0.65 0.54 0.36       3 0.64 0.68 0.71 0.72 0.71 0.67 0.67     3 0.7 0.73 0.67 0.73 0.69 0.72 0.74
                                         0.4                                       0.4                                        0.4                                      0.4
    4 0.04 0.02 0.41 0.49 0.91 0.36 0.45      4 0.39 0.39 0.57 0.65 0.84 0.58 0.6        4 0.7 0.73 0.72 0.71 0.79 0.76 0.77      4 0.77 0.8 0.75 0.69 0.77 0.78 0.81
    5 0.46 0.02 0.01 0.39 0.36 0.94 0.26 0.2  5 0.63 0.4 0.39 0.54 0.58 0.89 0.5   0.2   5 0.77 0.76 0.72 0.67 0.76 0.83 0.76 0.2 5 0.81 0.83 0.8 0.72 0.78 0.83 0.84  0.2
    6 0.29 0.46 0.01 0.03 0.45 0.26 0.97      6 0.52 0.66 0.37 0.36 0.6 0.5 0.93         6 0.76 0.83 0.74 0.67 0.77 0.76 0.89     6 0.83 0.85 0.82 0.74 0.81 0.84 0.88
        0    1    2    3    4    5    6  0.0      0    1    2    3    4    5    6  0.0       0    1    2    3    4    5    6  0.0     0    1    2    3    4    5    6  0.0
    (a) 𝛾 = 0, Acc: 86%                      (b) 𝛾 = 0.4, Acc: 66%                      (c) 𝛾 = 0.8, Acc: 44%                     (d) 𝛾 = 1, Acc: 41%
Figure 4.4: Cross-class neighborhood similarity on synthetic graphs generated from Cora; all
graphs have ℎ = 0.3, but with varying neighborhood distributions as per the noise parameter 𝛾:
as 𝛾 increases, the intra-class similarity and the inter-class similarity become closer to each other,
indicating that the distributions for different classes become more indistinguishable.
will converge to the target D𝑐 , and classification accuracy will reach 100% (see Appendix A.1).
4.3.2.2         Introducing noise to neighborhood distributions
In Section 4.3.2.1, we showed that the GCN model can achieve reasonable performance on
heterophilous graphs constructed following distinct, pre-defined neighborhood patterns. As per
Observation 2, our theoretical understanding suggests that performance should degrade under
heterophily if the distributions of different labels get more and more indistinguishable. Hence, we
next demonstrate this empirically by introducing controllable noise levels into our edge addition
strategy. We adopt a strategy similar to that described in Algorithm 2, but with the key difference
being that we introduce an additional parameter 𝛾, which controls the probability that we add
cross-label edges randomly rather than following the pre-defined distribution {D𝑐 , 𝑐 ∈ C}. The
pseudo code to describe the process to generate graphs with a noise level 𝛾 is shown in Algorithm 3.
The only difference from Algorithm 2 is in Line 6-7, we randomly add edges if the generated random
number 𝑟 is smaller than the pre-defined 𝛾 (with probability 𝛾). For nodes of a given class 𝑐 (w.l.o.g),
compared to the edges added according to D𝑐 , the randomly added edges can be regarded as noise.
Specifically, by increasing the noise parameter 𝛾, we increase the similarity between D𝑐 , D𝑐 0 for
any pair of labels 𝑐, 𝑐0. If 𝛾 = 1, then all neighborhood distributions will be indistinguishable
(they will all be approximately Uniform (|C|). By fixing 𝐾 and varying 𝛾, we can generate graph
variants with exactly the same homophily ratio but with differing similarities between D𝑐 and D𝑐 0 .
                                                                                     50


As in Section 4.3.2.1, we create graphs by adding edges at various 𝐾, but also vary 𝛾 ∈ [0, 1] in
increments of 0.2 on both Cora and Citeseer. We report the SSNC performance on these graphs
in Figure 4.3.
  Algorithm 3: Heterophilous Edge Addition with Noise
                                    |C|−1           |C|−1
    Input: G = {V, E}, 𝐾, {D𝑐 } 𝑐=0       and {V𝑐 } 𝑐=0
    Output: G 0 = {V, E 0 }
  1 Initialize G 0 = {V, E}, 𝑘 = 1 ;
  2 while 1 ≤ 𝑘 ≤ 𝐾 do
  3      Sample node 𝑖 ∼ Uniform (V);
  4      Obtain the label, 𝑦𝑖 of node 𝑖;
  5      Sample a number 𝑟 ∼ Uniform(0,1) ; // Uniform(0,1) denotes the continuous
           standard uniform distribution
  6      if 𝑟 ≤ 𝛾 then
  7           Sample a label 𝑐 ∼ Uniform (C \ {𝑦𝑖 });
  8      else
  9           Sample a label 𝑐 ∼ D𝑦𝑖 ;
 10      Sample node 𝑗 ∼ Uniform (V𝑐 );
 11      Update edge set E 0 = E 0 ∪ {(𝑖, 𝑗)};
 12      𝑘 ← 𝑘 + 1;
 13 return G 0 = {V, E 0 }
    We make several observations from the results: The noise level affects the performance
significantly when the homophily ratio is low. For example, observing Figure 4.3(a) vertically at
homophily ratio ℎ = 0.3, higher 𝛾 clearly results in worse performance. This indicates that not only
the fixed-ness of the neighborhood distributions, but their similarities are important for the SSNC
task. It also indicates that there are “good” and “bad” kinds of heterophily; adding yet another
supporting claim that heterophily in and of itself is not a problem for GCNs. On the other hand, high
𝛾 does not too-negatively impact when 𝐾 is small, since noise is minimal and the original graph
                                                   51


topology is yet largely homophilous (when ℎ is large, towards the left of the plot). At this stage,
both “good” (fixed and disparate patterns) and “bad” (randomly added edges) heterophilous edges
introduce noise to the dominant homophilous patterns which allow GCNs to separate classes well.
When the noise level 𝛾 is not too large, we can still observe the 𝑉-shape: e.g. 𝛾 = 0.4 in Figure 4.3(a)
and 𝛾 = 0.2 in Figure 4.3 (b); this is because the designed pattern is not totally dominated by the
noise. However, when 𝛾 is too high, adding edges will constantly decrease the performance, as
nodes of different classes has indistinguishably similar neighborhoods.
    To further demonstrate how 𝛾 affects the neighborhood distributions in the generated graph, we
examine the cross-class neighborhood similarity, which we define as follows:
Definition 2 (Cross-Class Neighborhood Similarity). Given a graph G = {V, E} and node labels y
for all nodes, the cross-class neighborhood similarity between classes 𝑐, 𝑐0 ∈ C is given by
                                           1           Õ
                            s(𝑐, 𝑐0) =                           𝑐𝑜𝑠 (𝑑 (𝑖), 𝑑 ( 𝑗))                (4.8)
                                       |V𝑐 ||V𝑐 0 | 𝑖∈V , 𝑗 ∈V 0
                                                       𝑐      𝑐
where V𝑐 (w.l.o.g) indicates the set of nodes in class 𝑐 and 𝑑 (𝑖) denotes the empirical histogram
(over |C| classes) of node 𝑖’s neighbors’ labels, and the function cos(·, ·) measures the cosine
similarity.
    When 𝑐 = 𝑐0, 𝑠(𝑐, 𝑐0) calculates the intra-class similarity, otherwise, it calculates the inter-class
similarity from a neighborhood label distribution perspective. Intuitively, if nodes with the same
label share the same neighborhood distributions, the intra-class similarity should be high. Likewise,
to ensure the neighborhood pattern for nodes with different labels are distinguishable, the inter-class
similarity should be low. To illustrate how various 𝛾 values affect the neighborhood patterns, we
illustrate the intra-class and inter-class similarities in Figure 4.4 for 𝛾 = 0, 0.4, 0.8, 1 on graphs
generated from Cora with homophily ratio ℎ = 0.3. The diagonal cells in each heatmap indicate
the intra-class similarity while off-diagonal cells indicate inter-class similarity. Clearly, when 𝛾 is
small, the intra-class similarity is high while the inter-class similarity is low, which demonstrates
the existence of strongly discriminative neighborhood patterns in the graph. As 𝛾 increases, the
                                                      52


intra-class and inter-class similarity get closer, becoming more and more indistinguishable, leading
to bad performance due to indistinguishable distributions as referenced in Observation 2.
4.4    Revisiting GCN’s Performance on Real-world Graphs
How can we reconcile our findings with existing literature on GCN’s performance under heterophily?
As noted in Section 4.3, we observed that on some publicly available heterophilous graph benchmark
datasets, the GCN model surprisingly outperforms methods specifically designed to deal with
heterophilous graphs. In this section, we first give more details on these experiments. We next
investigate why the GCN model does or does not work well on certain datasets utilizing the
understanding developed in earlier sections.
4.4.1   Evaluating Node Classification on Homophilous and Heterophilous Benchmarks
Following previous work [65, 106], we evaluate the performance of the GCN model on several real-
world graphs with different levels of homophily. We include the citation networks Cora, Citeseer
and Pubmed [40], which are highly homophilous. We also adopt several heterophilous benchmark
datasets including Chameleon, Squirrel, Actor, Cornell, Wisconsin and Texas [65]. Ap-
pendix A.2.1 gives descriptions and summary statistics of these datasets. For all datasets, we follow
the experimental setting provided in [65], which consists of 10 random splits with proportions
48/32/20% corresponding to training/validation/test for each graph. For each split, we use 10
random seeds, and report the average performance and standard deviation across 100 runs. We
compare the standard GCN model [40] with several recently proposed methods specifically designed
for heterophilous graphs including H2GCN [106], GPR-GNN [10], and CPGNN [105]. A brief
introduction of these methods can be found in Appendix A.2.2.
    We carefully tuned the parameters for all models. The detailed parameter tuning settings can be
found in Appendix A.2.4. The experimental setting in H2GCN [106] is the same as [65], hence we
include the results for their method reported in the original paper.
    The node classification performance (accuracy) of these models is reported in Table 4.2. Notably,
                                                    53


all models achieve comparable performance on graphs with high homophily (Cora, Citeseer, and
Pubmed), as expected. For the heterophilous graphs, results are comparatively mixed. The GCN
model outperforms all other methods on Squirrel and Chameleon, while underperforming on the
other datasets (Actor, Cornell, Wisconsin, and Texas). However, on these datasets, a two-layer
feedforward model (denoted as MLP) always achieves comparable or even better performance than
even the methods specifically designed for these graphs. Hence, we suspect that in these datasets,
the graph offers un-useful (and sometimes actively damaging) information for node classification.
Note that, both H2GCN and GPRGNN have skip connection (or similar architecture) to directly
connect the input features to the output. Their good performance on Actor, Cornell, Wisconsin,
and Texas is likely attributed to this residual design. To verify this idea, we implement a simple
method, which linearly combines the features output from GCN model and MLP model (denoted as
MLP+GCN in Table 4.2; further details in Appendix A.2.3) and its performance is comparable or
even better than heterophily-specific methods on all heterophilous graphs.
4.4.2    Investigating GCN performance on Homophilous and Heterophilous Benchmarks
Our work so far illustrates that the popular notion of GCNs not being suitable for heterophily, or
heterophily being a mandate for good GCN performance are at worst myths, and at best misleading.
We believe this is due to prior evaluations having been conducted on a limited number of publicly
available heterophilous graph benchmarks with relatively poor performance, which led to a (prima
facie) conclusion that heterophily is the responsible for these issues rather than any of the other
intrinsic properties of these datasets. In this subsection, we aim to use the understanding we
developed in Section 4.3 to explain why GCN does (not) work well on real-world graphs. As in
Section 4.3.2.2, we inspect cross-class neighborhood similarity (Eq.(4.8)) for each dataset; due to
the space limit, we only include representative ones here (Cora, Chameleon, Actor and Cornell);
see Figure 4.5). Heatmaps for the other datasets can be found in Appendix A.3. From Figure 4.5(a),
it is clear that the intra-class similarity is much higher than the inter-similarity ones, hence Cora
contains distinct neighborhood patterns, consistent with Observation 1. In Figure 4.5(b), we can
                                                   54


Table 4.2: GNN node classification performance (accuracy) on homophilous and heterophilous
graphs. Standard GCN or MLP+GCN methods (top) outperform or achieve comparable performance
to heterophily-specific methods (bottom).
                          Cora            Citeseer         Pubmed         Chameleon      Squirrel           Actor         Cornell         Wisconsin          Texas
     GCN             87.12 ± 1.38 76.50 ± 1.61 88.52 ± 0.41 67.96 ± 1.82 54.47 ± 1.17 30.31 ± 0.98                       59.35 ± 4.19 61.76 ± 6.15 63.81 ± 5.27
     MLP             75.04 ± 1.97 72.40 ± 1.97 87.84 ± 0.30 48.11 ± 2.23 31.68 ± 1.90 36.17 ± 1.09                       84.86 ± 6.04 86.29 ± 4.50 83.30 ± 4.54
  MLP + GCN          87.01 ± 1.35 76.35 ± 1.85 89.77 ± 0.39 68.04 ± 1.86 54.48 ± 1.11 36.24 ± 1.09                       84.82 ± 4.87 86.43 ± 4.00 83.60 ± 6.04
   H2GCN-1           86.92 ± 1.37        77.07 ±1.64 89.40 ± 0.34         57.11 ± 1.58  36.42 ± 1.89    35.86 ±1.03 82.16 ± 6.00         86.67 ± 4.69     84.86 ± 6.77
   H2GCN-2           87.81 ± 1.35        76.88 ± 1.77 89.59 ± 0.33        59.39 ± 1.98  37.90 ± 2.02    35.62 ± 1.30 82.16 ± 6.00        85.88 ± 4.22     82.16 ± 5.28
 CPGNN-MLP           85.84 ± 1.20        74.80 ± 0.92 86.58 ± 0.37        54.53 ± 2.37  29.13 ± 1.57    35.76 ± 0.92 79.93 ± 6.12        84.58 ± 2.72     82.62 ± 6.88
 CPGNN-Cheby         87.23 ± 1.31        76.64 ± 1.43 88.41 ± 0.33        65.17 ± 3.17  29.25 ± 4.17    34.28 ± 0.77 75.08 ± 7.51        79.19 ± 2.80     75.96 ± 5.66
   GPR-GNN           86.79 ± 1.27        75.55 ± 1.56 86.79 ± 0.55        66.31 ± 2.05  50.56 ± 1.51    33.94 ± 0.95 79.27 ± 6.03        83.73 ± 4.02     84.43 ± 4.10
                                          1.0                                      1.0                                    1.0                                      1.0
    0 0.84 0.07 0.03 0.15 0.12 0.11 0.16         0 0.579 0.579 0.484 0.483 0.467         0 0.508  0.51 0.508 0.504 0.507          0  0.6   0.69 0.53  0.57   0.51
    1 0.07 0.87 0.09 0.1 0.04 0.06 0.01   0.8                                      0.8                                    0.8                                      0.8
                                                 1 0.579 0.593 0.505 0.508 0.469         1 0.51  0.511 0.509 0.506 0.509          1 0.69    1.0 0.57  0.57   0.56
    2 0.03 0.09 0.95 0.05 0.01 0.05 0.01
                                          0.6                                      0.6                                    0.6                                      0.6
    3 0.15 0.1 0.05 0.9 0.11 0.07 0.04           2 0.484 0.505 0.696 0.676 0.671         2 0.508 0.509 0.508 0.504 0.507          2 0.53   0.57 0.48  0.52   0.47
                                          0.4                                      0.4                                    0.4                                      0.4
    4 0.12 0.04 0.01 0.11 0.89 0.04 0.02
                                                 3 0.483 0.508 0.676  0.73   0.702       3 0.504 0.506 0.504 0.501 0.504          3 0.57   0.57 0.52  0.58    0.5
    5 0.11 0.06 0.05 0.07 0.04 0.87 0.12 0.2                                       0.2                                    0.2                                      0.2
    6 0.16 0.01 0.01 0.04 0.02 0.12 0.89         4 0.467 0.469 0.671 0.702 0.739         4 0.507 0.509 0.507 0.504 0.508          4 0.51   0.56 0.47   0.5   0.45
        0    1    2    3    4    5    6   0.0        0     1     2      3      4   0.0       0      1    2     3     4    0.0         0      1    2     3      4   0.0
               (a) Cora                              (b) Chameleon                               (c) Actor                               (d) Cornell
Figure 4.5: Cross-class neighborhood similarity on homophilous graphs (a) and heterophilous graphs
(b-d). The deviations between inter and intra-class similarities substantiate GCN’s performance on
these datasets.
observe that in Chameleon, intra-class similarity is generally higher than inter-class similarity,
though not as strong as in Figure 4.5(a). Additionally, there is an apparent gap between labels
0, 1 and 2, 3, 4, which contributes in separating nodes of the former 2 from the latter 3 classes,
but potentially increasing misclassification within each of the two groupings. These observations
also help substantiate why GCN can achieve reasonable performance (much higher than MLP) on
Chameleon. The GCN model underperforms MLP in Actor and we suspect that the graph does
not provide useful information. The heatmap for Actor in Figure 4.5 shows that the intra-class and
inter-class similarities are almost equivalent, making the neighborhood distributions for different
classes hard to distinguish and leading to bad GCN performance. Similar observations can be made
for Cornell. Note that Cornell only consists of 183 nodes and 280 edges, hence, the similarities
shown in Figure 4.5 are impacted significantly by few samples (e.g. there is a single node with label
1, leading to perfect intra-class similarity for label 1).
                                                                                     55


4.5    Related Work
Graph neural networks (GNNs) are powerful models for graph representation learning. They have
been widely adopted to tackle numerous applications from various domains [40, 20, 2, 79]. [72]
proposed the first GNN model, to tackle both node and graph level tasks. Subsequently, [6] and
[15] generalized convolutional neural networks to graphs from the graph spectral perspective. [40]
simplified the spectral GNN model and proposed graph convolutional networks (GCNs). Since
then, numerous GNN variants, which follow specific forms of feature transformation (linear
layers) and aggregation have been proposed [84, 28, 26, 43]. The aggregation process can be
usually understood as feature smoothing [47, 52, 34, 107]. Hence, several recent works claim
[106, 105, 10], assume [27, 92, 102] or remark upon [1, 56, 31] GNN models homophily-reliance
or unsuitability in capturing heterophily. Several recent works specifically develop GNN models
choices to tackle heterophilous graphs by carefully designing or modifying model architectures such
as Geom-GCN [65], H2GCN [106], GPR-GNN [10], and CPGNN [105].
                                                 56


                                                   CHAPTER 5
      EIGENPOOLING: A GRAPH POOLING OPERATION FROM THE SPECTRAL
                                     GRAPH THEORY PERSPECTIVE
5.1      Introduction
In this chapter, we work on graph level representation learning with a focus on the task of graph
classification. The task of graph classification is to predict the label of a given graph utilizing
its associated features and graph structure. GNNs1 can extract graph representation while using
all associated information. Majority of existing graph neural networks [14, 18, 26, 49] have
been designed to generate good node representations, and then globally summarize the node
representations as the graph representation. These methods are inherently “flat” since they treat all
the nodes equivalently when generating graph representation using the node representations. In
other words, the entire graph structure information is totally neglected during this process. However,
nodes are naturally of different statuses and roles in a graph, and they should contribute differently
to the graph level representation. Furthermore, graphs often have different local structures (or
subgraphs), which contain vital graph characteristics. For instance, in a graph of a protein, atoms
(nodes) are connected via bonds (edges); some local structures, which consist of groups of atoms
and their direct bonds, can represent some specific functional units, which, in turn, are important to
tell the functionality of the entire protein [77, 18, 5]. These local structures are also not captured
during the global summarizing process. To generate the graph representation which preserves the
local and global graph structures, a hierarchical pooling process, analogous to the pooling process
in conventional convolutional neural (CNN) networks [44], is needed.
     There are very recent works investigating the pooling procedure for graph neural networks [97,
15, 21, 81]. These methods group nodes into subgraphs (supernodes), coarsen the graph based
on these subgraphs and then the entire graph information is reduced to the coarsened graph by
    1 In this chapter, we are dealing with a graph-focus tasks. Hence, when we mention GNNs, we refer to the GNN
models that follow the general framework for graph-focus task described in Section 2.3.3.2
                                                          57


generating features of supernodes from their corresponding nodes in subgraphs. However, when
pooling the features for supernodes, average pooling or max pooling have been usually adopted
where the structures of these group nodes (the local structures) are still neglected. With the local
structures, the nodes in the subgraphs are of different statuses and roles when they contribute to
the supernode representations. It is challenging to design a general graph pooling operation while
incorporating the local structure information as 1) the subgraphs may contain different numbers of
nodes, thus a fixed size graph pooling operation cannot work for all subgraphs; and 2) the subgraphs
could have very different structures, which may require different approaches to summarize the
information for the supernode representation. To address the aforementioned challenges, we design
a novel graph pooling operation EigenPooling based on the eigenvectors of the subgraphs, which
naturally have the same size of each subgraph and can effectively capture the local structures when
summarizing node features for supernodes. EigenPooling can be used as pooling layers to stack with
any graph neural network layers to form a novel framework EigenGCN for graph classification.
5.2     The Proposed Framework – EigenGCN
In this chapter, we aim to develop a Graph Neural Networks (GNN) model, which consists of
convolutional layers and pooling layers, to learn graph representations such that graph level
classification can be applied. Before going to the details, we first introduced some notations and the
problem setting.
Problem Setting: A graph can be represented as G = {E, V}, where V = {𝑣 1 , . . . , 𝑣 𝑁 } is the set
of 𝑁 nodes and E is the set of edges. The graph structure information can also be represented by an
adjacency matrix A ∈ R𝑁×𝑁 . Furthermore, each node in the graph is associated with node features
and we use X ∈ R𝑁×𝑑 to denote the node feature matrix, where 𝑑 is the dimension of features.
Note that this node feature matrix can also be viewed as a 𝑑-dimensional graph signal [80] defined
on the graph G. In the graph classification setting, we have a set of graphs {G𝑖 }𝑖 , each graph G𝑖
is associated with a label 𝑦𝑖 . The task of the graph classification is to take the graph (structure
information and node features) as input and predict its corresponding label. To make the prediction,
                                                  58


it is important to extract useful information from both graph structure and node features. We aim to
design graph convolution layers and EigenPooling to hierarchically extract graph features, which
finally learns a vector representation of the input graph for graph classification.
5.2.1    An Overview of EigenGCN
In this work, we build our model based on Graph Convolutional Networks (GCN) layers [40], which
has been demonstrated to be effective in node-level representation learning. While the GCN model
is originally designed for semi-supervised node classification, we only discuss the part for node
representation learning but ignoring the classification part. The GCN model is stacked by several
GCN graph filtering layers The GCN model has been designed for learning node representations.
In the end, the output of the GCN model is a matrix instead of a vector. The procedure of the
GCN model is rather “flat”, as it can only “pass message” between nodes through edges but cannot
summarize the node information into the higher level graph representation. A simple way to
summarize the node information to generate graph level representation is global pooling. For
example, we could use the average of the node representations as the graph representation. However,
in this way, a lot of key information is ignored and the graph structure is also totally overlooked
during the pooling process.
     To address this challenge, we propose eigenvector based pooling layers EigenPooling to hierar-
chically summarize node information and generate graph representation. An illustrative example is
demonstrated in Figure 5.1. In particular, several graph pooling layers are added between GCN
layers. Each of the pooling layers pools the graph signal defined on a graph into a graph signal
defined on a coarsened version of the input graph, which consists of fewer nodes. Thus, the design of
the graph pooling layers consists of two components: 1) graph coarsening, which divides the graph
into a set of subgraphs and form a coarsened graph by treating subgraphs as supernodes; and 2)
transform the original graph signal information into the graph signal defined on the coarsened graph
with EigenPooling. We coarsen the graph based on a subgraph partition. Given a subgraph partition
with no overlaps between subgraphs, we treat each of the subgraphs as a supernode. To form a
                                                  59


                      𝑐𝑜𝑛𝑣% 𝑐𝑜𝑛𝑣&
                                       𝑐𝑜𝑛𝑣' 𝑐𝑜𝑛𝑣(
                                𝑝𝑜𝑜𝑙%              𝑐𝑜𝑛𝑣) 𝑐𝑜𝑛𝑣*
                                                𝑝𝑜𝑜𝑙&
                                                           𝑝𝑜𝑜𝑙'                     𝑙𝑎𝑏𝑒𝑙
                     Figure 5.1: An illustrative example of the general framework
coarsened graph of the supernodes, we determine the connectivity between the supernodes by the
edges across the subgraphs. During the pooling process, for each of the subgraphs, we summarize
the information of the graph signal on the subgraph to the supernode. With graph coarsening, we
utilize the graph structure information to form coarsened graphs, which makes it possible to learn
representations level by level in a hierarchical way. With EigenPooling, we can learn node features
of the coarsened graph that exploits the subgraph structure as well as the node features of the input
graph.
    Figure 5.1 shows an illustrative example, where a binary graph classification is performed. In
this illustrative example, the graph is coarsened three times and finally becomes a single supernode.
The input is a graph signal (the node features), which can be multi-dimensional. For the ease of
illustration, we do not show the node features on the graph. Two convolutional layers are applied to
the graph signal. Then, the graph signal is pooled to a signal defined on the coarsened graph. This
procedure (two convolution layers and one pooling layer) is repeated two more times and the graph
signal is finally pooled to a signal on a single node. This pooled signal on the single node, which is
a vector, can be viewed as the graph representation. The graph representation then goes through
several fully connected layers and the prediction is made upon the output of the last layer. Next, we
introduce details of graph coarsening and EigenPooling of EigenGCN.
                                                   60


5.2.2    Graph Coarsening
In this subsection, we introduce how we perform the graph coarsening. As we mentioned in the
previous subsection, the coarsening process is based on subgraph partition. There are different ways
to separate a given graph to a set of subgraphs with no overlapping nodes. In this paper, we adopt
spectral clustering to obtain the subgraphs, so that we can control the number of the subgraphs,
which, in turn, determines the pooling ratio. We leave other options as future work. Given a set of
subgraphs, we treat them as supernodes and build the connections between them as similar in [83].
An example of the graph coarsening and supernodes is shown in Figure 5.1, where a subgraph and
its supernodes are denoted using the same color. Next, we introduce how to mathematically describe
the subgraphs, supernodes, and their relations.
     Let c be a partition of a graph G, which consists of 𝐾 connected subgraphs {G (𝑘) } 𝐾𝑘=1 . For the
graph G, we have the adjacency matrix A ∈ R𝑁×𝑁 and the feature matrix X ∈ R𝑁×𝑑 . Let 𝑁 𝑘 denote
the number of nodes in the subgraph G (𝑘) and Γ (𝑘) is the list of nodes in subgraph G (𝑘) . Note that
each of the subgraph can be also viewed as a supernode. For each subgraph G (𝑘) , we can define a
sampling operator C (𝑘) ∈ R𝑁×𝑁 𝑘 as follows:
                              C (𝑘) [𝑖, 𝑗] = 1 if and only if      Γ (𝑘) ( 𝑗) = 𝑣 𝑖 ,                 (5.1)
where C (𝑘) [𝑖, 𝑗] denotes the element in the (𝑖, 𝑗)-th position of C (𝑘) [𝑖, 𝑗] and Γ (𝑘) ( 𝑗) is the 𝑗-th
element in the node list Γ (𝑘) . This operator provides a relation between nodes in the subgraph G (𝑘)
and the nodes in the original graph. Given a single dimensional graph signal x ∈ R𝑁×1 defined on
the original entire graph, the induced signal that is only defined on the subgraph G (𝑘) can be written
as
                                              x (𝑘) = (C (𝑘) )𝑇 x.                                    (5.2)
On the other hand, we can also use C (𝑘) to up-sample a graph signal x (𝑘) defined only on the
subgraph G (𝑘) to the entire graph G by
                                                x̄ = C (𝑘) x (𝑘) .                                    (5.3)
                                                      61


It keeps the values of the nodes in the subgraph untouched while setting the values of all the other
nodes that do not belong to the subgraph to 0. The operator can be applied to multi-dimensional
signal X ∈ R 𝑁×𝑑 in a similar way. The induced adjacency matrix A (𝑘) ∈ R𝑁 𝑘 ×𝑁 𝑘 of the subgraph
G (𝑘) , which only describes the connection within the subgraph G (𝑘) , can be obtained as
                                         A (𝑘) = (C (𝑘) )𝑇 AC (𝑘) .                            (5.4)
The intra-subgraph adjacency matrix of the graph G, which only consists of the edges inside each
subgraph, can be represented as
                                             Õ 𝐾
                                     A𝑖𝑛𝑡 =       C (𝑘) A (𝑘) (C (𝑘) )𝑇 .                      (5.5)
                                              𝑘=1
Then the inter-subgraph adjacency matrix of graph G, which only consists of the edges between
subgraphs, can be represented as A𝑒𝑥𝑡 = A − A𝑖𝑛𝑡 .
    Let G𝑐𝑜𝑎𝑟 denote the coarsened graph, which consists of the supernodes and their connections.
We define the assignment matrix S ∈ R𝑁×𝐾 , which indicates whether a node belongs to a specific
subgraph as:
                                 S[𝑖, 𝑗] = 1    if and only if 𝑣 𝑖 ∈ Γ ( 𝑗) .
Then, the adjacency matrix of the coarsened graph is given as
                                           A𝑐𝑜𝑎𝑟 = S𝑇 A𝑒𝑥𝑡 S.                                  (5.6)
    With Graph Coarsening, we can obtain the connectivity of G𝑐𝑜𝑎𝑟 , i.e., A𝑐𝑜𝑎𝑟 . Obviously,
A𝑐𝑜𝑎𝑟 encodes the network structure information of G. Next, we describe how to obtain the node
features X𝑐𝑜𝑎𝑟 of G𝑐𝑜𝑎𝑟 using EigenPooling. With A𝑐𝑜𝑎𝑟 and X𝑐𝑜𝑎𝑟 , we can stack more layers
of GCN-GraphCoarsening-EigenPooling to learn higher level representations of the graph for
classification.
5.2.3    Eigenvector-Based Pooling – EigenPooling
In this subsection, we introduce EigenPooling, aiming to obtain X𝑐𝑜𝑎𝑟 that encodes network structure
information and node features of G. Globally, the pooling operation is to transform a graph signal
                                                    62


defined on a given graph to a corresponding graph signal defined on the coarsened version of this
graph. It is expected that the important information of the original graph signal can be largely
preserved in the transformed graph signal. Locally, for each of the subgraph, we summarize the
features of the nodes in this subgraph to a single representation of the supernode. It is necessary to
consider the structure of the subgraph when we perform the summarizing, as the subgraph structure
also encodes important information. However, common adopted pooling methods such as max
pooling [97, 15] or average pooling [18] ignored the graph structure. In some works [63], the
subgraph structure is used to find a canonical ordering of the nodes, which is, however, very difficult
and expensive. In this work, to use the structure of the subgraphs, we design the pooling operator
based on the graph spectral theory by facilitating the eigenvectors of the Laplacian matrix of the
subgraph. Next, we first briefly review the graph Fourier transform and then introduce the design of
EigenPooling based on graph Fourier transform.
5.2.3.1    Graph Fourier Transform
Given a graph G = {E, V} with A ∈ R𝑁×𝑁 being the adjacency matrix and X ∈ R𝑁×𝑑 being the
node feature matrix. Without loss of generality, for the following description, we consider 𝑑 = 1, i.e.,
x ∈ R𝑁×1 , which can be viewed as a single dimensional graph signal defined on the graph G [70].
This is the spatial view of a graph signal, which maps each node in the graph to a scalar value (or a
vector if the graph signal is multi-dimensional). Analogous to the classical signal processing, we can
define graph Fourier transform [80] and spectral representation of the graph signal in the spectral
domain. To define the graph signal in the spectral domain, we need to use the Laplacian matrix [11]
                                                                       𝑁
                                                                       Í
L = D − A, where D is the diagonal degree matrix with D[𝑖, 𝑖] =            𝐴[𝑖, 𝑗]. The Laplacian matrix
                                                                      𝑗=1
L can be used to define the “smoothness” of a graph signal [80] as follows:
                                                 𝑁
                                             1Õ
                                       𝑇
                              𝑠(x) = x Lx =          A[𝑖, 𝑗] (x[𝑖] − x[ 𝑗]) 2 .                    (5.7)
                                             2 𝑖, 𝑗
𝑠(x) measures the smoothness of the graph signal x. The smoothness of a graph signal depends on
how dramatically the value of connected nodes can change. The smaller 𝑠(x), the more smooth it
                                                    63


is. For example, for a connected graph, a graph signal with the same value on all the nodes has a
smoothness of 0, which means “extremely smooth” with no change.
     As L is a real symmetric positive matrix, it has a completed set of orthonormal eigenvectors
{𝑢 𝑙 }𝑙=1
      𝑁 . These eigenvectors are also known as the graph Fourier modes [80], which are associated
with the ordered real non-negative eigenvalues {𝜆 𝑙 }𝑙𝑁 . Given a graph signal x, the graph Fourier
transform can be obtained as follows
                                                 x̂ = UT x,                                     (5.8)
where U = [𝑢 1 , . . . , 𝑢 𝑁 ] ∈ R𝑁×𝑁 is the matrix consists of the eigenvectors of L. The vector
x̂ obtained after the transform is the representation of the graph signal in the spectral domain.
Correspondingly, the inverse graph Fourier transform, which transfers the spectral representation
back to the spatial representation, can be denoted as:
                                                  x = Ux̂.                                      (5.9)
     Note that we can also view each the eigenvector 𝑢 𝑙 of the Laplacian matrix L as a graph signal,
and its corresponding eigenvalue 𝜆 𝑙 can measure its “smoothness”. For any of the eigenvector 𝑢 𝑙 ,
we have:
                                     𝑠(u𝑙 ) = u𝑇𝑙 Lu𝑙 = u𝑇𝑙 𝜆 𝑙 u𝑙 = 𝜆 𝑙 .                    (5.10)
The eigenvectors (or Fourier modes) are a set of base signals with different “smoothness” defined on
the graph G. Thus, the graph Fourier transform of a graph signal x can be also viewed as linearly
decomposing the signal x into the set of base signals. x̂ can be viewed as the coefficients of the
linear combination of the base signals to obtain the original signal x.
5.2.3.2   The Design of Pooling Operations
Since graph Fourier transform can transform graph signal to spectral domain which takes into
consideration both graph structure and graph signal information, we adopt graph Fourier transform
                                                     64


to design pooling operators, which pool the graph signal defined on a given graph G to a signal
defined on its coarsened version G𝑐𝑜𝑎𝑟 . The design of the pooling operator is based on graph Fourier
transform of the subgraphs {G 𝑘 } 𝐾𝑘=1 . Let L (𝑘) denote the Laplacian matrix of the subgraph G (𝑘) . We
denote the eigenvectors of the Laplacian matrix L (𝑘) as 𝑢 1(𝑘) , . . . , 𝑢 𝑁(𝑘)𝑘 . We then use the up-sampling
operator 𝐶 (𝑘) to up-sample these eigenvectors (base signals on this subgraph) into the entire graph
and get the up-sampled version as:
                                     ū𝑙(𝑘) = C (𝑘) u𝑙(𝑘) , 𝑙 = 1 . . . 𝑁 𝑘 .                            (5.11)
With the up-sampled eigenvectors, we organize them into matrices to form pooling operators. Let
Θ𝑙 ∈ R𝑁×𝐾 denote the pooling operator consisting of all the 𝑙-th eigenvectors from all the subgraphs
                                           Θ𝑙 = [𝑢¯ 𝑙(1) , . . . , 𝑢¯ 𝑙(𝐾) ]                             (5.12)
    Note that the subgraphs are not necessary all with the same number of nodes, which means that
the number of eigenvectors can be different. Let 𝑁𝑚𝑎𝑥 = max 𝑁 𝑘 be the largest number of nodes
                                                                        𝑘=1,...,𝐾
among all the subgraphs. Then, for a subgraph         G (𝑘)    with 𝑁 𝑘 nodes, we set u𝑙(𝑘) for 𝑁 𝑘 < 𝑙 ≤ 𝑁𝑚𝑎𝑥
as 0 ∈ R𝑁 𝑘 ×1 . The pooling process with 𝑙-th pooling operator Θ𝑙 can be described as
                                                 X𝑙 = Θ𝑇𝑙 X                                              (5.13)
where X𝑙 ∈ R𝐾×𝑑 is the pooled result using the 𝑙-th pooling operator. The 𝑘-th row of X𝑙 contains
the information pooled from the 𝑘-th subgraph, which is the representation of the 𝑘-th supernode.
    Following this construction, we build a set of 𝑁𝑚𝑎𝑥 pooling operators. To combine the information
pooled by different pool operators, we can concatenate them together as follows:
                                      X 𝑝𝑜𝑜𝑙𝑒𝑑 = [X0 , . . . , X𝑁𝑚𝑎𝑥 ].                                  (5.14)
where X 𝑝𝑜𝑜𝑙𝑒𝑑 ∈ R𝐾×𝑑·𝑁𝑚𝑎𝑥 is the final pooled results. For efficient computation, instead of using the
results pooled by all the pooling operators, we can choose to only use the first 𝐻 of them as follows:
                                   X𝑐𝑜𝑎𝑟 = X 𝑝𝑜𝑜𝑙𝑒𝑑 = [X0 , . . . , X𝐻 ].                                (5.15)
                                                       65


In Section 5.3.1 and Section 5.3.2, we will show that with 𝐻  𝑁𝑚𝑎𝑥 , we can still preserve most of
the information. We will further empirically investigate the effect of choice of 𝐻 in Section 5.4
5.3     Theoretical Analysis of EigenPooling
In this section, we provide a theoretical analysis of EigenPooling by understanding it from local and
global perspectives. We prove that the pooling operation can preserve useful information to be
processed by the following GCN layers. We also verify that EigenGCN is permutation invariant,
which lays the foundation for graph classification with EigenGCN.
5.3.1    A Local View of EigenPooling
In this subsection, we analyze the pooling operator from a local perspective focusing on a specific
subgraph G (𝑘) . For the subgraph G (𝑘) , the pooling operator tries to summarize the nodes’ features
and form a representation for the corresponding supernode of the subgraph. For a pooling operator
Θ𝑙 , the part that is effective on the subgraph G (𝑘) , is only the up-sampled eigenvector ū𝑙(𝑘) as the
other eigenvectors have 0 values on the subgraph G (𝑘) . Without the loss of generality, let’s consider
a single dimensional graph signal x ∈ R𝑁×1 defined on the G, the pooling operation on G (𝑘) can be
represented as:
                                         ( ū𝑙(𝑘) )𝑇 x = (u𝑙(𝑘) )𝑇 x (𝑘) ,                        (5.16)
which is the Fourier coefficient of the Fourier mode u𝑙(𝑘) of the subgraph G (𝑘) . Thus, from a local
perspective, the pooling process is a graph Fourier transform of the graph signal defined on the
subgraph. As we introduced in the Section 5.2.3.1, each of the Fourier modes (eigenvectors) is
associated with an eigenvalue, which measures its smoothness. The Fourier coefficient of the
corresponding Fourier mode provides the information to indicate the importance of this Fourier
mode to the signal. The coefficient summarizes the graph signal information utilizing both the
node features and the subgraph structure as the smoothness is related to both of them. Each of the
coefficients characterizes a different property (smoothness) of the graph signal. Using the first 𝐻
coefficients while discarding the others means that we focus more on the “smoother” part of the graph
                                                        66


signal, which is common in a lot of applications such as signal denoising and compression [83, 9, 61].
Therefore, we can use the squared summation of the coefficients to measure how much information
can be preserved as shown in the following theorem.
Theorem 9. Let x be a graph signal defined on the graph G and U = [u1 , . . . , u𝑁 ] be the
Fourier modes of this graph. Let x̂ be the corresponding Fourier coefficients, i.e., x̂ = U𝑇 x. Let
       𝐻                                                                                                            || x̂[1:𝐻] || 2
x0 = x̂[𝑙] · u𝑙 be the signal re-constructed using only the first 𝐻 Fourier modes. Then ||x̂||2 2
       Í
      𝑙=1                                                                                                                    2
can measure the information being preserved by this re-construction. Here x̂[1 : 𝐻] denotes the
vector consisting of the first 𝐻 elements of x̂.
                                                                    Í𝑁
Proof. According to Eq.(5.9), x can be written as x =                   x̂[𝑙] · u𝑙 . Since U is orthogonal, we have
                                                                    𝑙=1
                                                 𝐻
                                                    x̂[𝑙] · u𝑙 k 22
                                                Í
                                              k
                                   kx0 k 22     𝑙=1                    k x̂[1 : 𝐻] k 22
                                            =                       =                                                    (5.17)
                                   kxk 22        𝑁                          k x̂k 22
                                                            u𝑙 k 22
                                                Í
                                              k     x̂[𝑙] ·
                                                𝑙=1
which completes the proof.
     It is common that for natural graph signal that the magnitude of the spectral form of the graph
                                                                                                     k x̂[1:𝐻] k 22
signal is concentrated on the first few coefficients [70, 80], which means that                           k x̂k 22
                                                                                                                      ≈ 1 for
𝐻  𝑁 𝑘 . In other words, by using the first 𝐻 coefficients, we can preserve the majority of the
information while reducing the computational cost. We will empirically verify it in the experiment
section.
5.3.2     A Global View of EigenPooling
In this subsection, we analyze the pooling operators from a global perspective focusing on the entire
graph G. The pooling operators we constructed can be viewed as a filterbank [83]. Each of the
filters in the filterbank filters the given graph signal and obtains a new graph signal. In our case, the
filtered signal is defined on the coarsened graph G𝑐𝑜𝑎𝑟 . Without the loss of generality, we consider a
                                                                                            𝑁 𝑚𝑎𝑥
single dimensional signal x ∈ R𝑁×1 of G, then the filtered signals are {x𝑙 }𝑙=1                   . Next, we describe
some key properties of these pooling operators.
                                                          67


Property 1: Perfect Reconstruction: The first property is that when 𝑁𝑚𝑎𝑥 number of filters are
used, the input graph signal can be perfectly reconstructed from the filtered signals.
                                                                                                                           𝑁 𝑚𝑎𝑥
Lemma 10. The graph signal x can be perfectly reconstructed from its filtered signals {x𝑙 }𝑙=1
                                                                           𝑁Í𝑚𝑎𝑥
                                                       𝑁 𝑚𝑎𝑥
together with the pooling operators {Θ𝑙 }𝑙=1                    as x =            Θ𝑙 x𝑙 .
                                                                            𝑙=1
Proof. With the definition of Θ𝑙 given in Eq.(5.12), we have
                           𝑁Õ𝑚𝑎𝑥            𝑁Õ  𝑚𝑎𝑥 Õ 𝐾                          Õ𝐾 𝑁Õ 𝑚𝑎𝑥
                                  Θ𝑙 x𝑙 =                 ū𝑙(𝑘)  · x𝑙 [𝑘] =                ū𝑙(𝑘) · x𝑙 [𝑘]              (5.18)
                            𝑙=1               𝑙=1 𝑘=0                            𝑘=0 𝑙=1
From Eq.(5.13), we know that x𝑙 [𝑘] = ( ū𝑙(𝑘) )𝑇 x. Together with the fact that ū𝑙(𝑘) = C (𝑘) u𝑙(𝑘) in
                                 𝑁Í𝑚𝑎𝑥
Eq.(5.11), we can rewrite              ū𝑙(𝑘) · x𝑙 [𝑘] as
                                  𝑙=1
                                 𝑁Õ𝑚𝑎𝑥                                  𝑁Õ𝑚𝑎𝑥
                                                                                          𝑇
                                       ū𝑙(𝑘)                                 u𝑙(𝑘) u𝑙(𝑘) )C (𝑘) x
                                                                 (𝑘)                               𝑇
                                                · x𝑙 [𝑘] = C          (                                                  (5.19)
                                  𝑙=1                                    𝑙=1
                                   𝑇                            𝑇
                       u𝑙(𝑘) u𝑙(𝑘)                 u𝑙(𝑘) u𝑙(𝑘) = I, since that {u𝑙(𝑘) }𝑙=1            𝑁𝑘
                 Í𝑁𝑚𝑎𝑥                     Í𝑁𝐾
Obviously,        𝑙=1                 =       𝑙=1                                                          are orthonormal and
                                                     𝑁Í𝑚𝑎𝑥
{u𝑙(𝑘) }𝑙=𝑁
        𝑁 𝑚𝑎𝑥
            𝑘 +1
                 are all 0 vectors. Thus,                   ū𝑙(𝑘) · x𝑙 [𝑘] = C (𝑘) x (𝑘) . Substitute this to Eq.(5.18),
                                                      𝑙=1
we arrive at
                                               𝑁Õ 𝑚𝑎𝑥             Õ 𝐾
                                                      Θ𝑙 x𝑙 =            C (𝑘) x (𝑘) = x                                 (5.20)
                                                 𝑙=1              𝑘=1
which completes the proof.
                                                                                                                           𝑁 𝑚𝑎𝑥
    From Lemma 10, we know if 𝑁𝑚𝑎𝑥 number of filters are chosen, the filtered signals {x𝑙 }𝑙=1
can preserve all the information from x. Thus, together with graph coarsening, eigenvector pooling
can preserve the signal information of the input graph and can enlarge the receptive filed, which
allows us to finally learn a vector representation for graph classification.
Property 2: Energy/Information Preserving The second property is that the filtered signals
preserve all the energy when 𝑁𝑚𝑎𝑥 filters are chosen. To show this, we first give the following
lemma, which serves as a tool for demonstrating property 2.
                                                                        𝑁 𝑚𝑎𝑥
Lemma 11. All the columns in the operators {Θ𝑙 }𝑙=1                            are orthogonal to each other.
                                                                   68


Proof. By definition, we know that, for the same 𝑘, i.e, the same subgraph, u𝑙(𝑘) , 𝑙 = 1, . . . 𝑁𝑚𝑎𝑥 are
orthogonal to each other, which means ū𝑙(𝑘) , 𝑙 = 1, . . . 𝑁𝑚𝑎𝑥 are also orthogonal to each other. In
addition, all the ū𝑙(𝑘) with different 𝑘 are also orthogonal to each other as they only have non-zero
values on different subgraphs.
    With the above lemma, we can further conclude that the ℓ2 norm of graph signal x is equal to the
                                                                    𝑁 𝑚𝑎𝑥
summation of the ℓ2 norm of the pooled signals {x𝑙 }𝑙=1                    . The proof is given as follows:
Lemma 12. The ℓ2 norm of the graph signal x is equal to the summation of the ℓ2 norm of the
                       𝑁 𝑚𝑎𝑥
pooled signals {x𝑙 }𝑙=1      :
                                                             𝑁Õ𝑚𝑎𝑥
                                                  ||x|| 22 =        ||x𝑙 || 22                                       (5.21)
                                                              𝑙=1
Proof. From Lemma 10 and 11, we have
                                            𝑁Õ                    𝐾 𝑁Õ                            2
                                              𝑚𝑎𝑥               Õ        𝑚𝑎𝑥
                                kxk 22 = k        Θ𝑙 x𝑙 k 22 =                    ū𝑙(𝑘) · x𝑙 [𝑘]
                                             𝑙=1                 𝑘=0 𝑙=1                          2
                                         Õ 𝐾 𝑁Õ 𝑚𝑎𝑥             𝑁Õ 𝑚𝑎𝑥
                                       =            x2𝑙 [𝑘] =           ||x𝑙 || 22
                                          𝑘=0 𝑙=1                 𝑙=1
which completes the proof.
Property 3: Approximate Energy Preserving Lemma 12 describes the energy preserving when
𝑁𝑚𝑎𝑥 number of filters are chosen. In practice, we only need 𝐻  𝑁𝑚𝑎𝑥 of filters for efficient
computation. Next we show that even with 𝐻 number of filters, the filtered signals preserve most of
the energy/information.
                              𝐻
Theorem 13. Let x0 =
                             Í
                                 Θ𝑙 x𝑙 be the graph signal reconstructed only using the first 𝐻 pooling
                             𝑙=1
                                                                   𝑁Í𝐻
                                                                       ||x𝑙 || 22
operators and pooled signals {x𝑙 }𝑙=1    𝐻 .   Then the ratio      𝑙=1
                                                                 𝑁𝑚𝑎𝑥                can measure the portion of information
                                                                         ||x𝑙 || 22
                                                                    Í
                                                                   𝑙=1
being preserved by this x0.
                                                             69


                                             𝑁Í𝑚𝑎𝑥                                                    𝐻
Proof. As shown in Lemma 5.3.2, kxk 22 =           kx𝑙 k 22 . Similarly, we can show that kx0 k 22 =      kx𝑙 k 22 .
                                                                                                     Í
                                              𝑙=1                                                    𝑙=1
The portion of the information being preserved can be represented as
                                                      𝑁
                                                      Í𝐻
                                                           ||x𝑙 || 22
                                         kx0 k 22     𝑙=1
                                                  =                     .                                (5.22)
                                         kxk 22     𝑁Í 𝑚𝑎𝑥
                                                             ||x𝑙 || 22
                                                      𝑙=1
which completes the proof.
     Since for natural graph signals, the magnitude of the spectral form of the graph signal is
concentrated in the first few coefficients [70], which means that even with 𝐻 filters, EigenPooling
can preserve the majority of the information/energy.
5.3.3    Permutation Invariance of EigenGCN
EigenGCN takes the adjacency matrix A and the node feature matrix X as input and aims to learn a
vector representation of the graph. The nodes in a graph do not have a specific ordering, i.e., A
and X can be permuted. Obviously, for the same graph, we want EigenGCN to extract the same
representation no matter which permutation of A and X are used as input. Thus, in this subsection,
we prove that EigenGCN is permutation invariant, which lays the foundation of using EigenGCN for
graph classification.
Theorem 14. Let P = {0, 1}𝑛×𝑛 be any permutation matrix, then
                            EigenGCN (A, X) = EigenGCN (PAP𝑇 , PX),
i.e., EigenGCN is permutation invariant.
Proof. In order to prove that EigenGCN is permutation invariant, we only need to show that it’s key
components GCN, graph coarsening and EigenPooling are permutation invariant. For GCN, before
                                            1       1
permutation, the output is F = 𝑅𝑒𝐿𝑈 ( D̃− 2 ÃD̃− 2 XW𝑖 ). With permutation, the output becomes
                             1      1
                                                                         1    1
                𝑅𝑒𝐿𝑈 (PD̃− 2 ÃD̃− 2 P𝑇 )(PX)W = 𝑅𝑒𝐿𝑈 (PD̃− 2 ÃD̃− 2 XW) = PF
                                                    70


where we have used P𝑇 P = I. This shows that the effect of permutation on GCN only permutes
the order of the node representations but doesn’t change the value of the representations. Second,
the graph coarsening is done by spectral clustering with A. No matter which permutation we
have, the detected subgraphs will not change. Finally, EigenPooling summarizes the information
within each subgraph. Since the subgraph structures are not affected by the permutation and the
representation of each node in the subgraphs is also not affected by the permutation, we can see that
the supernodes’ representations after EigenPooling are not affected by the permutation. In addition,
the inter-connectivity of supernodes is not affected since it’s determined by spectral clustering.
Thus, we can say that one step of GCN-Graph Coarsening-EigenPooling is permutation invariant.
Since finally EigenGCN learns one vector representation of the input graph, we can conclude that
the vector representation is the same under any permutation P.
5.4     Experiment
In this section, we conduct experiments to evaluate the effectiveness of the proposed framework
EigenGCN. Specifically, we aim to answer two questions:
    • Can EigenGCN improve the graph classification performance by the design of EigenPooling?
    • How reliable it is to use 𝐻 number of filters for pooling?
We begin by introducing datasets and experimental settings. We then compare EigenGCN with
representative and state-of-the-art baselines for graph classification to answer the first question. We
further conduct analysis on graph signals to verify the reasonableness of using 𝐻 filters, which
answers the second question.
5.4.1   Datasets
To verify whether the proposed framework can hierarchically learn good graph representations for
classification, we evaluate EigenGCN on 6 widely used benchmark data sets for graph classifica-
tion [38], which includes three protein graph data sets, i.e., ENZYMES [5, 73], PROTEINS [5, 17],
                                                   71


and D&D [17, 77]; one mutagen data set Mutagenicity [67, 37] (We denoted as MUTAG in Table 5.1
and Table 5.2); and two data sets that consist of chemical compounds screened for activity against
non-small cell lung cancer and ovarian cancer cell lines, NCI1 and NCI109 [88]. Some statistics of
the data sets can be found in Table 5.1. From the table, we can see that the used data sets contain
varied numbers of graphs and have different graph sizes. We include data sets of different domains,
sample and graph sizes to give a comprehensive understanding of how EigenGCN performs with
data sets under various conditions.
                                  Table 5.1: Statistics of datasets
                      ENZYMES PROTEINS D& D NCI1                    NCI109     MUTAG
           # graphs       600           1,113        1,178 4,110     4,127       4,337
           mean |V|      32.63          39.06       284.32 29.87     29.68       30.32
           # classes       6              2            2        2       2          2
5.4.2   Baselines and Experimental Settings
To compare the performance of graph classification, we consider some representative and state-of-
the-art graph neural network models with various pooling layers. Next, we briefly introduce these
baseline approaches as well as the experimental settings for them.
    • GCN [40] is a graph neural network framework proposed for semi-supervised node classifi-
      cation. It learns node representations by aggregating information from neighbors. As, the
      GCN model does not consist of a pooling layer, to obtain the graph representation, we use
      the average of the learned node representations as the graph representation. We use it as a
      baseline to compare whether a hierarchical pooling layer is necessary.
    • GraphSage [28] is similar as the GCN and provides various aggregation method. Here, we
      adopt the mean aggregator for node representation learning. The average of the learned node
      representations is served as the graph representation.
    • SET2SET. This baseline is also built upon GCN, it is also “flat” but uses set2set architecture
      introduced in [86] instead of averaging over all the nodes. We select this method to further
                                                 72


       show whether a hierarchical pooling layer is necessary no matter average or other pooling
       methods are used.
    • DGCNN [98] is built upon the GCN layer. The features of nodes are sorted before feeding
       them into traditional 1-D convolutional and dense layers [98]. This method is also “flat”
       without a hierarchical pooling procedure.
    • Diff-pool [97] is a graph neural network model designed for graph level representation
       learning with differential pooling layers. It uses node representations learned by an additional
       convolutional layer to learn the subgraphs (supernodes) and coarsen the graph based on it.
       We select this model as it achieves state-of-art performance on the graph classification task.
    • EigenGCN-H represents various variants of the proposed framework EigenGCN, where 𝐻
       denotes the number of pooling operators we use for EigenPooling. In this evaluation, we
       choose 𝐻 = 1, 2, 3.
    For each of the data sets, we randomly split it to 3 parts, i.e., 80% as training set, 10% as
validation set and 10% as testing set. We repeat the randomly splitting process 10 times, and the
average performance of the 10 different splits are reported. The parameters of baselines are chosen
based on their performance on the validate set. For the proposed framework, we use the 9 splits of
the training set and validation set to tune the structure of the graph neural network as well as the
learning rate. The same structure and learning rate are then used for all 9 splits.
    Following previous work [97], we adopt the widely used evaluation metric, i.e., Accuracy, for
graph classification to evaluate the performance.
5.4.3    Performance on Graph Classification
Each experiment is run 10 times and the average graph classification performance in terms of
accuracy is reported in Table 5.2. From the table, We make the following observations:
                                                   73


     • Diff-pool and the EigenGCN framework perform better than those methods without a
        hierarchical pooling procedure in most of the cases. Aggregating the node information
        hierarchically can help learn better graph representations.
     • The proposed framework EigenGCN shares the same convolutional layer with GCN, GraphSage,
        and SET2SET. However, the proposed framework (with different 𝐻) outperforms them with a
        large margin in most of the data sets. This further indicates the necessity of the hierarchical
        pooling procedure. In other words, the proposed EigenPooling can indeed help the graph
        classification performance.
     • In most of the data sets, we can observe that the variants of the EigenGCN with more
        eigenvectors achieve better performance than those with fewer eigenvectors. Including more
        eigenvectors, which suggests that we can preserve more information during pooling, can help
        learn better graph representations in most of the cases. In some of data sets, including more
        eigenvector does not bring any improvement in performance or even make the performance
        worse. Theoretically, we are able to preserve more information by using more eigenvectors.
        However, noise signals may be also preserved, which can be filtered when using fewer
        eigenvectors.
     • The proposed EigenGCN achieves the state-of-the-art or at least comparable performance on
        all the data sets, which shows the effectiveness of the proposed framework EigenGCN.
     To sum up, EigenPooling can help learn better graph representation and the proposed framework
EigenGCN with EigenPooling can achieve state-of-the-art performance in graph classification task.
5.4.4     Understanding Graph Signals
In this subsection, we investigate the distribution of the Fourier coefficients on signals in real data.
We aim to show that for natural graph signals, most of the information/energy concentrates on the
first few Fourier models (or eigenvectors). This paves us a way to only use 𝐻 filters in EigenPooling.
Specifically, given one data set, for each graph G𝑖 with 𝑁𝑖 nodes and its associated signal X𝑖 ∈ R𝑁𝑖 ×𝑑 ,
                                                   74


                                 Table 5.2: Performance comparison.
                                                                 Data sets
            Baselines
                          ENZYMES PROTEINS D&D NCI1                                NCI109    MUTAG
              GCN            0.440             0.740             0.759 0.725        0.707      0.780
          GraphSage          0.554             0.746             0.766 0.732        0.703      0.785
           SET2SET           0.380             0.727             0.745 0.715        0.686      0.764
            DGCNN            0.410             0.732             0.778 0.729        0.723      0.788
            Diff-pool        0.636             0.759             0.780 0.760        0.741      0.806
         EigenGCN-1          0.650             0.751             0.775 0.760        0.746      0.801
         EigenGCN-2          0.645             0.754             0.770 0.767        0.748      0.789
         EigenGCN-3          0.645             0.766             0.786 0.770        0.749      0.795
we first calculate the graph Fourier transform and obtain the coefficients X̂𝑖 ∈ R𝑁𝑖 ×𝑑 . We then
                                        k X̂𝑖 [1:𝐻,:] k 22
calculate the following ratio: 𝑟𝑖𝐻 =         k X̂𝑖 k 22
                                                           ,  where X̂𝑖 [1 : 𝐻, :] denotes the first 𝑖 rows of the
matrix X̂𝑖 for various values of 𝐻. According to Theorem 13, this ratio measures how much
information can be preserved by the first 𝐻 coefficients. We then average the ratio over the entire
data set and obtain
                                                             Õ
                                                   𝑟𝐻 =          𝑟𝑖𝐻 .                                      (5.23)
                                                               𝑖
Note that if 𝐻 > 𝑁𝑖 , we set 𝑟𝑖𝐻 = 1. We visualize the ratio for each of the data set up to 𝐻 = 40
in Figure 5.2. As shown in Figure 5.2, for most of the data set, the magnitude of the coefficients
concentrated in the first few coefficients, which demonstrates the reasonableness of using only
𝐻  𝑁𝑚𝑎𝑥 filters in EigenPooling. In addition, using 𝐻 filters can save computational cost.
5.5     Related Work
In recent years, graph neural network models, which try to extend deep neural network models to
graph structured data, have attracted increasing interests. These graph neural network models have
been applied to various applications in many different areas. In [40], a graph neural network model
that tries to learn node representation by aggregating the node features from its neighbors, is applied
to perform semi-supervised node classification. Similar methods were later proposed to further
enhance the performance by including attention mechanism [84]. GraphSage [97], which allows
                                                             75


             1.0                          1.0                           1.0
             0.8                          0.8                           0.8
             0.6                          0.6                           0.6
             0.4              ENZYMES     0.4             PROTEINS      0.4             NCI1
             0.2                          0.2                           0.2
             0.0 0     10  20    30    40 0.0 0     10  20    30     40 0.0 0 10   20  30    40
                   (a) ENZYMES                  (b) PROTEINS                  (c) NCI1
             1.0                          1.0                           1.0
             0.8                          0.8                           0.8
             0.6                          0.6                           0.6
             0.4                NCI109    0.4           Mutagenicity    0.4               DD
             0.2                          0.2                           0.2
             0.0 0     10  20    30    40 0.0 0     10  20    30     40 0.0 0 10   20  30    40
                     (d) NCI109               (e) Mutagenicity                 (f) DD
                                 Figure 5.2: Understanding graph signals
more flexible aggregation procedure, was designed for the same task. There are some graph neural
networks models designed to reason the dynamics of physical systems where the model is applied
to predict future states of nodes given their previous states [3, 69]. Most of the aforementioned
methods can fit in the framework of “message passing” neural networks [26], which mainly involves
transforming, propagating and aggregating node features across the graph through edges. Another
stream of graph neural networks was developed based on the graph Fourier transform [15, 6, 29, 45].
The features are first transferred to the spectral domain, next filtered with learnable filters and then
transferred back to the spatial domain. The connection between these two streams of works is shown
in [15, 40]. Comprehensive surveys on graph neural networks can be found in [104, 93, 100, 4].
    However, the design of the graph neural network layers is inherently “flat”, which means the
output of pure graph neural network layers is node representations for all the nodes in the graph. To
apply graph neural networks to the graph classification task, an approach to summarize the learned
node representations and generate the graph representation is needed. A simple way to generate
                                                       76


the graph representations is to globally combine the node representations. Different combination
approaches have been investigated, which include averaging over all node representation as the
graph representation [18], adding a “virtual node” connected to all the nodes in the graph and using
its node representation as the graph representation [49], and using conventional fully connected
layers or convolutional layers after arranging the graph to the same size [98, 26]. However, these
global pooling methods cannot hierarchically learn graph representations, thus ignoring important
information in the graph structure. There are a few recent works [15, 97, 81, 21] investigating
learning graph representations with a hierarchical pooling procedure. These methods usually involve
two steps 1) coarsen a graph by grouping nodes into supernode to form a hierarchical structure and
2) learn supernode representations level by level and finally obtain the graph representation. These
methods use mean-pooling or max-pooling when they generate supernodes representation, which
neglects the important structure information in the subgraphs. In this paper, we propose a pooling
operator based on local graph Fourier transform, which utilizes the subgraph structure as well as the
node features for generating the supernode representations.
                                                 77


                                                   CHAPTER 6
    REPPOOL: A GRAPH POOLING OPERATION WITH REPRESENTATIVENESS
6.1      Introduction
In this chapter, we continue considering the hierarchical pooling procedure. In the previous
chapter, we group nodes into “supernodes” to coarsen graphs during the graph pooling process.
There is another stream of works on graph pooling operations, which are based on downsmapling.
These graph pooling operations identify and sample important nodes to build coarsened graphs.
This perspective enables GNNs1 to iteratively coarsen the input graph and gradually reduce the
information into the graph representation level by level. However, when performing downsampling
(or node selection), most of these existing pooling methods [23, 32, 99] only focus on the importance
of the nodes. In Figure 6.1, we illustrate one downsampling step by such a pooling operation
Figure 6.1: An illustration of node selection based on the importance by gPool [23] at the first layer.
Red nodes are the selected ones. gPool might select nodes that are redundantly concentrated on
some substructures while totally neglecting information of other substructures.
    1 In this chapter, we are dealing with a graph-focus tasks. Hence, when we mention GNNs, we refer to the GNN
models that follow the general framework for graph-focus task described in Section 2.3.3.2
                                                          78


gPool [23], where the selected nodes are highlighted in red.
    As shown in Figure 6.1, if only node importance is considered in the pooling operator, several
limitations are naturally introduced. First, the selected nodes tend to be redundantly concentrated
on some substructures while completely overlooking information in other substructures (e.g.,
substructures around nodes 𝑣 22 and 𝑣 23 ). One possible reason is that gPool selects nodes according
to a learnable importance score, which takes node features as input. Hence, connected nodes
may have similar scores, as their node features tend to be similar. Furthermore, important nodes
tend to connect with each other in many real-world graphs. For instance, in social networks,
celebrities tend to make friends with other celebrities, and famous researchers usually cooperate
with others who are also influential in the same area. The importance score can be viewed as a global
measure, which ignores the local representativeness of nodes. The representativeness indicates
the importance of a node for its local substructure. Specifically, the selection of one node will
reduce the representativeness of other nodes in the same local substructure though they may also
have high global importance scores. Thus, we would rather choose those nodes that have relatively
smaller importance scores but are more representative for other substructures, with the purpose to
cover more substructures and preserve more information from the original graph. For example,
in Figure 6.1, nodes 𝑣 22 and 𝑣 23 are more preferred than node 𝑣 36 or 𝑣 10 . Second, generating the
coarsened graph and its corresponding node features only based on these selected nodes could result
in an unrepresentative structure and lead to huge information loss. For example, gPool utilizes the
original connections between the selected nodes to build the connections for the coarsened graph,
which might render the resulting coarsened graph unconnected; furthermore, to generate the node
features for the coarsened graph, gPool solely utilizes the node features of the selected nodes while
discarding features on all the non-selected ones. Therefore, strategies are needed to better preserve
the information of the original graph during the pooling process in both the stages of node selection
and coarsened graph generation.
    In this paper, we propose a novel graph pooling operator RepPool. It can select nodes that are
both globally important and can represent more substructures. Moreover, it utilizes both selected
                                                   79


and un-selected nodes when generating the coarsened graph and its corresponding node features.
6.2    Related Work
This section briefly reviews related work, including general graph neural network models and various
pooling methods. More details on GNNs and their applications can be found in the book Deep
Learning on Graphs [53].
6.2.1   Graph Neural Networks
GNNs can be roughly divided into two categories – spectral [7, 15, 29, 48, 41] and spatial
models [24, 28, 63, 85]. Spectral methods generally define convolutional operations in the Fourier
domain based on the graph spectral theory. Bruna et al. [7] first proposed a variant of graph
convolutional operation by utilizing spectral filters. This method is difficult to apply on large-scale
graphs due to its high computational cost and non-locality property. To address this challenge,
Defferrard et al. [15] tried to improve its efficiency by approximating K-polynomial filters. Kipf
and Welling [41] proposed a simplified model [15] by approximating the first-order Chebyshev
expansion. The goal of the spatial approaches is to define graph convolutional operations that
can work on graphs directly. It generally passes information to the next layer via propagating
and aggregating information from local neighborhood. Among them, Hamilton et al. [28] aimed
to learn node representations by sampling and fixing the size of the neighborhood. Velickovi et
al. [85] utilized the attention mechanism to operate in the entire neighborhood. GNNs have also
been extended to various types of graphs, including signed graphs [16], dynamic graphs [50], and
multi-dimensional graphs [54]. However, conventional GNNs mentioned above are flat: learning
node representations for all nodes, and cannot work on the graph classification task directly.
6.2.2   Graph Pooling
To apply GNNs to the graph classification task, a graph level representation is needed. Early
common methods performed global pooling to combine node representations into one. For instance,
                                                   80


Duvenaud et al. [18] averaged all node representations as the graph representation. Vinyals et
al. [87] proposed to learn graph representation by utilizing LSTMs [30]. Zhang et al. [98] sorted
node features in a consistent order to learn graph representation with a fixed size. Nevertheless,
these methods cannot generate hierarchical graph representations, leading to the loss of important
information. There have been some recent works that tried to design hierarchical pooling procedures.
DiffPool [97] introduced an assignment matrix to group nodes into a set of clusters. gPool [23]
and AttPool [32] selected top-K nodes according to a proposed score to form a coarser subgraph
for the next layer. EigenPool [55] introduced a pooling operator based on graph Fourier transform.
ASAP [66] devised a cluster scoring procedure to select nodes by using a graph convolution method
which can capture local extremum information. Besides sampling nodes to form a new subgraph,
HGP-SL [99] also focused on constructing a refined graph structure for the subgraph. However,
most of these methods ignore representativeness in the node selection procedure. In addition, when
forming a new subgraph, some methods [23, 99, 66] only use the original data from selected nodes
while discarding non-selected nodes, e.g., node features of selected nodes or connections between
selected nodes, which might cause information loss.
6.3    The Proposed Model
The proposed model mostly follows the general framework that we introduce in Section 2.3.3.2 with
some tiny discrepancies. More specifically, instead of only using the output from the last pooling
layer, we also combine representations learned from the intermediate layers for graph classification.
In this Section, we first describe an overview of the framework and then introduce the proposed
graph pooling operation in detail.
6.3.1   An Overview
An overview of the proposed framework is shown in Figure 6.2a, which mainly consists of three
types of modules, i.e., the GCN module, the pooling module and the prediction module. The
GCN modules and the pooling modules are utilized during the process of learning the graph
                                                81


                            conv1
                                   conv2
                                                  Pooling1
                     GCN1
                            conv1
                                 conv2
                                                  Pooling2      Readout            …        Label
                     GCN2
                         conv1                                                    MLP
                              conv2
                                                  Pooling3
                     GCN3
                         (a) An Overview of the graph classification architecture
    Input graph                               Select nodes                         Pooled graph
                                 (b) An Illustration of the pooling Operator
Figure 6.2: An Illustration of the proposed hierarchical graph classification architecture and the
proposed pooling module. In the node selection process in (b), 4 nodes are selected step by step.
Note that the red color indicates that the node has been selected.
representations, while the prediction module takes the learned graph representation as input and
makes the prediction. Specifically, during the graph representation learning process, one pooling
module is stacked over one GCN module and this procedure repeats several times. In particular,
the GCN module is used to refine the node representations and the pooling module is utilized to
generate a coarsened graph and its corresponding node features, which is then used as the input
to another GCN module. We call the combination of a GCN module and a pooling module as a
learning block of the proposed framework. At each learning block, a graph-level representation
is generated by applying node-wise max-pooling to the outputs of the pooling module. Then a
                                                     82


readout function is applied to these graph-level representations from different layers to generate a
final graph representation, which goes through Multi-Layer Perceptron (MLP) layers to perform
the graph classification task in the prediction module. In this way, the whole architecture can be
optimized in an end-to-end manner. More details about each module are below:
    • The GCN Module: It consists of 𝐾𝑔 subsequent GCN Graph Filtering layers (GCN layers),
       and takes the output of its previous GCN layer as input. Specifically, the input of the first
       GCN layer in a GCN module is either the input features (for the first learning block) or the
       output of the previous learning block.
    • The Pooling Module: It contains two major components in the pooling module: 1) A
       component to select nodes from the input graph; and 2) a component to build a coarser graph
       by updating the feature matrix and the adjacency matrix utilizing the selected nodes.
    • The Prediction Module: As shown in Figure 6.2, the proposed hierarchical architecture
       repeats GCN module and pooling module for several times, thus we could have several graph
       representations from each learning block. Following a common setting [97, 32], we employ a
       readout function to concatenate graph representations from all learning blocks to form the
       finalized one. At last, finalized graph representations are fed into MLP layers with softmax
       classifier. The loss is defined as cross entropy loss over the graph labels. Through training the
       whole architecture, a predicting label can be obtained for each input graph.
    Next, we give details about the components of the proposed pooling module. Without the loss of
generality, we use A ∈ R|V |×|V | and X ∈ R|V |×𝑑 to denote the input graph for the pooling module,
A𝑛𝑒𝑤 ∈ R 𝑘×𝑘 and X𝑛𝑒𝑤 ∈ R 𝑘×𝑑 to denote the output coarser graph which is also the input graph of
the next GCN module, where 𝑘 is the number of selected nodes and 𝑘 < |V |.
6.3.2    Node Selection
During the pooling process, to preserve the graph structure and feature information as much as
possible, we need to select nodes with significant importance in the graph. However, if we only
                                                    83


focus on selecting important nodes, we may end up with selecting redundant nodes from some
substructures, while ignoring nodes from other substructures, which leads to a huge information loss.
In fact, it is evident that nodes in the same substructure may have mutual exclusion effect on their
importance [91]. In other words, highly important nodes in one substructure could render others less
important. To alleviate this issue, we introduce node representativeness in addition to importance,
with the purpose to include more diverse important nodes and cover as many substructures as
possible. Specifically, we propose an importance score and a representativeness score, which are
combined to generate an overall score to serve as the criterion to perform the node selection. Next,
we first introduce these two scores and then describe how we can utilize the combination of them to
select nodes.
6.3.2.1    Node Importance
For the node importance, we introduce a global score to evaluate how much information it contains.
In general, if a node has a large degree and its neighbors contain rich information, namely its
neighbors are relatively important, it means that this node probably is also important and contains
rich information. Hence, we define the importance score of a node utilizing its information as well
as its neighbors’ information. In particular, the importance score 𝑠𝑖 for the node 𝑣 𝑖 can be defined as
follows:
                                                 © Õ               1
                                 𝑠𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ­                          𝑝(𝑣 𝑗 ) ®                 (6.1)
                                                                                 ª
                                                               𝑑𝑒𝑔(𝑣 𝑗 )
                                                   𝑣
                                                 « 𝑗 ∈N (𝑣 𝑖 )                   ¬
where N (𝑣 𝑖 ) denotes the set of neighbors of the node 𝑣 𝑖 , 𝑑𝑒𝑔(𝑣 𝑗 ) is the degree of the node 𝑣 𝑗 ,
𝑠𝑖𝑔𝑚𝑜𝑖𝑑 () is the sigmoid function, and 𝑝(𝑣 𝑗 ) is defined as follows:
                                                             x𝑗p
                                               𝑝(𝑣 𝑗 ) =                                           (6.2)
                                                             ||p||
with x 𝑗 the input feature of node 𝑣 𝑗 (i.e., x 𝑗 is the 𝑗-th row of X) and p ∈ R𝑑 a learnable vector to
project x 𝑗 to a scalar 𝑝(𝑣 𝑗 ).
                                                      84


6.3.2.2   Node Representativeness
On the one hand, since only selecting nodes based on importance is likely to choose nodes to cover
some substructures while overlooking other substructures, we aim to select various nodes from
more substructures. On the other hand, highly important nodes in one substructure could render
others less important. Both above intuitions suggest that in addition to importance, we should
select nodes representing more substructures (or representativeness). Correspondingly, we introduce
the representativeness score. To cover more substructures and mitigate the influence of existing
important nodes, we should select nodes distant from these existing ones. This paves us a way to
define the representativeness score.
    Specifically, we select the nodes one by one and the representativeness score needs to be
generated during each selection step. Assuming that we have already selected a set of nodes and
now proceed to select the next node. The set of the indices of the selected nodes is denoted as 𝑖𝑑𝑥.
Then, the representativeness score 𝛿𝑖 of the remaining candidate node 𝑣 𝑖 is defined as follows:
                                        𝛿𝑖 = 𝑓 (ℎ(𝑣 𝑖 , 𝑣 𝑗 ), 𝑗 ∈ 𝑖𝑑𝑥)                             (6.3)
where ℎ is a function to measure the distance between nodes 𝑣 𝑖 and 𝑣 𝑗 , and 𝑓 is a function to define
the distance between node 𝑣 𝑖 and nodes in 𝑖𝑑𝑥. In this paper, we empirically find that the following
choices of ℎ and 𝑓 work very well – (1) ℎ defines the distance between 𝑣 𝑖 and 𝑣 𝑗 as the length of the
shortest path between them; and (2) 𝑓 defines the distance between node 𝑣 𝑖 and nodes in 𝑖𝑑𝑥 as the
minimal pair-wise distance of 𝑣 𝑖 to nodes in 𝑖𝑑𝑥. With these choices, 𝛿𝑖 can be rewritten as:
                                            𝛿𝑖 = min ℎ(𝑣 𝑖 , 𝑣 𝑗 )                                  (6.4)
                                                 𝑗 ∈𝑖𝑑𝑥
where if a candidate node is close to a selected node, its representativeness score will be small,
otherwise it will be relatively large. This helps to select nodes that are more likely to be far from the
selected nodes.
                                                     85


6.3.2.3       Node Selection Algorithm
We then obtain an overall selection score 𝛾𝑖 for 𝑣 𝑖 by combining this two scores as follows:
                                              𝛾𝑖 = 𝑔(𝑠𝑖 , 𝛿𝑖 ),                                       (6.5)
where 𝑔 is a function to combine these two sources. We can have many settings for 𝑔 such as linear
combination or we even can use a forward neural network to make 𝑔 learnable. However, in this
work, we empirically set 𝛾𝑖 = 𝑠𝑖 ∗ 𝛿𝑖 and leave other settings as one future work.
     The Algorithm of the node selection process is given in Algorithm 4, where 𝑘 = 𝑛 ∗ 𝑟 is the
number of nodes selected in the new subgraph with the predefined pooling ratio 𝑟. In line 3, we
first choose the node with the maximal importance score as the selected node. For the remaining
nodes, we compute the current representativeness score, which is shown in line 5. In line 6, we
combine both the global importance score and the representativeness score into an overall score.
Then the next node is selected with maximal overall score which balances node importance and
representativeness. This process is repeated in 𝑘 − 1 times. In this way, nodes are selected one by
one in a greedy fashion. Meanwhile, by incorporating the influence of representativeness, selected
nodes tend to locate in various substructures of the graph. With the selected nodes, we will generate
a new coarser graph and more details will be discussed in the following subsection. Thus, the new
coarser graph may lose some information from the original graph. If we calculate node distance
ℎ(𝑣 𝑖 , 𝑣 𝑗 ) based on the new coarser graph, it may be inaccurate. Thus, to preserve this information
and distinguish nodes from different substructures, we will fix the distance ℎ(𝑣 𝑖 , 𝑣 𝑗 ) as the distance
in the original graph during the pooling process. The benefits of this setting are two-fold. First, we
preserve the original distance information during the pooling process. Second, we only need to
compute distance among nodes in the original graph once and then use it at each learning block
                                                   86


instead of computing the distance in each new coarser graph.
  Algorithm 4: Node Selection
    Input: Score set s ∈ R𝑛×1 , sample number 𝑘
    Output: Indexes of selected nodes 𝑖𝑑𝑥
  1 Initialize 𝑖𝑑𝑥 ← 𝑠𝑒𝑡 (), I ← indexes of all nodes, 𝑗 = 1 ;
  2 𝑖𝑑𝑥.add(arg max𝑖∈I (s𝑖 ));
  3 while 1 ≤ 𝑗 ≤ 𝑘 do
  4      𝛿𝑖 ← Eq. (6.4);
  5      𝛾𝑖 ← Eq. (6.5);
  6     𝑖𝑑𝑥.add(arg max𝑖∈I/𝑖𝑑𝑥 (𝛾𝑖 ));
  7      𝑗 ← 𝑗 + 1;
  8 return 𝑖𝑑𝑥
6.3.3    Coarser Graph Generation
With the selected nodes, a coarser graph with significant information extracted from the original
graph can be built by learning the feature matrix and the adjacency matrix. However, if we only
utilize the selected nodes to generate the adjacency matrix and the feature matrix, we may lose
too much structural and feature information about the original graph, which is contained in the
non-selected nodes. Hence, in this work, to better preserve such information, we propose to learn
the adjacency matrix and the feature matrix utilizing both the selected nodes and the non-selected
nodes. Next, we describe the two processes in detail.
6.3.3.1    Generating the Feature Matrix
To integrate node features from non-selected nodes, new feature matrix is generated by aggregating
information from the original graph. In particular, we learn an assignment matrix to assign a node
into selected ones. Formally, the assignment matrix B ∈ R𝑛×𝑘 is defined as follows:
                                          B = XWb ( X̂)𝑇                                      (6.6)
                                                 87


                                               X̂ = X(𝑖𝑑𝑥, :)                                       (6.7)
where Wb ∈ R𝑑×𝑑 is a learnable parameter and X̂ ∈ R 𝑘×𝑑 consists of 𝑘 row vectors extracted from
X. B𝑖 𝑗 indicates how likely 𝑣 𝑖 is associated with a selected node 𝑣 𝑗 . With the soft-assignment matrix
B, we can aggregate the information of the non-selected nodes to the selected nodes. Additionally,
we consider that nodes with higher global scores contribute more to the aggregation. Namely,
information from more important nodes is more worthwhile to be preserved. Hence, we utilize
the importance score as a gating system to control the information flow. The new features can be
generated as follows:
                              X𝑛𝑒𝑤 = (𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B        M))𝑇 (X     (s1𝑇𝑑 ))                       (6.8)
where s ∈ R𝑛 is a vector with elements defined in Eq. (6.1) (e.g., s𝑖 is the importance score for 𝑣 𝑖 ),
1𝑑 ∈ R𝑑 is a vector of size 𝑑 with all elements being 1, 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥() is a row-wise softmax function
to normalize the assignment scores for each node and M ∈ R𝑛×𝑘 is a masking matrix with M𝑖 𝑗 = 1 if
A(:, 𝑖𝑑𝑥)𝑖 𝑗 > 0 and M𝑖 𝑗 = 0 otherwise. We mask the soft-assignment matrix B, so that the selected
nodes can only aggregate information from those non-selected nodes that are connected to them in
the original graph.
6.3.3.2    Generating The Adjacency Matrix
As similar in [97], the new adjacency matrix is generated based on the soft-assignment matrix B
and the original adjacency matrix A to maintain the graph connectivity as:
                         A𝑛𝑒𝑤 = (𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B        M))𝑇 A(𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥(B        M))                   (6.9)
where we only aggregate information for the selected nodes from those non-selected nodes which
are connected in the original graph; thus we mask the soft-assignment matrix B similar to generating
the feature matrix in the above subsection. With X𝑛𝑒𝑤 and A𝑛𝑒𝑤 , a new coarser graph is obtained
which is fed into another GCN module in the next learning block. In the way, the original graph can
be coarsened, repeatedly.
                                                    88


6.3.4    Time Complexity Analysis
The computation of the proposed model depends on the GCN module and the pooling module. For
a learning block, we use A ∈ R𝑛×𝑛 and X ∈ R𝑛×𝑑 to denote the input graph. The computation of a
single GCN layer in the GCN module is 𝑂 (|E |𝑑 + 𝑛𝑑 2 ) where |E | is the number of edges. Then
the time complexity of a GCN module is 𝑂 (𝐾𝑔 (|E |𝑑 + 𝑛𝑑 2 )) where 𝐾𝑔 is the number of GCN
layers. In the pooling module, the computations of Eq. (6.1), (6.2), (6.4), (6.5), (6.6), (6.8), (6.9)
are 𝑂 (|E |), 𝑂 (𝑛𝑑), 𝑂 (𝑛𝑘 2 ), 𝑂 (𝑛𝑘), 𝑂 (𝑛𝑑𝑘 + 𝑘 𝑑 2 ), 𝑂 (𝑛𝑘 𝑑), 𝑂 (𝑘 |E | + 𝑛𝑘 2 ), respectively. Thus, the
total time complexity of the pooling module is 𝑂 (𝑛𝑑𝑘 + 𝑘 𝑑 2 + 𝑘 |E | + 𝑛𝑘 2 ). The overall time
complexity of the proposed model is 𝑂 (𝐾 𝑝 (𝑛𝑑𝑘 + 𝑘 𝑑 2 + 𝑘 |E | + 𝑛𝑘 2 + 𝐾𝑔 (|E |𝑑 + 𝑛𝑑 2 ))) where 𝐾 𝑝
is the number of learning blocks. Note that, it costs 𝑂 (𝑛|E |) to calculate the distance between
any pair of nodes in the graph. However, we do not include it into the time complexity of the
proposed algorithm as it can be pre-calculated and used for all learning blocks and training epochs.
Furthermore, the process of calculating the pair-wise distance in a graph can be easily paralleled.
6.4     Experiment
In this section, we conduct experiments to demonstrate the effectiveness of the proposed framework.
We first introduce the datasets we use, and the state-of-the-art baselines we compared with. Then,
we present the graph classification results together with some analysis. We further conduct ablation
study to demonstrate the importance of some of the major components in the proposed framework.
Finally, to better understand the proposed framework, we perform a case study and parameter
analysis.
6.4.1    Datasets
In order to evaluate the performance of the proposed model, we conducted experiments on eight
publicly available datasets. Statistics of the eight datasets are listed in Table 6.1 and next we give
more details about them:
                                                     89


Table 6.1: Statistics of the eight datasets. Graphs: total number of graphs. Nodes (avg): the average
number of nodes. Edges (avg): the average number of edges. #Class: the total number of classes.
                        Datasets     Graphs    Nodes (avg)    Edges (avg)   #Class
                          NCI1        4,110        29.87         32.30        2
                        NCI109        4,127        29.68         32.13        2
                      PROTEINS 1,113               39.06         72.82        2
                          D&D          1178       284.32        715.66        2
                       MUTAG            188        17.93         19.79        2
                        Synthie         400        95.00        172.93        4
                          PTC          344        25.56         25.96         2
                   IMDB-BINARY 1,000               19.77         96.53        2
    • NCI1 and NCI109 [88] are datasets for anti-cancer activity, where each graph is a chemical
      compound with nodes representing atoms and edges representing chemical bonds.
    • PROTEINS and D&D [17, 5] are two protein graph datasets where nodes represent the
      amino acids and two nodes are connected if they are less than 6 Angstroms apart [99].
    • MUTAG [77] is a dataset of mutagenic aromatic and heteroaromatic nitro compounds. The
      label indicates whether or not a compound has a mutagenic effect on a specific bacterium.
    • Synthie [59] is synthetic dataset which is subdivided into four classes.
    • PTC [78, 36] is a dataset of molecules dealing with carcinogenicity.
    • IMDB-BINARY [95] is a movie collaboration dataset which contains actor/actress and genre
      information of different movies on IMDB.
6.4.2   Baselines
We compare the proposed model with six state-of-the-art graph pooling methods with hierarchical
architectures:
    • DiffPool [97]: It groups nodes to generate supernodes in a differentiable way. These
      supernodes are then used to generate the coarsened graphs.
                                                   90


Table 6.2: Comparisons of graph classification performance in terms of accuracy. Note that we
highlight the best accuracy result in bold.
                                                     Datasets
       Methods
                 NCI1    NCI109 PROTEINS       D&D    MUTAG     Synthie    PTC IMDB-BINARY
      AttPool-L  80.05    80.58      79.01     79.96   90.55     63.75    75.62       69.00
      AttPool-G  80.22    82.14      79.73     81.20   89.44     64.00    75.88       68.80
        gPool    71.58    70.90      78.08     81.58   88.84     55.01    67.18       75.60
       DiffPool  82.43    81.31      79.37     82.96   93.89     62.75    73.82       75.30
       HGP-SL    78.07    76.13      78.83     79.45   90.00     54.50    70.88       67.70
      EigenPool  82.92    81.43      80.09     81.04   91.11     64.50    72.94       56.20
        ASAP     73.95    72.62      78.17     81.28   88.78     29.25    69.93       68.75
       RepPool   82.00    82.28      80.72     82.06   95.56     64.50    78.24       75.70
    • gPool [23]: It selects nodes by utilizing a scalar projection operator to measure how much
      information nodes contain. These selected nodes are then utilized to build the coarsened
      graph.
    • AttPool [32]: Improved upon gPool, it utilizes attention mechanism to select significant
      nodes for graph representation adaptively. Specifically, a local attention and a global attention
      mechanism are proposed, which leads to two variants of AttPool, i.e. AttPool-L and AttPool-G.
    • HGP-SL [99]: It incorporates graph pooling and structure learning into a unified model
      to learn hierarchical graph representations. To preserve the integrity of the topological
      information, it introduces a structure learning mechanism to learn a refined graph structure.
    • ASAP [66]: It is based on the idea of selecting important nodes to coarsen the graph as
      proposed in gPool. To measure the importance of each node, it utilizes a self-attention network
      along with a modified GNN formulation. It also learns a sparse soft cluster assignment for
      nodes to form new graphs.
    • EigenPool [55]: It learns hierarchical graph representations based on graph Fourier transform,
      which utilizes node features and local structures during the pooling process.
                                                  91


6.4.3   Experimental Settings
Following the setting of previous works [32, 23], each dataset is split into 10 different splits. We
perform a 10-fold cross validation to evaluate the performance of the framework, where each time
we use 9 splits as training set and the remaining 1 split as test set. The average performance over 10
runs is reported. For baseline algorithms, we use source codes released by authors and the best
performance of them are reported. For the proposed model, we tune parameters by varying their
values in different scales. For instance, the pooling ratio 𝑟 is set to be in the range of [0.1, 0.5], the
learning rate 𝜇 is searched in {0.0005,0.001,0.002,0.003,0.004,0.005,0.01}, the number of pooling
layers (i.e., the number of learning blocks) 𝐾 𝑝 and the number of convolutional layers in the GCN
module 𝐾𝑔 are both selected in the range of [1, 4]. A two-layers MLP is adopted with 64 hidden
units for all datasets, followed by softmax layer. We use cross entropy loss for training and optimize
the model utilizing the Adam optimizer [39]. More details about the parameter selection of the
proposed model will be discussed later.
6.4.4   Graph Classification Performance
We compare the performance of RepPool with baseline algorithms on 8 datasets. The graph
classification results are reported in Table 6.2. We can make the following observations from the
table:
    • Overall, the proposed model obtains the highest accuracy value on most of the datasets
       compared with baseline algorithms. In particular, the proposed model achieves approximate
       3.11% higher accuracy over the best baseline on the PTC dataset and 1.78% on the Synthie
       dataset.
    • The proposed model consistently outperforms gPool and AttPool-G, which demonstrates the
       effectiveness of selecting nodes that are both important and representative.
    • The AttPool-L considers the local importance using a local attention mechanism. The
       proposed model achieves better performance than AttPool-L consistently, which indicates that
                                                   92


             Table 6.3: Graph classification performance for RepPool and its variants.
                                Datasets
                                           Synthie    PTC     IMDB-BINARY
                     Module
                     RepPool_W/O_RA         56.50    73.82           66.90
                     RepPool_W/O_Rep        64.00    75.00           75.10
                     RepPool_W/O_Ass        61.50    74.71           67.50
                           RepPool          64.50    78.24           75.70
       (a) NCI1: RepPool               (b) NCI1: gPool          (c) PTC: RepPool   (d) PTC: gPool
Figure 6.3: Visualization of node selection results at first layer in RepPool and gPool. Figures
(a) and (b) are outputs on the NCI1 dataset and figures (c), (d) are outputs on the PTC dataset.
Each node is attached with a number indicting its index in the graph. 25% nodes are selected and
highlighted in red.
      the proposed framework captures the representativeness of nodes.
    • Although EigenPool and DiffPool exhibit competitive performance on some datasets, RepPool
      achieves 34.70% improvement over EigenPool on the IMDB-BINARY dataset and 5.99%
      over DiffPool on the PTC dataset. This verifies the effectiveness and robustness of RepPool to
      generate the hierarchical graph representations over different datasets.
6.4.5   Ablation Study
In this subsection, we validate the effectiveness of the proposed components in the proposed
model. Specifically, we investigate whether the inclusion of the node representativeness score
plays an important role in the node selection process. We also investigate whether aggregating
                                                 93


information from non-selected nodes via the learned soft-assignment matrix is effective for the graph
classification task. To achieve the goal, we define the following variants of the proposed framework:
    • RepPool_W/O_Rep: It denotes RepPool without the representativeness score combined to
       select nodes, i.e., only the node importance is considered.
    • RepPool_W/O_Ass: It is a variant of RepPool by excluding the soft-assignment matrix when
       updating the feature matrix, i.e., excluding contributions from non-selected nodes.
    • RepPool_W/O_RA: it denotes RepPool without representativeness and contributions from
       non-selected nodes.
    We conduct the ablation study on the following three datasets: Synthie, PTC, and IMDB-
BINARY. The results are shown in Table 6.3. We can make the following observations from the
table:
    • RepPool consistently outperforms RepPool_W/O_Rep on three datasets, which demonstrates
       the effectiveness of modeling representiveness in our framework.
    • RepPool_W/O_Ass performs not as good as RepPool, which indicates the necessity of
       including the soft-assignment to aggregate information from the non-selected nodes when
       generating the new feature matrix.
    • RepPool_W/O_RA obtains worse performance than both RepPool_W/O_Rep and Rep-
       Pool_W/O_Ass. Both components significantly contribute to the proposed framework.
6.4.6    Case Study
During the node selection process, the proposed framework considers both importance and
representativeness of nodes. In this subsection, we illustrate that the nodes selected by RepPool do
not concentrate together but are distributed across different structures over the graph. Specifically,
we also show the nodes selected by gPool to make a comparison. We randomly sample a graph from
                                                  94


the NCI1 dataset and another graph from the PTC dataset. For each graph, 25% nodes are selected
from the first layer in the corresponding pooling operator. The node selection results are shown in
Figure 6.3, where the selected nodes are highlighted in red.
     As shown in Figure 6.3, we can find that the proposed model RepPool potentially selects
nodes distributing in different substructures that cover the whole graph. On the contrary, gPool
is likely to select nodes that are concentrated in the same area and thus important information of
other parts might be neglected. For example, in Figure 6.3b, most of selected nodes are in the
substructure around node 𝑣 19 , but only one node 𝑣 3 is selected for the substructure around node 𝑣 24 .
Moreover, node 𝑣 19 tends to share similar properties with other nodes around it. Therefore, redundant
information might be provided by selecting them all. Similar situation occurs in Figure 6.3d. In
addition, gPool tends to select margin nodes, e.g., nodes 𝑣 3 , 𝑣 6 , 𝑣 34 in Figure 6.3b and node 𝑣 19 in
Figure 6.3d, that might contain less valuable information than representative nodes. Compared with
gPool, RepPool tries to select the representative nodes. For instance in Figure 6.3a, RepPool selects
the hub node 𝑣 24 that integrates information from other two hub nodes 𝑣 22 and 𝑣 23 , and thus could
provide much valuable information.
6.4.7    Parameter Analysis
We further study the sensitivity of several key hyper-parameters: pooling ratio 𝑟, learning rate 𝜇,
pooling layer number (i.e., number of learning blocks) 𝐾 𝑝 , and the number of convolutional layers
in the GCN module 𝐾𝑔 . We investigate how these parameters affect the graph classification task
by varying their values in different scales on three datasets: Synthie, PTC, and IMDB-BINARY.
Results are presented in Figure 6.4.
     We can find some similar trends in different datasets. For instance, as shown in Figure 6.4a, the
performance of RepPool tends to decrease when 𝑟 increases. It suggests that redundant information
might be provided when 𝑟 becomes larger. In most cases, RepPool exhibits the best performance
when 𝐾 𝑝 and 𝐾𝑔 are assigned to be relatively smaller numbers as shown in Figure 6.4c and
Figure 6.4d.
                                                  95


              80
                                                                                       75
 Accuracy %
              75
                                                                          Accuracy %
                                                                                       70
              70
                                                                                       65
              65
                                                                                       60
              60
                   1                2             3       4     5                      0.0005 0.001 0.002 0.003 0.004 0.005 0.01
                                    Pooling ratio r                                                 Learning rate
                                            (a)                                                           (b)
             80                                                                        80
Accuracy %
             75
                                                                         Accuracy %
                                                                                       75
             70                                                                        70
             65                                                                        65
             60                                                                        60
                  1                     2             3         4                           1         2          3           4
                             Pooling layers Kp                                                  Convolutional layers Kg
                                            (c)                                                           (d)
                                    Synthie                   PTC                       IMDB-BINARY
                                    80
                       Accuracy %
Figure 6.4: Graph classification results by varying four parameters: pooling ratio 𝑟, learning rate 𝜇,
pooling layers 𝐾 𝑝 , and GCN layers 𝐾𝑔 on three datasets: Synthie, PTC, and IMDB-BINARY.
                                    70
                                                                    96
                                    60
                                            1                  2                                  3                  4


                                              CHAPTER 7
                                             CONCLUSION
Graph Neural Networks have shown their great capability in graph representation learning, thus
facilitating numerous graph-related applications, including node classification, link prediction, and
graph classification. In this dissertation, we study two essential operations for graph neural networks.
For graph filtering operations, we mainly focus on investigating the existing architecture designs
and providing insightful understandings of them from new perspectives. Compared with the graph
filtering operations, the graph pooling operations are underresearched and we proposed two novel
graph pooling operations from new perspectives, which broaden the scope of the research of graph
pooling operations. In this chapter, we summarize our research outcomes and identify some potential
future research directions.
7.1     Summary
We first show how various graph filtering operations in representative GNN models including GCN,
PPNP, APPNP and GAT can be unified mathematically as natural instances of graph denoising
problems. Specifically, the aggregation operations in these models can be regarded as exactly or
approximately addressing such denoising problems subject to Laplacian regularization. With these
observations, we propose a general framework, Ugnn, which enables the design of new GNN
models from the denoising perspective via regularizer design. As an example demonstrating the
promise of this paradigm, we instantiate the Ugnn framework with a regularizer addressing adaptive
local smoothness across nodes, a property prevalent in several real-world graphs, and proposed and
evaluated a suitable new GNN model, Ada-Ugnn.
     Due to this intrinsic connection to the graph signal denoising problem, most existing graph
filtering operations inherently have “feature smoothing” effects. Hence, it is widely believed that,
when applying to node classification tasks, GNN models inherently assume strong homophily and
thus fail to generalize to graphs with heterophily. We revisit this popular notion in literature, and show
                                                    97


that it is not quite true. We investigate one representative model, GCN, and show empirically that it
can achieve good performance on heterophilous graphs under certain conditions, sometimes even
better than more complex architectures specifically designed for such graphs (and indeed, correct
the record for some common benchmark datasets). We next analyze theoretically what conditions
are required for GCNs to learn similar embeddings for nodes of the same class, facilitating the
semi-supervised node classification task; put simply, when nodes with the same label share similar
neighborhood patterns, and different classes have distinguishable patterns, GCN can achieve strong
class separation, regardless of homophily or heterophily properties. Empirical analysis supports
our theoretical findings. Finally, we revisit several existing homophilous and heterophilous semi-
supervised node classification benchmark graphs, and investigate GCN’s empirical performance in
light of our understanding.
    When GNNs are adopted to graph-level tasks, in addition to the graph filtering operation,
the graph pooling operation is required to summarize the node representations to form graph
representation. To preserve the structure of graphs, hierarchical graph pooling operations have been
developed. They coarsen the graph step by step until a graph representation is obtained. There
are majorly two streams of works in designing hierarchical graph pooling operations depending
on their ways to coarsen the graph. One of them coarsens a graph by downsampling nodes while
the other coarsens a graph by clustering similar nodes to form supernodes. These downsampled
nodes (or clustered supernodes) serve as the nodes for the coarsened graph, and their representations
and connections between them are then generated. We investigate both types of graph pooling
operations. We first design EigenPooling, a supernode-based pooling operation based on local graph
Fourier transform, which can extract subgraph information utilizing both node features and structure
of the subgraph. We provide a theoretical analysis of this pooling operation from both local and
global perspectives. This proposed graph pooling operation together with a subgraph-based graph
coarsening method forms the pooling layer, which can be incorporated into any graph neural networks
to hierarchically learn graph level representations. We further proposed a graph neural network
framework EigenGCN by combining the proposed pooling layers with the GCN grpah filtering
                                                   98


layers. Comprehensive graph classification experiments were conducted on 6 commonly used graph
classification benchmarks. Our proposed framework achieves state-of-the-art performance on most
of the data sets, which demonstrates its effectiveness.
    We then investigate the downsampling-based graph pooling operations. We introduce RepPool,
which focuses on preserving information of both node importance and node representativeness.
Specifically, the devised pooling module consists of two components, i.e., node selection and coarser
subgraph generation. In the first component, we propose an overall score, which combines a global
score for measuring node importance and a score for consideration of representativeness. Based on
the overall score, a greedy algorithm is designed to select important and representative nodes one
by one. In the second component, a new coarser subgraph is formed by utilizing information of
non-selected nodes to build connections and node features. The proposed RepPool module stacked
over GCN convolutional layers forms a hierarchical architecture to conduct the graph classification
task. Comprehensive experiments on various widely used benchmark datasets demonstrate the
superiority of RepPool compared to a variety of state-of-the-art methods.
7.2     Future Directions
Though GNNs work well in graph representation learning and have advanced many applications,
there are a lot of fundamental questions remaining unanswered, which we briefly summarized
as follows: “Why do GNNs work?”; “How do GNNs work?”; and “Do GNNs work properly?”
Answering these questions are important for understanding GNNs better and applying them to
applications comfortably.
    “Why do GNNs work?” To answer the first question, we plan to investigate the expressiveness
of GNNs. We have connected graph filtering operations with a graph denoising problem and
thus provide some insights for GNNs in learning node representations. It is promising to further
expand this work and also investigate the expressiveness of GNNs for learning graph representations,
especially for the graph pooling operations. Our investigation on how GNNs works on heterophilous
graphs also provides some understandings on what GNNs can capture. However, it still has some
                                                  99


limitations. Firstly, our current theoretical analysis majorly focuses on the GCN model; we hope to
extend this analysis in the future to more general models with various graph filtering operations.
Secondly, we plan to continue characterization of sufficient distinguishability of neighborhood and
feature distributions to facilitate best class separation for future investigation. Also, the notion of
homophily in our current work is defined based on node labels. As a future direction, we would like
to extend the analysis to more broad and general definition of homophily.
    “How do GNNs work?” Just like traditional deep models, GNNs lack interpretability. Without
understanding their decision making mechanism, GNNs can not be fully trusted and applied to fields
like healthcare, where understanding the prediction is as important as achieving high prediction
performance. Hence, it is important to answer the second question by investigating the interpretability
of GNN models. More specifically, the interpretability of both graph filtering operations and graph
pooling operations should be investigated.
    “Do GNNs work properly?” Finally, the third question involves many properties of GNNs
including robustness and fairness. Recent studies show that GNNs can be easily fooled by
deliberately-crafted perturbations, called adversarial attacks. The vulnerability to adversarial attacks
preevent GNNs from safety-critical applications. Therefore, it is of great significance to develop
robust GNN algorithms to defend adversarial attacks. Specifically, it is important to study the
robustness of both graph filtering operations and graph pooling operations. Furthermore, GNNs
likely exhibit algorithmic unfairness with respect to certain groups as traditional deep models.
Besides these potential unfairness issues inherited from traditional deep models, they also have
specific issues induced by graph structures. For example, the extremely skewed degree distribution
in scale-free networks may cause the algorithms to bias towards nodes with large degrees (e.g.,
celebrities in online social networks), thus being unfair to the low-degree nodes. More effort is
required to develop GNNs (both graph filtering operations and graph pooling operations) with
fairness.
                                                   100


APPENDIX
   101


                                              APPENDIX A
           IS HOMOPHILY A NECESSITY FOR GRAPH NEURAL NETWORKS?
A.1      Additional Details and Results for Section 4.3.2.1
A.1.1    Details on the generated graphs
In this subsection, we present the details of the graphs that we generate in Section 4.3.2.1. Specifically,
we detail the distributions {D𝑐 , 𝑐 ∈ C} used in the examples, the number of added edges 𝐾, and the
homophily ratio ℎ. We provide the details for Cora and Citeseer in the following subsections.
Note that the choices of distributions shown here are for illustrative purposes, to coincide with
Observations 1 and 2. We adapted circulant matrix-like designs due to their simplicity.
A.1.1.1 Cora
There are 7 labels, which we denote as {0, 1, 2, 3, 4, 5, 6}. The distributions {D𝑐 , 𝑐 ∈ C} are listed
as follows. The values of 𝐾 and the homophily ratio of their corresponding generated graphs are
shown in Table A.1.
                               D0 : Categorical ([0, 0.5, 0, 0, 0, 0, 0.5]),
                               D1 : Categorical ([0.5, 0, 0.5, 0, 0, 0, 0]),
                               D2 : Categorical ([0, 0.5, 0, 0.5, 0, 0, 0]),
                               D3 : Categorical ([0, 0, 0.5, 0, 0.5, 0, 0]),
                               D4 : Categorical ([0, 0, 0, 0.5, 0, 0.5, 0]),
                               D5 : Categorical ([0, 0, 0, 0, 0.5, 0, 0.5]),
                               D6 : Categorical ([0.5, 0, 0, 0, 0, 0.5, 0]).
                                                    102


                   Table A.1: 𝐾 and ℎ values for generated graphs based on Cora
 𝐾     1003     2006    3009    4012    6018     8024     10030     12036 16048 20060 24072
  ℎ   0.789    0.737    0.692  0.652 0.584      0.529     0.483      0.445 0.384    0.338   0.302
A.1.1.2 Citeseer
There are 6 labels, which we denote as {0, 1, 2, 3, 4, 5}. The distributions {D𝑐 , 𝑐 ∈ C} are listed
as follows. The values of 𝐾 and the homophily ratio of their corresponding generated graphs are
shown in Table A.2.
                               D0 : Categorical ([0, 0.5, 0, 0, 0, 0.5]),
                               D1 : Categorical ([0.5, 0, 0.5, 0, 0, 0]),
                               D2 : Categorical ([0, 0.5, 0, 0.5, 0, 0]),
                               D3 : Categorical ([0, 0, 0.5, 0, 0.5, 0]),
                               D4 : Categorical ([0, 0, 0, 0.5, 0, 0.5]),
                               D5 : Categorical ([0.5, 0, 0, 0, 0.5, 0]).
                 Table A.2: 𝐾 and ℎ values for generated graphs based on Citeseer
 𝐾     1204     2408    3612    4816    7224     9632     12040     14448 19264 24080 28896
  ℎ   0.735    0.675    0.625  0.581 0.510      0.454     0.410      0.373 0.316    0.275   0.243
A.1.2    Results on more datasets: Chameleon and Squirrel
We conduct similar experiments as those in Section 4.3.2.1 based on Chameleon and Squirrel.
Note that both Squirrel and Chameleon have 5 labels, which we denote as {0, 1, 2, 3, 4}. We
pre-define the same distributions for them as listed as follows. The values of 𝐾 and the homophily
ratio of their corresponding generated graphs based on Squirrel and Chameleon are shown in
Table A.3 and Table A.4, respectively.
                                                 103


                                                        D0 : Categorical ([0, 0.5, 0, 0, 0.5]),
                                                        D1 : Categorical ([0.5, 0, 0.5, 0, 0]),
                                                        D2 : Categorical ([0, 0.5, 0, 0.5, 0]),
                                                        D3 : Categorical ([0, 0, 0.5, 0, 0.5]),
                                                        D4 : Categorical ([0.5, 0, 0, 0.5, 0]).
                                                                                                                                                             (A.1)
                                  Table A.3: 𝐾 and ℎ values for generated graphs based on Squirrel
    𝐾                 12343     24686       37030 49374           61716       74060                      86404      98746       111090       12434       135776
     ℎ                0.232      0.225      0.219      0.207       0.201      0.196                      0.191       0.186       0.181        0.178       0.174
                                 Table A.4: 𝐾 and ℎ values for generated graphs based on Chameleon
                      K            1932      3866      5798      7730      9964                   11596 13528             15462 17394           19326      21260
   Homo. Ratio 0.249 0.242                            0.236     0.230 0.224                        0.218        0.213     0.208      0.203      0.198      0.194
                          =0                                                                      0.85        =0
                0.9
                                                                                                  0.80
Test Accuracy
                0.8
                                                                                  Test Accuracy
                                                                                                  0.75
                0.7                                                                               0.70
                0.6                                                                               0.65
                0.5                                                                               0.60
                      0.23 0.23 0.22 0.21 0.21 0.2 0.2 0.19 0.19 0.18 0.18 0.17                          0.26 0.25 0.24 0.24 0.23 0.22 0.22 0.21 0.21 0.2 0.2 0.19
                                             Homophily Ratio                                                                    Homophily Ratio
                (a) Synthetic graphs generated from Squirrel                               (b) Synthetic graphs generated from Chameleon
                       Figure A.1: Performance of GCN on synthetic graphs with various homophily ratio.
                 The performance of the GCN model on these two sets of graphs (generated from Squirrel
and Chameleon) is shown in Figure A.1. The observations are similar to what we found for those
generated graphs based on Cora and Citeseer in Section 4.3.2.1. Note that the original Squirrel
and Chameleon graphs already have very low homophily, but we still observe a 𝑉-shape from the
                                                                              104


                1.00        =0
                0.95
                0.90
Test Accuracy
                0.85
                0.80
                0.75
                0.70
                0.65
                        0.84 0.79 0.74 0.69 0.65 0.58 0.53 0.48 0.44 0.38 0.34 0.3 0.27 0.25 0.23 0.21 0.2 0.18 0.17 0.16 0.12
                                                                    Homophily Ratio
Figure A.2: The GCN’s performance approaches 100% as we keep adding edges following the
specific patterns
figures. This is because there are some neighborhood patterns in the original graphs, which are
distinct from those that we designed for addition. Hence, when we add edges in the early stage, the
performance decreases. As we add more edges, the designed pattern starts to mask the original
patterns and the performance starts to increase.
A.1.3                  GCN’s Performance in the Limit (as 𝐾 → ∞)
In this subsection, we illustrate that as 𝐾 → ∞, the accuracy of the GCN model approaches 100%.
Specifically, we set 𝐾 to a set of larger numbers as listed in Table A.5. Ideally, when 𝐾 → ∞, the
homophily ratio will approach 0 and the model performance will approach 100% (for diverse-enough
 D𝑐 ). The performance of the GCN model on the graphs described in Table A.1 and Table A.5 are
shown in Figure A.2. Clearly, the performance of the GCN model approaches the maximum as we
continue to increase 𝐾.
                           Table A.5: Extended 𝐾 and ℎ values for generated graphs based on Cora
       𝐾               28084 32096 36108                 40120       44132        48144 52156             56168        80240
        ℎ              0.272     0.248       0.228       0.211        0.196       0.183       0.172        0.162       0.120
                                                                    105


A.2      Experimental Settings: Datasets and Models
A.2.1     Datasets
We give the number of nodes, edges, homophily ratios and distinct classes of datasets we used in
this paper in Table A.6.
                           Table A.6: Benchmark dataset summary statistics.
                      Cora Citeseer Pubmed Chameleon    Squirrel   Actor  Cornell  Wisconsin  Texas
    # Nodes (|V |)    2708    3327     19717    2277      5201      7600     183      251      183
     # Edges (|E |)   5278    4676     44327   31421     198493    26752     280      466      295
 Homophily Ratio (ℎ)  0.81    0.74      0.80    0.23      0.22      0.22     0.3      0.21     0.11
    # Classes (|C|)     7       6         3       5         5         5       5         5       5
A.2.2     Models
     • H2GCN [106] specifically designed several architectures to deal with heterophilous graphs,
        which include ego- and neighbor-embedding separation (skip connection), aggregation from
        higher-order neighborhoods, and combination of intermediate representations. We include
        two variants H2GCN-1 and H2GCN-2 with 1 or 2 steps of aggregations, respectively. We
        adopt the code published by the authors at https://github.com/GemsLab/H2GCN.
     • GPR-GNN [10] performs feature aggregation for multiple steps and then linearly combines
        the features aggregated with different steps. The weights of the linear combination are
        learned during the model training. Note that it also includes the original features before
        aggregation in the combination. We adopt the code published by the authors at https:
        //github.com/jianhao2016/GPRGNN.
     • CPGNN [105] incorporates the label compatibility matrix to capture the connection information
        between classes. We adopted two variants of CPGNN that utilize MLP and ChebyNet [15] as
        base models to pre-calculate the compatibility matrix, respectively. We use two aggregation
        layers for both variants. We adopt the code published by the authors at https://github.
        com/GemsLab/CPGNN.
                                                 106


A.2.3    MLP+GCN
We implement a simple method to linearly combine the learned features from the GCN model and
                          (2)
an MLP model. Let H𝐺𝐶𝑁        ∈ R|V |×|C| denote the output features from a 2-layer GCN model, where
|V | and |C| denote the number of nodes and the number of classes, respectively. Similarly, we use
H (2)
  𝑀 𝐿𝑃 ∈ R
            |V |×|C| to denote the features output from a 2-layer MLP model. We then combine them
for classification. The process can be described as follows.
                                               (2)
                                    H = 𝛼 · H𝐺𝐶𝑁    + (1 − 𝛼) · H (2)
                                                                  𝑀 𝐿𝑃 ,                       (A.2)
where 𝛼 is a hyperparameter balancing the two components. We then apply a row-wise softmax to
each row of H to perform the classification.
A.2.4    Parameter Tuning and Resources Used
We tune parameters for GCN, GPR-GCN, CPGNN, and MLP+GCN from the following options:
     • learning rate: {0.002, 0.005, 0.01, 0.05}
     • weight decay {5𝑒−04, 5𝑒−05, 5𝑒−06, 5𝑒−07, 5𝑒−08, 1𝑒−05, 0}
     • dropout rate: {0, 0.2, 0.5, 0.8}.
For GPR-GNN, we use the “PPR” as the initialization for the coefficients. For MLP+GCN, we tune
𝛼 from {0.2, 0.4, 0.6, 0.8, 1}. Note that the parameter search range encompasses the range adopted
in the original papers to avoid unfairness issues.
    All experiments are run on a cluster equipped with Intel(R) Xeon(R) CPU E5-2680 v4 @
2.40GHz CPUs and NVIDIA Tesla K80 GPUs.
A.3     Heatmaps for Other Benchmarks
We provide the heatmaps for Citeseer and Pubmed in Figure A.3 and those for Squirrel, Texas,
and Wisconsin in Figure A.4. For the Citeseer and Pubmed, which have high homophily, the
                                                   107


observations are similar to those of Cora as we described in Section 4.4.2. For Squirrel, there are
some patterns; the intra-class similarity is generally higher than inter-class similarities. However,
these patterns are not very strong, i.e, the differences between them are not very large, which means
that the neighborhood patterns of different labels are not very distinguishable from each other. This
substantiates the middling performance of GCN on Squirrel. Both Texas and Wisconsin are
very small, with 183 nodes, 295 edges and 251 nodes, 466 edges, respectively. The average degree
is extremely small (< 2). Hence, the similarities presented in the heatmap may present strong bias.
Especially, in Texas, there is only 1 node with label 1. In Wisconsin, there are only 10 nodes with
label 0.
                                                        1.0  0.5                                1.0
                        0 0.65 0.26 0.12 0.14 0.17 0.07
                                                        0.8  0.0    0.82      0.16     0.24     0.8
                        1 0.26 0.79 0.18 0.08 0.09 0.05
                                                             0.5
                        2 0.12 0.18 0.85 0.17 0.06 0.09 0.6                                     0.6
                                                             1.0    0.16      0.92     0.19
                        3 0.14 0.08 0.17 0.85 0.06 0.05 0.4                                     0.4
                                                             1.5
                        4 0.17 0.09 0.06 0.06 0.87  0.1
                                                        0.2  2.0    0.24      0.19     0.88     0.2
                        5 0.07 0.05 0.09 0.05  0.1 0.87
                            0    1    2    3    4    5  0.0  2.50.5  0.0  0.5  1.0 1.5  2.0 2.5 0.0
                               (a) Citeseer                              (b) Pubmed
Figure A.3: Neighborhood Similarity On Citeseer and Pubmed. On both graphs, the intra-class
similarity is clearly higher than the inter-class ones.
                                                         108


                                   1.0                             1.0                              1.0
       0 0.61  0.63 0.64 0.65 0.66     0 0.65 0.03  0.61 0.17 0.21     0 0.75   0.25 0.27 0.36 0.51
                                   0.8                             0.8                              0.8
       1 0.63   0.7  0.7 0.71 0.72     1 0.03  1.0   0.1 0.33 0.31     1 0.25   0.75 0.71 0.64 0.35
                                   0.6                             0.6                              0.6
       2 0.64   0.7 0.75 0.74 0.75     2 0.61  0.1  0.68 0.31 0.41     2 0.27   0.71 0.72 0.63 0.45
                                   0.4                             0.4                              0.4
       3 0.65  0.71 0.74 0.78 0.77     3 0.17 0.33  0.31 0.58 0.58     3 0.36   0.64 0.63 0.69 0.51
                                   0.2                             0.2                              0.2
       4 0.66  0.72 0.75 0.77  0.8     4 0.21 0.31  0.41 0.58 0.64     4 0.51   0.35 0.45 0.51 0.72
           0     1    2    3    4  0.0     0    1     2    3    4  0.0     0      1    2    3    4  0.0
              (a) Squirrel                     (b) Texas                      (c) Wisconsin
Figure A.4: Neighborhood Similarity On Squirrel, Texas and Wisconsin. The inter-class
similarity on Squirrel is slightly higher than intra-class similarity for most classes, which
substantiates the middling performance of GCN. Both Texas and Wisconsin are quite small,
hence the cross-class similarity in these two graphs present severe bias and may not provide precise
information about these graphs.
                                                   109


BIBLIOGRAPHY
     110


                                        BIBLIOGRAPHY
[1]  Sami Abu-El-Haĳa, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman,
     Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. MixHop: Higher-order graph
     convolutional architectures via sparsified neighborhood mixing. In Kamalika Chaudhuri and
     Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine
     Learning, volume 97 of Proceedings of Machine Learning Research, pages 21–29. PMLR,
     09–15 Jun 2019.
[2]  Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. Graph
     convolutional encoders for syntax-aware neural machine translation. arXiv preprint
     arXiv:1704.04675, 2017.
[3]  Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction
     networks for learning about objects, relations and physics. In Advances in neural information
     processing systems, pages 4502–4510, 2016.
[4]  Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius
     Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan
     Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint
     arXiv:1806.01261, 2018.
[5]  Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J
     Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics,
     21(suppl_1):i47–i56, 2005.
[6]  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and
     locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
[7]  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally
     connected networks on graphs. In 2nd International Conference on Learning Representations,
     ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[8]  Siheng Chen, Yonina C Eldar, and Lingxiao Zhao. Graph unrolling networks: Interpretable
     neural networks for graph signal denoising. arXiv preprint arXiv:2006.01301, 2020.
[9]  Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovacevic. Signal denoising
     on graphs via graph filtering. In 2014 IEEE Global Conference on Signal and Information
     Processing (GlobalSIP), pages 872–876. IEEE, 2014.
[10] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized
     pagerank graph neural network. In International Conference on Learning Representations,
     2021.
[11] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American
     Mathematical Soc., 1997.
                                                111


[12] Valerio Ciotti, Moreno Bonaventura, Vincenzo Nicosia, Pietro Panzarasa, and Vito Latora.
     Homophily and missing links in citation networks. EPJ Data Science, 5:1–14, 2016.
[13] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal
     neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718, 2020.
[14] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for
     structured data. In International Conference on Machine Learning, pages 2702–2711, 2016.
[15] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
     networks on graphs with fast localized spectral filtering. In Advances in neural information
     processing systems, pages 3844–3852, 2016.
[16] Tyler Derr, Yao Ma, and Jiliang Tang. Signed graph convolutional networks. In 2018 IEEE
     International Conference on Data Mining (ICDM), pages 929–934. IEEE, 2018.
[17] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes
     without alignments. Journal of molecular biology, 330(4):771–783, 2003.
[18] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy
     Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for
     learning molecular fingerprints. In Advances in neural information processing systems, pages
     2224–2232, 2015.
[19] Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis.
     All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of
     the 13th International Conference on Web Search and Data Mining, pages 169–177, 2020.
[20] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph
     neural networks for social recommendation. In The World Wide Web Conference, pages
     417–426, 2019.
[21] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast
     geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018.
[22] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of
     eugenics, 7(2):179–188, 1936.
[23] Hongyang Gao and Shuiwang Ji. Graph u-nets. In Proceedings of the 36th International
     Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California,
     USA, pages 2083–2092, 2019.
[24] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional
     networks. In Proceedings of the 24th ACM SIGKDD, pages 1416–1424, 2018.
[25] Elisabeth R Gerber, Adam Douglas Henry, and Mark Lubell. Political homophily and
     collaboration in regional planning networks. American Journal of Political Science, 57(3):598–
     610, 2013.
                                              112


[26] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.
     Neural message passing for quantum chemistry. In Proceedings of the 34th International
     Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
[27] Jonathan Halcrow, Alexandru Mosoi, Sam Ruth, and Bryan Perozzi. Grale: Designing
     networks for graph learning. In Proceedings of the 26th ACM SIGKDD International
     Conference on Knowledge Discovery & Data Mining, pages 2523–2532, 2020.
[28] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large
     graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
[29] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-
     structured data. arXiv preprint arXiv:1506.05163, 2015.
[30] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
     9(8):1735–1780, 1997.
[31] Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard T. B. Ma, Hongzhi Chen, and
     Ming-Chang Yang. Measuring and improving the use of graph information in graph neural
     networks. In International Conference on Learning Representations, 2020.
[32] Jingjia Huang, Zhangheng Li, Nannan Li, Shan Liu, and Ge Li. Attpool: Towards hierar-
     chical feature representation in graph convolutional networks via attention mechanism. In
     Proceedings of the IEEE International Conference on Computer Vision, pages 6480–6489,
     2019.
[33] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. In
     Proceedings of the Tenth ACM International Conference on Web Search and Data Mining,
     pages 731–739, 2017.
[34] Junteng Jia and Austin R Benson. A unifying generative model for graph learning algorithms:
     Label propagation, graph convolutions, and combinations. arXiv preprint arXiv:2101.07730,
     2021.
[35] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph
     structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203, 2020.
[36] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Kernels for graphs. Kernel methods in
     computational biology, 39(1):101–113, 2004.
[37] Jeroen Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores
     for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1):312–320, 2005.
[38] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann.
     Benchmark data sets for graph kernels, 2016.
[39] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd
     International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
     May 7-9, 2015, Conference Track Proceedings, 2015.
                                              113


[40] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
     networks. arXiv preprint arXiv:1609.02907, 2016.
[41] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
     networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon,
     France, April 24-26, 2017, Conference Track Proceedings, 2017.
[42] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate:
     Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[43] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate:
     Graph neural networks meet personalized pagerank. In International Conference on Learning
     Representations (ICLR), 2019.
[44] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
     convolutional neural networks. In Advances in neural information processing systems, pages
     1097–1105, 2012.
[45] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets:
     Graph convolutional neural networks with complex rational spectral filters. arXiv preprint
     arXiv:1705.07664, 2017.
[46] Juanhui Li, Yao Ma, Yiqi Wang, Charu Aggarwal, Chang-Dong Wang, and Jiliang Tang.
     Graph pooling with representativeness. In 2020 IEEE International Conference on Data
     Mining (ICDM), pages 302–311. IEEE, 2020.
[47] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional
     networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 32, 2018.
[48] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional
     neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018.
[49] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence
     neural networks. arXiv preprint arXiv:1511.05493, 2015.
[50] Yao Ma, Ziyi Guo, Zhaochun Ren, Yihong Eric Zhao, Jiliang Tang, and Dawei Yin. Dynamic
     graph neural networks. CoRR, abs/1810.10627, 2018.
[51] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a neccesity for graph neural
     networks? arXiv preprint arXiv:2106.06134, 2021.
[52] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on
     graph neural networks as graph signal denoising. arXiv preprint arXiv:2010.01777, 2020.
[53] Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press, 2020.
[54] Yao Ma, Suhang Wang, Chara C Aggarwal, Dawei Yin, and Jiliang Tang. Multi-dimensional
     graph convolutional networks. In Proceedings of the 2019 SIAM International Conference on
     Data Mining, pages 657–665. SIAM, 2019.
                                             114


[55] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. Graph convolutional networks
     with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on
     Knowledge Discovery & Data Mining, pages 723–731, 2019.
[56] Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. Improving graph neural networks with
     simple architecture design. arXiv preprint arXiv:2105.07634, 2021.
[57] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in
     social networks. Annual review of sociology, 27(1):415–444, 2001.
[58] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and
     Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture
     model cnns. In Proceedings of the IEEE Conference on CVPR, pages 5115–5124, 2017.
[59] Christopher Morris, Nils M Kriege, Kristian Kersting, and Petra Mutzel. Faster kernels for
     graphs with continuous attributes via hashing. In 2016 IEEE 16th International Conference
     on Data Mining (ICDM), pages 1095–1100. IEEE, 2016.
[60] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,
     Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural
     networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages
     4602–4609, 2019.
[61] Sunil K Narang and Antonio Ortega. Perfect reconstruction two-channel wavelet filter banks
     for graph structured data. IEEE Transactions on Signal Processing, 60(6):2786–2799, 2012.
[62] Mark Newman. Networks. Oxford university press, 2018.
[63] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural
     networks for graphs. In International conference on machine learning, pages 2014–2023,
     2016.
[64] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass
     filters. arXiv preprint arXiv:1905.09550, 2019.
[65] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-
     gcn: Geometric graph convolutional networks. In International Conference on Learning
     Representations, 2020.
[66] Ekagra Ranjan, Soumya Sanyal, and Partha Pratim Talukdar. Asap: Adaptive structure aware
     pooling for learning hierarchical graph representations. arXiv preprint arXiv:1911.07979,
     2019.
[67] Kaspar Riesen and Horst Bunke. Iam graph database repository for graph based pattern
     recognition and machine learning. In Joint IAPR International Workshops on Statistical
     Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition
     (SSPR), pages 287–297. Springer, 2008.
                                             115


[68] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep
     graph convolutional networks on node classification. In International Conference on Learning
     Representations, 2019.
[69] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin
     Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines
     for inference and control. arXiv preprint arXiv:1806.01242, 2018.
[70] Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE
     transactions on signal processing, 61(7):1644–1656.
[71] Kevin Scaman and Aladin Virmaux. Lipschitz regularity of deep neural networks: analysis
     and efficient estimation. arXiv preprint arXiv:1805.10965, 2018.
[72] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
     dini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80,
     2008.
[73] Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor
     Huhn, and Dietmar Schomburg. Brenda, the enzyme database: updates and major new
     developments. Nucleic acids research, 32(suppl_1):D431–D433, 2004.
[74] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina
     Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
[75] Neil Shah. Scale-free, attributed and class-assortative graph generation to facilitate introspec-
     tion of graph neural networks. KDD Mining and Learning with Graphs, 2020.
[76] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann.
     Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
[77] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M
     Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research,
     12(Sep):2539–2561, 2011.
[78] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt.
     Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics,
     pages 488–495, 2009.
[79] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with
     directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, pages 7912–7921, 2019.
[80] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst.
     The emerging field of signal processing on graphs: Extending high-dimensional data analysis
     to networks and other irregular domains. IEEE signal processing magazine, 30(3):83–98,
     2013.
[81] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional
     neural networks on graphs.
                                                116


[82] Ryan J Tibshirani et al. Adaptive piecewise polynomial estimation via trend filtering. The
     Annals of Statistics, 42(1):285–323, 2014.
[83] Nicolas Tremblay and Pierre Borgnat. Subgraph-based filterbanks for graph signals. IEEE
     Transactions on Signal Processing, 64(15):3827–3840.
[84] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and
     Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[85] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
     Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning
     Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
     Track Proceedings, 2018.
[86] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence
     for sets. arXiv preprint arXiv:1511.06391, 2015.
[87] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence
     for sets. In 4th International Conference on Learning Representations, ICLR 2016, San Juan,
     Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
[88] Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical
     compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375,
     2008.
[89] Yu-Xiang Wang, James Sharpnack, Alexander J Smola, and Ryan J Tibshirani. Trend filtering
     on graphs. The Journal of Machine Learning Research, 17(1):3651–3691, 2016.
[90] Huĳun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu.
     Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint
     arXiv:1903.01610, 2019.
[91] Liang Wu, Xia Hu, and Huan Liu. Relational learning with social status analysis. In
     Proceedings of the Ninth ACM International Conference on Web Search and Data Mining,
     pages 513–522, 2016.
[92] Xuan Wu, Lingxiao Zhao, and Leman Akoglu. A quest for structure: jointly learning the graph
     structure and semi-supervised classification. In Proceedings of the 27th ACM international
     conference on information and knowledge management, pages 87–96, 2018.
[93] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu.
     A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
[94] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
     networks? In International Conference on Learning Representations, 2019.
[95] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
     1365–1374. ACM, 2015.
                                               117


[96] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning
      with graph embeddings. In International conference on machine learning, pages 40–48.
      PMLR, 2016.
[97] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure
      Leskovec. Hierarchical graph representation learning with differentiable pooling. arXiv
      preprint arXiv:1806.08804, 2018.
[98] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep
      learning architecture for graph classification. 2018.
[99] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang.
      Hierarchical graph pooling with structure learning. arXiv preprint arXiv:1911.05954, 2019.
[100] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint
      arXiv:1812.04202, 2018.
[101] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv
      preprint arXiv:1909.12223, 2019.
[102] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Woodford, Meng Jiang, and Neil Shah. Data
      augmentation for graph neural networks. In AAAI, 2020.
[103] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng
      Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and
      applications. AI Open, 1:57–81, 2020.
[104] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph
      neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434,
      2018.
[105] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai
      Koutra. Graph neural networks with heterophily. arXiv preprint arXiv:2009.13566, 2020.
[106] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra.
      Beyond homophily in graph neural networks: Current limitations and effective designs.
      Advances in Neural Information Processing Systems, 33, 2020.
[107] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph
      neural networks with an optimization framework. arXiv preprint arXiv:2101.11859, 2021.
[108] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via
      meta learning. arXiv preprint arXiv:1902.08412, 2019.
                                                118