NODE CLASSIFICATION IN SHIFTING LANDSCAPES: FAIRNESS AWARE DOMAIN ADAPTATION IN NETWORK DATA By Anna Joy Stephens A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Master of Science 2023 ABSTRACT Domain adaptation is an important task that considers how to apply a model trained on a labeled source dataset to an unlabeled target dataset. This is becoming an increasingly relevant concern as many datasets do not contain a large number of reliable labels and must consider additional training data from other sources. The domain adaptation task is chal- lenging as the distributions of the source and target datasets may not be fully aligned. Such discrepancy in data distribution is called concept drift. This thesis focuses on the problem of domain adaptation in node classification for network data, a task known as cross-network node classification (CNNC). Unlike approaches specific to independent and identically dis- tributed (i.i.d.) data, CNNC is concerned with the additional challenges introduced by the link structure of the source and target networks and differences in their node distributions. Such differences may exacerbate unfairness in node classification, and lead to one protected group receiving a disproportionately large number of positive outcomes. Additionally, ex- isting CNNC approaches are computationally expensive as they rely upon having access to both the entire source and target graphs at one time. In this thesis I first present OTGCN, a method for performing domain adaptation between multiple disconnected networks for the task of node classification. OTGCN combines the powerful graph convolutional network (GCN) architecture along with techniques from optimal transport to design source node embeddings that are more aligned with nodes in the target graph. This allows us to train a more accurate node classifier within the target domain. I then present FOCI, an improvement to OTGCN which addresses fairness by implementing a novel optimal transport approach designed to directly target harmful link bias. Lastly I introduce FastFOCI and SpFOCI, two further enhancements to FOCI, which directly address performance and statistical parity, a popular group fairness measure. I demonstrate the effectiveness of each of these methods on several real-world datasets and discuss their strengths and weaknesses. Copyright by ANNA JOY STEPHENS 2023 This thesis is dedicated to my great-grandmother, Dorothy Marusek. Thank you for always encouraging my curiosity and education. Rest in peace. iv ACKNOWLEDGEMENTS I would like to acknowledge and thank everyone who has contributed to this work and my well-being along the way. First I would like to acknowledge my advisors Dr. Pang- Ning Tan and Dr. Abdol-Hossein Esfahanian. I would not be here without their patient instruction and assistance. Thank you as well to my committee member, Dr. Jiayu Zhou. Next I would like to thank the other members of the DMiner lab at MSU. Fransisco Santos especially, as my partner on network fairness projects, has been there every step of the way to learn, share, encourage, and commiserate. Other members of the lab during my time there include Dr. Tyler Wilson, Dr. Boyang Liu, Galib Asadullah, Bidhan Bashyal, Andrew Mcdonald, and Yue Deng, who have all shared insights, code, data, and advice. I would also like to thank my personal supporters through this process. My wonderful mother has never failed to encourage me or watch my dogs on long days. My brothers and faithful game masters have been my best friends and most vocal supporters, even when I cancel game sessions to meet a deadline. Last but certainly not least, my incredible fianc`e, Joel, has supported me during every step of this crazy adventure from proof reading my initial applications to the lunch he’s making me as I write this (which I will also ask him to proof read). Lastly I would like to acknowledge the grant that has supported me through my time at MSU. This material is based upon work supported by the NSF Program on Fairness in AI in collaboration with Amazon under grant IIS-1939368. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Amazon. v TABLE OF CONTENTS LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2: LITERATURE REVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 3: CNNC ACROSS SAMPLE GROUPS . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CHAPTER 4: FAIR CNNC VIA OPTIMAL TRANSPORT . . . . . . . . . . . . . . . . 26 CHAPTER 5: CONCLUSION & FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 vi LIST OF ABBREVIATIONS ASD Autism Spectrum Disorder. CNNC Cross-Network Node Classification. fMRI Functional Magnetic Resonance Imaging. GCN Graph Convolutional Network. GNN Graph Neural Network. MMD Maximum Mean Discrepancy. NFT Node Feature Transformation. OT Optimal Transport. PPMI positive pointwise mutual information. TSNE T-distributed Stochastic Neighbor Embedding. VAE Variational Auto-Encoders. vii CHAPTER 1: INTRODUCTION 1.1 Motivation Within the modern data landscape a typical dataset is often thought of as a single table filled with rows of independent and identically distributed (i.i.d.) data points. However, realistic datasets often take much more complex forms such as images, which must consider the spatial relationships between pixels, or time series, which considers sequence of the data points. Another complex dataset is the network, which considers both individual data points and the relationships between them. A network or graph consists of a set of nodes which are connected to each other by edges. These edges introduce relationships between data points which can be useful for modeling a wide range of important factors. Graphs can be used for modeling social networks, traffic patterns, brain structures, molecules, and many other scenarios in which structures, locations, similarity, or relationships are of importance. One of the most common tasks when working with graphs is node classification. The standard node classification problem is to classify some unlabeled nodes when given a graph with node features, edge structure, and some labeled nodes. However, while the node clas- sification task is common, real world scenarios are often much more nuanced. In particular, often a dataset of interest may not include enough labeled training data and must supple- ment from another labeled dataset. This can lead to complications as the datasets may not always have identical distributions. For example, social networks may represent different regions where node classification is complicated by divergence in societal norms. Population graphs used to model medical conditions in patient populations may come from large re- search institutions or small rural clinics. These scenarios motivate the task of cross-network node classification. In cross-network node classification (CNNC) you typically have two graphs: a labeled source graph and an unlabeled target graph. The goal is to correctly classify nodes in the target graph. The source and target graphs contain the same features, but have varying distributions. The difference between the two distributions is called concept shift and is a 1 major focus of this work. 1.2 Challenges In this thesis I consider three major challenges with CNNC. The primary challenge to any CNNC approach is to perform domain adaptation in a manner which respects the graph structure. In addition, fairness and performance are two very practical challenges when implementing a method for CNNC. Domain adaptation is generally the task of classifying unlabeled target data points by making use of target feature information and a labeled source dataset. As described in the CNNC problem, the source and target datasets are often separated by concept drift. Most previous CNNC methods implement some form of adversarial domain adaptation [20] [67] [75] [56]. However, adversarial domain adaptation is often difficult to train and may not be effective in cases with a large amount of concept drift [30]. In this work I instead choose to address the domain adaptation challenge by implementing an optimal transport domain adaptation method. Optimal transport (OT) is a technique for discovering the best way to map between two distinct probability distributions. This map can be used to transform data points from one domain to another and can serve as a powerful tool in domain adaptation. Importantly, OT can be used to map between datasets with a large amount of concept drift and can be implemented with very little data. Fairness in artificial intelligence can be grouped into two main categories. Individual fairness considers whether an outcome is fair to individuals, and group fairness considers if an outcome is fair to a group as a whole. The focus of this work is group fairness, and one of the most popular group fairness metrics is statistical parity. Statistical parity states that each protected group (e.g., males and females) should have an equal probability of experiencing a positive outcome. However, due to the fairness-utility trade off improving statistical parity may degrade model utility or performance [53] [15]. The challenge in this work is to incorporate statistical parity in a manner which balances this trade-off. For various reasons most CNNC approaches encounter scalability issues when tasked 2 with large graphs. While these issues manifest at various technical points they all stem back to the same root cause. These CNNC approaches assume access to the entire graph in each step. For example, ASN, a state-of-the-art CNNC approach, implements variational auto- encoders which require access to complete source and target graphs and then a loss function with aggregation steps that can quickly explode as graphs increase in size [75]. In this work I specifically design a CNNC approach which is capable of scaling to large graphs if needed. 1.3 Key Contributions In this thesis I introduce several novel contributions to address the challenges described above. • I develop OTGCN, a CNNC approach that relies upon optimal transport based domain adaptation. • I introduce two new fair Sinkhorn distance measures, one supervised and one unsuper- vised, which can be used to mitigate unfairness in optimal transport. • I develop FOCI, a CNNC approach that addresses fair domain adaptation by imple- menting a fair optimal transport algorithm. • I develop FastFOCI and SpFOCI, two additional CNNC approaches that extend the work of FOCI to add more scalability and an alternative approach to fairness. • I perform experiments to demonstrate the effectiveness of these new approaches on several real-world datasets. 1.4 Thesis Organization The remainder of this thesis will be organized in the following manner. Chapter 2 contains a broad overview of various areas of relevant literature. Areas of literature presented in this chapter are relevant to the complete body of work, and are supplemented in later chapters with related work sections specific to the chapter of interest. Chapter 3 presents OTGCN, our initial CNNC framework, and then demonstrates it as an effective approach when working with fMRI data collected from several different collection sites. Chapter 4 presents two new approaches to optimal transport which allow it to consider fairness directly 3 within the transportation process. I then use these new optimal transport algorithms to develop several improvements to OTGCN which allow it to consider fairness and improve performance. All new methods are demonstrated experimentally on datasets of various sizes. 1.5 Publications Much of the work in this thesis has been adapted from the following papers: • Anna Stephens, Francisco Santos, Pang-Ning Tan, Abdol-Hossein Esfahanian. “Popu- lation Graph Cross-Network Node Classification for Autism Detection Across Sample Groups”. In Proceedings of the Data Mining in Biomedical Informatics and Healthcare workshop at IEEE International Conference on Data Mining (DMBIH@ICDM, 2023), Shanghi (2023) • Anna Stephens, Francisco Santos, Pang-Ning Tan, Abdol-Hossein Esfahanian. “FOCI: Fair Cross-Network Node Classification via Optimal Transport.” (under review), 2023 4 CHAPTER 2: LITERATURE REVIEW In this chapter I review literature related to several key areas of this project. Specifically this chapter includes literature on node or graph representation learning, cross-network node classification (CNNC), optimal transport as it has been used with graph data structures, and fairness in neural networks. Some of these areas are much too large to be fully reviewed within the confines of this chapter, and in those cases I refer the reader to survey papers for a more thorough discussion of the topic. 2.1 Node Representation Learning In current literature the most common method for constructing node representations or embeddings is via graph neural networks (GNNs). GNNs generally consider graph structure by using message passing techniques to combine a node’s features with features from other nodes in its neighborhood. Notably, node representation learning is a large field and this section will only hit some of the highlights. For more information please consider a survey paper on the topic [76] [36]. Arguably the most well known and straightforward GNN is the Graph Convolutional Network (GCN)[37]. GCN uses a simple message passing technique to combine node features with features from all of its immediate neighbors in one step. GCN layers can then be stacked to expand to additional neighbors beyond one hop. Most other convolutional GNNs build off the GCN in some way. JKNet [68] and MixHop [2] are approaches for expanding convolutions beyond the immediate neighborhood. FastGCN [12] and simple graph convolution (SGC) [66] are methods for simplifying and speeding up GCN. GraphSAGE [28] is also similar to GCN, but does not require the model to evaluate the whole graph at one time; instead performing strategic graph sampling before sending portions of the graph to a GCN-like model. While graph convolutional networks are perhaps the largest and well known family of GNNS, graph attention networks follow closely behind. The most well known graph attention networks is aptly named Graph Attention Network (GAT) [62]. GAT introduces 5 an attention mechanism by which the model can weight various neighboring nodes differently when combining features. GATv2 [8] re-orders the steps of GAT in order to provide a more expressive representation. Multi-hop Attention Graph Neural Network (MAGNA) [63] also operates similarly to GAT, but considers larger neighborhoods in each layer or step. 2.2 Cross Network Node Classification One of the earliest works on CNNC is the CDNE algorithm [57]. CDNE uses stacked autoencoders to learn separate embeddings for the source and target networks and tackles domain adaptation by utilizing maximum mean discrepancy (MMD) loss between the two embeddings. ACDNE[56] was a follow up work to CDNE which improved upon embedding construction by designing separate methods for the node features and graph structure. Ad- ditionally, it used an adversarial discriminator to address domain adaptation instead of the MMD loss. More recently AdaGCN[20], UDA-GCN[67], and ASN[75] have attempted to address the CNNC task. All three of these methods relied upon the GCN acchitecture [37] to generate the node embeddings. AdaGCN[20] addresses the domain adaptation issue with an adversarial discriminator. UDA-GCN[67] followed on the heels of AdaGCN by utilizing the PPMI matrix instead of an adjacency matrix along with an entropy loss to update the target GCN. Finally ASN[75] is the most recent of these works and combined concepts from both AdaGCN and UDA-GCN. ASN used variational auto-encoders (VAE) to help preserve information unique to the source and target graphs respectively. None of these methods address fairness concerns. 2.3 Optimal Transport with Graphs There is a small body of work investigating OT within the context of graph data. Some of these works [60][1] have developed OT-inspired distance measures for graph data, while others employ OT for graph comparison or alignment tasks instead of node classificaiton. For example, GOT[13] uses a combination of traditional Wasserstein distance with the Gromov- Wasserstein to find a transport plan for the purposes of graph alignment. Zhang et al. [74] 6 employed OT to fine-tune large GNNs while OT-GNN[11] used OT to develop a prototype for prototype learning on whole graph classification. Unfortunately, none of these works consider using OT for CNNC nor addressing the issue of algorithmic bias. 2.4 Fairness in Neural Networks Previously I mentioned that fairness metrics and definitions can be grouped into ones targeting group fairness and individual fairness. On the other hand, fairness mechanisms are typically grouped based on when they occur in the machine learning pipeline: pre-process, in-process, and post-process. This section focuses on some popular fairness mechanisms. Similar to node representation learning, fairness in neural networks is a large topic, and I recommend a survey paper for a more in-depth review of both fairness history, metrics, and mechanisms [53]. Pre-processing fairness approaches modify the training data before sending it to a model for training. One approach does this by using a k-nearest neighbor classify to strategically flip training labels so that a later model will be more generous towards specific protected groups [45]. Another approach implements representation learning techniques to design a more fair feature set [71]. In-processing approaches modify the model directly in some way during training in or- der to train a fair model more directly. This is often done by adding additional terms to the model’s loss function [32] [7] [5]. Sometimes, however, these approaches can become more creative. For example, one method uses maximum mean discrepancy (MMD) to align dis- tributions of various protected groups [54]. Another method uses variational auto-encoders and MMD to remove any dependence on the protected attribute [44]. Post-processing methods modify model output in some way to encourage fairness. A straighforward example of this is using different probability thresholds for different protected groups in a binary classification task [47]. More generally one can strategically flip the labels of some predicted outputs to meet some specific fairness criteria [29]. 7 CHAPTER 3: CNNC ACROSS SAMPLE GROUPS 3.1 Introduction The proliferation of graph-based data in various application domains has motivated the need to develop more advanced techniques based on deep learning to harness the network data from multiple sources in order to improve the performance of node classification algorithms. Current techniques would learn a feature embedding of the nodes from multiple graphs using their node features and link structure information. The learned feature representation is then presented to a fully-connected network layer to perform the final classification. However, despite the notable advances in graph neural networks, there are still numerous challenges that must be addressed in order for the techniques to be successfully deployed to solve real-world problems. First, existing techniques often assume that the graphs share similar distributional prop- erties, thus enabling us to apply a model trained on one graph, a.k.a. the source graph, to the nodes in another, a.k.a. the target graph. Unfortunately, such a scenario is too idealistic for the real world as one would likely encounter some form of distributional shift, where the training data only captures the essence of a particular graph but fails to account for some unforeseen differences in another graph. Such type of concept drift is illustrated in Fig. 3.1, Figure 3.1: An illustration of OT for domain adaptation, where the color represents class labels. The diagram on the left is the original source dataset and on the far right is the target dataset. The dotted line is a the decision boundary of logistic regression trained on the source dataset. The middle diagram shows the transported source dataset using OT, with the solid line representing the decision boundary obtained by logistic regression. 8 where the decision boundary induced from the learned representation of a source graph (left) does not reflect the class separation of the nodes in the target graph (right). To address the concept drift problem, domain adaptation and transfer learning methods [50] have been developed. Here, the deep neural network is initially trained using data from a related source domain. The model is then fine-tuned to fit characteristics of the target domain. Transfer learning is particularly useful when there is limited training data available in the target domain. For graph data, transfer learning is typically studied under the guise of cross-network node classification (CNNC) problems [56, 57, 20, 67, 75]. In CNNC, we consider a scenario in which there is a source graph, with fully or mostly labeled nodes, and a target graph, which has either a few or no class labels. CNNC approaches are particularly well suited to address the multi-site Autism Spec- trum Disorder (ASD) detection problem. ASD refers to a condition characterised by specific communication impairments, restricted interests, and repetitive behaviours [43]. As the disorder typically presents itself early in life, early and accurate detection can help reduce the severity of many lifelong symptoms. Towards this end, automated techniques based on machine learning and deep learning have been developed for the early detection of ASD. Specifically, these techniques have been applied to a wide range of subject data including detailed subject screening data [22], videos of subject movements [77], and Magnetic Reso- nance Imaging (MRI) data [42]. Given the diverse modalities of data available, it is likely that the best detection results can be achieved by combining the data from different sources. Previous works have established an effective method to combine various modalities of data using graph structures because of their flexible yet powerful representation [61] [72]. These approaches make use of population graphs, where the nodes represent individuals and edges are generally defined with some similarity measure. In several prior ASD research, the node features are generated from image information while the edges consider a combination of subject information (sex, age, etc) in addition to image similarity [51] [52] [35] [31]. Once these graphs are constructed, detecting ASD becomes a node classification problem. 9 Similar to other diagnosis problems, one of the major challenges in applying machine learning to ASD detection is the limited amount of labeled data available. This has led to growing interest in utilizing labeled data from multiple sites to train the machine learning model. One limitation of these approaches is that they fail to consider a more realistic scenario in which there may be little or no labeled data associated with the population of interest. For example, a model may be trained on a research dataset but needs to be applied to another location with a different imaging equipment or patient demographics that were not well represented in the training data. This problem can be addressed using an approach generally known as domain adaptation or transfer learning [59]. Within the context of node classification tasks, it is also referred to as cross-network node classification (CNNC) approach. CNNC assumes the availability of a sufficiently large number of labeled nodes in a source network and an unlabeled target network, whose node labels are to be predicted accurately. This chapter focuses on addressing the CNNC task for ASD detection. We use the pop- ular ABIDE [17] dataset, which contains both resting-state functional Magnetic Resonance Imaging (fMRI) information and phenotypic data such as age, sex, and screening results. Samples in the ABIDE dataset were collected from several different collection sites, which were then divided into 2 groups to form the source and target networks for our experimental studies. CNNC has two major challenges when applied to the ABIDE dataset. The first is extracting relevant information from the networks for ASD detection. In this work we will learn a graph embedding of the source and target networks via a graph convolutional network (GCN) [37]. A GCN layer is capable of learning a node embedding which contains information about a node and its immediate neighbors. This method relies on the homophily principle [46], which states that similar individuals tend to form neighborhoods within a graph. Therefore, in a graph with high homophily we can improve a node embedding by adding information from it’s neighborhood. 10 The second major challenge is to handle potential concept drift between the source and target networks. As the separate data collection sites may have different fMRI imaging equipment and procedural differences, this may introduce some discrepancies or ”drift” be- tween the two networks. As previously mentioned, the presence of concept drift often leads to poor results if a classifier is simply trained on the source dataset and applied as it is to the target dataset. Fig. 3.1 demonstrates the challenge of CNNC when the learned node embedding does not account for such distributional shift. A decision boundary learned from the source dataset is likely to incorrectly classify a significant portion of the target dataset. In this chapter we introduce OTGCN, a novel approach to address the CNNC task for ASD detection. Our proposed approach combines graph neural network with optimal trans- port (OT) to handle the drift between the source and target networks. OT is a method for mapping a transportation between two distributions. We will use OT to map the learned source representation to the target representation. This strategy allows us to train an ac- curate model for classifying the target nodes. Experimental results on the ABIDE dataset demonstrate the effectiveness of our approach at diagnosing ASD across different collection sites compared to state of the art CNNC methods. 3.2 Related Works In previous chapters we have discussed works relevant to node classification and cross- network node classification tasks. In this chapter, however, we specifically focus on the use of optimal transport for addressing ASD detection across various sample collection sites. To that end, we must now explore works which are specifically relevant to detecting ASD as well as addressing fMRI data collected from multiple sites. 3.2.1 Machine Learning Approaches to ASD Detection ASD detection is a task that lends itself to a wide variety of approaches. Zunino et al [77] employed computer vision approaches for automatic early detection of ASD by evaluating videos of subject movements when grasping a bottle. They then applied recurrent neural networks to distinguish subjects who have ASD from those who do not. Erkan and Thanh 11 [22] analyzed detailed screening data collected from mobile app surveys using traditional machine learning methods such as k-nearest neighbor, support vector machines, and random forests to diagnose subjects with ASD. Yuan et al. [70] applied natural language processing (NLP) techniques to analyze hand written medical forms of potential ASD patients while Carette et al. [10] performed ASD detection on eye tracking data using long-short term memory (LSTM) networks. The majority of the recent works in this area, however, focuses on using MRI data [42] [49] along with other subject information such as sex, age, and screening results [51] [52] [31] [35] for screening potential ASD patients. For example, Li et al. [42] presented a graph neural network approach to find biomarkers that can be used to detect ASD while Parisot et al. [51] employed a graph convolutional network (GCN) to perform the detection using both fMRI imaging and non-imaging phenotypic data. Similar to these works, the work in this chapter focuses on using a combination of fMRI and other subject data, though our approach is also applicable to blend other forms of data given the representation power of graph neural networks. 3.2.2 Machine Learning on Multi-site fMRI Data Previous works on diagnosing brain-related problems using fMRI data from multiple sites generally fall into two major categories—transfer learning and federated learning. Transfer learning [59] is a machine learning approach that enables prediction models trained from a given data domain (known as the source domain) to be applied to another domain (known as the target domain). Such a domain adaptation approach can be used even if the target domain has no labeled training data. Previous works on applying trans- fer learning to fMRI data can be found in [14] [64] [27]. However, these approaches are mostly designed for using only the image information and do not consider more complex data structures or the use of additional non-imaging information. In contrast, federated learning [73] is designed for training prediction models in a col- laborative fashion on decentralized datasets. The approach assumes restricted or indirect 12 access to the source dataset and direct access to a partially labeled target dataset. There has been a few works applying federated learning approaches to multi-site fMRI data [41] [65], but none of them consider non-imaging information. 3.3 Preliminaries 3.3.1 Problem Statement Let G(V, E, X, Y ) be an attributed graph, with a node set V , edge set E ⊆ V × V , node feature matrix X, and node label Y . Let A denote the adjacency matrix representation of E, where Aij > 0 if (vi, vj) ∈ E and 0 otherwise. In a domain adaptation setting, we assume there exists a source graph, Gs(Vs, Es, Xs, Ys), where the node labels in Ys are known, and a target graph, Gt(Vt, Et, Xt, Yt), where the node labels in Yt are unknown. The adjacency matrices corresponding to the source and target graphs are denoted as As ∈ Rns×ns and At ∈ Rnt×nt, respectively, where ns and nt are the corresponding number of source and target nodes. We further assume that both graphs have identical features, i.e., Xs ∈ Rns×m and Xt ∈ Rnt×m. Both graphs are also assumed to have the same set of class labels, i.e., Ys ∈ {0, 1, · · · , k − 1}ns and Yt ∈ {0, 1, · · · , k − 1}nt, where k is the number of classes. For ASD detection, we consider a binary node classification problem with k = 2. At times, we also consider the combined graph Gc(Vc, Ec, Xc, Yc), where Vc = Vs ∪ Vt, Ec = Es ∪ Et, Xc = [Xs; Xt] ∈ Rn×m, Ac = [As 0; 0T At] ∈ Rn×n, Yc ∈ {0, 1, · · · , k − 1}n, n = ns + nt, and 0 is an ns × nt null matrix. Definition 1 (Cross Network Node Classification (CNNC)). Given a source graph, Gs = (Vs, Es, Xs, Ys) and target graph, Gt = (Vt, Et, Xt, Yt), our goal is to learn a target function, f : V → {0, 1, · · · , k − 1}, that accurately classifies the labeled nodes in Ys as well as the unlabeled nodes in Yt. 3.4 Graph Convolutional Network (GCN) Graph convolutional networks [37] employ a message passing strategy to succinctly capture both the node features and graph structure information when learning the feature 13 Figure 3.2: An illustration of the cost and transport plan matrices of optimal transport. embedding of a node in the graph. Specifically, each “message” corresponds to the feature embedding information of a node, which will be passed to all of its immediate neighbors. By aggregating the features gathered from the neighbors, a new embedding for the node will be generated. The message passing strategy can be formally stated as follows. Given an adjacency matrix A and node feature matrix X, the feature embedding of the nodes in layer l+1, H (l+1), can be computed based on its feature embedding at its previous layer, H (l), as follows: H (l+1) = σ( ˜D− 1 2 ˜A ˜D− 1 2 H (l)W (l)) (1) ˜A = A + I ˜Dii = (cid:88) ˜Aij j where H (0) = X (i.e., original node features), σ(·) is the ReLU activation function, and ˜D is a diagonal matrix containing the sum of the edge weights associated with each node in the graph. 3.5 Optimal Transport Optimal Transport (OT) provides a principled approach for comparing two probability distributions by finding the least costly way to reshape one of the distributions into the other while incorporating their geometric information. The original OT problem was proposed by Monge [48], with its modern formulation being attributed to Kantorovich [33]. Given two marginal distributions µs ∈ Rns and µt ∈ Rnt let C ∈ Rns×nt be a cost matrix, where Cij 14 is the cost of transporting one unit from µs i to µt j. Consider a transportation plan matrix P ∈ Rns×nt, where pij is the proportion of probability mass µs i that should be moved to µt j. The optimal transport (OT) problem seeks to find a transportation plan P that min- imizes the total transportation cost. The minimum transportation cost W (µs, µt) is also called the Wasserstein distance or the earth-mover distance. W (µs, µt) := min ⟨P, C⟩F P ∈U (µs,µt) (2) U (µs, µt) := {P ∈ Rns×nt + |P 1nt = µs, P T 1ns = µt} The discrete OT formulation [33] shown above can be solved as a linear programming problem. However, it is computationally expensive O((n + m)nm log(n + m)), unstable, and is not guaranteed to have unique solution. Fortunately these problems can be addressed by employing the Sinkhorn distance Wλ [18] shown below, which utilizes an entropy regulariza- tion function H(P ) to accelerate the OT computation. Specifically, Wλ(µs, µt) and P λ can be solved using the well-known Sinkhorn algorithm as established by Cuturi [18]. Wλ(µs, µt) = ⟨P λ, C⟩ (3) P λ = argmin P ∈U (µs,µt) := (cid:8)P ∈ Rns×nt + U (µs, µt) ⟨P, C⟩ − λH(P ) (cid:12) (cid:12)P 1nt = µs, P T 1ns = µt (cid:9) where H(P ) = − (cid:80) ij Pij log Pij is the entropy regularization and λ is a user-specified regu- larization hyperparameter. 3.6 Methodology We employed a combination of graph convolutional networks (GCN) with node feature transformation layers to learn the feature embedding of the nodes in the source and target graphs. We then performed optimal transport on the learned embedding of the source nodes to match the distribution of the learned embedding of the target nodes. Figure 3.3 provides a high-level illustration of our proposed deep neural network architecture along with its 15 training procedure. Details of the proposed architecture are described in the subsections below. 3.6.1 Initial Node Embedding Construction & Pretraining As previously noted, conventional GCN employs a message passing strategy to transmit information about the feature embedding of a node to its immediate neighbors. This allows each node to aggregate the feature information of its neighbors when constructing its own latent embedding. Unfortunately, in graphs with low homophily, it is possible that the neigh- borhood information obtained via message passing is of less value than the original feature information of the node itself, which was the case with the population graph constructed Figure 3.3: High level architecture of our model. Combined source and target datasets are fed to two GCN layers and two NFT layers to learn two distinct embeddings. In pretraining (left) these embeddings are concatenated and sent directly to a fully connected classifier. Model weights are then updated using just cross entropy loss. In OT training (right) con- catenated embeddings are routed to an OT layer before being sent to the classifier. The OT layer replaces the source embedding with a transported version and then sends the target embedding and the transported source embedding to the classifier. At this point the model weights are updated with a combination of cross entropy and OT losses. 16 from the ABIDE dataset. However, due to its current formulation (see Eqn. 1), conventional GCN may not be able to attenuate the influence of the graph structure in comparison to the influence of the original features. To overcome this challenge, we propose a modification to the graph convolutional net- work architecture to independently learn a nonlinear embedding of the original node features. We call this the node feature transformation (NFT) layer in Fig 3.3. The NFT layer consists of a combination of linear layer plus ReLU activation functions to transform the original node features into their corresponding nonlinear embedding. The transformed features are then concatenated with the structural embedding of the nodes obtained from GCN for sub- sequent node classification. This strategy increases the flexibility of the model to capture the relative influence of the node features and homophily (i.e., neighborhood features) on the classification task. As shown in the architecture diagram, the GCN layers are trained on combined adjacency and feature matrices of the source and target graphs, while the NFT layers are trained on the combined feature matrices. The weights of the GCN and NFT layers are jointly updated during backpropagation. The pretraining process of the deep neural network can be seen on the left side of Fig 3.3. The purpose of pretraining is to ensure that the full network with optimal transport (right side of Fig 3.3) can be seeded with a good set of initial weights. During pretraining, we train the NFT and GCN layers to each produce their own feature embeddings, H L (output of NFT layer) and H G (output of GCN layer). The two embeddings are then concatenated and sent to a fully connected network for node classification. The network is trained to minimize the following cross-entropy loss function: Pretraining: LCE = − 1 ns (cid:88) l (cid:88) xi∈Xs j=0 Yij log( ˆYij) (4) 3.6.2 Domain Adaptation via Optimal Transport GCN We address the domain adaptation problem for cross network node classification (CNNC) by combining the pretrained network with an optimal transport layer as shown on the right 17 side of Fig. 3.3. The optimal transport layer utilizes the Sinkhorn algorithm [18] to learn the relevant transportation plan matrix P that will transform the learned embedding of the source nodes to match the distribution of the target node embedding. The OT layer takes the source and target node embeddings along with the entropy regularization hyperparameter λ as inputs and returns a new embedding of the source nodes, ˆHs, which matches the distribution of the target nodes, Ht. This can be accomplished by solving the OT problem given in Eqn. (3) to learn the transportation plan P and using the barycentric mapping approach in [16] [23] to transform the source node features. Specifically, for each source node i, its latent features will be transported to a new embedding as follows: ˆHs,i = argmin H∈Rd (cid:88) j i,j C(H, H t P λ j) If the cost function C corresponds to the squared l2 distance, then the barycentric mapping reduces to the following form. ˆHs = diag(P λ1nt)−1P λHt For domain adaptation, we typically choose the marginal distributions of the source and target node embeddings, µs and µt, to be a uniform distribution. This allows us to further simplify the equation as follows: ˆHs = nsP λHt (5) where ns is the number of source nodes. Both ˆHs and Ht are then fed to a fully connected layer to perform the node classification. By replacing Hs with ˆHs we train the model to work on data that has been transported to the target domain. This addresses the domain adaptation problem and allows the classifier to accurately classify the target nodes. Finally, the full OTGCN network is trained to minimize the following joint objective function: L = LCE + θLOT (6) 18 where LCE is the cross entropy loss given in Eqn. (8) and LOT is the optimal transport loss defined as follows: LOT = ⟨P λ, C⟩ − λH(P λ) (7) with the hyperparameter θ controlling how much emphasis is placed on the optimal transport term. 3.7 Experimental Evaluation This section presents the experiments performed to evaluate the effectiveness of our proposed OTGCN framework. The source code for the framework and other aspects of our experiments can be found at https://github.com/ajoystephens/otgcn Site YALE CALTECH CMU NYU UM 1 USM UCLA 1 PITT MAX MUN TRINITY UM 2 KKI LEUVEN 2 LEUVEN 1 OLIN SDSU SBL STANFORD OHSU UCLA 2 Samples Autism Dataset 22 5 6 74 34 43 37 24 19 19 13 12 12 14 14 8 12 12 12 11 target target target source source source source source source source source source source source source source source source source source 41 15 11 172 86 67 64 50 46 44 34 33 28 28 28 27 26 25 25 21 Table 3.1: The data collection sites for ABIDE and its selection as source/target graph. 19 Autism Source 370 (46.0%) 33 (49.3%) Target Control 434 (54.0%) 34 (50.7%) Total 804 67 Table 3.2: ABIDE Totals by Label and Dataset. 3.7.1 Data Preparation The ABIDE [17] dataset is a combination of samples obtained from 20 different collection sites as shown in Table 4.1 for ASD classification. We split these collection sites into source and target sites for a total of 804 source samples and 67 target samples. Following the terminology used in other previous works in this area [51] [64] [42] we refer to the two classes in the dataset as autism and control. Autism refers to a subject who developed ASD while control refers to a subject without ASD. For the ABIDE dataset, the image data was prepared following the methodology of [51]. A population graph is constructed by considering each subject as a node and computing the similarity of their fMRI images for edge construction. Specifically, an edge was placed be- tween a pair of nodes only if their image similarity exceeds certain threshold value. The edges are also weighted according to their computed similarity. The node features were derived from phenotype information. After removing all columns with a known direct relationship to the class label, the features were one-hot-encoded as applicable. They were then normalized and any features with an abnormally high correlation to the label were also removed. 3.7.2 Experimental Setup We compared the performance of OTGCN against the following baselines. Aside from GCN, the rest of the baseline methods are designed for cross network node classification problems. • GCN [37]: A graph neural network that uses a message-passing technique to learn the feature embedding of the nodes. The architecture has been incorporated into OTGCN as well as other graph neural networks [52] [35] [31]. The GCN implementation was written using code from the torch geometric library [24] and involved two GCN layers 20 followed by a fully connected classification layer. • AdaGCN [20]: A CNNC technique which uses a series of GCN layers to learn separate embeddings for the source and target networks. It then performs domain adaptation by using an adversarial discriminator to force the two embedding into a shared do- main. The AdaGCN implementation used in this chapter was taken directly from the author’s provide source code1. No significant changes were made to the author’s implementation, which consists of three GCN layers and a single layer discriminator. • ASN [75]: Another CNNC approach which consists of two separate GCN variational autoencoders (VAE) for the source and target, a shared GCN encoder which looks at both the source and target, and a adversarial discriminator. Here we used source code provide by the author2 with a small modification to correct for NaNs produced by the VAE reconstruction loss. The VAE reconstruction loss equation implemented in ASN includes a large exponential term followed by an aggregation step which can lead to NaNs if there is a large dataset or large values in VAE output. We addressed this issue by restricting values in VAE-generated embedding to between -10 and 10. Each of the three encoders in ASN were implemented with two GCN layers and the two decoders had a single layer. • ACDNE [56]: A CNNC approach which uses special feature extraction layers to learn separate embeddings for the features and structure of each network. It then concate- nates the two embeddings into a single source embedding and a single target embedding before sending both to an adversarial discriminator. We followed the authors source code3 with this method as well and made no significant changes. Each of the four feature extractors consisted of two layers. For fair comparison we implemented OTGCN with two NFT layers and two GCN layers, similar to most of the baseline methods. We use the macro- and micro-F1 scores[26] as 1https://github.com/daiquanyu/AdaGCN TKDE 2https://github.com/yuntaodu/ASN 3https://github.com/shenxiaocam/ACDNE 21 our evaluation metrics when comparing the performance of different classification methods. We perform 10-fold cross validation to select the hyperparameters of OTGCN as well as the reported baselines. For OTGCN, we tune the hyperparameters for λ and θ as well as standard hyperparameters such as learning rate. Possible hyperparameter values were [0.01, 0.03, 0.05] for λ and [5, 10, 15] for θ. Since the target graph is completely unlabeled and has potentially different distribution than the source graph, this poses a significant problem for hyperparameter tuning. To our knowledge there is no ideal solution to this problem. For this work, the source dataset was split into 10 folds and the target dataset was excluded from the hyperparameter tuning process. In each pass the fold selected for validation was removed from the remainder of the source dataset and treated as a distinct target dataset. We followed this process to choose the best hyperparameters for both our method and the reported baselines. For OTGCN the first and second NFT layers produced hidden embeddings of 32 and 64 units respectively. Similarly, the two GCN embeddings were also 32 and 64. The selected hyperparameters were λ = 0.01 and θ = 10. Each of ACDNE’s feature extrators is build with two layer models which contain 64 and 32 hidden dimensions. We chose hyperparameter candidate values by referencing the paper associated with the work. Our tuning script selected 1 × 10−5 for the weight of the pairwise constraint loss and 1 × 10−5 for the weight of the l2 regularization term. ADAGCN’s three GCN layers have 64, 32 and 16 hidden dimensions and the discrimina- tor has a single 16 unit hidden layer. We chose hyperparameter candidate values by looking at existing values within their source code for other datasets. There were three model-specific hyperparmeters for controlling various portion of their loss functions. Our tuning script se- lected 5 × 10−5 for the weight of the l2 loss, 1.0 for the weight of the Wasserstein loss, and 5.0 for the weight of the penalty loss. ASN’s encoders were all two layers with 64 and 32 hidden units. Once again we chose hyperparameter candidate values by looking at existing information within the source code. 22 Our tuning script chose 0.0001 for the weight of the difference loss, 1.0 for the weight of the domain loss, and 0.5 for the weight of the reconstruction loss. After hyperparameter tuning each method was trained with the chosen hyperparameters ten times with distinct random seeds. Each time the model was trained on the entire labeled source dataset and evaluated on the target dataset. The mean and standard deviation of the resulting micro and macro F1 results are recorded in Table 3.3. Method GCN AdaGCN ASN ACDNE OTGCN 0.97907 +/- 0.00733 0.97910 +/- 0.00731 Micro F1 0.55821 +/- 0.02234 0.37112 +/- 0.22752 0.52537 +/- 0.04418 0.94328 +/- 0.04049 Macro F1 0.50265 +/- 0.03876 0.27853 +/- 0.18966 0.38712 +/- 0.09156 0.94310 +/- 0.04065 Table 3.3: Performance results on target dataset. Figure 3.4: TSNE plot of combined embeddings before and after transport in the first pass of our OT layer. Points represent subjects and the lines are potential decision boundaries based off from each source dataset. The light colors represent the source dataset, before transport on the left and after transport on the right. The dark colors represent the target datset, which is the same in both cases. The green circle points out a portion of the target group which will likely be incorrectly classified before transportation occurs. The green arrow points to that same group on the right. 23 3.7.3 Results Table 3.3 shows the micro- and macro-F1 results for OTGCN and all baselines on the prepared ABIDE target dataset. These results demonstrate the improved performance of OTGCN over existing CNNC baselines. The next best performing result was from ACDNE, which did not rely on GCN for it’s graph embedding, but rather extracted structural and feature information separately into separate embeddings. OTGCN and ACDNE perfomed significantly better than all other baselines, likely because these two methods do not rely strictly upon GCN and the homophily principle. Next we endeavor to establish the effectiveness of our optimal transport layer. To do this we save off a GCN embedding just before we send it to the very first OT transportation. We then transport the source embedding, use TSNE to reduce the embeddings to two dimensions and plot the results. Fig. 3.4 is an example of a plot after this process. Here the points are colored according to label and dataset. On the left we see source before transportation in lighter colors, autism is light blue and control is light orange. The target dataset is plotted with it in darker colors, autism is dark blue and control is dark orange. A dark green line illustrates a potential decision boundary according to the source dataset, and it is clear that it will likely misclassify a significant group of the target samples. On the right we see a similar plot, but in this case the source samples have been transported to the target domain via our OT layer. Once again a dark green line illustrates a potential decision boundary based off from the source samples. It is clear from this demonstration that a decision boundary derived from the transported source embedding will more accurately classify subjects in the target network. 3.8 Conclusion This chapter presents a deep learning framework called OTGCN to address the ASD detection problem using imaging and non-imaging information from multiple sites. OTGCN leverages ideas from graph neural networks to learn a feature representation of the nodes 24 and optimal transport to effectively tackle the domain adaptation problem between source and target graphs. Our framework also incorporates a nonlinear feature transformation layer to alleviate the issue of graphs with low homophily. We experimentally compared the performance of OTGCN against several state of the art CNNC baselines using the multi-site ABIDE dataset. These experiments demonstrated the superior performance of OTGCN over the baseline methods. 25 CHAPTER 4: FAIR CNNC VIA OPTIMAL TRANSPORT 4.1 Introduction The previous chapter presented a method for leveraging optimal transport to address the concept drift challenge within the task of cross-network node classification (CNNC). In this chapter we extend the previous method to to address the challenges of fairness and scalability within CNNC. Incorporating fairness consideration into node classification is a necessary challenge that must be addressed to prevent the model from generating biased prediction results. Graph neural networks (GNNs) are especially prone to fairness issues due to an artifact of the ho- mophily principle, which states that similar nodes tend to be connected to each other [46]. GNNs rely upon this principle as they use message-passing to incorporate neighborhood in- formation into the learning of node representations. Past research on graph fairness have, however, shown that neighborhood structure is often more dependent on the protected at- tributes than the classification labels [21][55]. As a consequence, GNNs have the potential to exacerbate unfairness as the learned node embeddings may capture more information about a node’s protected group than its class label [34], a phenomenon that is also known as link bias [19]. To alleviate such bias, the goal of many fair node classification algorithms is to learn an embedding that is less reliant upon an individual’s protected group [3][19][55]. How- ever, despite the increasing research, current fairness-aware methods are mostly designed for homogeneous networks without accounting for the effect of concept drift. Furthermore, despite their growing research, none of the existing CNNC approaches are designed to address fairness concerns. Similar to CNNC, while OT is a powerful approach for graph transfer learning [11], there has yet been any studies on incorporating fairness into the OT learning formulation. As OT maps samples in the source domain to similar samples in the target, it is also vulnerable to generating biased transportation plans. As previously noted, GNNs have an established weakness in that they often learn node representations which more closely represent protected groups than they do classification outcomes [55][34]. 26 Thus, when used in conjunction with a GNN, OT has the potential to exacerbate unfairness by mapping between members of the same protected group instead of members of the same class. One simple way to ensure fairness is by postprocessing the classifier trained on the feature embeddings generated by OT so they would not discriminate against particular subgroups of the population. However, this approach may not be as helpful since the post- processing does not alter the learned OT embedding that may still be biased. This begs the question: How to effectively impart fairness into OT for fair cross-network node classification? In this chapter, we first present a novel framework called FOCI (Fair CrOss-Network Node ClassIfication) that performs fair optimal transport while mitigating the concept drift issue in cross-network node classification. Specifically, FOCI considers the nodes’ protected attribute information when transporting the source nodes to their corresponding target nodes when learning their feature embedding. We introduce a fair Sinkhorn distance measure for OT to encourage diversity in the mapping of the source nodes to target nodes. This strategy ensures that the learned features are oblivious to the protected attribute information, which is essential for creating an unbiased node feature representation. After establishing the effectiveness of FOCI we develop and evaluate two additional variations. FastFOCI is the core FOCI approach, but with a few architectural improvements which allow it to scale to larger datasets. SpFOCI is a method which takes advantage of the performance improvements of FastFOCI to integrate statistical parity more directly into the optimal transport algorithm. In summary, the main contributions of the paper include the following: • We introduce a fair Sinkhorn distance measure to alleviate bias in the OT process by encouraging connections between members of distinct protected groups. • We introduce a second fair Sinkhorn distance measure which aleviates bias by directly considering statistical parity within the optimal transport loss function. 27 • We developed FOCI, a graph neural network framework that uses the fair Sinkhorn distance measure for cross-network node classification. To the best of our knowledge, this is the first CNNC and OT method that considers fairness in its formulation. • We develop FastFOCI and SpFOCI, two modifications which can be made on the core FOCI approach. • We perform extensive experiments to demonstrate the effectiveness of FOCI, FastFOCI, and SpFOCI on several real world datasets. 4.2 Related Works Previous chapters have disused foundational works in the areas of node classification, cross-network node classification and fairness in neural networks. This chapter brings these concepts together in fair cross-network node classification. To the best of our knowledge this is the first work addressing fair cross-network node classification and, therefore, there are no other related works in the space at this time. There is, however, a significant body of work addressing fairness in node classification tasks. 4.2.1 Fair Node Classification As fairness has become an important factor to consider for node classification, various methods have recently been developed. For example, the FairDrop [58] method aims to reduce the negative effect of homophily by reducing the number of edges between nodes sharing the same sensitive attribute.Dai et al.[19] proposed FairGNN, which addresses the shortage of sensitive attributes for adversarial debiasing by estimating the users’ sensitive attributes. Agarwal et al.[3] proposed the NIFTY algorithm, which enforces fairness and stability conditions on both the objective function and GNN architecture. However, none of these methods are designed for cross-network node classification, where there could be distribution shift among the networks. 28 4.3 Preliminaries 4.3.1 Problem Statement Consider a set of graphs, G1, G2, · · · Gn, where each Gk(Vk, Ek, Xk, Yk) is an attributed graph, with a node set Vk, edge set Ek ⊆ Vk × Vk, node attribute matrix Xk, and node label vector Yk. We further partition the columns of the node attribute matrix Xk into two submatrices, X (p) k and X (u) k , where X (p) k set of protected attributes (e.g., gender, race, age group, etc) while X (u) corresponds to the submatrix associated with the is the submatrix k associated with the remaining (unprotected) attributes. Let A(k) denote the adjacency matrix representation of Ek, where A(k) ij > 0 if (vi, vj) ∈ Ek and 0 otherwise. In a domain adaptation setting, we assume there exists a partitioning on the set of the graphs into two disjoint groups—source graph Gs(Vs, Es, Xs, Ys), where the node labels in Ys are known, and a target graph, Gt(Vt, Et, Xt, Yt), where the node labels in Yt are unknown. Assuming the first m graphs contain labeled nodes, then Gs = ∪m k=1Gk and Gt = ∪n k=m+1Gk, where Vs = ∪m k=1Vk, Es = ∪m k=m+1Ek. The adjacency matrices corresponding to the source and target graphs are denoted as As ∈ Rns×ns and At ∈ Rnt×nt, respectively, where ns = (cid:80)m i=m+1 |Vi| are the corresponding i=1 |Vi| and nt = (cid:80)n k=m+11Vk, and Et = ∪n k=1Ek, Vt = ∪n number of source and target nodes. Assuming the graphs have identical node features and the same set of class labels, therefore Xs ∈ Rns×d and Xt ∈ Rnt×d, Ys ∈ {0, 1, · · · , k − 1}ns and Yt ∈ {0, 1, · · · , k − 1}nt, where k is the number of classes. At times, we also consider the combined graph Gc(Vc, Ec, Xc, Yc), where Vc = Vs ∪ Vt, Ec = Es ∪ Et, Xc = [Xs; Xt] ∈ Rn×m, Ac = [As 0; 0T At] ∈ Rn×n, Yc ∈ {0, 1, · · · , k − 1}n, n = ns + nt, and 0 is an ns × nt null matrix. Given a source graph, Gs = (Vs, Es, Xs, Ys) and target graph, Gt = (Vt, Et, Xt, Yt), the goal of fair cross-network node classification is to learn a target function, f : V → {0, 1, · · · , k − 1}, that accurately classifies the labeled nodes in Ys as well as the unlabeled nodes in Yt while minimizing the disparity in the classification performance for different groups of the protected attributes. 29 4.3.2 Statistical Parity We define fairness through the lens of statistical parity, which is one of the most com- monly used fairness metrics. The metric states that the positive prediction outcomes should be evenly distributed between members of different protected groups. The metric is defined in the following equation where statistical parity is noted SP. SP = |P ( ˆY = 1|X (p) = 0) − P ( ˆY = 1|X (p) = 1)| where ˆY is a given prediction, X (p) is a protected attribute, and the goal is to have a value as close to zero as possible. Statistical parity is typically used in scenarios such as default prediction for a loan or insurance claims. In these scenarios specific protected groups are often disproportionately associated with a given label for complicated cultural reasons and improvements in statistical parity can be an important first step towards achieving equity. 4.4 Proposed FOCI Architecture Fig. 4.1 shows a high-level overview of our proposed architecture. FOCI consists of 3 in- tegrated modules for representation learning, OT for domain adaptation, and fully connected layers for node classification. Details of the architecture are given below. Figure 4.1: A schematic illustration of the FOCI framework. The framework contains in- tegrated modules for representation learning (using 2-layer GCN with PPMI matrix along with 2-layer NFT), fair OT for domain adaptation, and a fully connected node classification layer. 30 4.4.1 Representation Learning & Pretraining FOCI modifies the GCN architecture [37] to learn a joint embedding of the nodes in the source and target networks. The modifications are needed for two reasons. First, due to oversmoothing effect [40], current GCN cannot be easily extended beyond 2 or 3 layers, thus throttling the effective neighborhood size of its message passing. Because of its limited neighborhood size and the link bias problem noted in the introduction, the learned represen- tation may inadvertantly encode the protected attribute information [19]. To overcome this limitation, we expand the neighborhood utilized by the GCN through the use of the positive pointwise mutual information (PPMI) matrix [39]. A PPMI matrix measures the topological proximity of nodes within some K-steps within the network and has been used with other GNN approaches [9, 57, 56, 75] to expand the neighborhood of interest. By using a PPMI matrix, the embedding learned will more likely capture information from nodes beyond those from the same protected group. The representation learning module also learns a separate nonlinear embedding of the node features. This allows us to maintain a node embedding that is free of link bias for later use. We refer to this as the node feature transformation (NFT) layer, which consists of a fully-connected linear layer with ReLU activation function. We pre-train the network to classify only the labeled nodes in the source network. Specifically, FOCI uses two GCN layers with the PPMI matrix and two NFT layers. The learned node embeddings are concatenated as shown in Fig. 4.1 before being sent to a fully connected classification layer, which is trained to minimize the following cross-entropy loss: Pre-training: LCE = − 1 ns (cid:88) l (cid:88) xi∈Xs j=0 Yij log( ˆYij) (8) 4.4.2 Fair Optimal Transport Our strategy to incorporate fairness into OT for graph transfer learning is by encour- aging mappings between members of different protected groups. This heuristic helps to alleviate the link bias problem due to the inherent homophily effect in network data. This 31 method will be integrated directly into OT, resulting in a new fair Sinkhorn algorithm. Let X (p) s and X (p) t be the protected attributes of the nodes in the source and target networks, respectively. We first create two sparse matrices, R ∈ [0, 1]ns×nt and S ∈ [0, 1]ns×nt, where Rij =    1 if X (p) s,i = X (p) t,j 0 otherwise , Sij =    1 if X (p) s,i ̸= X (p) t,j 0 otherwise Given a transportation plan matrix P , we compute the following fairness loss: ℓF = ⟨P, R⟩F ⟨R, R⟩F − ⟨P, S⟩F ⟨S, S⟩F , where ⟨A, B⟩F = (cid:80) ij AijBij denotes the Frobenius inner product between two matrices. The smaller the fairness loss, the greater the emphasis is on transportation between samples in different protected groups. This enables us to incorporate fairness consideration into OT by introducing the following γ-fair Sinkhorn distance: Wγ(µs, µt) = ⟨P γ, C⟩ (9) where P γ = argminP ∈U (µs,µt)LOT , LOT = ⟨P, C⟩ − λh(P ) + γ U (µs, µt) := (cid:8)P ∈ Rns×nt + (cid:20) ⟨P, R⟩ ⟨P, S⟩ ⟨S, S⟩ ⟨R, R⟩ (cid:12) (cid:9) (cid:12)P 1nt = µs, P T 1ns = µt − (cid:21) Armed with this new formulation we introduce the following theorem. Theorem 1. Given the γ-fair Sinkhorn distance in Eqn. 9, the solution for P γ can be simplified as where P γ = diag(u)Kdiag(v) K = e− 1 λ (C+γ( R n1 − S n2 )), u = e− 1 2 − 1 λ α, v = e− 1 2 − 1 λ β The proof for Theorem 1 is as follows. 32 Proof. The Lagrangian of the function in Eqn. 9 is L = (cid:88) (cid:20) PijCij + λPij log Pij + γPij ij (cid:19)(cid:21) (cid:18) Rij n1 − Sij n2 + αT (P 1d − µs) + βT (P T 1d − µt) where n1 = ⟨R, R⟩ and n2 = ⟨S, S⟩. The solution can be found by taking the derivative of the Lagrangian function with respect to Pij and setting it to zero. Pij = e− 1 2 − 1 λ αie− 1 λ (cij +γ( rij n1 − sij n2 ))e− 1 2 − 1 λ βj The theorem follows by replacing the terms for K, u, and v, respectively, into the above equation. Theorem 1 enables us to compute P γ using the modified fair Sinkhorn algorithm shown in Algorithm 1, which in turn, allows us to find a fairness-aware transportation plan. These changes can be seen in Algorithm 1 and has no significant impact on computational com- plexity of existing Sinkhorn algorithm [18]. Additionally, since we are still utilizing squared l2 loss for our cost and uniform marginals, the transported source node representation by OT remains unchanged as shown below: ˆH s = nsP λHt (10) 4.4.3 Cross-Network Node Classification The fair OT module will generate a transported source node embedding matrix, ˆH s using equation (10). Next ˆHs and the un-transformed target embedding Ht are passed to a fully connected network for node classification. After pre-training, the entire network is trained end-to-end to optimize the following joint objective function, which is a combination of the cross-entropy loss and optimal transport loss. L = LCE + θLOT (11) where θ is a hyperparameter that controls the tradeoff between the two losses. 33 , λ, γ Algorithm 1 Fair Sinkhorn , X (p) Require: Hs, Ht, X (p) s t C = computeCost(Hs, Ht) µs, µt = computeUniformMarginals(Hs, Ht) R, S = getMasks(gs, gt) λ (C+γ( R K = e− 1 n1 u = ones(length(µs))/length(µs) ˜K = diag(1/µs)K while u changes do − S n2 )) u = 1/( ˜K(µt/(K ′u)) end while v = µt/(K ′u) W = sum(u ⊙ ((K ⊙ C)v)) P = diag(u)Kdiag(v) ˆHs = nsP Ht 4.5 Proposed FOCI Modifications 4.5.1 FastFOCI ▷ Modification ▷ Distance Measure ▷ Transport Plan The first obvious hindrance to performance in FOCI is that the representation learning step must evaluate both the source and target graphs at the same time. GCN has a compu- tational complexity of O(|E|mh1h2) where |E| is the number of edges, m is the number of Figure 4.2: A schematic illustration of the FastFOCI framework. 34 Algorithm 2 FOCI Require: Xs, X (p) s , As, Ys, Xt, X (p) Ac, Xc = Combine(Xs, As, Xt, At) Qc = ComputePPMI(Ac) for epoch in max epochs do t , At, λ, γ, θ HG = GCN (Qc, Xc) HL = N F T (Xc) (cid:76) HL H = HG Hs, Ht = SeparateSourceAndTarget(H) if epoch > pretrain then ˆHs, P, R, S = FairSink(Hs, Ht, X (p) s , X (p) t , λ, γ) end if ˆys, ˆyt = Classifier( ˆHs, Ht) if epoch > pretrain then ot loss = ⟨P, C⟩ − λh(P ) + γ loss = LCE(ys, ˆys) + θLOT (cid:16) ⟨P,R⟩ ⟨R,R⟩ − ⟨P,S⟩ ⟨S,S⟩ (cid:17) else loss = LCE(ys, ˆys) end if update layer weights end for input channels or features in X, h1 is the number of hidden features in the first GCN layer, and h2 is the number of features in the output layer [37]. While GCN is not horribly com- plex, it is linear in the number of graph edges, and therefore forcing a GCN to look at both graphs can have a significant impact. Therefore, the first major improvement implemented in FastFOCI is that the representation learning phase will consider the source data only. Previously FOCI representation learning consisted of a GCN and an NFT which con- sidered the source and target togeather. In FastFOCI the source graph will be fed to source GCN and NFT layers, GCNs and NFTs. In pretraining the output of these layers will be sent to a classifier and source predictions will be used to update the GCNs, NFTs, and classifier. In later steps GCNs and NFTs will first pass their current parameters to target GCNt and NFTt. Then GCNs and NFTs will be used to construct a source representation HS while GCNt and NFTt is used to construct a target representation HT . Later the loss will be used to update parameters of GCNs, NFTs, and classifier. In this way we reduce the number of 35 edges required by the GCN, which no longer requires full access to both source and target graphs at the same time. The next area of improvement is in the optimal transport layer. The Sinkhorn algorithm has been experimentally shown to be O(nsnt) [18]. The purpose of our optimal transport layer is to transport node representations from the source domain to the target domain prior to sending them to a classifier. The classifier does not need all node representations at the same time so we can improve the speed of the optimal transport layer by inserting a sampling step after node representations are acquired and before optimal transport occurs. Within the sampling step we randomly sample representations from both the source and target embeddings HS and HT described above and then pass those samples on to optimal transport. We can then transport the sampled source representations to send on to the classifier. These changes are all shown in Fig. 4.2. 4.5.2 SpFOCI Previously the FOCI algorithm sought to improve fair treatment by intuitively encourag- ing transportation between members of different protected groups. Another option, however, is to incorporate statistical parity into the optimal transport loss function in a more direct manner. Remember that the formulation for statistical parity is the following: P ( ˆY = 1|X (p) = 0) = P ( ˆY = 1|X (p) = 1) According to our problem statement we have access to the source labels ys as well as a binary protected attribute from the target dataset X (p) t ∈ Rnt where X (p) ti can be either 0 or 1. We now define ˜yt = P T ys, an approximation of ˆyt where ˜yti is the proportion of target sample i that is being mapped with source samples where ys = 1. This approximation allows us to define a statistical parity measure below. SPF OCI = (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (X (p) t (X (p) t )T ˜yt )T 1nt − Or for simplicity, (1 − X (p) t (1 − X (p) )T ˜yt )T 1nt t (1 − X (p) t (1 − X (p) )T )T 1nt t (cid:33) P T ys (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) = (cid:32) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) X (p) t (X (p) t )T )T 1nt − 36 SPF OCI = |wT P T ys| where w = X (p) t (X (p) )T 1nt t − 1 − X (p) t (1 − X (p) )T 1nt t We can now add this new term into the OT loss function and introduce a new SP-fair Sinkhorn distance: where Wsp(µs, µt) = ⟨P sp, C⟩ (12) P sp = argminP ∈U (µs,µt)LSP , LSP = ⟨P, C⟩ − λh(P ) + γ|wT P T ys| U (µs, µt) := (cid:8)P ∈ Rns×nt + (cid:12) (cid:12)P 1nt = µs, P T 1ns = µt (cid:9) This OT formulation leads us to the following theorem. Theorem 2. Given the SP-fair Sinkhorn distance in Eqn. 12, the solution for P SP can be simplified as P SP = diag(u)Kdiag(v) where u = e− 1 2 − 1 λ α, v = e− 1 2 − 1 λ βand K =    e− 1 λ (C+γwyT s ), e− 1 λ (C−γwyT s ), if wT P T ys > 0 if wT P T ys < 0 (13) e− 1 λ C, if wT P T ys = 0 The proof for Theorem 2 is as follows. Proof. The Lagrangian of the function in Eqn. 12 is L = (cid:88) ij λpij log pij + pijcij + γ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:88) ij wipijysj (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 37 + αT (P 1d − µs) + βT (P T 1d − µt) dataset Pokec-n Pokec-z Compas-0 Compas-1 Abide-large Abide-small Credit-0 Credit-1 nodes 2933 3285 2170 1434 804 67 10468 10169 edges #(Y = 1) #(X (p) = 1) 1722 (59%) 16821.0 1844 (56%) 22454.0 1690 (78%) 137473 1140 (79%) 148012 685 (85%) 93886 587 42 (63%) 5462 (52%) 75669 5360 (52%) 29227 2145 (73%) 2137 (65%) 1042 (48%) 674 (47%) 370 (46%) 33 (49%) 8029 (76%) 8384 (82%) Table 4.1: Breakdown of datasets used in experiments. We can then solve the Lagrangian function with respect to Pij to obtain the following result. pij =    e− 1 2 − 1 λ αie− 1 λ (cij +γwiysj )e− 1 2 − 1 λ βj , e− 1 2 − 1 λ αie− 1 λ (cij −γwiysj )e− 1 2 − 1 λ βj , e− 1 2 − 1 λ αie− 1 λ cij e− 1 2 − 1 λ βj , if wT P T ys > 0 if wT P T ys < 0 if wT P T ys = 0 (14) The theorem follows by replacing the terms for K, u, and v, respectively, into the above equation. This new formulation allows us to obtain a solution by computing three different trans- port plans, one for each value of K, and then checking to see which one minimizes LSP . This approach will be more computationally expensive as we must run a Sinkhorn algorithm three times to obtain our three transport plans. Therefore, this OT approach is combined with the improvements made in FastFOCI to present SpFOCI. 4.6 Experimental Evaluation This section describes the experiments performed to evaluate the effectiveness of our proposed FOCI framework. The source code for our approach can be found at https://github. com/ajoystephens/foci. 4.6.1 Data Preparation We consider four real-world datasets for our experiments. Pokec [38] is a popular social media platform in Slovakia. The node feature information is obtained from the user profile 38 data while the link structure represents relationships between users. The classification task in this case is to predict whether a user smokes while the protected attribute here is user’s sex. We use data from two separate geographical regions—Pokec-n and Pokec-z—to form the source and target networks for CNNC. Compas [4] is a recidivism dataset in which each node corresponds to an incarcerated individual while an edge is formed by connecting individuals who were incarcerated during an overlapping time period. The classification task here is to predict whether an individual will re-offend again in the future. We use race as the protected attribute. The source and target networks are created by applying spectral clustering to split the network into 2 subgraphs, denoted as Compas-0 and Compas-1, respectively. ABIDE [17] is a popular dataset for studying Autism Spectrum Disorder (ASD). Simular to previous works [51][52][35][31], we construct a population graph by using phenotypic subject data as node features and resting-state fMRI similarity to construct the edges. Here the presence of ASD is node label and sex is the protected attribute. The dataset was split into two separate graphs according to their data collection sites. Since many of the collection sites provide few or no female samples, we split the dataset into two by combining collection sites with over 70% male samples into a large source dataset and the remaining collection sites into a smaller target dataset. Given the small size of the ABIDE dataset and the need to have a large enough sample to train GNN models, there is only 1 version of the source and target networks in our experiments. Credit [69] is a financial dataset where the task is to predict whether or not an individual will default on a loan. The protected attribute is the age of the individual, and edges are formed between individuals with similar spending and payment patterns. Credit-0 was selected from the initial 30,000 node dataset by first selecting the highest degree node and then repeatedly adding all immediate degree neighbors until the Credit-0 included over 10,000 nodes. Once Credit-0 was removed from the intial graph the process was repeated to obtain Credit-1. 39 4.6.2 Baseline Approaches We compared FOCI against the following CNNC baseline methods. • ACDNE [56]: This approach uses an adversarial deep network embedding approach for domain adaptation. It learns a separate embedding from the node features and link structure before sending them to a discriminator. We use the implementation provided by the authors4. • AdaGCN [20]: This approach learns the node embeddings using a 3-layer GCN, fol- lowed by a single layer discriminator to address the domain drift issue. All experiments were performed using the source code made available by the authors5. • ASN [75]: This approach uses a series of 2-layer GCNs and GCN variational autoen- coders (VAE) to learn the feature embedding. It addresses the domain adaptation issue using an adversarial discriminator. We use the authors’ implementation6 with a minor change using gradient clipping to address the NaN values generated by their VAEs. For every method, we performed 10-fold cross fold validation on the source dataset to select their best hyperparameters. For the baseline methods, the range of their hyperparam- eter values include those reported in their authors’ published papers and source code. Once the best hyperparameters were chosen, we acquired the final results by training the models on the full source dataset and applied them to the full target dataset. We repeated this 10 times with different random seeds and recorded the average F1-score and statistical parity values. Recall that an ideal F1-score should be 1 while statistical parity should be as close to 0 as possible. 4.6.3 Performance Comparison The results comparing FOCI method to other CNNC approaches are shown in Fig. 4.3. Here the goal is to achieve high node classification results, as measured by F1-score, while 4https://github.com/shenxiaocam/ACDNE 5https://github.com/daiquanyu/AdaGCN TKDE 6https://github.com/yuntaodu/ASN 40 (a) Pokec-n to z (b) Pokec-z to n (c) Compas-0 to 1 (d) Compas-1 to 0 (e) Abide Figure 4.3: F1-score and statistical parity for FOCI and other CNNC baselines. The goal is to be closest to the upper left corner by balancing a high f1-score and low statistical parity. simultaneously improving fairness, as measured by statistical parity. A good results moves closer to the upper left corner, achieving a high F1-score and low statistical parity. Both AdaGCN and ASN struggled to classify nodes in most of the datasets that we reviewed. They achieved very low F1-scores and low statistical parity by classifying the majority of nodes as the same class. We provide an example confusion matrix for one of the 10 runs of ASN in Table 4.2. Though the result suggests ASN has an excellent statistical parity, i.e., 0.0002, its F1-score is 0.1329, which is significantly lower than the average F1-score for FOCI (0.6637). This example illustrates how a model can achieve excellent statistical parity despite performing equally badly on all groups of the protected attribute. The exception to this is the Pokec datasets, where ASN manages to obtain reasonably high F1-score (though still lower than FOCI and ACDNE) and low statistical parity on Pokec-z to Pokec-n. Unfortunately, these good results were not consistent for ASN. In each case, FOCI and ACDNE consistently achieve the two highest F1-scores, though the scores were relatively close. ACDNE had a slightly higher F1-score than FOCI in three of the five datasets. However, FOCI outperformed ACDNE in terms of statistical parity in all 5 41 X (p) = 0 Y = 0 Y = 1 X (p) = 1 Y = 0 Y = 1 ˆY = 0 ˆY = 1 422 32 298 26 438 40 569 47 Table 4.2: Example results from ASN on a target dataset of Compas-0 after training on Compas-1 over one seed. Though its statistical parity is 0.0002, its F1-score is poor (0.1329). datasets. These results show that FOCI can perform comparably to other CNNC approaches in terms of their F1-scores while achieving better fairness. 4.6.4 Ablation Study Here we investigate the impact of the introduced hyperparameter γ on the OT transport plan and model outcome. For this portion of the experiments we limit our exploration to the Pokec datasets and the FairSinkhorn OT algorithm as shown in Algorithm 1 and implemented in the FOCI method. In the first set of experiments we saved off hidden layers just before they were sent to OT and then performed OT on them with several different values of γ, evaluating the resulting transport plan matrix P γ. In these and all other experiments involving OT we held λ = 0.03. Fig. 4.4 shows these results by plotting γ against the mean value of pij where nodes i and j do or do not share protected groups respectively. Note that (a) same protected groups (b) different protected groups Figure 4.4: Impact of γ on transport plan matrix when mapping from pokec-n to pokec-z. As γ increases transport plan values decrease between nodes which share a protected group and increase for nodes which are in different protected groups. 42 (a) impact of γ on statistical parity (b) impact of γ on f1-score Figure 4.5: Impact of γ on FOCI model outcomes. mean values for pij are very small as P γ ∈ Rm×n. Next we look at the impact of γ on model outcomes in terms of statistical parity and f1-score. For these experiments γ values were varied within the range that saw impact on P γ in Fig. 4.4, but all other hyperparameters were kept the same. The models were trained with 10 random seeds for each value of γ and the resulting mean statistical parity and f1-scores were plotted in Fig 4.5. Source pokec-n pokec-n pokec-n pokec-z pokec-z pokec-z compas-0 compas-0 compas-0 compas-1 compas-1 compas-1 credit-0 credit-0 credit-1 credit-1 FOCI FastFOCI SpFOCI FOCI FastFOCI SpFOCI FOCI Target Method pokec-z pokec-z pokec-z pokec-n pokec-n pokec-n compas-1 compas-1 FastFOCI compas-1 compas-0 compas-0 FastFOCI compas-0 credit-1 credit-1 credit-0 credit-0 SpFOCI FastFOCI SpFOCI FastFOCI SpFOCI SpFOCI FOCI F1-Score 0.790 +/- 0.004 0.773 +/- 0.041 0.700 +/- 0.067 0.784 +/- 0.008 0.756 +/- 0.071 0.672 +/- 0.072 0.560 +/- 0.187 0.599 +/- 0.085 0.578 +/- 0.025 0.638 +/- 0.009 0.622 +/- 0.037 0.627 +/- 0.018 0.883 +/- 0.001 0.910 +/- 0.001 0.868 +/- 0.009 0.866 +/- 0.008 SP 0.007 +/- 0.006 0.013 +/- 0.014 0.045 +/- 0.026 0.056 +/- 0.016 0.047 +/- 0.027 0.068 +/- 0.054 0.122 +/- 0.041 0.158 +/- 0.027 0.079 +/- 0.030 0.163 +/- 0.011 0.169 +/- 0.026 0.181 +/- 0.017 0.014 +/- 0.001 0.026 +/- 0.011 0.006 +/- 0.004 0.020 +/- 0.018 Epoch Time (s) 0.569 +/- 0.067 0.117 +/- 0.042 1.162 +/- 1.614 0.627 +/- 0.065 0.110 +/- 0.029 2.481 +/- 1.672 0.277 +/- 0.024 0.108 +/- 0.022 0.111 +/- 0.027 0.177 +/- 0.021 0.084 +/- 0.023 0.086 +/- 0.023 0.123 +/- 0.089 0.194 +/- 0.162 0.145 +/- 0.195 0.263 +/- 0.527 Table 4.3: Comparison of FOCI, FastFOCI and SpFOCI in terms of F1-Score, statically parity (SP), and epoch time in seconds. 43 4.6.5 FOCI Comparison Lastly we compare results from FOCI, FastFOCI and SpFOCI in terms of F1-score, sta- tistical parity, and run time. The results for this comparison can be seen in Table 4.3. FOCI results for the credit datasets are excluded from this table because the FOCI method was unable to process the credit dataset. FOCI’s optimal transport layer encountered numerical errors when attempting to transport the entire source and target credit datasets. Here we see quickly see that FastFOCI provides significant run time improvements over FOCI, but with a slight decline in model utility and fairness. This suggests that FastFOCI may be the best option with larger datasets, but that FOCI may be the better choice with a smaller dataset. On the other hand, SpFOCI did not see an improvement in fairness over FOCI or FastFOCI in most cases. 4.7 Conclusion This chapter presents a framework called FOCI for fair cross-network node classifica- tion. The framework uses a novel fair Sinkhorn distance measure to encourage mapping between members of different protected groups in the source and target networks. It then presents FastFOCI, a incremental improvement to FOCI which improves performance and scalablilty, and SpFOCI, a combination of FastFOCI and an alternative fair OT approach. We have experimentally shown that all three proposed approaches help to mitigate unfair- ness while maintaining accuracy comparable to other state-of-the-art CNNC baselines. We have also presented run time values comparing our three approaches and demonstrating the performance improvements of FastFOCI . 44 CHAPTER 5: CONCLUSION & FUTURE WORK In this thesis I present several approaches for using optimal transport for domain adap- tation in the CNNC task. OTGCN was introduced in Chapter 3 and provides a basic CNNC approach which utilizes optimal transport for domain adaptation and a GCN for node rep- resentation learning. I then expanded upon the OTGCN framework to develop FOCI , an intuitively fair CNNC framework in Chapter 4. FastFOCI and SpFOCI were then introduced as improvement addressing performance and offering an alternative fairness approach. OT- GCN and FOCI were compared to other state-of-the-art CNNC baselines. FOCI, FastFOCI and SpFOCI were all compared on the largest of our datasets to explore scalabilty and per- formance. Two immediate areas for further exploration are in the application of statistical parity and the sampling method. Perhaps SpFOCI could be improved by checking the transport plan within each iteration of the three Sinkhorn algorithms. Additionally, both FastFOCI and SpFOCI may see performance improvements with a more intelligent sampling approach such as active sampling [6] [25]. Another area to explore is methods for considering fairness directly within alternative domain adaptation approach. For example, instead of transporting source samples to the target domain it may be possible to generate an adapted training set using diffusion or other generative approaches Lastly, a large area for future work is to design methods which target different definitions of fairness. In this thesis FOCI and FastFOCI implemented an intuitive fairness approach while SpFOCI addressed statistical parity, but, as discussed in Chapter 1, statistical parity is not always an ideal or applicable metric. Other metrics such as equalized odds or disparate impact are more applicable in some scenarios and may be explored in the CNNC setting. 45 REFERENCES [1] T. Abrishami, N. Guillen, P. Rule, Z. Schutzman, J. Solomon, T. Weighill, and S. Wu. Geometry of graph partitions via optimal transport. SIAM Journal on Scientific Com- puting, 42(5):A3340–A3366, 2020. [2] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pages 21–29. PMLR, 2019. [3] C. Agarwal, H. Lakkaraju, and M. Zitnik. Towards a unified framework for fair and In Uncertainty in Artificial Intelligence, pages stable graph representation learning. 2114–2124. PMLR, 2021. [4] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. propublica, may 23, 2016, 2016. [5] Y. Bechavod and K. Ligett. Penalizing unfairness in binary classification. arXiv preprint arXiv:1707.00044, 2017. [6] K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1131–1139, 2012. [7] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. arXiv preprint arXiv:1706.02409, 2017. [8] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=F72ximsx7C1. [9] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [10] R. Carette, F. Cilia, G. Dequen, J. Bosche, J.-L. Guerin, and L. Vandromme. Auto- matic autism spectrum disorder detection thanks to eye-tracking and neural network- based approach. In Internet of Things (IoT) Technologies for HealthCare: 4th Interna- tional Conference, HealthyIoT 2017, Angers, France, October 24-25, 2017, Proceedings 4, pages 75–81. Springer, 2018. [11] B. Chen, G. B´ecigneul, O.-E. Ganea, R. Barzilay, and T. Jaakkola. Optimal transport graph neural networks. arXiv preprint arXiv:2006.04804, 2020. [12] J. Chen, T. Ma, and C. Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018. 46 [13] L. Chen, Z. Gan, Y. Cheng, L. Li, L. Carin, and J. Liu. Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning, pages 1542–1553. PMLR, 2020. [14] B. Cheng, M. Liu, D. Shen, Z. Li, D. Zhang, and A. D. N. Initiative. Multi-domain transfer learning for early diagnosis of alzheimer’s disease. Neuroinformatics, 15:115– 132, 2017. [15] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 797–806, 2017. [16] N. Courty, R. Flamary, and D. Tuia. Domain adaptation with regularized optimal transport. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, pages 274–289. Springer, 2014. [17] C. Craddock, S. Sikka, B. Cheung, R. Khanuja, S. S. Ghosh, C. Yan, Q. Li, D. Lurie, J. Vogelstein, R. Burns, et al. Towards automated analysis of connectomes: The config- urable pipeline for the analysis of connectomes (c-pac). Front Neuroinform, 42:10–3389, 2013. [18] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Cur- ran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Paper.pdf. [19] E. Dai and S. Wang. Say no to the discrimination: Learning fair graph neural net- In Proceedings of the 14th ACM works with limited sensitive attribute information. International Conference on Web Search and Data Mining, pages 680–688, 2021. [20] Q. Dai, X. Shen, X.-M. Wu, and D. Wang. Network transfer learning via adversarial domain adaptation with graph convolution. arXiv preprint arXiv:1909.01541, 2019. [21] Y. Dong, O. Lizardo, and N. V. Chawla. Do the young live in a “smaller world” than the old? age-specific degrees of separation in a large-scale mobile communication network. arXiv preprint arXiv:1606.07556, 2016. [22] U. Erkan and D. N. Thanh. Autism spectrum disorder detection with machine learn- ing methods. Current Psychiatry Research and Reviews Formerly: Current Psychiatry Reviews, 15(4):297–308, 2019. [23] S. Ferradans, N. Papadakis, G. Peyr´e, and J.-F. Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014. [24] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. 47 [25] J. Gottlieb. Understanding active sampling strategies: Empirical approaches and im- plications for attention and decision research. Cortex, 102:150–160, 2018. [26] M. Grandini, E. Bagli, and G. Visani. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756, 2020. [27] H. Guan, L. Wang, and M. Liu. Multi-source domain adaptation via optimal trans- port for brain dementia identification. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1514–1517. IEEE, 2021. [28] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [29] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. [30] M. HassanPour Zonoozi and V. Seydi. A survey on adversarial domain adaptation. Neural Processing Letters, 55(3):2429–2469, 2023. [31] H. Jiang, P. Cao, M. Xu, J. Yang, and O. Zaiane. Hi-gcn: A hierarchical graph con- volution network for graph embedding learning of brain network and brain disorders prediction. Computers in Biology and Medicine, 127:104096, 2020. [32] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Fairness-aware classifier with preju- dice remover regularizer. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Pro- ceedings, Part II 23, pages 35–50. Springer, 2012. [33] L. Kantorovitch. On the translocation of masses. Management Science, 5(1):1–4, 1958. doi: 10.1287/mnsc.5.1.1. URL https://doi.org/10.1287/mnsc.5.1.1. [34] F. Karimi, M. G´enois, C. Wagner, P. Singer, and M. Strohmaier. Homophily influences ranking of minorities in social networks. Scientific reports, 8(1):11077, 2018. [35] A. Kazi, S. Shekarforoush, S. Arvind Krishna, H. Burwinkel, G. Vivar, K. Kort¨um, S.-A. Ahmadi, S. Albarqouni, and N. Navab. Inceptiongcn: receptive field aware graph convolutional network for disease prediction. In Information Processing in Medical Imaging: 26th International Conference, IPMI 2019, Hong Kong, China, June 2–7, 2019, Proceedings 26, pages 73–85. Springer, 2019. [36] S. Khoshraftar and A. An. A survey on graph representation learning methods. arXiv preprint arXiv:2204.01855, 2022. [37] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [38] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. 48 [39] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014. [40] Q. Li, Z. Han, and X.-M. Wu. Deeper insights into graph convolutional networks for semi-supervised learning, 2018. [41] X. Li, Y. Gu, N. Dvornek, L. H. Staib, P. Ventola, and J. S. Duncan. Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: Abide results. Medical Image Analysis, 65:101765, 2020. [42] X. Li, Y. Zhou, N. Dvornek, M. Zhang, S. Gao, J. Zhuang, D. Scheinost, L. H. Staib, P. Ventola, and J. S. Duncan. Braingnn: Interpretable brain graph neural network for fmri analysis. Medical Image Analysis, 74:102233, 2021. [43] C. Lord, T. S. Brugha, T. Charman, J. Cusack, G. Dumas, T. Frazier, E. J. Jones, R. M. Jones, A. Pickles, M. W. State, et al. Autism spectrum disorder. Nature reviews Disease primers, 6(1):1–23, 2020. [44] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoen- coder. arXiv preprint arXiv:1511.00830, 2015. [45] B. T. Luong, S. Ruggieri, and F. Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502–510, 2011. [46] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001. [47] A. K. Menon and R. C. Williamson. The cost of fairness in binary classification. In Conference on Fairness, accountability and transparency, pages 107–118. PMLR, 2018. [48] G. Monge. M´emoire sur la th´eorie des d´eblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666–704, 1781. [49] H. S. Nogay and H. Adeli. Machine learning (ml) for the diagnosis of autism spectrum disorder (asd) using brain imaging. Reviews in the Neurosciences, 31(8):825–841, 2020. [50] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009. [51] S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. G. Moreno, B. Glocker, and D. Rueckert. Spectral graph convolutions for population-based disease prediction. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, pages 177–185. Springer, 2017. [52] S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrero, B. Glocker, and D. Rueckert. Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease. Medical image analysis, 48:117–130, 2018. 49 [53] D. Pessach and E. Shmueli. A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3):1–44, 2022. [54] N. Quadrianto and V. Sharmanska. Recycling privileged learning and distribution matching for fairness. Advances in neural information processing systems, 30, 2017. [55] T. Rahman, B. Surma, M. Backes, and Y. Zhang. Fairwalk: Towards fair graph em- bedding. 2019. [56] X. Shen, Q. Dai, F.-l. Chung, W. Lu, and K.-S. Choi. Adversarial deep network em- bedding for cross-network node classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2991–2999, 2020. [57] X. Shen, Q. Dai, S. Mao, F.-l. Chung, and K.-S. Choi. Network together: Node clas- IEEE Transactions on Neural sification via cross-network deep network embedding. Networks and Learning Systems, 32(5):1935–1948, 2020. [58] I. Spinelli, S. Scardapane, A. Hussain, and A. Uncini. Fairdrop: Biased edge dropout for enhancing fairness in graph representation learning. IEEE Transactions on Artificial Intelligence, 3(3):344–354, 2021. [59] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep trans- fer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, pages 270–279. Springer, 2018. [60] V. Titouan, N. Courty, R. Tavenard, and R. Flamary. Optimal transport for structured In International Conference on Machine Learning, data with application on graphs. pages 6275–6284. PMLR, 2019. [61] T. Tong, K. Gray, Q. Gao, L. Chen, D. Rueckert, A. D. N. Initiative, et al. Multi-modal classification of alzheimer’s disease using nonlinear graph fusion. Pattern recognition, 63:171–181, 2017. [62] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. ICLR, 2018. [63] G. Wang, R. Ying, J. Huang, and J. Leskovec. Multi-hop attention graph neural net- work. International Joint Conference on Artificial Intelligence, 2021. [64] J. Wang, L. Zhang, Q. Wang, L. Chen, J. Shi, X. Chen, Z. Li, and D. Shen. Multi-class asd classification based on functional connectivity and functional correlation tensor via multi-source domain adaptation and multi-view sparse representation. IEEE transac- tions on medical imaging, 39(10):3137–3147, 2020. [65] Z. Wang, M. Ye, X. Zhu, L. Peng, L. Tian, and Y. Zhu. Metateacher: Coordinating multi-model domain adaptation for medical image classification. Advances in Neural Information Processing Systems, 35:20823–20837, 2022. 50 [66] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861– 6871. PMLR, 2019. [67] M. Wu, S. Pan, C. Zhou, X. Chang, and X. Zhu. Unsupervised domain adaptive graph convolutional networks. In Proceedings of The Web Conference 2020, pages 1457–1467, 2020. [68] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. In International conference on machine learning, pages 5453–5462. PMLR, 2018. [69] I.-C. Yeh and C.-h. Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert systems with applications, 36(2):2473–2480, 2009. [70] J. Yuan, C. Holtz, T. Smith, and J. Luo. Autism spectrum disorder detection from semi-structured and unstructured medical data. EURASIP Journal on Bioinformatics and Systems Biology, 2017:1–9, 2016. [71] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013. [72] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2215–2224, 2021. [73] C. Zhang, Y. Xie, H. Bai, B. Yu, W. Li, and Y. Gao. A survey on federated learning. Knowledge-Based Systems, 216:106775, 2021. [74] J. Zhang, X. Xiao, L.-K. Huang, Y. Rong, and Y. Bian. Fine-tuning graph neural net- works via graph topology induced optimal transport. arXiv preprint arXiv:2203.10453, 2022. [75] X. Zhang, Y. Du, R. Xie, and C. Wang. Adversarial separation network for cross- network node classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2618–2626, 2021. [76] J. Zhou, L. Liu, W. Wei, and J. Fan. Network representation learning: from prepro- cessing, feature extraction to node embedding. ACM Computing Surveys (CSUR), 55 (2):1–35, 2022. [77] A. Zunino, P. Morerio, A. Cavallo, C. Ansuini, J. Podda, F. Battaglia, E. Veneselli, C. Becchio, and V. Murino. Video gesture analysis for autism spectrum disorder de- tection. In 2018 24th international conference on pattern recognition (ICPR), pages 3421–3426. IEEE, 2018. 51