GRAPH-BASED LEARNING FOR COMMUNITY DETECTION AND HUB NODE IDENTIFICATION By Meiby Ortiz-Bouza A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical and Computer Engineering—Doctor of Philosophy 2025 ABSTRACT Many real-world systems can be represented using complex networks, where the different agents and their relations are represented as nodes and links, respectively. Traditional network models employ simple graphs where the graph is represented by a set of vertices and edges that connect them. With the advances in data acquisition technologies and the different types of data that are available, the simple graph model becomes insufficient to describe the higher dimensional relational data sets. For instance, in social networks, users can be defined as nodes with multiple types of interactions like friendship, collaboration, and economic exchange, connecting them. Furthermore, each node is often associated with attributes such as demographics or interests. To overcome the limitation of existing simple graph models, multi-dimensional graphs such as multiplex networks have been proposed. Similarly, in order to capture the node information that is available in most real-world networks, attributed graphs have been introduced. Given a large scale complex network, one is usually interested in learning the underlying graph structure, such as the community structure or hub nodes. These structures uncover meaningful patterns and provide insights within complex networks. Community detection identifies groups of nodes that are more densely connected to each other than they are to the rest of the network. Hub nodes, on the other hand, correspond to nodes which are densely connected to the rest of the graph and play a critical role in information processing in the network. Although there are numerous works on community detection in single-layer networks, existing work on multiplex community detection mostly focuses on learning a common community structure across layers without taking the heterogeneity of the different layers into account. Beyond detecting communities within a single multiplex network, many applications may require comparing the community structures of two or more multiplex networks. Furthermore, most of the existing community detection methods focus solely on the graph connectivity information. In attributed graphs, where each node is associated with an attribute vector, the community detection methods that focus only on the edges and the data clustering methods that focus only on the attributes of the nodes become insufficient. Traditional hub detection methods rely mostly on graph connectivity without taking the node attributes into account. This thesis addresses the limitations of learning these graph structures in high-dimensional and attributed networks. Novel algorithms for multiplex and attributed community detection, as well as approaches for discovering discriminative communities between two multiplex networks, are introduced using graph spectral theory and graph signal processing methods. Similarly, a graph signal processing approach that takes into account both the graph topology and node attributes is introduced for hub node identification. ACKNOWLEDGEMENTS I would like to express my sincere gratitude to everyone who has supported me throughout this journey. First, I am deeply thankful to my advisor, Dr. Selin Aviyente, for her guidance, encour- agement, invaluable insights, and the significant contributions she has made to my professional and personal growth throughout this process. Second, I’d like to thank my co-authors, Hanlu Yang, Duc Vu, Sema Athamnah, and Dr. Adbullah Karaslaanli, for their collaboration, valuable input, and insightful discussions. Additionally, I am grateful to my Ph.D. guidance committee members for their time, constructive feedback, and support at each milestone of this journey. A special thank you to my parents, stepparents, brother, and sister for their unconditional love and support; to my friends and extended family for being there through every step of this journey; and to my husband, whose love and encouragement have been my greatest source of strength. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Notations and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Graph Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Multiplex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Organization and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 5 7 8 . 11 . . . . . . . CHAPTER 2 COMMUNITY DETECTION IN MULTIPLEX NETWORKS . . . . . . . . 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Related works . 2.3 Proposed Method (MX-ONMTF) . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Recovery and Consistency Analysis 2.6 Experiments: Simulated Data and Real-world Networks . . . . . . . . . . . . . 32 2.7 Application to fMRI data: Subgroup identification . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8 Conclusions . . . . . CHAPTER 3 DISCRIMINATIVE COMMUNITY DETECTION FOR MULTIPLEX NETWORKS . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . Introduction . 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Background . . . 3.3 Discriminative Community Detection Methods . . . . . . . . . . . . . . . . . 54 3.4 Experiments: Multiplex Networks . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Experiments: Temporal Mutiplex Networks . . . . . . . . . . . . . . . . . . . 66 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6 Conclusions . . . . . . . . . . . . . . CHAPTER 4 GRAPH FILTERING FOR CLUSTERING ATTRIBUTED GRAPHS . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . Introduction . 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Background . 4.3 Graph Filtering for Clustering Attributed Graphs (GraFiCA) . . . . . . . . . . 78 4.4 Multi-Scale Graph Wavelets for Clustering (MSGWC) . . . . . . . . . . . . . . 85 4.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Experimental Results on Real Networks . . . . . . . . . . . . . . . . . . . . . 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.7 Conclusions . . . . . . CHAPTER 5 GRAPH FILTERING LEARNING FOR STRUCTURE-FUNCTION . . . . . . . Introduction . COUPLING BASED HUB NODE IDENTIFICATION . . . . . . . . . . . . 106 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 . 5.1 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Optimal Graph Filtering for Hub Node Identification . . . . . . . . . . . . . . 112 5.4 Experiments on simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5 Application to Resting State fMRI Data . . . . . . . . . . . . . . . . . . . . . 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.6 Discussion . . . . . . v 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 CHAPTER 6 CONCLUSIONS . . 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . 135 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 APPENDIX A: AUXILIARY FUNCTION PROOF . . . . . . . . . . . . . . . . . . . . . . 159 APPENDIX B: CONSISTENCY PROOF . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 APPENDIX C: INVERTIBILITY PROOF . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 vi CHAPTER 1 INTRODUCTION Many real-world complex systems spanning from social systems to biological entities can be effectively modeled as complex networks, where entities and their interactions are represented as nodes and links, respectively [14]. These networks can be mathematically described using graphs, which consist of a set of vertices connected by edges. Modeling relational datasets as graphs provides a powerful framework for representing, analyzing, and extracting valuable insights from complex data. It enables the exploration of relationships, patterns, and structures in a wide range of applications, making it an essential tool in data analysis. Traditional network models are based on simple graphs, where a single edge describes the relationship between the vertices. While this basic representation has been effective for many applications, it falls short when dealing with more complex and multidimensional datasets. The limitations of these single-graph models become evident in light of the increasing availability of diverse data types and the need to capture richer relationships within networks. In real-world scenarios, nodes may interact in various ways, and these interactions may carry different meanings. For example, in a transportation network, cities might be connected by different modes of transportation (buses, airplanes, trains). Similarly, the same group of people may interact in different ways on social media platforms such as Facebook, Twitter, and LinkedIn. Since single graph models can only denote one edge between vertices, they are not capable of capturing the multiple modes of interaction that can exist between nodes. To address this issue, multiplex networks have emerged as a model where the traditional graph model is extended by introducing multiple layers, each of which represents a distinct mode of interaction. Multiplex networks have found applications in modeling a broad spectrum of complex systems, including living organisms, human societies, transportation systems, and critical infrastructures [229, 7]. Another significant limitation of single-graph models is their inability to capture the rich node information often available in real-world networks. In many applications, nodes have attributes or properties beyond their connectivity patterns [42, 207]. For instance, in protein networks, the 1 interactions between proteins can be modeled as a graph, but there are also known properties of such proteins that are available. In social networks, in addition to the different modes of interaction between the nodes, information about each node such as demographics, e.g., age, gender, location, and interests may be available. To accommodate this type of information, attributed graphs have been introduced, extending traditional graph structures by allowing nodes to have attributes or properties. This extension enriches the representation of nodes and edges, enabling a more comprehensive analysis of complex systems. Given a large complex network, one of the most important problems is dimensionality re- duction and network structure discovery. Community detection and hub node identification are two particular forms of network structure discovery problems. Communities allow us to identify groups of functionally related objects (i.e., nodes) and the interactions between them. For example, in social networks, communities correspond to groups of friends who attended the same school or who come from the same hometown [151]; in protein interaction networks, communities are functional modules of interacting proteins [6]; in co-authorship networks, communities correspond to scientific disciplines [99]. Various methods have been proposed for detecting the community structure of single layer networks [92]. Among these, the most commonly used ones are spectral clustering [155, 256], methods based on statistical inference [1], approaches based on optimization of a quality function [51], and techniques based on network dynamics [213]. Although there are numerous works on community detection in single-layer networks, existing work on multiplex community detection mostly focuses on learning a common community structure across layers without taking the heterogeneity of the different layers into account [52, 77, 245]. In the case of attributed networks, both the graph connectivity and node attributes are available. Thus, two sources of data can be used to perform the clustering task. The first is the data about the objects, i.e., node attributes. For example, known properties of proteins, users’ social network profiles, or authors’ publication histories may tell us which objects are similar, and to which communities they may belong. The second source of data is network connectivity, i.e., the set of edges between the nodes, such as the friendship relationships between users, interactions between proteins and 2 the collaboration between authors. However, classical clustering methods typically focus only on one of these two data modalities. While data clustering methods such as 𝑘-means can be used to assign class labels based on attribute similarity [123], community detection algorithms aim to find communities based on the network structure, e.g., to find groups of nodes that are densely connected [91, 248, 182]. Employing just one of these two sources of information in isolation can result in the algorithm failing to account for important structures in the data. In the case of the hub node identification, the goal is to detect highly connected nodes, known as hubs, that play a central role in network connectivity and information processing. Hub nodes play a critical role in various domains, such as social networks, biological systems, and brain connectomics, where they act as key influencers [118], essential proteins [252], or important brain regions [253], respectively. Traditionally, hubs have been defined based on structural properties using common measures of centrality such as degree [188], clustering coefficient [189], vulnera- bility [131], betweenness [282], and eigenvector centrality [165], which rely solely on connectivity information. While these measures effectively capture nodes with high influence in terms of net- work topology, they overlook intrinsic node attributes that may be equally important in defining hub roles. One critical area where hub identification has significant implications is brain network analysis. Although there has been some works for identifying brain hubs [129], most the proposed methods rely only on the functional connectivity network without considering the coupling between the brain’s anatomical wiring, i.e., structural network, and its dynamic functional properties, e.g., the BOLD signal [93]. 1.1 Notations and Background Vectors and matrices are indicated by bold lowercase letters, x, and bold uppercase letters X, respectively. Entries of a vector are denoted as 𝑥𝑖, and entries of a matrix are denoted as 𝑋𝑖 𝑗 . The 𝑖th row and column of X are indicated as X𝑖· and X·𝑖, respectively and they are both column vectors. Superscript ⊤ indicates the transpose of vectors and matrices, identity matrix is shown by I, and || · ||𝐹 and || · ||1 are the Frobenious norm and ℓ1 norm, respectively. 3 1.1.1 Graphs Let G = (𝑉, 𝐸) be a graph where 𝑉 is the vertices set with |𝑉 | = 𝑁, 𝐸 ⊆ 𝑉 × 𝑉 is the edge set. An edge from vertex 𝑖 to 𝑗 is represented by 𝑒𝑖 𝑗 ∈ 𝐸 and it is associated with a weight. Graphs are natural representations of networks, where vertices correspond to network nodes and edges correspond to network links. A graph can be represented algebraically by an adjacency matrix A ∈ R𝑁×𝑁 . If the graph is undirected, then the adjacency matrix is symmetric. For a weighted graph, 𝐴𝑖 𝑗 ∈ [0, 1], and for a binary graph, 𝐴𝑖 𝑗 ∈ {0, 1}. In this thesis, we use undirected (symmetric) weighted and binary adjacency matrices. A degree matrix, D, for an undirected graph is a diagonal matrix with elements 𝐷𝑖𝑖, which are equal to the sum of weights of all edges connected to the vertex 𝑖, that is, the sum of elements in the 𝑖−th row of A, 𝐷𝑖𝑖 = (cid:205) 𝑗 𝐴𝑖 𝑗 . Therefore, for an unweighted and undirected graph, the value of the element 𝐷𝑖𝑖 is equal to the number of edges connected to the 𝑖−th vertex. A graph can also be represented by the graph Laplacian, which combines the adjacency matrix and the degree matrix, given by L = D − A. The graph Laplacian matrix carries more physical interpretation than the corresponding adjacency matrix due to its intrinsic connection to the number of disjoint components or subgraphs within the graph. The Laplacian matrix is defined in such a way that the sum of elements in each row (column) is zero. As a consequence, this enforces the inner products of every row of L with any constant vector to be zero. This means that at least one eigenvalue of the Laplacian is zero, 𝜆0 = 0. The multiplicity of the eigenvalue 𝜆0 = 0 of the Laplacian is equal to the number of connected subgraphs in the corresponding graph. This property follows from the fact that the Laplacian matrix of disconnected graphs can be written in a block diagonal form and each subgraph of a disconnected graph behaves as an independent graph and has a 𝜆0 = 0. This property does not hold for the adjacency matrix. The normalized Laplacian matrix L𝑛 is defined as L𝑛 = D−1/2(D − A)D−1/2= I𝑁 −D−1/2AD−1/2 = I𝑁 − A𝑛, where I𝑁 is the identity matrix of size 𝑁 and A𝑛 is the normalized adjacency matrix. The spectrum of L𝑛 is composed of the diagonal matrix of the eigenvalues, 𝚲 = 𝑑𝑖𝑎𝑔(𝜆1, . . . , 𝜆𝑁 ) with 𝜆1 ≤ 𝜆2 ≤ . . . ≤ 𝜆𝑁 , and the eigenvector matrix U = [𝑢1|𝑢2| . . . |𝑢𝑁 ] such that Ln = U𝚲U⊤ 4 [57]. 1.2 Graph Signal Processing Graph signal processing (GSP) extends traditional signal processing tools to deal with data defined on graph or network structures. In GSP, a graph signal is defined as a numerical or informational quantity associated with each node (or vertex), mathematically described by a function 𝑓 : 𝑉 → R. The graph signal can be represented as a vector f ∈ R𝑁 where 𝑓𝑖 is the signal value on node 𝑖. If for each node the dimension of the signal, i.e., attributes, is 𝑃, the graph signal can be represented as a matrix F ∈ R𝑁×𝑃. The graph signal value can be continuous or discrete, and it can convey various types of information depending on the application. Examples include temperature readings from sensor nodes, pixel values in an image graph, or feature vectors representing individuals in a social network. Graph signal processing techniques are applied to analyze, filter, transform, or extract meaningful information from these graph signals. GSP methods leverage the graph’s connectivity structure to process signals in a way that takes into account the relationships between nodes. Akin to conventional signal processing, a graph signal can be studied using its Fourier domain representation, which can be derived using the graph shift operator (GSO). A GSO is an 𝑁 × 𝑁 dimensional matrix representing the structure of the graph, such as the adjacency, Laplacian or normalized Laplacian matrices [224]. In this work, the latter is employed as the GSO and its eigenvectors and eigenvalues provide a similar notion of frequency as that in the classical Fourier domain. For connected graphs, the Laplacian eigenvector u0 associated with the eigenvalue 𝜆0 = 0 is constant and equal to 1 √ 𝑁 at each vertex. The graph Laplacian eigenvectors associated with low frequencies vary slowly across the graph and the values of the eigenvector at those locations are likely to be similar. The eigenvectors associated with larger eigenvalues oscillate more rapidly and are more likely to have dissimilar values on vertices connected by a high-weight edge. These eigenvectors and eigenvalues are used to define Graph Fourier Transform (GFT) of f as (cid:98)f = U⊤f where ˆ𝑓𝑖 is the Fourier coefficient at the 𝑖th frequency component Λ𝑖𝑖. When the dimension of the signals is 𝑃, i.e., F ∈ R𝑁×𝑃 with F·𝑝 = f𝑝, GFT can be computed as (cid:98)F = U⊤F. 5 GFT can be utilized to define a notion of signal variability over the graph such that F is a smooth graph signal, i.e., F has low variation over the graph if most of the energy of (cid:98)F lies in the low-frequency components. The smoothness of F can then be calculated using the total variation of F measured in terms of the spectral density of its Fourier transform as: tr((cid:98)F⊤𝚲(cid:98)F) = tr(F⊤U𝚲U𝑇 F) = tr(F⊤L𝑛F). (1.1) The quadratic term tr(F⊤L𝑛F) on the right-hand side of (1.1) is equal to (cid:205)𝑖≠ 𝑗 𝐴𝑖 𝑗 whose smaller values indicate that the graph signal is smooth. In particular, for a smooth signal F, − F 𝑗 ·√ 𝑑 𝑗 F𝑖·√ 𝑑𝑖 , (cid:18) (cid:19) 2 signal values on strongly connected nodes, i.e., large 𝐴𝑖 𝑗 , are similar to each other. 1.2.1 Graph Filtering A (convolutional) graph filter is an information processing unit that preserves specific properties of its input graph signal f by applying a shift-and-sum operation on f [121]. Shifting f requires propagating information in f across the graph topology, which can be done by a linear transformation of f with the GSO, e.g., L𝑛f, when the GSO is the normalized Laplacian. A graph filter of order 𝑇 is then defined as: H (L𝑛) = 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 𝑛, (1.2) where h = [ℎ0, . . . , ℎ𝑇−1]⊤ is the vector of filter coefficients and L𝑡 𝑛 is the 𝑡th power of the normalized Laplacian representing 𝑡-times shift operation. In the graph Fourier domain, the filtering operation can be interpreted as preserving the spectral content that is relevant to the task at hand. Namely, for a graph signal f, let ˜f = H (L𝑛)f. In graph Fourier domain: ˜f = UH (𝚲)U⊤f = UH (𝚲)(cid:98)f = U(cid:98)f𝑜, where (cid:98)f𝑜 is the GFT of f after being filtered by H (𝚲), and whose 𝑖th Fourier coefficient is: [ˆf𝑜]𝑖 = H (Λ𝑖𝑖) ˆ𝑓𝑖 = 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡Λ𝑡 𝑖𝑖 ˆ𝑓𝑖. (1.3) (1.4) Depending on the values of h, one can attenuate or amplify specific spectral components, thus, yielding an output graph signal ˜f that has a desired Fourier representation. For example, a smooth ˜f 6 can be obtained by filtering out high-frequency components while amplifying low-frequency ones. This concept of graph filtering can be extended to multi-dimensional graph signals, F ∈ R𝑁×𝑃, such that the filtered signal is obtained as ˜F = UH (𝚲)U⊤F. 1.3 Multiplex Networks Multiplex networks are complex systems that consist of multiple layers, each representing different types of relationships or interactions among entities. The multiplex networks are used in a broad of applications such as social networks, transportation networks, and biological ones [229, 7]. For example, in social media, a multiplex network can represent a user’s interactions across multiple platforms, such as Facebook, Twitter, and LinkedIn. Each layer corresponds to a different social media platform, as seen in Figure 1.1, and nodes represent users. Edges within each layer represent interactions like friendships, follows, or messages. In transportation networks, a multiplex graph can model various modes of transportation, including road networks, railway systems, and air travel. Each layer represents a different mode of transport, with nodes representing locations (cities or transportation hubs) and edges denoting transportation routes. Figure 1.1: Example of a Multiplex Social Network. Multiplex networks can be represented using a finite sequence of graphs {G𝑙 }, where 𝑙 ∈ {1, 2, . . . , 𝐿}, G𝑙 = (𝑉, 𝐸𝑙, A𝑙) [62]. 𝑉 is the set of vertices which is the same for all layers 𝑙, 𝐸𝑙 and A𝑙 ∈ R𝑛×𝑛 are the set of edges and the adjacency matrix for layer 𝑙, respectively. 7 1.4 Community Detection One of the most fundamental tasks in analyzing large-scale networks is community detection. A community is a dense subnetwork within a larger network, within which nodes are connected more densely among themselves than to those outside the community. Numerous methods have been proposed for detecting the community structure of networks in single layer networks [92]. Among these, the most commonly used ones are spectral clustering [155, 256], methods based on statistical inference [1], approaches based on optimization of a quality function, [51], and techniques based on network dynamics [213]. In addition to these classical approaches focusing on nonoverlapping community structure, extensions for overlapping communities have also been considered [86]. In this thesis, we focus on nonoverlapping community detection, which is the partitioning of a node set 𝑉 as C = {𝐶1, . . . , 𝐶𝐾 } where 𝐾 is the number of communities. 1.4.1 Spectral Clustering One of the most popular algorithms for partitioning graphs is the minimum cut method, where the network is partitioned such that the number of edges between different communities is mini- mized. Given a single layer graph, G = {𝑉, 𝐸, A} and 𝐾 clusters, minimizing the cut consists of finding a partition {𝐶1, 𝐶2, . . . , 𝐶𝐾 } that minimizes: Cut(𝐶1, 𝐶2, . . . , 𝐶𝐾) = 1 2 𝐾 ∑︁ 𝑘=1 links(𝐶𝑘 , 𝑉 \ 𝐶𝑘 ), (1.5) where links(C𝑘 , 𝑉 \ C𝑘 ) = (cid:205)𝑖∈C𝑘, 𝑗∉C𝑘 unbalanced. In order to address this issue, different variations of the cut definition, e.g., Ratio Cut 𝐴𝑖 𝑗 . In practice, minimizing the cut results in clusters that are (RCut), Normalized Cut (NCut), and MinMax Cut [74, 258], have been proposed. The Normalized Cut is defined as: NCut(𝐶1, 𝐶2, . . . , 𝐶𝐾) = 𝐾 ∑︁ 𝑘=1 links(𝐶𝑘 , 𝑉 \ 𝐶𝑘 ) vol(𝐶𝑘 ) , (1.6) where vol(𝐶𝑘 ) is the total degree of all nodes in 𝐶𝑘 . Similar to minimizing the cut metrics, one can determine the partition by maximizing the 8 corresponding association metrics [73]. For example, the normalized association is defined as: NAssoc(C1, C2, . . . , C𝐾) = 𝐾 ∑︁ 𝑘=1 links(C𝑘 , C𝑘 ) vol(C𝑘 ) . (1.7) Minimizing or maximizing these cost functions are NP hard. However, it has been shown [74, 258] that spectral clustering and nonnegative matrix factorization provide solutions to relaxed cut can be rewritten as (cid:205)𝐾 versions of these problems. Introducing a community assignment matrix Z ∈ R𝑁×𝐾, the normalized 𝑘 Dz𝑘)1/2 . 𝑘=1 (z⊤ Relaxing the problem by allowing z𝑘 to take arbitrary real values, the corresponding optimization 𝑘 L˜z𝑘 where ˜z𝑘 = links(𝐶𝑘,𝑉\𝐶𝑘) vol(𝐶𝑘) z⊤ 𝑘 (D−A)z𝑘 z⊤ 𝑘 Dz𝑘 𝑘=1 ˜z⊤ = (cid:205)𝐾 = (cid:205)𝐾 𝑘=1 z𝑘 problem becomes: min ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤D−1/2LD−1/2 ¯Z), (1.8) where ¯Z = D1/2 ˜Z. This is the standard form of a trace minimization problem, and the Rayleigh-Ritz theorem states that the solution is given by choosing ¯Z as the matrix containing the 𝐾 eigenvectors corresponding to the smallest eigenvalues of the normalized Laplacian D−1/2LD−1/2 as columns. Minimizing the NCut can equivalently be written in terms of maximizing the NAssoc written in trace form: max ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤D−1/2AD−1/2 ¯Z), (1.9) where D−1/2AD−1/2 is the normalized adjacency matrix, which will be denoted by A𝑛 in the remainder of the thesis. In both of these formulations, the eigenvectors encode the low-dimensional embeddings of the graph’s nodes and 𝑘-means algorithm is applied to determine the communities. Authors in [74] show that NMF is equivalent to Laplacian based spectral clustering, and that Normalized Cut using the normalized adjacency matrix, A𝑛 = D−1/2AD−1/2, is equivalent to the nonnegative matrix factorization problem argmin H≥0 ||A𝑛 − HH⊤||2 𝐹. 1.4.2 Non-negative Matrix Factorization Methods based on Nonnegative matrix factorization (NMF) and its variants have been popular for community detection [111]. Compared to other community detection methods, these methods have some unique advantages including high interpretability, and applicability to general complex networks including directed, temporal and multilayer networks [111]. 9 Nonnegative Matrix Factorization decomposes a nonnegative matrix W ∈ R𝑁×𝑀 into the product of two low-rank nonnegative matrices V ∈ R𝑁×𝑘 and U ∈ R𝑀×𝑘 , such that W ≈ VU⊤ and 𝑘 ≪ 𝑁, 𝑀. V and U are found by solving the optimization problem argmin V,U>0 ||W − VU⊤||2 𝐹 . (1.10) Compared with other types of matrix factorization techniques, NMF is more suitable for the task of community detection. This is because it has two unique capabilities. One is the potential clustering capability possessed by NMF. In [74], NMF and its extensions are proved to have equivalent relationships with some classical clustering models. For example, if UU⊤ = I, then NMF becomes equivalent to k-means clustering. If W is symmetric, then the transformation can be written as W ≈ UU⊤, and NMF becomes equivalent to spectral clustering. The other aspect is the generative capability of NMF that can give a good interpretation to community structure [206]. In NMF-based community detection, V and U are the community feature matrix and the community indicator matrix, respectively. The conventional NMF model, W ≈ VU⊤ can be directly used to detect communities. However, it cannot model the interactions among communities, which is useful to determine whether communities are overlapping. Nonnegative matrix tri-factorization (NMTF) was introduced to address this issue, where W ≈ USU⊤, with S modeling the interactions among communities. Another variant of NMF, the orthogonal NMF (ONMF) imposes an additional orthogonality constraint on one of the factor matrices and improves the performance as orthogonality and non- negativity force each row of V (U) to have only one nonzero element which implies that each node belongs only to one community. Like the k-means, the orthogonally constrained factor matrix functions the same as an indicator matrix that shows how the data samples are assigned to different clusters [75, 201]. Therefore, the ONMF model can be regarded as a continuous relaxation of k-means [262]. ONMF has been broadly used in community detection and has been shown to outperform k-means and NMF based clustering [264, 167]. In this thesis, we use Orthogonal Nonnegative Matrix Tri-factorization (ONMTF) which combines the advantages of orthogonality 10 in ONMF and the degrees of freedom provided by NMTF to formulate the community detection problem in multiplex networks in Chapter 2. 1.5 Organization and Contributions In this thesis, we developed algorithms for two important tasks in network analysis: commu- nity detection and hub node identification. Although there are numerous works on community detection in single-layer networks, existing work on multiplex community detection mostly focuses on learning a common community structure across layers without taking the heterogeneity of the different layers into account. Beyond detecting communities within a single multiplex network, in a lot of applications, one may be interested in comparing the community structure of two multiplex networks. For instance, in settings where we have multiplex networks constructed from different conditions, e.g., a treatment and a control experiment, we are interested in visualizing and ex- ploring communities that are specific to one multiplex network or communities that discriminate between the two groups. Furthermore, most of the existing community detection methods focus solely on the graph connectivity information. In attributed graphs, where in addition to connec- tivity information, nodes have associated features or signals, classical clustering methods focus on detecting communities using the attributes of the nodes [123], ignoring the relationships between the nodes, while community detection methods focus only on the topology of the network [248, 182]. However, these methods usually fall short in attributed graph clustering, as they do not exploit informative node features such as user profiles in social networks and document contents in citation networks. In the case of identifying hubs, traditional hub detection methods rely on graph connectivity, e.g., degree, betweenness, and eigenvector centrality, without considering the node attributes. Specifically, in brain networks, this corresponds to considering only the functional connectivity network without considering the coupling between the brain’s anatomical wiring, i.e., structural network, and its dynamic functional properties, e.g., the BOLD signal. In Chapter 2, we introduce a new multiplex community detection method that identifies com- munities that are common across layers as well as those that are unique to each layer. The proposed method, Multiplex Orthogonal Nonnegative Matrix Tri-Factorization (MX-ONMTF), represents 11 the adjacency matrix of each layer as the sum of two low-rank matrix factorizations corresponding to the common and private communities, respectively. Unlike most of the existing methods, which require the number of communities to be pre-determined, the proposed method also introduces a two stage method to determine the number of common and private communities. The proposed algorithm is based on multiplicative update rules and a proof of convergence is provided. Addition- ally, we present an in-depth analysis of the algorithm, including studies of overfitting and ablation, recovery guarantees, and consistency. MX-ONMTF is evaluated on synthetic and real multiplex networks, as well as for multiview clustering applications, and compared to state-of-the-art tech- niques. In addition, MX-ONMTF is applied to functional connectivity networks that are extracted from multi-subject resting-state fMRI data to identify subgroups of subjects that exhibit significant differences in key functional areas of the brain. In Chapter 3, we introduce a discriminative community detection approach based on spectral clustering for detecting community structures that distinguish between two multiplex networks in both static and dynamic settings. In particular, we introduce three different formulations: the first approach finds discriminative subspaces between two multiplex networks; the second method offers a more comprehensive approach where the consensus, discriminative and individual layerwise subspaces are learned simultaneously across the two groups; and the third method learns the discriminative subgraphs between two temporal multiplex networks. These methods are evaluated on synthetic and real networks, including EEG and dynamic fMRI functional brain networks, comparing across experimental conditions and tasks. In Chapter 4, we present two graph signal processing based frameworks for community detection in attributed networks, Graph Filtering for Clustering Attributed Graphs (GraFiCA) and Multi-Scale Graph Wavelets for Clustering (MSGWC). A cost function quantifying the separability of filtered attributes is proposed, along with a general framework for learning optimal graph filters. For GraFiCA, the parameters of Finite Impulse Response (FIR) and Autoregressive Moving Average (ARMA) filters are learned. While for MSGWC, we learn the optimal combination of the multi- scale features from graph spectral wavelet and scaling filters. The proposed methods are evaluated 12 on real-world attributed networks with both binary and numerical attributes and compared to the state-of-the-art graph clustering algorithms. GraFiCA is also extended to multiplex networks and evaluated on an EEG brain network dataset. In Chapter 5, we introduce a GSP-based framework for hub node identification in brain networks utilizing both the structural connectome and functional BOLD signals. The proposed approach is based on learning the optimal graph filter for detecting hub nodes with the following assumptions: (i) hub nodes are sparse and have high activation patterns simultaneously with a more diverse set of connections, i.e., their activity corresponds to the high-frequency component of the BOLD signal, and (ii) the non-hub nodes’ activation patterns are low-frequency/smooth with respect to the structural connectome, thus can be modeled as the output of a graph diffusion kernel, e.g., polynomial graph filter. The proposed method is evaluated on both simulated data and rs-fMRI data from HCP. The results are compared to the state-of-the-art hub node identification methods and recently published meta-analysis of hub nodes in rs-fMRI [271]. Finally, conclusions and future directions are presented in Chapter 6, where we discuss exten- sions of the presented methods. 13 CHAPTER 2 COMMUNITY DETECTION IN MULTIPLEX NETWORKS 2.1 Introduction Community detection methods for multiplex networks [170] can be grouped into three main classes. The first class of methods merges the layers in a multiplex network, using a flattening algorithm, then apply single-layer community detection to the aggregated network [23, 52, 245]. While these methods are computationally efficient, they can only identify communities that are common across all layers. Moreover, due to the flattening process, some spurious communities may emerge. The second class of methods applies community detection to each layer individually and then merges the results [24, 243, 77]. These methods include nodes in the same community only when they are part of the same community in at least one layer. Finally, the third class of methods operates directly on the multiplex network model [145, 69, 182, 11, 290, 193]. Existing multiplex community detection approaches typically assume that the community struc- ture is the same across layers and find the partition that best fits all layers. Thus, they do not differentiate between communities that are common across layers from those that are unique to each layer. This is particularly important for real world applications where the networks are het- erogeneous, and the different layers correspond to different modes of interaction. For example, in social networks, a group of individuals may be well connected via friendships on Facebook; however, this group of individuals will likely not work at the same company. Thus, in a situation like this, a given community will only be present in a subset of the layers, and different communities may be present in different subsets of layers. In this work, common communities are defined as communities that are observed in more than one layer, i.e., communities that are common across any subset of two or more layers, and private communities as communities that are unique to each layer. The problem of detecting common and private communities is then formulated using a novel framework titled Multiplex Orthogonal Nonnegative Matrix Trifactorization (MX-ONMTF). In the proposed framework, each layer’s ad- jacency matrix is represented as the sum of two low-rank matrix factorizations corresponding to 14 the common and private communities, respectively. The resulting optimization problem is solved using an iterative multiplicative update algorithm. The proposed approach also addresses the prob- lem of determining the number of communities. Unlike most existing work, where the number of communities is determined through a greedy search, in this work, a two-step approach is proposed. The proposed algorithm is first evaluated on synthetic benchmark multiplex networks with different numbers of layers, nodes, communities, noise levels, and inter-layer dependency probability. Next, the proposed method is applied to real networks including social and biological networks. Finally, the algorithm is evaluated for multiview clustering task, where the communities across all layers are assumed to be the same. The rest of the chapter is organized as follows. Section 2.2 presents a summary of related works. Sections 2.3 and 2.4 present the proposed multiplex community detection algorithm and its convergence analysis. Section 2.5 establishes the theoretical properties of the algorithm, while Section 2.6 illustrates results on both simulated and real networks. Section 2.7 presents an application of our method to functional connectivity networks that are extracted from multi-subject resting-state fMRI data. Finally, Section 2.8 provides conclusions and discussion on future work. 2.2 Related works The method proposed in this chapter belongs to the third class of algorithms, which operate directly on the multiplex network model. There are different types of algorithms that fall in this class: random walk, statistical generative network models, label propagation, objective function optimization, and Nonnegative Matrix Factorization (NMF). Methods based on random walkers model the dynamic process on networks as random walks where the process is more likely to persist on the vertices in the same community and far less on the vertices in different communities. For instance, LART [145] is initialized by assigning each node in each layer to its own community. Hierarchical clustering is then used to merge nodes based on a distance matrix, and the partition with the highest multiplex modularity is chosen. In [69], Infomap, which is based on a compression of network flows, is proposed to identify communities within and across layers. However, Infomap tends to assign each physical node across layers to the 15 same community, not differentiating the topological differences across layers. Statistical methods use variants of Stochastic Block Model (SBM) to model the latent variables. Among these, multilayer stochastic block model (MLSBM) [197, 25, 110, 149, 36] model each layer’s adjacency matrix through a common community membership matrix, Z, and a layer specific connectivity probability matrix B𝑚, where the goal is to infer Z. Consistency properties of various methods such as orthogonal linked matrix factorization (OLMF) [244, 197, 77] and spectral clustering under MLSBM have been investigated [197, 25, 110]. More recently, mixture multilayer stochastic block model (MMLSBM) has been proposed to model the heterogeneity of multiplex networks [128, 89, 241, 233]. MMLSBM assumes that there is a mixture of 𝑚 latent network models and each network is sampled independently from this mixture of models with each of the 𝑚 classes following SBM. While these methods provide some flexibility in modeling heterogeneous networks, they still make assumptions such that the layers can be clustered into subgroups where each subgroup has exactly the same community structure and connectivity. Similarly, Weighted Stochastic Block Model (WSBM) [8] has been proposed to detect common and private communities in heterogeneous weighted networks. Although this method addresses the heterogeneity of networks across layers, the method is limited to detecting only common communities that are shared by all layers, ignoring communities that may be shared by only a subset or different subsets of layers. In [66], authors propose a generative model and an expectation maximization algorithm for community detection and link prediction in multilayer networks. Although the method allows for different connectivity patterns in each layer, the interdependence between layers is only taken into account for link prediction, while the layers are assumed to share a common community structure. The third class of methods, Label Propagation Algorithms (LPA), are based on the intuition that a label can become dominant in a densely connected group of nodes but will have trouble crossing a sparsely connected region. In [32], an LPA-based method for community detection in multidimensional networks is proposed to identify communities and the subset of layers in which each of these communities is observed, simultaneously. However, this algorithm fails to detect communities that are private to each layer and communities that may be common among a small 16 number of layers. The fourth type of multiplex community detection methods is based on defining an objective function and identifying the community structure that maximizes/minimizes the objective function. For example, Generalized Louvain (GenLouvain) [182] uses an extended definition of modularity and is one of the fastest methods for community detection in multiplex networks. As GenLouvain assigns each node-layer tuple to its own community, it cannot identify common communities across layers. More recently, multiobjective genetic and evolutionary algorithms such as MultiMOGA [11] and MOEA/D-TS [135] have been used to jointly maximize the modularity of each layer and the similarity between the community structures across layers. These methods find a shared community structure across all layers, not differentiating communities that may be unique to each layer. In [49], extension of normalized cut to multiplex networks is proposed by constructing a block Laplacian matrix with each block corresponding to a layer. This method relies on selecting a parameter 𝛽 that controls the consistency of the community structure across different layers. The last class of methods is based on NMF which, because of its interpretability and good performance, has been broadly used for community detection in single-layer, multiplex, multilayer, and dynamic networks [74, 260, 172, 237]. In [169], Semi-Supervised joint Nonnegative Matrix Factorization (S2-jNMF) is proposed for detecting the common communities across layers in a multiplex network. A greedy search of dense subgraphs is performed and these subgraphs are used as a priori information to create new adjacency matrices for each layer. In [101], a two-step approach is proposed, where first a nonnegative low dimensional feature representation of each layer is found using one of the four different NMF models. These community structures are then used to obtain a consensus community structure. Authors in [187] use NMF for detecting communities in multiplex social networks, where both unifying and coupling approaches are proposed. The unifying approach finds a common community structure by aggregating all layers, while the coupling approach finds mostly consistent community structures. Most of the aforementioned NMF based methods find a common structure across all layers or for a majority of layers and do not consider cases where common communities may be present in different subsets of layers. Moreover, they do not detect 17 (a) (b) (c) Figure 2.1: Illustration of the proposed community detection algorithm. (a) Toy example of a multiplex network with 3 layers with 8 nodes, and 2 common communities (purple and green) across different pair of layers. (b) Community membership matrices obtained from MX-ONMTF cost function. (c) Post-processing described in Algorithm 2.3 applied to H to detect which of the common communities are present in each layer, and final community labels. private communities. These methods also require that the number of communities is provided a priori. 2.3 Proposed Method (MX-ONMTF) The proposed method, MX-ONMTF, models each layer’s adjacency matrix as a sum of low- rank representations of common and private communities using Orthogonal Nonnegative Matrix Tri-Factorization (ONMTF). Figure 2.1 illustrates the overview of the proposed algorithm for a multiplex network with three layers and two common communities. In this example, there are two common communities that do not exist in all layers. The purple community is common across layers 1 and 3, while the green one is common across layers 1 and 3. Figure 2.1b shows the low-rank representations corresponding to the common and private communities and 2.1c shows the post-processing applied to H in order to determine which layers each common community is present in. 2.3.1 Problem Formulation In this work, we define common communities as communities that are observed in more than one layer. Definition 1: An ideal common community C in a multiplex network {G𝑙 } is defined as a subgraph 18 with the same set of nodes for a subset of layers m ⊆ {1, 2, . . . , 𝐿}, where |m| > 1. Mathematically, C can be defined as C = {(𝑉 C 𝑙 , 𝐸 C 𝑙 ) : 𝑉 C 𝑙 ⊆ 𝑉𝑙, 𝐸 C 𝑙 ⊆ 𝑉 C 𝑙 × 𝑉 C 𝑙 , 𝑉 C 𝑙 = 𝑉 C 𝑘 , with 𝑙, 𝑘 ∈ m, m ⊆ {1, 2, . . . , 𝐿}, |m| > 1}. Definition 2: A private community C in a multiplex network {G𝑙 } is defined as any community that is not common across at least two layers. For a multiplex network with 𝐿 layers and adjacency matrices, A𝑙 ∈ R𝑁×𝑁 , 𝑙 ∈ {1, 2, . . . , 𝐿}, we model each layer’s adjacency matrix in terms of common and individual communities using ONMTF. The resulting objective function can be formulated as argmin H≥0,H𝑙 ≥0,S𝑙 ≥0,G𝑙 ≥0 𝐿 ∑︁ 𝑙=1 ||A𝑙 − HS𝑙H⊤ − H𝑙G𝑙H⊤ 𝑙 ||2 𝐹 s.t H⊤H = I, H⊤ 𝑙 H𝑙 = I, with 𝑙 ∈ {1, 2, . . . , 𝐿}, (2.1) where H ∈ R𝑁×𝑘 𝑐 and H𝑙 ∈ R𝑁×𝑘 𝑝𝑙 , 𝑙 ∈ {1, 2, . . . , 𝐿} are the community membership matrices corresponding to the common and private communities, respectively, and S𝑙 and G𝑙 are diagonal matrices whose entries indicate the strength of the common and private communities across layers, respectively. In this work, it is assumed that the 𝐿 layers have a total of 𝑘𝑐 common communities and 𝑘 𝑝𝑙 private communities in each layer 𝑙. The goal is to simultaneously identify communities that are common across any subset of two or more layers and communities that are unique to each layer. Therefore, H will contain information for all common communities. 2.3.2 Optimization solution ONMTF optimization problem in (2.1) can be solved using a multiplicative update algorithm (MUA) [75]. Multiplicative update algorithms for solving NMF problems were introduced in [148], while solving NMTF with orthogonal constraints was first addressed by [75]. In this work, we follow their approach to derive the multiplicative update rules for each variable. To find the update rules for H, H𝑙, S𝑙, and G𝑙, the following Lagrangian function with Lagrange multipliers 𝚲 and 𝚲𝑙 is minimized: 19 L (H, H𝑙, S𝑙, G𝑙) = 𝐿 ∑︁ 𝑙=1 ||A𝑙 − HS𝑙H⊤ − H𝑙G𝑙H⊤ 𝑙 ||2 𝐹 + tr(𝚲(H⊤H − I)) + 𝐿 ∑︁ 𝑙=1 tr(𝚲𝑙 (H⊤ 𝑙 H𝑙 − I)). For updating H, we find ∇HL as ∇HL = 𝐿 ∑︁ 𝑙=1 (4HS⊤ 𝑙 H⊤HS𝑙 + 4H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙 − 4A𝑙HS𝑙) + 4H𝚲. Setting ∇HL = 0 and ∇𝚲L = 0, we obtain: (i) 𝚲 = (cid:205)𝐿 𝑙 S𝑙 − H⊤H𝑙G⊤ 𝑙=1(−S⊤ 𝑙 H⊤ 𝑙 HS𝑙 + H⊤A𝑙HS𝑙). (ii) H⊤H = I. Substituting (i) and (ii) in Eq. (2.3), we get ∇HL = 𝐿 ∑︁ 𝑙=1 (4H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙 − 4A𝑙HS𝑙 + 4HH⊤A𝑙HS𝑙 − 4HH⊤H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙). (2.2) (2.3) (2.4) As discussed in [280], if the gradient of an error function, 𝜀, is of the form ∇𝜀 = ∇𝜀+ − ∇𝜀−, where ∇𝜀+ > 0 and ∇𝜀− > 0, then the multiplicative update for parameter 𝚯 has the form 𝚯 = 𝚯 ⊙ ∇𝜀− ∇𝜀+ . It can be easily seen that the multiplicative update preserves the nonnegativity of 𝚯, while ∇𝜀 = 0 when the convergence is achieved. Following this procedure, from the gradient of the error function in Eq. (2.4), we derive the following multiplicative update rule for H H ← H ⊙ (cid:205)𝐿 (cid:205)𝐿 𝑙=1(A𝑙HS𝑙 + HH⊤H𝑙G⊤ 𝑙=1(H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙) 𝑙 HS𝑙 + HH⊤A𝑙HS𝑙) 𝑙 H⊤ , (2.5) where the multiplication and division are performed element-wise and both numerator and denom- inator are positive. Note that the update in Eq. (2.5) satisfies the Karush-Kuhn Tucker (KKT) complementary slackness condition for nonnegativity of H, ∇HL𝑖 𝑗 H𝑖 𝑗 = 0, given as (cid:18) 𝐿 ∑︁ 𝑙=1 (4H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙 − 4A𝑙HS𝑙 + 4HH⊤A𝑙HS𝑙 − 4HH⊤H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙) (cid:19) 𝑖 𝑗 H𝑖 𝑗 = 0. (2.6) This is the fixed point condition that any local minima H∗ must satisfy. This shows that if the update rule (2.5) converges, the converged solution is a local minimum of the optimization problem. 20 Similarly, we obtain the following update rules for H𝑙, S𝑙, and G𝑙, for each 𝑙 ∈ {1, 2, . . . , 𝐿}: H𝑙 ← H𝑙 ⊙ A𝑙H𝑙G𝑙 + H𝑙H⊤ HS⊤ 𝑙 H⊤H𝑙G⊤ 𝑙 HS⊤ 𝑙 + H𝑙H⊤ 𝑙 H⊤H𝑙G𝑙 𝑙 A𝑙H𝑙G𝑙 S𝑙 ← S𝑙 ⊙ H⊤A𝑙H H⊤HS𝑙H⊤H + H⊤H𝑙G𝑙H⊤ 𝑙 H , G𝑙 ← G𝑙 ⊙ H⊤ 𝑙 A𝑙H𝑙 𝑙 H𝑙 + H⊤ H⊤ 𝑙 H𝑙G𝑙H⊤ 𝑙 HS𝑙H⊤H𝑙 , . (2.7) (2.8) (2.9) Since NMF algorithms are initialized with random matrices, different runs yield local minima. For this reason, we run the algorithm 50 times and report the best results [153, 168]. As shown in Algorithm 2.1, for each random initialization of H, H𝑙, S𝑙, and G𝑙, the multiplicative update rules described in Eqs. (2.5)-(2.9) are repeated for 1000 iterations or until convergence. We then select the solution that yields the maximum value of the performance metric across the different runs. For synthetic networks for which a ground truth is available, Normalized Mutual Information (NMI) [65] is used. For networks without ground truth, Modularity Density (𝑄 𝐷) [157] is used as the performance metric. 2.3.3 Number of communities In most NMF-based community detection algorithms, the number of communities (𝑘) is an input parameter [111]. This problem is usually addressed by detecting communities with different values of 𝑘 and selecting the one that gives the solution with the best pre-determined performance metric, such as modularity [203]. In this chapter, a two-step approach is proposed to determine the number of communities per layer and the number of common communities. First, the number of communities per layer (𝑘1, 𝑘2,. . . , 𝑘 𝐿) are found using the eigengap rule [160] which determines the number of communities by the value that maximizes the eigengap, i.e. the difference between consecutive eigenvalues. A suitable null model, e.g., Laplacianized Erd˝os–R´enyi adjacency matrices L𝑛𝑢𝑙𝑙 with size and density matching the Laplacian of the network, can be used . The threshold, 𝛿, can be set to be the 0.95 quantile of the largest eigengap (see lines 1 to 4 of Algorithm 2.2). 21 Algorithm 2.1: MX-ONMTF. end for for each layer 𝑙 do Randomly initialize H, H𝑙, S𝑙, G𝑙 > 0 for 1000 iterations or until convergence do update H according to Eq. (2.5) update H𝑙 for each 𝑙 ∈ {1, 2, . . . , 𝐿} according to Eq. (2.7) update S𝑙 for each 𝑙 ∈ {1, 2, . . . , 𝐿} according to Eq. (2.8) update G𝑙 for each 𝑙 ∈ {1, 2, . . . , 𝐿} according to Eq. (2.9) Input: Adjacency matrices A𝑙, 𝑙 ∈ {1, 2, . . . , 𝐿}. Output: Community membership matrices H, H𝑙. 1: Use Algorithm 2.2 to find 𝑘𝑐, 𝑘𝑙, and 𝑘 𝑝𝑙 . 2: for r=1 to 50 do 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: end for 23: Choose the partition 𝑟 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑟NMI𝑟 (𝑟 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑟𝑄 𝐷𝑟 ). 𝑗 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑗 H𝑐𝑙 (𝑖, 𝑗) if H𝑐𝑙 (𝑖, 𝑗 ∗) > H𝑙 (𝑖, 𝑗 ∗) then 𝑖𝑑𝑥(𝑖) ← (𝑎𝑟𝑔𝑚𝑎𝑥 𝑗 H𝑙 (𝑖, 𝑗)) + 𝑘𝑐 + (cid:205)𝑙−1 end for Compute NMI𝑟 or 𝑄 𝐷𝑟 . Apply Algorithm 2.3 with A𝑙, H, and 𝑘 𝑝𝑙 as inputs to find H𝑐𝑙 . for each 𝑖 do 𝑖𝑑𝑥(𝑖) ← 𝑗 ∗ 𝑛=1 𝑘 𝑝𝑛 end for end if else Next, ONMTF is applied to each layer [75] and the low-rank embedding matrices, U𝑙 ∈ R𝑁×𝑘𝑙 , are obtained. Once we have U𝑙 corresponding to each layer 𝑙, our goal is to reduce the embedding subspace by finding columns that are similar to each other, i.e., embedding vectors for the common communities. Each element of the embedding matrices, U𝑙 (𝑖, 𝑗), represents the likelihood of node 𝑖 belonging to community 𝑗. An agglomerative hierarchical clustering algorithm using Euclidean distance is applied to the columns of X = [U1, U2, . . . , U𝐿] ∈ R𝑁×𝑚, where 𝑚 = (cid:205)𝐿 𝑙=1 𝑘𝑙, to obtain the number of common communities. At each step of the algorithm, the two columns with the smallest distance are aggregated, and the distances between the newly formed cluster and the remaining ones are updated. The assumption is that if two or more layers share a common 22 community, the columns of the respective U𝑙’s corresponding to this community will be close to each other. A dendrogram like the one shown in Figure 2.2b can be used to represent the different iterations of this algorithm. The 𝑚 leaves of the dendrogram correspond to the total number of communities across the 𝐿 layers. Algorithm 2.2: Finding 𝑘𝑙, 𝑘𝑐 and 𝑘 𝑝𝑙 , for 𝑙 ∈ {1, 2, . . . , 𝐿}. Input: Adjacency matrices A𝑙 for 𝑙 ∈ {1, 2, . . . , 𝐿}. Output: Number of common communities 𝑘𝑐, total number of communities per layer 𝑘𝑙, and number of private communities per layer 𝑘 𝑝𝑙 , for 𝑙 ∈ {1, 2, . . . , 𝐿}. be the normalized Laplacian of the Erd˝os–R´enyi null model. 𝑙 = V𝑙𝚲𝑙V𝑇 𝑙 1: Let L𝑛𝑢𝑙𝑙 2: L𝑛𝑢𝑙𝑙 𝑙 3: 𝛿 ← 𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒0.95 [𝑚𝑎𝑥{|𝜆𝑛𝑢𝑙𝑙 4: 𝑘𝑙 ← 𝑚𝑖𝑛{𝑘 : |𝜆𝑙 5: Randomly initialize U𝑙 ≥ 0, S𝑙 ≥ 0. 6: for 1000 iterations or until convergence do 7: | − |𝜆𝑛𝑢𝑙𝑙 𝑖+1| > 𝛿, ∀𝑖 > 𝑘 } update U𝑙 using U𝑙 = U𝑙 ∗ 𝑖 | − |𝜆𝑙 (A𝑙U𝑙S𝑙)𝑖 𝑗 𝑖 𝑖+1 |, 𝑖 ≥ 2}] 𝑙 A𝑙U𝑙S𝑙)𝑖 𝑗 (U𝑙U⊤ (U⊤ 𝑙 U𝑙S𝑙U⊤ 𝑙 A𝑙U𝑙)𝑖 𝑗 𝑙 U𝑙)𝑖 𝑗 (U⊤ 8: update S𝑙 using S𝑙 = S𝑙 ∗ 𝑙=1 𝑘𝑙 if 𝑚𝑎𝑥{F(𝑖, 1), F(𝑖, 2)} ≤ 𝑚 then 𝑑𝑖 ← F(𝑖,3)−F(𝑖−1,3) F(𝑖−1,3) if 𝑑𝑖 ≥ 0.5 then 9: end for 10: X = [U1, U2, . . . , U𝐿] ∈ R𝑁×𝑚, 𝑚 = (cid:205)𝐿 11: F ← 𝐴𝑔𝑔𝑙𝑜𝑚𝑒𝑟𝑎𝑡𝑖𝑣𝑒𝐻𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔(X) 12: 𝑘𝑐 = 0 13: for 𝑖 = 2 to 𝑚 do 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: end for 25: 𝐶 ←find(1 + (cid:205)𝑙−1 26: 𝑘 𝑝𝑙 ← 𝑘𝑙 − |𝐶 |, for 𝑙 ∈ {1, 2, . . . , 𝐿} 𝑗=1 𝑘 𝑗 ≤ F(1 : 𝑐𝑢𝑡, 1 : 2) ≤ (cid:205)𝑙 𝑐𝑢𝑡 ← 𝑖 and stop for 𝑘𝑐 ← 𝑘𝑐 + 1 𝑘𝑐 ← 𝑘𝑐 end if end if else else 𝑗=1 𝑘 𝑗 ) This agglomerative hierarchical clustering algorithm outputs a matrix F ∈ R𝑚−1×3. The first two columns of F(𝑖, :) correspond to the labels of the two leaves of the dendrogram that form cluster 23 𝑚 + 𝑖, and the third column contains the distance between these two leaves. The number of clusters resulting from this procedure correspond to 𝑘𝑐, while the columns of X corresponding to each layer 𝑙 that are not assigned to any of the clusters correspond to 𝑘 𝑝𝑙 (see lines 13 to 26 in Algorithm 2.2). The algorithm iterates until the minimum distance between any two clusters increases by more than 50% of the minimum distance from the previous iteration. Figure 2.2 illustrates how the number of communities per layer and the common communities are obtained for an example of a 3-layer network with 𝑘1 = 6, 𝑘2 = 6, and 𝑘3 = 5, and three common communities highlighted in red, green, and purple. For this example, 𝑘𝑐 = 3, 𝑘 𝑝1 = 3, 𝑘 𝑝2 = 4, and 𝑘 𝑝3 = 3. 2.3.4 Determining the common community labels for each layer H ∈ R𝑁×𝑘 𝑐 is the community membership matrix corresponding to the common communities. In order to determine whether a node from a particular layer belongs to any of the 𝑘𝑐 common communities, H needs to go through some post-processing as described in Algorithm 2.3. First, for each node, 𝑖, in each layer, 𝑙, the common community membership matrix, H, and the layer specific community membership matrix, H𝑙 are concatenated and the column 𝑗 with the maximum entry is identified (see line 3 in Algorithm 2.3). This determines the initial community assignment for that node. Next, we construct a binary common community membership matrix, Z𝑙, for each layer where each entry is equal to 1 if a particular node belongs to one of the 𝑘𝑐 common communities in that layer. For each layer, we compute the ratio of the average strength within a particular common community to the average strength outside that common community (lines 10-15). Finally, the common communities for that layer are determined as the ones which have the top 𝑘𝑙 − 𝑘 𝑝𝑙 ratios (lines 16-18). As shown in Figure 2.1c, the new embedding matrices corresponding to the common communities in each layer, H𝑐1, H𝑐2, and H𝑐3, will only have the columns that contain the information corresponding to the common communities present in that layer. In this example, H𝑐1 keeps the two columns (purple and green communities) from H, while H𝑐2 keeps only the first column (purple), and H𝑐3 only the second column (green). 24 (a) (b) Figure 2.2: Example illustrating how the number of total communities per layer and number of common communities are determined. (a) Illustration of the eigengaps for each layer from left to right, top row showing the adjacency matrices, A1, A2, and A3, and bottom row showing the corresponding eigenvalues. The eigengap rule yields 𝑘1 = 6, 𝑘2 = 6, and 𝑘3 = 5, respectively, as indicated by the black dotted lines; (b) Embedding matrices, U1, U2, U3, are found using these 𝑘 values and hierarchical clustering is then applied to the columns of X = [U1, U2, U3]. The red dotted line indicates where the algorithm stops, and resulting in the number of common communities as 𝑘𝑐 = 3. 2.3.5 Time complexity The time complexity of the proposed algorithm is mostly due to the Multiplicative Updates Rules, Eqs. (2.5)-(2.9). The time complexity for the product of two matrices, e.g., the product of a 𝑚 × 𝑘 matrix by a 𝑘 × 𝑛 matrix, is O (𝑚𝑘𝑛). Table 4.1 shows the time complexities of Eqs. (2.5)-(2.9) and the total complexity, with 𝑙 ∈ {1, 2, . . . , 𝐿}. 25 Algorithm 2.3: Identify the membership of 𝑘𝑐 common communities across layers {1, 2, . . . , 𝐿}. Input: Community membership matrices H, H𝑙, adjacency matrices A𝑙, and number of commu- nities 𝑘 𝑝𝑙 , 𝑘𝑙, and 𝑘𝑐, for each layer 𝑙 ∈ {1, 2, . . . , 𝐿}. Output: Layer specific community membership matrix H𝑐𝑙 containing information about the common communities present in layer 𝑙. for each node 𝑖 do 𝑗 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑗 ([H, H𝑙])𝑖 𝑗 𝑖𝑑𝑥(𝑖) ← 𝑗 ∗ end for for 𝑡 = 1 to 𝑘𝑐 do Z𝑙 ( 𝑓 𝑖𝑛𝑑 (𝑖𝑑𝑥 == 𝑡), 𝑡) = 1 Z𝑙 ( 𝑓 𝑖𝑛𝑑 (𝑖𝑑𝑥 ≠ 𝑡), 𝑡) = 0 1: for 𝑙 = 1 to 𝐿 do 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: end for for 𝑡 = 1 to 𝑘𝑐 do 𝑚 = |Z𝑙 (:, 𝑡) == 1| B0 ← A𝑙 ⊙ (Z𝑙 (:, 𝑡)Z𝑙 (:, 𝑡)⊤) B1 ← A𝑙 ⊙ (1𝑛×𝑛 − Z𝑙 (:, 𝑡)Z𝑙 (:, 𝑡)⊤) (cid:205)𝑣,𝑤 (B0)𝑣𝑤/(𝑚(𝑚−1)) 𝑞(𝑡) ← (cid:205)𝑣,𝑤 (B1)𝑣𝑤/(𝑚(𝑁−𝑚)) end for [q𝑠𝑜𝑟𝑡𝑒𝑑, 𝐽] ← 𝑠𝑜𝑟𝑡 (q, 𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑖𝑛𝑔). 𝐽 contains the sorted indices of the elements of q. 𝑝𝑜𝑠 ← 𝐽 (1 : (𝑘𝑙 − 𝑘 𝑝𝑙 )). H𝑐𝑙 ← H(:, [ 𝑝𝑜𝑠]) 15: 16: 17: 18: 19: end for Table 2.1: Computational complexity of updating each variable per iteration. H update (Eq. (2.5)) O (𝑁 2(𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝1, · · · , 𝑘 𝑝 𝐿 }) H𝑙 update (Eq. (2.7)) S𝑙 update (Eq. (2.8)) G𝑙 update (Eq. (2.9)) Total O (𝑁 2(𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝𝑙 })) O (𝑁 2(𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝𝑙 })) O (𝑁 2(𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝𝑙 })) O (𝑁 2(𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝1, · · · , 𝑘 𝑝 𝐿 })) 2.3.6 Storage complexity The storage complexity of our algorithm is determined by the sizes of the matrices H, H𝑙, S𝑙, and G𝑙. It can be seen that the total storage complexity is O (𝑁 (𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝1, · · · , 𝑘 𝑝 𝐿 })). For a multiplex network of size 𝑁 × 𝑁 × 𝐿, this is a significant reduction in memory cost. 26 Table 2.2: Storage complexity of each variable. H H𝑙 S𝑙 O (𝑁 𝑘𝑐) O (𝑁 𝑘 𝑝𝑙 ) 2) O (𝑘𝑐 2) O (𝑘 𝑝𝑙 G𝑙 Total O (𝑁 (𝑚𝑎𝑥{𝑘𝑐, 𝑘 𝑝1, · · · , 𝑘 𝑝 𝐿 })) 2.4 Convergence Analysis In this section, we will prove the convergence of the multiplicative update rule defined by Eq. (2.5) using the auxiliary function approach. As the other update rules are similar, we will not explicitly prove their convergence. We first introduce the definition of auxiliary function as follows. Definition 1: A function 𝑍 (H, H𝑡) is called an auxiliary function of L (H) if it satisfies 𝑍 (H, H𝑡) ≥ L (H) and 𝑍 (H, H) = L (H). The auxiliary function is a useful concept because of the following lemma which is proved in [148]. Lemma 1. If 𝑍 is an auxiliary function, then L is non-increasing under the update H𝑡+1 = argmin 𝑍 (H, H𝑡). H Theorem 1. Given H𝑙, S𝑙, and G𝑙 the Lagrangian function L (H) is monotonically decreasing under the update rule (2.5). Proof. For convenience, let L (ℎ) denote the part of L (H) dependent on 𝐻𝑖 𝑗 . From Eq. (2.4) we have L′(ℎ) = 𝐿 ∑︁ 𝑙=1 (4H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙 − 4A𝑙HS𝑙 + 4HH⊤A𝑙HS𝑙−4HH⊤H𝑙G⊤ 𝑙 H⊤ 𝑙 HS𝑙)𝑖 𝑗 . The second-order derivative of L (ℎ) with respect to ℎ𝑖 𝑗 is L′′(ℎ) = 𝐿 ∑︁ 𝑙=1 {4(H𝑙G𝑙H⊤ 𝑙 )𝑖𝑖S𝑙 𝑗 𝑗 − 4A𝑙𝑖𝑖 S𝑙 𝑗 𝑗 + 4[(H⊤A𝑙HS𝑙)𝑖 𝑗 + ℎ𝑖 𝑗 (A𝑙HS𝑙)𝑖 𝑗 + (HH⊤A𝑙)𝑖𝑖S𝑙 𝑗 𝑗 ] −4[(H⊤H𝑙G𝑙H⊤ 𝑙 HS𝑙)𝑖 𝑗 + ℎ𝑖 𝑗 (H𝑙G𝑙H⊤ 𝑙 HS𝑙)𝑖 𝑗 + (HH⊤H𝑙G𝑙H⊤ 𝑙 )𝑖𝑖S𝑙 𝑗 𝑗 ]}. 27 Let ℎ𝑡 𝑖 𝑗 denote the updated value of ℎ𝑖 𝑗 after the 𝑡th iteration, then the Taylor series expansion of L (ℎ) at ℎ𝑡 𝑖 𝑗 can be written as L (ℎ) = L (ℎ𝑡 𝑖 𝑗 ) + L′(ℎ𝑡 𝑖 𝑗 ) (ℎ − ℎ𝑡 𝑖 𝑗 ) + 1 2 L′′(ℎ𝑡 𝑖 𝑗 ) (ℎ − ℎ𝑡 𝑖 𝑗 )2. Now, the key is to find an appropriate auxiliary function 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ). We choose the following 𝑖 𝑗 ) and prove in Appendix A, that it satisfies the conditions to be an auxiliary function of 𝑍 (ℎ, ℎ𝑡 L (ℎ), 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ) = L (ℎ𝑡 𝑖 𝑗 ) + 3L′(ℎ𝑡 𝑖 𝑗 )(ℎ − ℎ𝑡 𝑖 𝑗 ) + 𝐿 (cid:205) 𝑙=1 3 2 (4H𝑡 𝑙G𝑙H𝑡 𝑙 ⊤H𝑡S𝑙 + 4H𝑡H𝑡 ⊤A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 (ℎ − ℎ𝑡 𝑖 𝑗 )2. (2.10) According to Lemma 1, we must find the minimum of 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ) with respect to ℎ. 𝜕𝑍 (ℎ, ℎ𝑡 𝜕ℎ 𝑖 𝑗 ) = 3L′(ℎ𝑡 𝑖 𝑗 ) + 3 (cid:205)𝐿 𝑙=1(4H𝑙G𝑙H⊤ 𝑙 H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 (ℎ − ℎ𝑡 𝑖 𝑗 ) + 3 (cid:205)𝐿 𝑙=1(4H𝑡H𝑡⊤ ℎ𝑡 𝑖 𝑗 A𝑙H𝑡S𝑙)𝑖 𝑗 (ℎ − ℎ𝑡 𝑖 𝑗 ) = 0 Replacing L′(ℎ𝑡 𝑖 𝑗 ) in the equation above and canceling the common terms, we obtain 𝐿 ∑︁ 𝑙=1 (−4A𝑙H𝑡S𝑙−4H𝑡H𝑡⊤ H𝑙G⊤ 𝑙 H⊤ 𝑙 H𝑡S𝑙)𝑖 𝑗 + 𝐿 ∑︁ 𝑙=1 (4H𝑙G𝑙H⊤ 𝑙 H𝑡S𝑙 + 4H𝑡H𝑡⊤ A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ ℎ𝑡 𝑖 𝑗 = 0. Replacing ℎ by ℎ𝑡+1 𝑖 𝑗 we obtain the following update rule ℎ𝑡+1 𝑖 𝑗 = ℎ𝑡 𝑖 𝑗 (cid:205)𝐿 (cid:205)𝐿 𝑙=1(A𝑙H𝑡S𝑙 + H𝑡H𝑡⊤ 𝑙 H⊤ 𝑙=1(H𝑙G⊤ 𝑙 H𝑡S𝑙)𝑖 𝑗 H𝑙G⊤ 𝑙 H𝑡S𝑙 + H𝑡H𝑡⊤A𝑙H𝑡S𝑙)𝑖 𝑗 𝑙 H⊤ , which is the same as the update rule shown in Eq. (2.5). 2.5 Recovery and Consistency Analysis In this section, we will establish the theoretical properties of the proposed community detection method. In particular, we investigate the recovery guarantees of the proposed objective function and the consistency of the algorithm as 𝑁 and 𝐿 increase under MLSBM. 28 2.5.1 Recovery Guarantees In this section, we will investigate the recovery guarantees of the global optimizer of the objective function under the MLSBM when there is no noise. The optimization problem in (2.1) can be rewritten as 𝑙=1 where F𝑙 is a block matrix defined as argmin H′𝑙,F𝑙 𝐿 ∑︁ ||A𝑙 − H′ 𝑙F𝑙H′⊤ 𝑙 ||2 𝐹 s.t H′⊤ 𝑙 H′ 𝑙 = I, (2.11) F𝑙 = S𝑙 0 0 G𝑙                 𝑙 = [H|H𝑙] is the concatenation of the community membership matrices of the common and and H′ private communities, H ∈ R𝑁×𝑘𝑐 and H𝑙 ∈ R𝑁×𝑘 𝑝𝑙 , respectively. For the 𝑁 × 𝑁 × 𝐿 adjacency tensor A = {A1, ..., A𝐿 }, we can define a multiplex SBM [Z, Θ] as in [8], with each of the 𝐿 slices A𝑙 ∈ R𝑁×𝑁 . The multiplex SBM with parameters [Z = {Z1, ..., Z𝐿 }, B = {𝚯1, ..., 𝚯𝐿 }], can be written in matrix form as, E(A𝑙) = A𝑙 = Z𝑙𝚯𝑙Z⊤ 𝑙 , with 𝚯𝑙 ∈ [0, 1] (𝑘𝑐+𝑘 𝑝𝑙 )×(𝑘𝑐+𝑘 𝑝𝑙 ) and Z𝑙 ∈ {0, 1}𝑁×(𝑘𝑐+𝑘 𝑝𝑙 ) for each layer 𝑙. 𝚯𝑙 is a block matrix defined as 𝚯𝑙 = 𝚯𝑐 0 0 𝚯𝑝𝑙         ,         with 𝚯𝑐 and 𝚯𝑝𝑙 being the affinity probability matrices of the common and private communities, respectively. Z𝑙 = [Z𝑐 |Z𝑝𝑙 ] is the concatenation of the community membership matrices of the common and private communities, Z𝑐 ∈ R𝑁×𝑘𝑐 and Z𝑝𝑙 ∈ R𝑁×𝑘 𝑝𝑙 , respectively. A𝑙 is the population adjacency matrix for the 𝑙-th layer and the tensor A = {A1, · · · , A𝐿 } is the population adjacency tensor. To prove that our method can correctly recover the community assignments, we propose the following lemma following the work in [197]. 29 Lemma 2. The optimization problem in (2.1) applied to the population tensor A has H′ 𝑙 = 𝑙 Z𝑙)1/2, 𝑙 = 1, ..., 𝐿, as the unique solution up to an Z𝑙 (Z⊤ 𝑙 Z𝑙)−1/2 and F𝑙 = (Z⊤ 𝑙 Z𝑙)1/2𝚯𝑙 (Z⊤ orthogonal matrix, provided at least one of the 𝚯𝑙 is full rank. Proof. To prove Lemma 2, we can show that H′ 𝑙 Z𝑙)1/2, 𝑙 = 1, · · · , 𝐿, is a solution to the optimization problem in (2.11) to the population tensor A . 𝑙 Z𝑙)−1/2, F𝑙 = (Z⊤ 𝑙 Z𝑙)1/2𝚯𝑙 (Z⊤ 𝑙 = Z𝑙 (Z⊤ Substituting the solution to (2.11), we have 𝐿 ∑︁ 𝑙=1 ||A𝑙 − Z𝑙 (Z⊤ 𝑙 Z𝑙)−1/2(Z⊤ 𝑙 Z𝑙)1/2𝚯𝑙 (Z⊤ 𝑙 Z𝑙)1/2(Z⊤ 𝑙 Z𝑙)−1/2Z⊤ 𝑙 ||2 𝐹 = 𝐿 ∑︁ 𝑙=1 ||A𝑙 − Z𝑙𝚯𝑙Z⊤ 𝑙 ||2 𝐹 and, since A𝑙 = E(A𝑙) = Z𝑙𝚯𝑙Z⊤ 𝑙 Z𝑙)−1/2Z⊤ 𝑙 Z𝑙 (Z⊤ 𝑙 H′ H′⊤ 𝑙 = (Z⊤ 𝑙 Z𝑙)1/2(Z⊤ Now, we need to show the uniqueness of this solution. By assumption, at least one of the 𝑙 Z𝑙)−1/2 = (Z⊤ 𝑙 Z𝑙)−1/2 = I 𝑙 , the value of this minimization objective function is 0, and 𝚯𝑙 is full rank. For a non-singular matrix P ∈ R(𝑘 𝑐+𝑘 𝑝𝑙 )×(𝑘 𝑐+𝑘 𝑝𝑙 ) we can say that H′ 𝑙P and P−1𝚯𝑙P−1⊤ is also a solution. Due to the orthogonality constraint, we must have (H′ 𝑙P)⊤H′ 𝑙P = 𝑙 Z𝑙)−1/2P = I, which implies P⊤P = I, and therefore the solution is unique 𝑙 Z𝑙)−1/2 is a diagonal matrix with positive up to an orthogonal matrix. Moreover, since Q−1/2 = (Z⊤ P⊤(Z⊤ 𝑙 Z𝑙)−1/2Z⊤ 𝑙 Z𝑙 (Z⊤ elements and therefore invertible, we have that Z𝑖Q−1/2 = Z 𝑗 Q−1/2 implies Z𝑖 = Z 𝑗 . 2.5.2 Consistency In this section, we investigate the asymptotic consistency of our method following [197]. The first step to prove consistency is to show that it is possible to recover the communities by maximizing the population version of the objective function, which was proven in the previous section. We consider the following asymptotic setup where we let 𝑁 and 𝐿 grow and assume no relationship exists between their growth rate. We also let the number of communities per layer, 𝑘𝑙, grow with both 𝑁 and 𝐿. The following results are proven for a multiplex network with 𝐿 layers and the 𝑁 × 𝑁 × 𝐿 population adjacency tensor A = {A1, ..., A𝐿 } described in the previous section. 𝜆𝑙 denotes the minimum in absolute value nonzero eigenvalue of the 𝑙-th layer population adjacency matrix, Δ𝑙 the maximum expected degree per layer, with ¯Δ = 1 𝐿 𝑙=1 Δ𝑙 and ¯Δ′ = 1 (cid:205)𝐿 𝑙=1 Δ2 𝑙 . (cid:205)𝐿 𝐿 30 Theorem 2. Let [( ˆH′ 𝐿), ( ˆF1, · · · , ˆF𝐿)] be the solution that minimizes the MX-ONMTF objective function in (2.11) applied to the tensor adjacency A. Let 𝑟 𝑀 𝑋 denote the fraction of 1, · · · , ˆH′ misclustered nodes. Assume that Δ𝑙 > 4 9 𝑙𝑜𝑔(2𝑁/𝜖), and at least one of the Θ𝑙’s is of full rank, then with probability at least 1 − 𝜖, 𝑟 𝑀 𝑋 ≤ 96𝑁𝑚𝑎𝑥 𝑘𝑚𝑎𝑥 𝐿1/4 ¯Δ5/4(𝑙𝑜𝑔(2𝑁/𝜖))1/2 (cid:205)𝐿 . 𝑁 1 𝐿 𝑙=1(𝜆𝑙)2 Here, a four parameter MLSBM defined by p = {𝑝1, · · · , 𝑝 𝐿 }, q = {𝑞1, · · · , 𝑞 𝐿 }, 𝑘𝑚𝑎𝑥, 𝑁𝑚𝑎𝑥 is considered. 𝑝𝑙 and 𝑞𝑙 are the connection probabilities within and between communities, respectively. It is assumed that 𝑝𝑙 ≠ 𝑞𝑙 but they are of the same asymptotic order with respect to 𝑁, for all 𝑙. 𝑘𝑚𝑎𝑥 = 𝑚𝑎𝑥{𝑘1, 𝑘2, · · · , 𝑘 𝐿 }, and 𝑁𝑚𝑎𝑥 denotes the number of nodes in the largest true community with 𝑁𝑚𝑎𝑥 ≍ 𝑁/𝑘𝑚𝑎𝑥. For this MLSBM, 𝜆𝑙 = 𝑁𝑚𝑎𝑥 ( 𝑝𝑙 − 𝑞𝑙). Let 𝑎𝑙 Δ𝑙 𝑁 = 𝑝𝑙 and 𝑙=1(𝑎𝑙 − 𝑏𝑙)2. Under 𝑏𝑙 the four parameter MLSBM, with Δ𝑙’s all being of the same order, Δ𝑙 ≍ ¯Δ and ¯Δ′ ≍ ¯Δ2, the bound (𝑎𝑙 − 𝑏𝑙), and 𝑎𝑙 ≍ 𝑏𝑙 ≍ 1. Define 𝑓 (a, b) = 1 𝐿 𝑁 = 𝑞𝑙. Then, 𝜆𝑙 = Δ𝑙 Δ𝑙 𝑘 𝑚𝑎𝑥 (cid:205)𝐿 in Theorem 2 can be simplified to 𝑟 𝑀 𝑋 ≲ 𝐿1/4 ¯Δ5/4(𝑙𝑜𝑔(2𝑁/𝜖))1/2 Δ2 𝑙 (𝑎𝑙−𝑏𝑙)2 𝑘 2 𝑚𝑎𝑥 (cid:205)𝐿 1 𝐿 𝐿1/4𝑘 2 𝑙=1 ¯Δ5/4(𝑙𝑜𝑔(2𝑁/𝜖))1/2 ¯Δ2 𝑓 (a, b) In the dense case where ¯Δ ≍ 𝑁, we have 𝑚𝑎𝑥 ≍ 𝐿1/4𝑘 2 𝑚𝑎𝑥 ≍ ¯Δ5/4(𝑙𝑜𝑔(2𝑁/𝜖))1/2 ¯Δ′ 𝑓 (a, b) , ≍ 𝐿1/4𝑘 2 𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2 ¯Δ3/4 𝑓 (a, b) . 𝑟 𝑀 𝑋 ≲ 𝐿1/4𝑘 2 𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2 ¯Δ3/4 𝑓 (a, b) ≍ 𝑘 2 𝑚𝑎𝑥 𝐿−1/4𝑁 3/4 𝑓 (a, b) (𝑙𝑜𝑔(2𝑁/𝜖))−1/2 . and, 𝑘𝑚𝑎𝑥 = O (cid:19) 1/8 (cid:32) (cid:18) 𝑁 3 𝐿 ( 𝑓 (a, b))1/2(𝑙𝑜𝑔(2𝑁/𝜖))−1/4 (cid:33) . Hence consistent estimation is possible as long as 𝑘𝑚𝑎𝑥 grows slower than (cid:16) 𝑁 3 𝐿 (cid:17) 1/8 . Proof of Theorem 2 is provided in Appendix B. 31 2.6 Experiments: Simulated Data and Real-world Networks 2.6.1 Synthetic Multiplex Networks Multiplex benchmark networks based on the model described in [16, 124] were generated. The authors in [16] propose a two-step approach to generate multilayer networks with a community structure. First, a multilayer partition with the user-defined number of nodes in each layer, number of layers, and an interlayer dependency tensor that specifies the desired dependency structure between layers is generated. Next, for the given multilayer partition, edges in each layer are generated following a degree-corrected block model [136] parameterized by the distribution of expected degrees and a community mixing parameter 𝜇 ∈ [0, 1]. The mixing parameter 𝜇 controls the modularity of the network. When 𝜇 = 0, all edges lie within communities, whereas 𝜇 = 1 implies that edges are distributed independently. For multiplex networks, the probabilities in the interlayer dependency tensor are the same for all pairs of layers and are specified by 𝑝 ∈ [0, 1]. When 𝑝 = 0, the partitions are independent across layers while 𝑝 = 1 indicates an identical partition across layers. In this work, we extend the model described above to generate multiplex benchmark networks with common and private communities. We first generate the common communities by randomly selecting 𝑁𝑐 nodes across all layers and setting the inter-layer dependency probability to 𝑝1. For each common community, we decide whether it exists in a particular layer or not. Next, we independently generate the private communities for each layer with the remaining nodes in that layer. We generated 100 different random realizations of each multiplex network in order to report the average performance metric on the experiments. Evaluation: We compared the performance of our method to well-known multiplex community detection algorithms. In particular, we compared with ONMTF applied to the aggregated multiplex networks using the average of the adjacency matrices (Aggregated Average), Spectral Clustering on Multi-Layer graphs (SC-ML) [77], Generalized Louvain (GenLouvain) multilayer community detection algorithm [130, 182], Infomap [69], Collective Symmetric Nonnegative Matrix Factor- ization (CSNMF) [101], Collective Projective Nonnegative Matrix Factorization (CPNMF) [101], 32 Collective Symmetric Nonnegative Matrix Tri-factorization (CSNMTF) [101], and Orthogonal Link Matrix Factorization (OLMF) [244]. Experiment 1: In this experiment, we generated two different types of networks, one where the common communities are present across all layers and another where the common communities are present in different subsets of layers. Figure 2.3 illustrates a single realization of the adjacency matrices generated with 𝜇 = 0.1 for two 5-layer networks, one of each type (first row and second row). The network in Figure 2.3a-2.3e has two common communities (the first two communities in each layer) across all layers and 𝑘 𝑝1 = 4, 𝑘 𝑝2 = 4, 𝑘 𝑝3 = 3, 𝑘 𝑝4 = 2, and 𝑘 𝑝5 = 2, while the network in Figure 2.3f-2.3j has a total of 3 common communities (highlighted in red) that are present in different subsets of layers and 𝑘 𝑝1 = 3, 𝑘 𝑝2 = 4, 𝑘 𝑝3 = 3, 𝑘 𝑝4 = 3, and 𝑘 𝑝5 = 3. In order to evaluate the performance of our algorithm for different noise levels, these two types of networks were generated with varying values of 𝜇 ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. The inter-layer dependency probability is 𝑝1 = 1, for the common communities. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 2.3: Example illustrating a single realization of two 5-layer networks generated with 𝜇 = 0.1. (a)-(e) Adjacency matrices for a network with two common communities (highlighted in red) across all layers; (f)-(j) Adjacency matrices for a network with three common communities (highlighted in red) across different subsets of layers. Figure 2.4 shows the results for the networks with 2 common communities across all layers for 3 (2.4a), 4 (2.4b), and 5 layers (2.4c), and for the networks with 3 common communities across different subsets of layers for 3 (2.4d), 4 (2.4e), and 5 layers (2.4f). The results indicate that our 33 (a) (d) (b) (e) (c) (f) Figure 2.4: Mean NMI over 100 realizations of (a)-(c) 3-layer, 4-layer and 5-layer benchmark networks, respectively, for the scenario with 2 common communities across all layers; (d)-(f) 3-layer, 4-layer and 5-layer benchmark networks, respectively, for the scenario with 3 common communities across different subsets of layers. All networks are generated with 8 different values of the mixing parameter 𝜇 and 𝑁 = 256. method performs well for both networks with common communities across all layers as well as for networks with common communities that do not span all layers. Our method discovers the complete structure of the network rather than forcing it to have a consensus partition. Moreover, our method is robust to noise for larger values of 𝜇 compared to the other methods. We can also conclude that our algorithm performs better when the common communities are across all layers (see Figure 2.4a, 2.4b, 2.4c) than when the multiplex community structure is more complex with common communities across subsets of layers (see Figure 2.4d, 2.4e, 2.4f ), but it still outperforms the rest of the methods. From Figure 2.4 we can see that GenLouvain performs well when 𝜇 is small, but its performance deteriorates for 𝜇 values above 0.6. Another observation is that when the number of layers is small, the NMF-based methods perform closer to GenLouvain but when the number of layers increases, NMF algorithms perform worse. This is because these NMF methods perform aggregation on either the adjacency or the community indicator matrices. When there is 34 more variation across layers, these methods fail to capture this heterogeneity. Experiment 2: In the second experiment, we evaluated the robustness of the algorithm against variations in the common community structure by fixing 𝜇 = 0.1 and varying the inter-layer dependency probability, 𝑝1, i.e., the common communities are allowed to vary across layers. The performance of all methods for a 5-layer network are reported in Figure 2.5 based on the average NMI over 100 realizations of the network. As we can see in Figure 2.5, our method still outperforms the other eight methods when there is some variation in the common community structure. This demonstrates that our method is robust to variations of the common community structure across layers. However, our algorithm is more sensitive to the drop in 𝑝1 than the rest of the methods. This is because when the common communities have a high variation our algorithm may try to assign some of those nodes to the private communities. Figure 2.5: 5-layer network generated with 6 different values of the interlayer dependency probability 𝑝1, with 𝜇 = 0.1, and 𝑁 = 256. Experiment 3: Another parameter in our model is the number of common communities 𝑘𝑐. In this experiment, we fixed 𝑁 = 256, 𝜇 = 0.3, and the number of communities in each layer, and varied 𝑘𝑐 from 1 to 7. When 𝑘𝑐 = 7, all communities are common across layers and there are no private communities. As we can see in Figure 2.6, as 𝑘𝑐 is increased, the performance of all the other methods improves, as expected, because these methods are designed to detect the common 35 community structure. When the communities are common across layers, most of the methods, except Infomap, converge to the same NMI value. The performance of MX-ONMTF is not affected by increasing 𝑘𝑐, and when all communities are common it performs similarly to the other methods. Figure 2.6: 5-layer network generated with different values of common communities 𝑘𝑐 across layers. Scalabilty Analysis: In this experiment, we evaluate the effect of network size on the run time of the proposed algorithm. For this purpose, we fixed 𝜇 = 0.3, 𝐿 = 5, 𝑘𝑐 = 3 and varied 𝑁 from 32 to 8192. From Figure 2.7, it can be seen that our method’s run time is almost log-linear. This is comparable with all the other NMF based methods. However, our as shown in the previous experiments, our method performs better. Most of this time complexity is due to the multiplicative update rule used in NMF-based algorithms and can be reduced using alternative approaches as discussed in [59]. 2.6.2 Ablation Study In this section, we consider the importance of the different variables and constraints in our method, i.e., orthogonality constraint and tri-factorization, by modifying our cost function and its constraints. • MX-NMTF: This is equivalent to our problem without the orthogonality constraints. The 36 Figure 2.7: 5-layer network with 9 different values of 𝑁, 𝑘𝑐 = 3, and 𝜇 = 0.3. modified problem formulation is argmin H≥0,H𝑙 ≥0,S𝑙 ≥0,G𝑙 ≥0 𝐿 ∑︁ 𝑙=1 ||A𝑙 − HS𝑙H⊤ − H𝑙G𝑙H⊤ 𝑙 ||2 𝐹 . • MX-ONMF: This is equivalent to symmetric NMF without the tri-factorization, with the orthogonality constraints preserved. argmin H≥0,H𝑙 ≥0 𝐿 ∑︁ 𝑙=1 ||A𝑙 − HH⊤ − H𝑙H⊤ 𝑙 ||2 𝐹, s.t H⊤H = I, H⊤ 𝑙 H𝑙 = I. • MX-NMF: This is equivalent to our method without the orthogonality constraints and the tri-factorization. argmin H≥0,H𝑙 ≥0 𝐿 ∑︁ 𝑙=1 ||A𝑙 − HH⊤ − H𝑙H⊤ 𝑙 ||2 𝐹 . Table 2.3 shows the NMI results for MX-ONMTF and the different variations presented above for the multiplex network with five layers and three common communities across different subsets of layers used in Experiment 1. MX-ONMTF, which uses both orthogonality and tri-factorization performs better than the other variations. It can also be concluded that orthogonality improves the results with respect to regular NMF more than adding only tri-factorization. 2.6.3 Real World Multiplex Networks Lazega Law Firm Multiplex Social Network: Lazega Law Firm [147] is a multiplex social network with 71 nodes and three layers representing Co-work, Friendship and Advice relationships 37 Table 2.3: Effect of orthogonality and tri-factorization in the proposed framework. 𝜇 MX-ONMTF MX-NMTF MX-ONMF MX-NMF 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9854 0.9717 0.9469 0.9489 0.8582 0.8110 0.6952 0.6158 0.5391 0.5466 0.5435 0.5357 0.5202 0.5084 0.4868 0.4750 0.9326 0.9236 0.9160 0.9080 0.8784 0.8157 0.6960 0.6180 0.8384 0.8484 0.8298 0.8114 0.7835 0.7404 0.6394 0.5705 between partners and associates of a corporate law firm. This data set also includes information about some attributes of each node such as status, gender, office location, years with the firm, age, type of practice, and law school. Applying MX-ONMTF to this network, we obtain one common community across all layers composed of the nodes colored in red as well as private communities for each layer, as shown in Figure 2.8. This network does not have ground truth community structure, but we can compute the NMI between the detected community structure and each type of node attributes, i.e., metadata, to gain better insight into the results and to be able to provide quantitative results [193]. For each of the attributes, the nodes are divided into communities based on that particular attribute. For example, for the status, the network is divided into two communities, partners and associates. For Age and Seniority, the nodes were grouped into five-year bins. The community structure for each attribute is used as ground truth to compute the NMI between each attribute and the community structure detected by our method. The NMI values given in Table 2.4 for the partition obtained by our method, suggest that office location and type of practice (litigation or corporate) are highly correlated with community membership across co-work, friendship and advice relationships. We can also see that the partition detected by MX-ONMTF has greater NMI values for each attribute. Therefore, our method detects a community structure that takes all of the attributes into account instead of partitioning with respect to just one attribute as the Aggregated Average does. 38 (a) Advice (b) Friendship (c) Co-work Figure 2.8: Communities detected in Lazega Law Firm network across the three layers, advice, friendship, and co-work relationships. Red nodes are in the common community across the three layers. Table 2.4: NMI of the obtained community partition for each method and the metadata available for the Lazega Law Firm Multiplex Network. Method Status Gender Office Seniority Age Practice Law School GenLouvain 0.0345 0.0307 0.5294 0.0807 0.0431 0.5468 Aggregated Average 0.0383 0.0197 0.5379 0.1307 0.0798 0.4411 SC-ML Infomap CSNMF CPNMF 0.0138 0.0259 0.0731 0.1225 0.0464 0.0249 0.0179 0.0043 0.1668 0.2880 0.0083 0.0003 0.0418 0.0291 0.5732 0.1155 0.0736 0.4227 0.0081 0.0524 0.1139 0.0514 0.0291 0.0187 CSNMTF 0.0395 0.0217 0.0798 0.0795 0.0487 0.1335 OLMF 0.0534 0.0296 0.4776 0.1023 0.0401 0.1665 MX-ONMTF 0.4752 0.4906 0.7386 0.4135 0.4203 0.6162 0.0040 0.0201 0.0140 0.0093 0.0172 0.0221 0.0279 0.0163 0.4226 C. Elegans Network: C. Elegans Network [68, 48] is a multiplex network with 279 nodes and 3 layers representing different synaptic junctions (electric, chemical monadic, and polyadic) of 279 neurons of the Caenorhabditis Elegans connectome. Information about different attributes of the neurons in this dataset such as the group of neuron they belong to (bodywall, mechanosensory, ring interneurons, head motor neurons, etc.), the type of neuron (motor neurons, sensory neurons, interneurons), and the color (blue, red, yellow, orange, etc.) is available. Table 2.5 shows the NMI values between the community structures detected by each method and each of the three attributes available for this dataset. The partition detected by MX-ONMTF has greater NMI values for each of the attributes compared to the other eight methods. 39 Table 2.5: NMI of the obtained community partition for each method and the metadata available for C. Elegans Network. Method Neuron Group Neuron Type Color GenLouvain Aggregated Average SC-ML Infomap CSNMF CPNMF CSNMTF OLMF MX-ONMTF 0.3756 0.3839 0.0185 0.2265 0.1635 0.0854 0.1113 0.3288 0.4074 0.1297 0.1590 0.0103 0.2345 0.075 0.0277 0.0402 0.2149 0.2362 0.2977 0.2690 0.2355 0.1211 0.0628 0.0914 0.2669 0.4001 0.4593 YeastLandscape Multiplex Network: Yeast Landscape is a multiplex genetic interaction network of a specie of yeast, Saccharomyces Cerevisiae [60, 68]. This network has 4458 nodes and 4 layers representing the positive and negative interaction networks of genes in Saccharomyces cerevisiae and positive and negative correlation based networks in which genes with similar interaction profiles are connected to each other. For this work, we use the bioprocess annotations of the genes available on the supplementary data file S6 of [60] as ground truth. We divided the genes into 18 groups according to their primary bioprocess. There were 1580 genes in this network without attributes. Table 2.6 shows the NMI values between the community structures detected by each method and the bioprocess of the genes. MX-ONMTF gives the highest NMI value followed by the other NMF-based community detection methods. 2.6.4 Analysis of Overfitting In this section, we evaluate the efficacy of MX-ONMTF in terms of overfitting/underfitting. We characterize the algorithm’s model fitting performance in terms of link prediction and link description as described in [98]. In link prediction, each layer of a multiplex network is sampled using an 𝛼 fraction of the edges of that layer creating a new sequence of graphs {G′ 𝑙 }, where 𝑙 ∈ {1, 2, . . . , 𝐿}. The goal 40 Table 2.6: NMI of the obtained community partition for each method with respect to the metadata available for YeastLandscape Network. Method Bioprocess GenLouvain Aggregated Average SC-ML Infomap CSNMF CPNMF CSNMTF OLMF MX-ONMTF 0.0794 0.1108 0.1564 0.2987 0.3553 0.3559 0.3549 0.3145 0.4123 of link prediction is to accurately distinguish missing links, 𝐸𝑚 = 𝐸 \ 𝐸′ (true positives), from non-existent edges, 𝐸𝑛𝑒 = 𝑈 \ 𝐸 (true negatives), within the set of unobserved connections 𝑈 \ 𝐸′, where 𝑈 is the set of all possible edges in G𝑙 and |𝐸′| = 𝛼|𝐸 | is a uniformly random subset of edges in the original graph G𝑙 = (𝑉𝑙, 𝐸𝑙). MX-ONMTF is then applied to the new multiplex network and a model-specific score function 𝑠𝑖 𝑗 that estimates the likelihood that a pair of nodes 𝑖, 𝑗 is connected, is then computed. In our case, we use the estimated adjacency matrix defined as ˆA𝑙 = H′S′ 𝑙H′⊤ + H′ 𝑙G′ 𝑙H′⊤ 𝑙 , using the embedding matrices obtained from MX-ONMTF such that 𝑠𝑖 𝑗 = ˆA𝑙 (𝑖, 𝑗). Provided the rank of all non-observed links, the AUC value can be interpreted as the probability that a randomly chosen missing link, i.e., a link in 𝐸𝑚, is given a higher score than a randomly chosen nonexistent link, i.e., a link in 𝐸𝑛𝑒. At each time we randomly pick a missing link and a nonexistent link to compare their scores, if among 𝑛 independent comparisons, there are 𝑛′ times the missing link has a higher score, and 𝑛′′ times they have the same score, the AUC value is AUC = 𝑛′ + 0.5𝑛′′ 𝑛 . The AUC curve as a function of 𝛼 shows how MX-ONMTF performs across the sampled graph ranging from when very few edges are observed to when only a few edges are missing. On the other hand, link description evaluates how well the method learns an observed network. Its goal is to distinguish accurately observed edges 𝐸′ (true positives) and observed non-edges 41 Figure 2.9: AUC curves for link prediction and description tasks. Each curve shows the mean AUC for MX-ONMTF over 20 realizations of a real-world network for a given fraction 𝛼 of observed edges in the network. 𝑈 \ 𝐸′ (true negatives) within the set of all possible edges 𝑈. Similar to link prediction, we employ 𝑙 ’s and use the same scoring function 𝑠𝑖 𝑗 to evaluate our algorithm’s accuracy at distinguishing G′ edges from non-edges. In our evaluation, 20 randomly subsampled multiplex networks {G′ 𝑙 } are generated for each value of 𝛼 and the mean AUC is computed at each 𝛼. This analysis is applied to all three real multiplex networks. As it can be seen in Figure 2.9, based on the guidelines provided in [98], the performance of our method in link prediction can be described as good and its performance in link description as poor, and we can conclude that our method does not overfit or underfit. 2.6.5 Multiview Networks In order to evaluate the performance of our method on networks where the communities are common across all layers, we use two multiview data sets, UCI Handwritten Digits1 [83] and Caltech [158]. The UCI Handwritten Digits data set consists of features of handwritten digits from (0- 9) extracted from a collection of Dutch utility maps. There is a total of 2000 patterns that have been digitized in binary images, 200 patterns per digit. These digits are represented by six different feature sets: Fourier coefficients of the character shapes, profile correlations, Karhunen-Lo`eve coefficients, pixel averages in 2 × 3 windows, Zernike moments, and morphological features. Each 1https://archive.ics.uci.edu/ml/datasets/Multiple+Features 42 layer of the multiplex network represents one of the 6 features. The graphs are constructed using 𝑘-nearest neighbors graphs with the nearest 50 neighbors and Euclidean distance. Caltech-101 is a well-known object recognition dataset that consists of pictures of objects belonging to 102 categories. There are about 40 to 800 images per category for a total of 9144 images. This dataset consists of 6 types of features extracted from each image. A multiplex network with 6 layers representing each of the features, 102 classes, and 9144 nodes is constructed from this dataset using k-nearest neighbors graphs with the nearest 50 neighbors. A smaller version of this dataset is also used in these experiments, where only 20 objects are selected, resulting in a multiplex network with 6 layers, 20 communities, and 2386 nodes. Table 2.7: NMI of the obtained community partition for multiview networks. Method Handwritten Caltech-20 Caltech-101 GenLouvain Aggregated Average SC-ML Infomap CSNMF CPNMF CSNMTF OLMF MX-ONMTF 0.8791 0.7957 0.8435 0.5367 0.4499 0.4421 0.4478 0.4876 0.9432 0.5921 0.5358 0.6476 0.3876 0.4250 0.4208 0.4264 0.4321 0.6861 0.3406 0.3941 0.5016 0.2583 0.3842 0.3816 0.3831 0.3117 0.5660 In this case, as we have the true class assignment, we compute the NMI with respect to this ground truth. As it can be seen in Table 2.7, our method performs better than the rest of the methods for the three networks. This indicates that even in cases where there are no private communities, our method is successful at obtaining the consensus community structure, thus can be used as an alternative to multiview clustering. 2.7 Application to fMRI data: Subgroup identification Identifying homogeneous subgroups with similar symptoms or neuropsychological patterns is essential for understanding the heterogeneity of psychotic disorders and advancing precision 43 medicine, which enables tailored treatments based on patients’ unique profiles. Given the complex- ity of psychiatric disorders, exploring relationships across multiple functional networks can provide deeper insights into diagnostic heterogeneity. In this section, we apply our method, MX-ONMTF, to functional connectivity networks that are extracted from multi-subject resting-state fMRI data, with each network representing a distinct functional interaction pattern. These networks form a multiplex framework, where each layer corresponds to a functional network, and nodes correspond to individual subjects. By applying the community detection method proposed in this chapter, we identify communities that capture shared functional patterns across multiple networks (common communities) while also preserving network-specific subgroup structures (private communities). 2.7.1 Resting-State fMRI Data The resting-state fMRI datasets and corresponding clinical scores used in this study are obtained from the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP) study [239, 238]. The study follows a standardized diagnostic and recruitment process across multiple sites, including Baltimore, Chicago, Dallas, Detroit, and Hartford. All subjects underwent a single 5-minute resting- state fMRI session using a 3-T scanner while maintaining fixation on a crosshair displayed on a monitor to minimize motion artifacts. The first three time points were discarded and head motion correction was performed followed by slice-timing correction. The corrected fMRI data were then transformed into the standard Montreal Neurological Institute (MNI) space using an echo-planar imaging template and then were resampled to 3 × 3 × 3 mm3 isotropic voxels. The resampled fMRI data were further smoothed using a Gaussian kernel, with a full width at half maximum (FWHM) equal to 6 mm. Quality control procedures [82] were applied to select subjects. In this study, 464 individuals diagnosed with psychotic disorders including 176 individuals with schizophrenia, 159 with psychotic bipolar disorder, and 129 with schizoaffective disorder, were used. Figure 2.10 illustrates the workflow of the proposed method for subgroup identification in this resting-state fMRI dataset. Figure 2.10a illustrates the preprocessing and decomposition pipeline used in this study. The fMRI data for each subject are first transformed into vectorized brain 44 volumes, forming the input matrices 𝑋𝑙 across multiple functional networks. These matrices are then constrained by reference templates and decomposed using entropy bound minimization (c- EBM) [275, 276] to obtain both estimated components and subject-specific spatially constrained variations (SCVs). The spatial constraints, i.e., references, for c-EBM are generated based on the fSIG pipeline [276], which includes 49 resting-state networks (RSNs) spanning auditory (AUD: 1 RSN), sensorimotor (MOT: 8 RSNs), visual (VIS: 10 RSNs), default-mode (DMN: 11 RSNs), attentional (ATTN: 8 RSNs), frontal (FRONT: 8 RSNs), cerebellar (CB: 2 RSNs), and basal ganglia (BG: 1 RSN) networks. Next, as shown in Figure 2.10b, an element-wise squared partial correlation matrix ˆC is constructed from the estimated SCVs while removing reference effects. This matrix is then transformed into an undirected graph by setting diagonal elements to zero and thresholding partial correlation values to define network edges. The resulting graph layers are categorized based on their topological properties as explained in the next subsection, ensuring consistent classification across networks. Finally, community detection is performed on each group of layers using MX-ONMTF to identify both common and private communities. 2.7.2 Layer Classification Based on Graph-Theoretical Metrics First, the layers in the multiplex network are classified based on their topological properties using graph-theoretical metrics. The element-wise squared partial correlation matrix ˆC𝑙 captures subject similarity within a functional network while excluding reference influences. Networks with similar correlation patterns are assigned to the same layer category, as they provide consistent subgroup information. To achieve this, ˆC𝑙 is converted into an undirected binary graph G𝑙 with adjacency matrix B𝑙, where subjects are nodes and edges represent significant partial correlations after thresholding. The threshold 𝑒 is selected to maintain a link density (Γ) between 20% and 70%, ensuring balanced connectivity. Graph-theoretical metrics including (1) path length, (2) global efficiency, (3) centrality, (4) clustering coefficient, and (5) small-worldness, are used to characterize each layer, and a feature matrix summarizing these topological characteristics of G𝑙 is formed. 𝑘-means clustering is applied to this feature matrix with the number of layer categories varying from 2 to 10. The final number of 45 (a) (b) Figure 2.10: (a): Flowchart of applying c-EBM to individual datasets. The estimated components Y[𝑘] are aligned across datasets by utilizing reference priors r𝑙. The resulting SCVs are formed by concatenating the corresponding estimated components from all subjects, such as the default mode network component across individuals. Each SCV summarizes information about a specific functional network across all subjects. (b): Flowchart of the proposed multiplex community-based subgroup identification includes Element-Wise Squared Partial Correlation Matrix Formation: The matrix ˆC is constructed from the estimated SCVs derived from c-EBM, with the effects of references removed; Graph Transformation: ˆC is converted into an undirected graph with adjacency matrix B by removing its diagonal elements, thresholding and binarizing the partial correlation values to define the edges, with the subjects represented as nodes; Layer Categorization: Based on the topological properties of the graphs, layers are classified into different layer categories. This ensures that connection patterns across subjects are consistent within each layer category across various functional networks; Community Detection: MX-ONMTF is applied to each layer category to identify common and private communities. 46 layer categories is determined based on the within-cluster sums of point-to-centroid distances. The purpose of dividing the 49 layers into different categories is to make sure the connection pattern across subjects is similar across different functional networks that are within a layer category. For this dataset, the layer categorization resulted in 𝐺 = 3 categories with 𝐿1 = 6, 𝐿2 = 23, and 𝐿3 = 20 layers, respectively. Once the 49 layers are divided into three categories, the number of communities in each category is determined by applying the algorithms described in Section 2.3.3. MX-ONMTF is then applied individually to the three multiplex networks, resulting in two common communities for each layer category. 2.7.3 Discussion on fMRI Results To investigate the behavioral differences across the detected communities, i.e. subgroups, we analyzed the corresponding Positive and Negative Syndrome Scale (PANSS) [137] scores collected from the same group of individuals, since abnormal functional status observed through fMRI analysis is often associated with certain behavioral symptoms. The PANSS evaluates 30 distinct symptoms categorized into positive symptoms, negative symptoms, and general psychopathology symptoms, commonly observed in psychotic patients. Each symptom is rated on a scale of 1 to 7, where 1 signifies no symptoms and 7 represents severe symptoms. Positive symptoms, such as hallucinations and delusions, reflect an excess or distortion of normal functions and are typically more pronounced in Type 1 schizophrenia. Negative symptoms, including emotional withdrawal and difficulty in abstract thinking, represent a loss of normal functions and are more severe in Type 2 schizophrenia. General psychopathology symptoms encompass issues not specifically categorized as positive or negative, such as poor attention, anxiety, guilt, tension, lack of insight, and active social avoidance [122]. In Figure 2.11, we present the two common communities identified separately from Layer Categories 1 and 2. The detected common communities exhibit significant group differences compared to the remaining subjects in both activations across functional networks and PANSS clinical scores. A two-sample 𝑡-test was conducted to analyze voxel-wise activation values of the 47 spatial maps for subjects within each subgroup, determining whether the spatial activation patterns of RSNs showed significant differences between subgroups. False discovery rate (FDR) correction [20] is applied to all results. Figure 2.11: The two identified common communities (I, II) show significant group differences compared to the remaining subjects, both in terms of meaningful functional brain areas and clinical scores. The two identified common communities are detected from a set of 23 layers and a set of 6 layers separately. The neuro-activity maps (𝑡-maps) highlight the functional areas with significant activation differences, derived from two-sample 𝑡-tests with 𝑝 < 0.05 after FDR correction, between the subgroups. From left to right, the visualization includes the element-wise squared partial correlation matrix of a SCV ˆC, the 𝑡-map, and the clinical score differences between the identified subgroups. Subjects belonging to the identified common communities are indicated by red and green blocks in ˆC. The common community in Figure 2.11 I, highlighted by a green square, exhibits elevated scores across multiple behavioral variables. These include poor attention, disturbance of volition, poor impulse control, and active social avoidance from the general psychopathology subscale of the PANSS score, as well as higher excitement scores from the positive subscale and poor rapport from the negative subscale. These behavioral differences can be linked to functional variations 48 𝒑=𝟔.𝟐𝟖 × 𝟏𝟎!𝟒 𝒑=𝟏.𝟏𝟒 × 𝟏𝟎!𝟑 𝑡-mapClinical scores𝓒#III (𝑝 = 1.14 × 10−3) observed in specific brain regions, including the anterior prefrontal cortex (antPFC, BA 10), dorsolateral prefrontal cortex (dlPFC, BA 9), dorsal anterior cingulate cortex (dACC, BA 32), dorsal posterior cingulate cortex (dPCC, BA 31), angular gyrus (BA 39), and supramarginal gyrus (BA 40). Notably, the angular gyrus and supramarginal gyrus are part of the inferior parietal lobule (IPL), which has been consistently reported associated with attention deficits in psychiatric patients [178, 218]. This aligns with the clinical observation that individuals in the identified community display higher (worse) scores in poor attention symptoms. Additionally, Paulus et al. [198] reported significantly greater activation of the supramarginal gyrus in patients compared to controls during a decision-making study, further supporting its role in attention- related deficits. Dysfunction in the antPFC has also been frequently linked to psychiatric disorders, particularly negative symptoms such as avolition (lack of motivation) [235]. This is consistent with the observed higher scores for disturbance of volition in the identified community. Notably, the identified common community exhibits more severe symptoms compared to the community in II. This increased severity may be linked to dysfunctions observed in the default mode network (DMN), which includes regions such as the posterior cingulate cortex and bilateral inferior parietal lobule. The DMN has been reported to positively correlate with internally-directed thought processes, including future planning and affective decision-making, while negatively correlating with task- related processes [38, 232, 261]. Mingoia et al. reported increased functional connectivity within the DMN in individuals with schizophrenia from resting-state fMRI [179]. In the other detected common community, shown in Figure 2.11 II and highlighted by a red square, subjects in this group displayed significantly higher activation (𝑝 = 6.28 × 10−4) in the dorsolateral prefrontal cortex (dlPFC, BA 46, BA 9), middle temporal gyrus (BA 21), primary sensory cortex (BA 1), superior parietal lobule (BA 7), and Broca’s operculum (BA 44). These brain regions are closely associated with various psychotic disorders such as schizophrenia. The dlPFC, in particular, plays a critical role in complex cognitive functions such as working memory, planning, and decision-making [285, 200]. Dysfunctional local connectivity in the dlPFC has been linked to psychiatric disorders, where patients often exhibit pronounced deficits in working 49 memory [196]. Alterations in the dlPFC have also been associated with negative symptoms in schizophrenia [257]. This dysfunctionality may contribute to the observed higher (worse) negative symptom scores, such as difficulty in abstract thinking, in subjects within this common community. 2.8 Conclusions In this chapter, we proposed a multiplex community detection method based on ONMTF. The proposed method, MX-ONMTF, is able to detect both common and private communities across layers, allowing us to differentiate between the topologies across layers. The proposed algorithm is based on multiplicative update rules and a proof of convergence is provided, along with an in- depth analysis of the algorithm, including studies of overfitting and ablation, recovery guarantees and consistency. A new approach based on the eigengap criterion is introduced for determining the number of communities. Results for both synthetic and real-world networks show that our method performs better than existing community detection methods for multiplex networks as it is able to handle the heterogeneity of the network topology across layers. Moreover, experiments on multiview networks show that our method also performs well in cases where a consensus community structure is needed. In addition, MX-ONMTF is applied to an fMRI dataset where the nodes are 464 psychotic patients and the layers represent different functional areas of the brain, identifying subgroups of subjects that exhibit significant differences in key functional areas, such as the default mode network (DMN) and anterior prefrontal cortex (antPFC), as well as in their corresponding clinical scores. These findings align with prior clinical studies, demonstrating the ability of the proposed approach to uncover clinically relevant subgroups and enhance understanding of psychotic disorder heterogeneity. 50 CHAPTER 3 DISCRIMINATIVE COMMUNITY DETECTION FOR MULTIPLEX NETWORKS 3.1 Introduction A multiplex network is a multilayer network where all layers share the same set of nodes with edges representing different interactions [140]. These multiplex networks model complex systems like living organisms, human societies, and transportation systems [229]. Community detection is a core task in network analysis, where communities are defined as groups of nodes that are more densely connected to each other than they are to the rest of the network [92]. Detecting the community structure is useful for understanding the structure and function of complex networks. In a lot of applications, one may have multiplex networks constructed from two different datasets. For example, in the study of brain networks through multiplex representations [67], each layer may correspond to a different subject, and each group may correspond to a different population, e.g., healthy vs. disease. In these settings, one is interested not only in the community structure of each multiplex network but also in the network components, i.e., communities, that discriminate between the two groups. Moreover, in a lot of cases the multiplex network structure may be changing with time leading to temporal multiplex networks. For example, recent studies show that functional brain networks are dynamic with the topology evolving with time, so one could also be interested in identifying network components that evolve differently across two groups over time. In this chapter, we introduce a discriminative community detection approach based on spectral clustering for detecting community structures that distinguish between two multiplex networks in both static and dynamic cases. In particular, we introduce three different formulations. The first approach, Multiplex Discriminative Spectral Clustering (MX-DSC), focuses on minimizing the normalized cut of the difference between the two groups with a regularization term that ensures that the projection distance between the discriminative subspaces is maximized. The second method, Multiplex Discriminative and Consensus Spectral Clustering (MX-DCSC), extends this approach by simultaneously learning consensus, discriminative, and individual layerwise subspaces across both groups. And, the third method, Discriminative Subgraph Detection for Temporal Multiplex 51 Networks (TMX-DiSG), identifies discriminative subgraphs between two temporal multiplex net- works by minimizing the normalized cut over time, incorporating regularization terms to maximize projection distance and ensure temporal smoothness. These methods are evaluated on synthetic and real multiplex and temporal multiplex networks, including EEG and dynamic fMRI functional brain networks, comparing across experimental conditions and tasks. 3.1.1 Related Work The problem of community detection in multiplex networks is closely tied to the literature in multiview clustering [46], which deals with the problem of clustering the data points given multiple sets of features. The main approach to multiview clustering is to optimize the objective function to find the best clustering solution for the given data with 𝑁 samples and 𝑚 views, yielding a membership matrix H ∈ R𝑁×𝑘 that indicates group membership. Some examples of this approach include multiview spectral clustering [77], multiview subspace clustering [37], multiview NMF clustering [161] and canonical correlation analysis based methods [47]. The methods proposed here are most similar to multiview spectral clustering, which constructs a similarity matrix and minimizes the normalized cut between clusters. However, existing multiview spectral clustering methods focus on learning either consensus or both layer-specific and consensus cluster structures [144]. Thus, there is no direct emphasis on differentiating between two groups of multiview data. Another class of methods that are closely related to the proposed frameworks are contrastive principal component analysis (cPCA) [2] and discriminative principal component analysis (dPCA) [50]. These methods deal with the dimensionality reduction problem similar to PCA. However, unlike PCA which copes with one dataset at a time, they analyze multiple datasets jointly. They extract the most discriminative information from one dataset of particular interest, i.e., target data, relative to the other(s), i.e., background data. The method proposed in this chapter can be thought of as an extension of cPCA and dPCA from the Euclidean domain to the graph domain, where the discriminative subspaces now correspond to the discriminative community structure. The problem of learning from multiple datasets has also been addressed in the area of brain imaging, where the increasing availability of data across multiple tasks in recent years provides 52 complementary information [236]. Methods that jointly analyze multiple datasets can leverage their complementary information, improving overall learning performance. However, a key challenge in analyzing these datasets is to distinguish between shared (joint) and unique (discriminative) patterns. Traditional methods in this domain focus on matrix and tensor decompositions [287] which use latent variable models that capture the interactions among the multiple datasets [5, 134]. Extensions of independent component analysis (ICA) such as linked ICA [105], tensor ICA [17], independent vector analysis (IVA) [4], group ICA [43] and multi-paradigm sparse tensor decomposition [287] have been used for multimodal data fusion, multi-subject, and multiple task fMRI analysis. More recently, deep learning methods have been employed for learning from multi- modal and multi-task fMRI data [279]. While these methods extract useful biomarkers, their lack of interpretability, especially as network depth increases, limits their usefulness. In recent work, a supervised dictionary learning method has been proposed for multi-subject fMRI data analysis to extract brain activation maps that are common and discriminative across different groups of subjects [127]. However, these methods still have shortcomings. First, most of the current methods focus on finding the common information across different networks, i.e., data fusion. Second, existing methods focus on the whole brain fMRI time series data and not the actual network. Thus, they cannot particularly answer how the topological organization of the network changes between two tasks or populations. Finally, the current methods aggregate the data over time, thus temporal variation of the common and unique components cannot be determined. 3.2 Background 3.2.1 Multiplex Network Community Detection Multiplex networks can be represented using a finite sequence of graphs {G𝑙 }, where 𝑙 ∈ {1, 2, . . . , 𝐿}, G𝑙 = (𝑉, 𝐸𝑙, A𝑙) [62]. 𝑉 is the set of nodes which is the same for all layers, 𝐸𝑙 and A𝑙 ∈ R𝑁×𝑁 are the edges set and the adjacency matrix for layer 𝑙, respectively. A large group of community detection methods for multiplex networks aim to find a consensus community structure across all layers by first merging the layers and then applying a single-layer community 53 detection algorithm to the aggregated networks. When the networks are aggregated through the mean operation, the trace minimization problem in Eq. (1.8) can be written as [144]: minimize U∗∈R𝑁 ×𝑘,U⊤ ∗ U∗=I tr(cid:16) U⊤ ∗ 𝐿 ∑︁ 𝑙=1 (cid:17) , L𝑙U∗ (3.1) where L𝑙 ∈ R𝑁×𝑁 is the graph Laplacian matrix for layer 𝑙. The goal is to find a subspace U∗ that is representative of all the layers in the multiplex network, and the consensus community structure can be found by applying 𝑘-means to this U∗. Multiplex networks that vary over time can be represented as G𝑙,𝑡 = (𝑉, 𝐸𝑙,𝑡, A𝑙,𝑡 } where 𝑉 is the set of nodes, 𝐸𝑙,𝑡 is the edge set and A𝑙,𝑡 ∈ R𝑁×𝑁 is the adjacency matrix for layer 𝑙 and time 𝑡 with 𝑙 ∈ {1, . . . , 𝐿} and 𝑡 ∈ {1, . . . , 𝑇 }. For multiplex temporal networks, the low-dimensional spectral embedding representative of all the layers at time 𝑡, U∗,𝑡, can be obtained by applying the trace minimization problem in Eq. (1.8) to layer aggregated Laplacian matrices as follows [144]: minimize U∗,𝑡 ∈R𝑁 ×𝑘 ,U⊤ ∗,𝑡 U∗,𝑡 =I tr(cid:16) U⊤ ∗,𝑡 𝐿 ∑︁ 𝑙=1 L𝑙,𝑡 U∗,𝑡 (cid:17) . (3.2) 3.3 Discriminative Community Detection Methods Given two multiplex networks G1 𝑙 , A1 and 𝑚 ∈ {1, 2, . . . , 𝑀 } and graph Laplacians L1 𝑙 = (𝑉 1, 𝐸 1 𝑙 ) and G2 𝑙 and L2 𝑚, the goal is to extract two embedding subspaces, ¯U1 and ¯U2, that discriminate between the two multiplex networks with respect to the 𝑚 = (𝑉 2, 𝐸 2 𝑚, A2 𝑚) with 𝑙 ∈ {1, 2, . . . , 𝐿} other. We propose two formulations for achieving this goal. 3.3.1 Multiplex Discriminative Spectral Clustering (MX-DSC) In the first approach, we focus on obtaining ¯U1 and ¯U2 that discriminate between the groups. Let ¯U1 ∈ R𝑁× ¯𝑘1 be the embedding subspace that minimizes the normalized cut of the first group, i.e., minimize tr( ¯U1⊤((cid:205)𝐿 𝑙 ) ¯U1), while maximizing the second group’s normalized cut, i.e., maximizing tr( ¯U1⊤((cid:205)𝑀 𝑚) ¯U1). These two goals can be simultaneously satisfied through the 𝑙=1 L1 𝑚=1 L2 following optimization: minimize ¯U1, ¯U1⊤ ¯U1=I tr(cid:16) ¯U1⊤ (cid:16) 𝐿 ∑︁ 𝑙=1 (cid:17) ¯U1(cid:17) − 𝛼tr(cid:16) ¯U1⊤ (cid:16) L1 𝑙 L2 𝑚 (cid:17) ¯U1(cid:17) , 𝑀 ∑︁ 𝑚=1 54 (a) (b) Figure 3.1: Overview of the proposed multiplex discriminative community detection methods, MX-DSC and MX-DCSC. (a) Toy example of two multiplex networks with shared (non-discriminative) communities colored in pink and purpe, and discriminative communities colored in orange, green and gray. (b) Discriminative embedding matrices are learned for each group, ¯U1 and ¯U2. 𝑘-means with 𝑘 = 2 is applied to the degrees of |Z1| and |Z2| to separate the nodes in the shared and in the discriminative subspace, respectively. where ¯U1 captures what is discriminative in the first multiplex network with respect to the other. Similarly, we can define ¯U2 as the embedding matrix that contains information about the discrimi- native subspace of the second multiplex network with respect to the first. Considering the two embedding matrices jointly, ¯U1 ∈ R𝑁× ¯𝑘1 and ¯U2 ∈ R𝑁× ¯𝑘2, results in tr(cid:16) ¯U1⊤ (cid:16) 𝐿 ∑︁ L1 𝑙 − 𝛼 𝑀 ∑︁ L2 𝑚 (cid:17) ¯U1(cid:17) + tr(cid:16) ¯U2⊤ (cid:16) minimize ¯U1, ¯U2 𝑀 ∑︁ 𝑚=1 L2 𝑚 − 𝛼 (cid:17) ¯U2(cid:17) L1 𝑙 𝐿 ∑︁ 𝑙=1 (3.3) 𝑙=1 𝑚=1 +𝛾1tr(cid:16) ¯U1 ¯U1⊤ ¯U2 ¯U2⊤(cid:17) , s.t. ¯U1⊤ ¯U1 = I, ¯U2⊤ ¯U2 = I, where the first term determines ¯U1 that discriminates the first multiplex network from the second, the second term defines ¯U2 that discriminates the second network from the first, and the last term is a regularization that maximizes the projection distance between ¯U1 and ¯U2. The hyperparameters 𝛼 and 𝛾1 control the level of discrimination and the dissimilarity between the two subspaces, respectively. The optimization problem in (3.7) can be solved in an alternating manner, first solving for ¯U1 and then for ¯U2. 55 ¯U1(𝑘+1) := argmin ¯U1∈R𝑁 ×𝑘1 tr( ¯U1⊤( ¯U2(𝑘+1) := argmin ¯U2∈R𝑁 ×𝑘2 tr( ¯U2⊤( 𝐿 ∑︁ 𝑙=1 𝑀 ∑︁ 𝑚=1 𝑀 ∑︁ L1 𝑙 − 𝛼 𝑚 + 𝛾1 ¯U2(𝑘) ¯U2(𝑘)⊤) ¯U1), L2 𝑚=1 𝐿 ∑︁ L2 𝑚 − 𝛼 𝑙=1 𝑙 + 𝛾1 ¯U1(𝑘+1) ¯U1(𝑘+1)⊤) ¯U2) L1 (3.4) s.t. ¯U1⊤ ¯U1 = I, ¯U2⊤ ¯U2 = I. The solution to updating ¯U1 is the eigenvectors corresponding to the 𝑘 1 smallest eigenvalues of ((cid:205)𝐿 𝑙=1 L1 𝑙 − 𝛼 (cid:205)𝑀 𝑚 + 𝛾1 ¯U2 ¯U2⊤), which is the global optimum solution to the ¯U1 sub-problem in (3.8) [34]. The solution for ¯U2 can be found in a similar manner, and it is the global optimum 𝑚=1 L2 for the ¯U2 sub-problem in (3.8). We solve iteratively for both variables until convergence. 3.3.2 Multiplex Discriminative and Consensus Spectral Clustering (MX-DCSC) In this section, we propose a formulation where we learn both the discriminative subspaces between groups, ¯U1 and ¯U2, while also learning the consensus subspaces, U1 ∗ ∈ R𝑁×𝑘 2 and the individual layerwise embeddings, U1 U2 group. 𝑙 ∈ R𝑁×𝑘 1 and U2 ∗ ∈ R𝑁×𝑘 1 and 𝑚 ∈ R𝑁×𝑘 2, within each For the discriminative part, we propose to use a variation of Eq. (3.7), where we find ¯U1 that captures what is discriminative in the first multiplex network with respect to the other. We can define the squared projection distance between the target representative subspace ¯U1 of the first group and the individual subspaces of the second group, U2 𝑚 as in [77] 𝑝𝑟𝑜 𝑗 ( ¯U1, {U2 𝑑2 𝑚}𝑀 𝑚=1) = 𝑀 ∑︁ 𝑚=1 𝑝𝑟𝑜 𝑗 ( ¯U1, U2 𝑑2 𝑚) = 𝑀 ∑︁ (𝑘 − tr( ¯U1 ¯U1⊤U2 𝑚U2⊤ 𝑚 )) = 𝑘 𝑀 − tr( ¯U1 ¯U1⊤U2 𝑚=1 𝑚U2⊤ 𝑚 ). We want to find a ¯U1 that minimizes the trace in Eq. (3.2) for the graph Laplacians of its group while maximizing its projection distance with the individual subspaces of the second group. Combining these two goals yields the following cost function Ldis( ¯U1) = tr(cid:16) ¯U1⊤ (cid:16) 𝐿 ∑︁ 𝑙=1 L1 𝑙 + 𝛼 𝑀 ∑︁ 𝑚=1 U2 𝑚U2⊤ 𝑚 (cid:17) ¯U1(cid:17) , 56 where the regularization parameter 𝛼 balances the trade-off between the two terms. To learn the community structure of each layer, we use trace minimization corresponding to spectral clustering Llw(U1 𝑙 ) = tr(U1⊤ 𝑙 L1 𝑙 U1 𝑙 ). Finally, in order to capture the consensus community structure for each group we use the multiview spectral clustering formulation in [77] Lcon(U1 ∗) = tr(cid:16) U1⊤ ∗ 𝐿 ∑︁ (cid:16) 𝑙=1 L1 𝑙 − 𝛽 𝑀 ∑︁ 𝑚=1 𝑙 U1⊤ U1 𝑙 (cid:17) (cid:17) . U1 ∗ Combining these three terms, Ldis, Llw and Lcon, for each group and the regularization term that maximizes the projection distance between ¯U1 and ¯U2, we propose the following formulation for MX-DCSC, to find ¯U1 ∈ R𝑁× ¯𝑘1, ¯U2 ∈ R𝑁× ¯𝑘2, U1 ∗ ∈ R𝑁×𝑘 1, and ∗ ∈ R𝑁×𝑘 2 U2 𝑙 ∈ R𝑁×𝑘 1, U2 𝑚 ∈ R𝑁×𝑘 2, U1 minimize 𝑚,U1 ,U2 ¯U1, ¯U2,U1 𝑙 ∗,U2 ∗ tr(cid:16) ¯U1⊤ (cid:16) 𝐿 ∑︁ 𝑙=1 𝑀 ∑︁ L1 𝑙 + 𝛼 U2 𝑚U2⊤ 𝑚 (cid:17) ¯U1(cid:17) + tr(cid:16) ¯U2⊤ (cid:16) L2 𝑚 + 𝛼 𝐿 ∑︁ 𝑙 U1⊤ U1 𝑙 (cid:17) ¯U2(cid:17) 𝑚=1 +𝛾1tr(cid:16) ¯U1 ¯U1⊤ ¯U2 ¯U2⊤(cid:17) 𝐿 ∑︁ tr(cid:16) + 𝑙 L1 U1⊤ 𝑙 U1 𝑙 (cid:17) + 𝑙=1 tr(cid:16) 𝑀 ∑︁ U2⊤ 𝑚 L2 𝑚U2 𝑚 (cid:17) 𝑀 ∑︁ 𝑚=1 +tr(cid:16) U1⊤ ∗ 𝐿 ∑︁ (cid:16) 𝑙=1 𝐿 ∑︁ L1 𝑙 − 𝛽 𝑙 U1⊤ U1 𝑙 (cid:17) (cid:17) U1 ∗ 𝑙=1 𝑙 = I, U2⊤ 𝑚 U2 𝑙=1 + tr(cid:16) U2⊤ ∗ 𝑀 ∑︁ (cid:16) 𝑚=1 𝑚=1 𝑀 ∑︁ L2 𝑚 − 𝛽 𝑚=1 U2 𝑚U2⊤ 𝑚 (cid:17) (cid:17) , U2 ∗ s.t. ¯U1⊤ ¯U1 = I, ¯U2⊤ ¯U2 = I, U1⊤ 𝑙 U1 𝑚 = I, for 𝑙 = 1, 2, · · · , 𝐿 and 𝑚 = 1, 2, · · · , 𝑀. The optimization problem in (3.5) can be solved in an alternating manner as follows (3.5) ¯U1(𝑘+1) :=argmin tr(cid:16) ¯U1⊤ (cid:16) ¯U1 ¯U2(𝑘+1) :=argmin tr(cid:16) ¯U2⊤ (cid:16) 𝐿 ∑︁ 𝑙=1 𝑀 ∑︁ 𝑀 ∑︁ L1 𝑙 + 𝛼 2(𝑘) 𝑚 U 2(𝑘)⊤ 𝑚 U + 𝛾1 ¯U2(𝑘) ¯U2(𝑘)⊤(cid:17) ¯U1(cid:17) , 𝑚=1 𝐿 ∑︁ L2 𝑚 + 𝛼 1(𝑘) 𝑙 U 1(𝑘)⊤ 𝑙 U + 𝛾1 ¯U1(𝑘) ¯U1(𝑘)⊤(cid:17) ¯U2(cid:17) , ¯U2 1(𝑘+1) 𝑙 U 2(𝑘+1) 𝑚 U :=argmin U1 𝑙 :=argmin U2 𝑚 tr(cid:16) U1⊤ 𝑙 tr(cid:16) U2⊤ 𝑚 (cid:16) (cid:16) 𝑙=1 𝑚=1 𝑙 + 𝛼 ¯U2(𝑘+1) ¯U2(𝑘+1)⊤ − 𝛽U L1 1(𝑘)⊤ 1(𝑘) ∗ U ∗ (cid:17) (cid:17) , U1 𝑙 𝑚 + 𝛼 ¯U1(𝑘+1) ¯U1(𝑘+1)⊤ − 𝛽U L2 2(𝑘) ∗ U 2(𝑘)⊤ ∗ (cid:17) (cid:17) , U2 𝑚 57 1(𝑘+1) ∗ U 2(𝑘+1) ∗ U :=argmin U1 ∗ :=argmin U2 ∗ 𝐿 ∑︁ (cid:16) tr(cid:16) U1⊤ ∗ L1 𝑙 − 𝛽 𝐿 ∑︁ 1(𝑘+1) 𝑙 1(𝑘+1)⊤ 𝑙 U U (cid:17) (cid:17) , U1 ∗ tr(cid:16) U2⊤ ∗ (cid:16) 𝑙=1 𝑀 ∑︁ 𝑚=1 L2 𝑚 − 𝛽 𝑙=1 𝑀 ∑︁ 𝑚=1 2(𝑘+1) 𝑚 U 2(𝑘+1)⊤ 𝑚 U (cid:17) (cid:17) , U2 ∗ s.t. ¯U1⊤ ¯U1 = I, ¯U2⊤ ¯U2 = I, U1⊤ 𝑙 U1 𝑙 = I, U2⊤ 𝑚 U2 𝑚 = I, for 𝑙 = 1, 2, · · · , 𝐿 and 𝑚 = 1, 2, · · · , 𝑀. 3.3.3 Discriminative Subgraph Detection for Temporal Multiplex Networks (TMX-DiSG) (3.6) Given two temporal multiplex networks G1 𝑙,𝑡) and G2 time 𝑡 for 𝑙 ∈ {1, 2, . . . , 𝐿} and 𝑚 ∈ {1, 2, . . . , 𝑀 }, and graph Laplacians L1 𝑙,𝑡 = (𝑉 1, 𝐸 1 𝑙,𝑡, A1 𝑚,𝑡 = (𝑉 2, 𝐸 2 𝑙,𝑡 and L2 𝑚,𝑡, A2 𝑚,𝑡) at 𝑚,𝑡, the goal is to find two spectral embedding matrices, ¯U1 𝑡 and ¯U2 𝑡 , that discriminate between the two temporal multiplex networks at each time 𝑡. Figure 3.2a shows a toy example of two temporal-multiplex networks. (a) (b) (c) Figure 3.2: Overview of TMX-DiSG (a) Illustrative example of a temporal multiplex network (b) Discriminative embedding matrices, ¯U1 𝑡 , are learned for each group at each time point 𝑡 (c) 𝑘-means with 𝑘 = 2 is applied to the degrees of |Z1 𝑡 | to separate the nodes in the shared 𝑡 | and |Z2 𝑡 and ¯U2 and the discriminative subspaces, respectively. Let ¯U1 tr( ¯U1⊤ 𝑡 tr( ¯U1⊤ 𝑡 ((cid:205)𝐿 ((cid:205)𝑀 𝑡 ∈ R𝑁× ¯𝑘 1 𝑙,𝑡) ¯U1 𝑙=1 L1 𝑚,𝑡) ¯U1 𝑚=1 L2 𝑡 be the embedding subspace that represents the first group, i.e., minimize 𝑡 ), while maximizing its distinction from the second group, i.e., maximize 𝑡 ). These two goals can be simultaneously satisfied through the following 58 optimization: minimize 𝑡 , ¯U1⊤ ¯U1 𝑡 =I 𝑡 ¯U1 tr(cid:16) ¯U1⊤ 𝑡 𝐿 ∑︁ (cid:16) 𝑙=1 L1 𝑙,𝑡 (cid:17) (cid:17) ¯U1 𝑡 − 𝛼tr(cid:16) ¯U1⊤ 𝑡 (cid:16) L2 𝑚,𝑡 (cid:17) (cid:17) ¯U1 𝑡 . 𝑀 ∑︁ 𝑚=1 Similarly, we can define ¯U2 𝑡 ∈ R𝑁× ¯𝑘 2 𝑡 as the embedding matrix that contains information about the discriminative subspace of the second group with respect to the first. Considering the two embedding matrices jointly, ¯U1 𝑡 and ¯U2 𝑡 , and introducing some regulariza- tions, we propose the following optimization problem: L1 𝑙,𝑡 − 𝛼 𝑀 ∑︁ L2 𝑚,𝑡 (cid:17) (cid:17) ¯U1 𝑡 + 𝑀 ∑︁ tr(cid:16) ¯U2⊤ 𝑡 (cid:16) L2 𝑚,𝑡 − 𝛼 𝐿 ∑︁ 𝑙=1 𝑇 ∑︁ tr(cid:16) ¯U1⊤ 𝑡 (cid:16) 𝑡=1 tr(cid:16) ¯U1 𝑡 ¯U1⊤ 𝑡 (cid:17) ¯U2 𝑡 ¯U2⊤ 𝑡 − 𝛾2 minimize 𝑡 , ¯U2 𝑡 ¯U1 +𝛾1 𝑇 ∑︁ 𝑡=1 𝑚=1 𝑇 ∑︁ tr(cid:16) ¯U1 𝑡 ¯U1⊤ 𝑡 𝑡=2 𝑇 ∑︁ 𝑡=1 𝑚=1 𝑇 ∑︁ − 𝛾3 𝑡=2 (cid:17) 𝑡−1 ¯U1⊤ ¯U1 𝑡−1 tr(cid:16) ¯U2 𝑡 ¯U2⊤ 𝑡 𝑡−1 ¯U2⊤ ¯U2 𝑡−1 (cid:17) , (3.7) L1 𝑙,𝑡 (cid:17) (cid:17) ¯U2 𝑡 𝐿 ∑︁ 𝑙=1 s.t. ¯U1⊤ 𝑡 ¯U1 𝑡 = I, ¯U2⊤ 𝑡 ¯U2 𝑡 = I, for 𝑡 = 1, 2, . . . , 𝑇, where the first term determines the subspace ¯U1 𝑡 that discriminates the first group of networks from the second at time 𝑡, the second term determines ¯U2 𝑡 that discriminates the second group from the first at time 𝑡, the third term is a regularization that maximizes the projection distance between ¯U1 𝑡 and ¯U2 𝑡 , thus ensuring that the overlap between the two spectral embeddings is minimized, and the last two terms enforce temporal smoothness of the subspaces. The optimization problem in (3.7) can be solved in an alternating manner, first solving for ¯U1 𝑡 and then for ¯U2 𝑡 : 1(𝑘+1) ¯U 𝑡 := argmin ¯U1 𝑡 , ¯U1⊤ 𝑡 =I 𝑡 ¯U1 tr(cid:16) ¯U1⊤ 𝑡 2(𝑘+1) ¯U 𝑡 := argmin ¯U2 𝑡 , ¯U2⊤ 𝑡 =I 𝑡 ¯U2 tr(cid:16) ¯U2⊤ 𝑡 (cid:16) 𝐿 ∑︁ (cid:16) 𝑀 ∑︁ L1 𝑙,𝑡 − 𝛼 2(𝑘) L2 𝑚,𝑡 + 𝛾1 ¯U 𝑡 2(𝑘)⊤ ¯U 𝑡 − 𝛾2 ¯U1 𝑡−1 ¯U1⊤ 𝑡−1 𝑙=1 𝑀 ∑︁ 𝑚=1 𝑚=1 𝐿 ∑︁ L2 𝑚,𝑡 − 𝛼 𝑙=1 1(𝑘) L1 𝑙,𝑡 + 𝛾1 ¯U 𝑡 1(𝑘)⊤ ¯U 𝑡 − 𝛾3 ¯U2 𝑡−1 ¯U2⊤ 𝑡−1 (cid:17) (cid:17) ¯U1 𝑡 , (cid:17) (cid:17) ¯U2 𝑡 , (3.8) The solution to updating ¯U1 (cid:16) (cid:205)𝐿 𝑙,𝑡 − 𝛼 (cid:205)𝑀 𝑚=1 L2 𝑙=1 L1 𝑚,𝑡 − 𝛾1 ¯U2 𝑡 is the eigenvectors corresponding to the 𝑘 1 𝑡 smallest eigenvalues of (cid:17), which is the global optimum solution to the ¯U1 𝑡−1 ¯U1⊤ 𝑡−1 𝑡 can be found in a similar manner, and is the global 𝑡 𝑡 ¯U2⊤ 𝑡 − 𝛾2 ¯U1 sub-problem in (3.8) [34]. The solution for ¯U2 optimum for the ¯U2 𝑡 sub-problem in (3.8). We solve iteratively for both variables until convergence. 59 3.3.4 Finding the embedding dimensions In most clustering algorithms, the number of communities (𝑘) is an input parameter. This is typically addressed by testing different 𝑘 values and selecting the best based on a performance met- ric. In this chapter, the embedding dimensions, i.e., discriminative and consensus, are determined following the eigengap rule and hierarchical clustering-based method proposed in Chapter 2. First, we compute embedding matrices U1 ∈ R𝑁×𝑘 1 and U2 ∈ R𝑁×𝑘 2 for each group, with 𝑘 1 and 𝑘 2 found using the eigengap rule on the graph Laplacian. We then concatenate the embeddings from both groups as X = [U1, U2] and apply a hierarchical clustering algorithm to the columns of X, grouping similar eigenvectors that represent shared structure between the groups. The number of clusters corresponding to shared components is denoted as 𝑘𝑐. Finally, we compute the dimensions of the discriminative subspaces, ¯U1 and ¯U1, by subtracting the shared components from the orig- inal embedding dimensions ¯𝑘 1 = 𝑘 1 − 𝑘𝑐, ¯𝑘 2 = 𝑘 2 − 𝑘𝑐. Thus, the final embeddings ¯U1 and ¯U2 capture only the distinctive features that differentiate between the two groups. For MX-DSC and MX-DCSC, the embedding process is computed once, producing dimensions ¯𝑘 1, ¯𝑘 2, 𝑘 1, and 𝑘 2. For TMX-DiSG, this process is repeated at each time step 𝑡, leading to time-dependent embeddings ¯U1 𝑡 and ¯U2 𝑡 , and dimensions ¯𝑘 1 𝑡 , ¯𝑘 2 𝑡 3.3.5 Subgraph Identification Once the low-dimensional discriminative subspaces, ¯U1 and ¯U2, are learned, we can construct 𝑁 × 𝑁 matrices, Z1 = ¯U1 ¯U1⊤ and Z2 = ¯U2 ¯U2⊤ that capture the discriminative subgraphs for each group, as shown in the toy example in Figure 3.1. We compute the degrees of nodes in both groups as 𝐷1 𝑖 = (cid:205) 𝑗 |𝑍 1 𝑖 𝑗 | and 𝐷2 𝑖 = (cid:205) 𝑗 |𝑍 2 𝑖 𝑗 |. As seen in the histograms in Figure 3.1, there are two groups of nodes with different degree distributions. These two clusters, discriminative and non-discriminative nodes, in each group, are identified by 𝑘-means with 𝑘 = 2 applied to D1 and D2, where the cluster with the low degree nodes corresponds to the non-discriminative structure, and the cluster with the high degree nodes correspond to the discriminative subgraph. For TMX-DiSG, this process is repeated at each time step 𝑡, using the time-dependent embeddings ¯U1 𝑡 and ¯U2 𝑡 . 60 3.3.6 Computational Complexity The computational complexity of the algorithm is mostly due to the eigendecompositions at each iteration. At each iteration, we find the embeddings by computing the eigenvectors corresponding to the 𝑘 smallest eigenvalues of and 𝑁 × 𝑁 matrix, which has a complexity of O (𝑁 2𝑘). Therefore, the total complexity of the algorithm is dominated by O (𝑁 2𝑚𝑎𝑥{𝑘 1 𝑡 , 𝑘 2 𝑡 , ¯𝑘 1 𝑡 , ¯𝑘 2 𝑡 }). 3.4 Experiments: Multiplex Networks In this section, MX-DSC and MX-DCSC are evaluated for both synthetic and real multiplex networks. For synthetic data, three experiments are conducted where different parameters of the simulated data are changed. MX-DSC is applied to two real multiplex networks datasets to find the corresponding discriminative subgraphs. 3.4.1 Synthetic Multiplex Networks Multiplex benchmark networks based on the model described in [16, 124] were generated. First, a multilayer partition is generated with a user-defined number of nodes, layers, and an inter-layer dependency tensor specifying the layer relationships. Next, for the given multilayer partition, edges in each layer are generated following a degree-corrected block model [136] parameterized by the distribution of expected degrees and a community mixing parameter 𝜇 ∈ [0, 1] that controls the network modularity. When 𝜇 = 0, all edges lie within communities, whereas 𝜇 = 1 implies the edges are distributed uniformly. For multiplex networks, the probabilities in the inter-layer dependency tensor are the same for all pairs of layers and are specified by 𝑝 ∈ [0, 1]. When 𝑝 = 0, the partitions are independent across layers while 𝑝 = 1 indicates an identical partition across layers. In this work, we extend the model described above to generate two multiplex benchmark networks with shared (non-discriminative) and discriminative communities among them two. We first generate the shared communities by randomly selecting 𝑛𝑐 nodes across all layers and for both groups and setting the inter-layer dependency probability to 𝑝1. Next, we independently generate the discriminative communities for each group with the remaining nodes. 61 Evaluation: The performance of MX-DSC is evaluated based on the accuracy of detecting discriminative subgraphs, while MX-DCSC is assessed on both subgraph and community detection accuracy. In order to evaluate the performance in detecting the discriminative subgraphs, we use AUC-ROC as the evaluation metric. Three experiments with different parameters are repeated 50 times, and the average AUC-ROC for MX-DSC and MX-DCSC are reported in Table 3.2. We evaluate the accuracy of community detection in terms of Normalized Mutual Information (NMI) [65]. The accuracy of MX-DCSC, which learns the consensus community structure per each group, is compared with existing methods, as shown in Figure 3.3. SC-ML [77] is applied, combining both multiplex networks into one, and individually to each multiplex network, SC-MLind, and GenLouvain [182] to the combined networks. All the experiments are run with 𝛼, 𝛽, 𝛾1 ∈ [0, 1], and the results with the highest performance are reported. Experiment 1: Varying Noise Level, 𝜇: In this experiment, we generated two groups of multiplex networks with 𝑁 = 256, 10 layers, and 6 and 5 communities per layer in each group, respectively. These two multiplex networks have two shared communities, and the number of discriminative communities per group is 𝑘 1 = 4 and 𝑘 2 = 3, respectively. To evaluate our algorithm’s performance under different noise levels, these two networks are generated with varying values of the mixing parameter 𝜇 ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. The inter-layer dependency probability for the shared communities is 𝑝1 = 1. Table 3.2 shows that both MX-DSC and MX-DCSC have high detection accuracy, with MX-DSC performing slightly better. From Figure 3.3, it can be seen that MX-DCSC outperforms existing community detection methods with increasing noise as the consensus community structure takes the discriminative information into account. Experiment 2: Change in Variability, 𝑝1: In the second experiment, we evaluated the robustness of the algorithm against variations in the shared community structure by fixing 𝜇 = 0.3 and varying the inter-layer dependency probability 𝑝1, i.e., the shared communities are allowed to vary across groups and layers. Table 3.2 shows that both methods are robust to variations in the shared community structure even when 𝑝1 = 0.5, implying that the variations in the shared community structure between the two groups do not affect the discriminative subgraph detection accuracy. 62 Additionally, MX-DCSC performs best in community detection accuracy. Experiment 3: Change in 𝑘𝑐: In this experiment, we evaluated the robustness of our method to the number of shared communities by fixing 𝜇 = 0.3 and the total number of communities per layer, while varying 𝑘𝑐 from 1 to 6. Both methods are robust to the value of 𝑘𝑐 except for 𝑘𝑐 = 1 since most of the nodes in the two groups will have different community structures, which makes it hard to detect the discriminative subgraphs. In terms of community detection accuracy, MX-DCSC performs well even when the two groups do not share a common community. Table 3.1: Average AUC values for synthetic multiplex networks. Experiment 1 Experiment 2 Experiment 3 𝜇 MX-DSC MX-DCSC 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1.0000 0.9864 0.9763 0.9683 0.9652 0.9398 0.9167 0.8228 1.0000 0.9801 0.9761 0.9723 0.9722 0.9224 0.8781 0.8118 𝑝1 MX-DSC MX-DCSC 𝑘𝑐 MX-DSC MX-DCSC 0.90 0.7115 0.9980 0.7052 0.9978 1 0.80 0.70 0.60 0.50 0.9940 0.9992 0.9956 0.9821 0.9988 0.9999 0.9835 0.9638 2 3 4 5 6 0.9812 0.9941 0.9802 0.9949 0.9991 0.9773 0.9877 0.9838 0.9958 0.9945 (a) (b) (c) Figure 3.3: NMI Results for MX-DCSC with respect to other methods: (a) Experiment 1, (b) Experiment 2, (c) Experiment 3. 63 3.4.2 UCI Handwritten Dataset In this section, we evaluate the performance of MX-DSC on a multiview dataset, UCI Hand- written Digits1 [83]. The UCI Handwritten Digits dataset consists of features of handwritten digits from (0- 9) extracted from a collection of Dutch utility maps. There are a total of 2000 patterns that have been digitized in binary images, 200 patterns per digit. These digits are represented by six different feature sets: Fourier coefficients of the character shapes, profile correlations, Karhunen- Lo`eve coefficients, pixel averages in 2 × 3 windows, Zernike moments, and morphological features. Each layer of the multiplex network represents one of the 6 features. The graphs are constructed using 𝑘-nearest neighbors graphs with the nearest 50 neighbors and Euclidean distance. For the purposes of this work, we selected two groups corresponding to digits 1 and 7 to construct the first and second multiplex networks, respectively. These are two digits that are often misclustered due to their similar patterns. (a) (b) Figure 3.4: UCI Handwritten dataset results. Images that were selected as (a) discriminative and (b) shared nodes for both groups (digits 1 and 7). MX-DSC is applied to these two multiplex networks each with 6 layers with 𝛼 = 0.5, 𝛾1 = 0.5, 1https://archive.ics.uci.edu/ml/datasets/Multiple+Features 64 𝑘 1 = 2, and 𝑘 2 = 2. Figure 3.4 shows the images that were selected as the discriminative and non-discriminative samples between the two groups. The samples in the discriminative subgraph correspond to images where both numbers 1 and 7 are clearly written and well-defined. On the other hand, the samples that are classified as non-discriminative are images with noisy patterns. We also evaluate the performance by applying spectral clustering to the features of the discriminative 1’s and 7’s and the non-discriminative 1’s and 7’s separately. The NMI for the discriminative samples is 0.7037 whereas it is 0.3091 for the non-discriminative samples. This shows that our method provides an accurate clustering of samples into discriminative and non-discriminative groups which offers better separability. 3.4.3 Electroencephalogram (EEG) Networks In this work, we applied MX-DSC to functional connectivity networks (FCNs) of the brain constructed from EEG data collected from a cognitive control-related error processing study [108]. Each participant was presented with a string of five letters at each trial. Letters could be congruent (e.g., SSSSS) or incongruent stimuli (e.g., SSTSS) and the participants were instructed to respond to the center letter with a mouse. The EEG was recorded following the international 10/20 system for the placement of 64 Ag–AgCl electrodes at a 512 Hz sampling frequency. For each response type (error and correct) the FCNs can be modeled as a multiplex network with 64 nodes (brain regions) and 17 layers (subjects). MX-DSC is applied with 𝛼 = 0.5, 𝛾 = 0.5, 𝑘 1 = 3, and 𝑘 2 = 2 to the multiplex networks corresponding to error and correct responses. Figure 3.5 shows the discriminative communities corresponding to error and correct responses. For the error response, we detect a discriminative community centered around fronto-central nodes (FCz, FC1, FC2, Cz, C1). This is consistent with prior work showing that medial frontal regions are more activated for error compared to correct trials [191]. The other discriminative communities are similar in both response types and correspond to the parietal-occipital region which is activated due to the visual stimulus. 65 Figure 3.5: Discriminative Communities for Error (left) and Correct (right) responses. 3.5 Experiments: Temporal Mutiplex Networks In this section, TMX-DiSG is evaluated in synthetic temporal multiplex networks. Two exper- iments are conducted where different parameters of the simulated data are changed. In addition, TMX-DiSG is applied to a real temporal multiplex network dataset to detect the discriminative subgraphs across time. 3.5.1 Synthetic Temporal Multiplex Networks In this section, we extended the model described in Section 3.4.1 to generate two temporal multiplex benchmark networks with non-discriminative and discriminative communities among the two groups. We use the same approach as in Section 3.4.1 to generate the multiplex networks and the process is repeated for 𝑇 = 60 time points. Evaluation: The performance of TMX-DiSG in detecting the discriminative subgraphs is eval- uated under two different settings using AUC-ROC. Both experiments are run with 𝛼, 𝛾1, 𝛾2, 𝛾3 ∈ [0, 1], and the results with the highest performance are reported. The accuracy of TMX-DiSG is compared to contrastive principal component analysis (cPCA) [2]. Since cPCA is applied to covariance matrices, adapting it to network data requires using the graph Laplacian in place of the covariance matrix. This adaptation makes cPCA equivalent to the first two terms of TMX-DiSG, i.e., learning the discriminative embedding matrices without the regularization terms. For both experiments, we generated two groups of multiplex networks with 𝑁 = 300 nodes, 66 𝐿, 𝑀 = 10 layers in each group, 𝑘𝑐 = 2 shared communities. The number of discriminative communities, ¯𝑘 1 𝑡 and ¯𝑘 2 𝑡 = 4 for 𝑡 ∈ [21 : 40], and ¯𝑘 1 𝑡 = 4 for 𝑡 ∈ [11 : 20], ¯𝑘 2 𝑡 , is varied over time for both groups. Specifically, for the first group: ¯𝑘 1 𝑡 = 3 for 𝑡 ∈ [41, 60]. For the second group, ¯𝑘 2 𝑡 = 3 for 𝑡 ∈ [21 : 40], ¯𝑘 2 𝑡 = 4 for 𝑡 ∈ [41 : 50], 𝑡 = 3 𝑡 = 3 for 𝑡 ∈ [1, 20], ¯𝑘 1 for 𝑡 ∈ [1 : 10], , ¯𝑘 2 and ¯𝑘 2 𝑡 = 3 for 𝑡 ∈ [51 : 60]. This variation tests the method’s adaptability to temporal changes in discriminative communities. Experiment 1: Varying Noise Level: To evaluate robustness against noise, the two temporal multiplex networks are generated with varying values of 𝜇 ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7}. Inter- layer dependencies are set as 𝑝1 = 0.9 to allow for some variation in the discriminative community structure over time, 𝑝2 = 1 to preserve shared communities across layers and time, and 𝑝3 = 1 to preserve the discriminative communities across layers. This ensures that shared communities remain unchanged, while discriminative subgraphs vary only over time. Table 3.2 presents the average AUC values across time and groups. As expected, TMX-DiSG achieves higher detection accuracy compared to cPCA, where there are no regularizations to ensure the dissimilarity between the two subspaces, and the smoothness across time. The performance of both methods declines as the noise level increases. Table 3.2: Average AUC values across time and groups for synthetic temporal multiplex networks. Experiment 1 Experiment 2 𝜇 0.1 0.2 0.3 0.4 0.5 0.6 0.7 cPCA TMX-DiSG 0.9546 0.9367 0.9004 0.9017 0.8998 0.8760 0.8591 0.9976 0.9786 0.9521 0.9513 0.9501 0.9245 0.8978 𝑝1 0.90 0.80 0.70 0.60 0.50 cPCA TMX-DiSG 0.9004 0.8999 0.9002 0.8997 0.8867 0.9521 0.9542 0.9501 0.9489 0.9297 Experiment 2: Change in Variability, 𝑝1: In the second experiment, we evaluated the robustness of the algorithm against variations in the discriminative community structure across time by fixing 67 𝜇 = 0.3 and varying the inter-layer dependency probability 𝑝1, i.e., the discriminative communities are allowed to vary. Table 3.2 shows that TMX-DiSG is robust to variations in the discriminative community structure even when 𝑝1 = 0.5, implying that the variations across time do not affect the discriminative subgraph detection accuracy. Figure 3.6 shows that even when the network structure changes every ten time points due to variations in ¯𝑘 1 𝑡 and ¯𝑘 2 𝑡 , as evidenced by Figure 3.6a, the accuracy remains constant as shown in Figure 3.6b. (a) (b) Figure 3.6: Illustrative example of the temporal variation in network structure and its impact on performance. (a) Frobenius norm of the difference between consecutive embedding-based matrices Z1 𝑡 = ¯U1 𝑡 ¯U1⊤ 𝑡 , computed for a single experiment and one run. (b) AUC values across time. 3.5.2 Dynamic fMRI Dataset: Task vs. Resting State In this section, we evaluated TMX-DiSG on dynamic multiplex brain networks constructed from the Midnight Scan Club (MSC) dataset which includes fMRI data from ten subjects during resting state and three tasks [103]. Dynamic functional connectivity matrices were estimated using Pearson’s correlation between windowed time courses (TCs) extracted with a sliding window approach [10]. A tapered window was created by convolving a rectangular window (width = 22 repetition times (TRs)) with a Gaussian kernel (𝜎 = 3 TRs) and slid in steps of 1 TR. TCs were derived using the Gordon parcellation (333 parcels and 12 brain networks). We compared the temporal multiplex graphs between resting state and two different tasks. In the first case, we constructed two groups of dynamic multiplex graphs representing motor vs. rest conditions, with 𝑁 = 333 brain regions (nodes), 10 subjects (layers), and 187 time points. In the second case, we constructed the dynamic multiplex graphs representing memory vs. rest conditions, 68 (a) (b) Figure 3.7: The percentage of discriminative nodes for both groups across default, visual, Cingulo Opercular, and Somatomotor Medial (Hand) networks and timepoints for (a) motor vs. rest case and (b) memory vs. rest case, respectively. (a) (b) Figure 3.8: The brain topology of the aggregated discriminative nodes for both groups across all timepoints for (a) motor vs. rest case and (b) memory vs. rest case, respectively. with 𝑁 = 333 brain regions (nodes), 10 subjects (layers), and 342 time points. TMX-DiSG was applied to both set of temporal multiplex networks using parameters 𝛼 = 0.5 and 𝛾1, 𝛾2, 𝛾3 = 0.1. For each comparison, we computed the percentage of discriminative nodes within each of the 12 brain systems at every time point. Figure 3.7a illustrates how the percentage of discriminative nodes fluctuates over time in the Default Mode Network (DMN), Visual, Cingulo-Opercular (CiO), and Somatomotor Medial (Hand) (SMm) networks during motor vs. rest. Notably, the CiO and SMm networks show distinct peaks, reflecting their critical role in motor task execution and attentional control. The DMN and Visual networks also demonstrate variability, suggesting their involvement 69 in cognitive processes linked to motor function. Figure 3.8a visualizes the spatial distribution of discriminative nodes aggregated across all time points for motor vs. rest, highlighting the DMN, Visual, SMm, CiO, and Somatomotor Lateral (Mouth) (SMI) networks as key discriminative areas. These networks play a crucial role in high-level cognitive and attentional control processes related to motor task execution as expected. Figure 3.7b presents the same analysis for memory vs. resting state. Compared to the first case, the fluctuations appear more evenly distributed across time, with the DMN, Visual, CiO, and SMm networks reflecting the continuous cognitive processing required for memory tasks. Figure 3.8b shows that the topology of discriminative nodes in memory-related tasks overlaps with motor- related networks but also involves additional regions, such as the Fronto-Parietal Network (FPN), and Retrosplenial Temporal Network (ReT). This suggests that memory tasks engage broader network interactions, integrating sensorimotor and cognitive control systems to a greater extent than motor tasks alone. Moreover, these regions are involved in high-level cognitive and attentional processes, including visual processing, sustained attention, and decision-making, essential for recognizing faces, scenes, and words in memory tasks. 3.6 Conclusions In this chapter, we introduced two spectral clustering based community detection methods for identifying the discriminative subspaces between two multiplex networks and one for identifying the discriminative subspaces between two temporal multiplex networks. The first method, MX-DSC, focuses on only learning the discriminative subspaces, while the second one, MX-DCSC, learns the discriminative, consensus, and individual layerwise subspaces simultaneously. The evaluation of the methods on simulated data shows that the two methods perform similarly in terms of discriminative subspace detection accuracy. With MX-DCSC, one can also obtain a more accurate community detection performance compared to existing multiplex community detection algorithms. The application of the proposed framework to real data illustrates the possible applications of the method to multivariate network data. The third method, TMX-DiSG, provides an extension to temporal multiplex networks and detects discriminative subgraphs between two groups across 70 time. TMX-DiSG was evaluated on both synthetic temporal multiplex networks and dynamic functional connectivity networks constructed from two different task comparisons, i.e., motor and memory-related tasks versus resting state. Our analysis demonstrated that TMX-DiSG effectively identifies task-specific discriminative subgraphs. Future work will consider the extension of these frameworks to more than two groups. 71 GRAPH FILTERING FOR CLUSTERING ATTRIBUTED GRAPHS CHAPTER 4 4.1 Introduction Many real-world systems with relational data such as social interactions, citation and co-author relationships, and biological systems are represented as networks [14]. An important aspect of analyzing networks is the discovery of communities [99] which allow us to identify groups of functionally related objects and the interactions between them [151, 6]. In most real-world networks, both the node attributes and graph connectivity are available. For example, known properties of proteins, users’ social network profiles, or authors’ publication histories may tell us which objects are similar, and to which communities they may belong. Similarly, the set of edges between the nodes, such as the friendship relationships between users, interactions between proteins, and the collaboration between authors, can help identify groups based on connectivity. Classical data clustering methods such as 𝐾-means assign class labels based on only attribute similarity [123]. On the other hand, community detection algorithms find groups of nodes that are densely connected [91, 248, 182], ignoring node attributes. Employing just one of these two sources of information can result in the algorithm failing to account for important structures in the data. For instance, while it would be hard to determine the community membership of a sparsely connected node by solely relying on the network’s connectivity, attributes may help reveal the community affiliation. Conversely, the network structure may indicate that two nodes belong to the same community, even if one of the nodes lacks attribute information. Hence, it is crucial to take both information sources into account and view network communities as clusters of closely linked nodes that also share common attributes. 4.1.1 Related Work In recent years, several methods have been proposed for attributed graph clustering by combining the node attributes and link information [151, 125, 286, 9, 277]. The first class of methods for attributed graph clustering focuses on combining link and node information by formulating an objective function that integrates the two types of similarity: adjacency matrix that captures link 72 information and the similarity matrix that quantifies the affinity between the attributes [166, 9, 171]. These methods are indirect in the sense that they rely on the construction of an arbitrary similarity matrix from the node attributes. The second class of methods incorporates the graph structure and node attributes simultaneously into the community detection framework benefiting from the representation capability of graph neural networks (GNNs) [281]. The goal of these methods is to encode nodes in the graph with the neural networks and assign node labels. For example, methods such as graph autoencoders (GAE) [138], variational GAE (VGAE) [139], adversarially regularized graph autoencoder (ARGA), adversarially regularized variational graph autoencoder (ARVGA) [194] and marginalized graph autoencoder for graph clustering (MGAE) [259] have demonstrated state-of-the-art performance on attributed graph clustering. Although these methods achieve promising performance, they are not designed for the specific clustering performance, i.e., the network parameters are optimized to minimize the reconstruction error rather than maximizing the separation between different clusters. Moreover, in these methods, each convolutional layer is coupled with a projection layer, making it difficult to stack many layers and train a deep model. Thus, they only take into account neighbors of each node in two or three hops away and hence may be inadequate to capture global cluster structure of large graphs. Finally, these methods lack interpretability as they do not explicitly show the relationship between the learned models and the structure of the graph. In order to take advantage of graph convolutional features while addressing the shortcomings of GNN-based methods, the last class of methods rely on graph signal processing (GSP) and in particular spectral graph filtering [286, 152, 266, 133, 265]. Over the past decade graph filters [120] have played a key role in different signal processing and machine learning tasks such as graph signal denoising [72, 190], smoothing [283], classification [219, 53], sampling [12], recovery [55], and graph clustering [249]. Different types of graph filter structures, including FIR graph filters [220, 225], ARMA filters [119], graph filter banks [186], and graph wavelets [109, 185] have been considered. For the task of graph node classification, spectral filtering based methods have certain advantages over spatial-based methods [29]. First, spectral filtering transforms all node features into 73 weighted sums of different eigenvectors via graph Fourier transform, which naturally captures global information, i.e., long range dependencies, unlike spatial methods that emphasize local information and suffer from oversmoothing [174]. Second, spectral filtering methods provide interpretability as the learned graph filters can directly state the most important frequency information associated with the labels, e.g., low-, medium-, and high-frequencies. Finally, spectral filters have been shown to obtain more cluster-friendly representations [289] and deal with the heterogeneity in the graph. Existing spectral graph filters either use pre-defined filters and learn the weights [291, 286, 267] or are based on adaptive spectral filter learning [112]. The first class of methods learn the weights of pre-determined filters, e.g., low-pass or high pass. The second class of methods learns the coefficients of polynomial filters with respect to different bases, e.g., ChebyNet [71], BernNet [112], ARMA [26]. However, in most cases the filters are designed empirically without any necessary constraints. As a result, these methods result in filters whose weights often have poor controllability. Some example applications of spectral graph filters include semi-supervised learning on graphs, where graph filters are used to weigh and propagate the label information of multi-hop neighbors to the unknown nodes. This problem is formulated as an optimization problem, where the filter parameters are estimated to minimize the error between the estimated labels and the true labels on the labeled nodes with regularization on the filter parameters or the output, e.g., smooth label variation [53, 88, 22]. These works only consider the node label and graph connectivity information but do not necessarily address the issue of unsupervised clustering of attributed graphs. In the realm of unsupervised learning, GSP techniques have been used to address clustering and community mining problems. In [248], spectral graph wavelets are utilized to develop a fast, multiscale community mining protocol. In [249], graph-spectral filtering of random graph signals is used to construct feature vectors for each vertex so that the distances between vertices based on these feature vectors resemble those based on standard spectral clustering feature vectors. More recently, [79] uses spectral graph wavelets to learn structural embeddings that help identify vertices that have similar structural roles in the network. In all of these cases, the problem of community 74 detection is only addressed for either graphs without attributes or regular data clustering without graph structure. Moreover, the graph filter parameters are fixed. More recently, adaptive graph convolution (AGC) [286] was proposed for attributed graph clustering. Instead of stacking layers as in GCN, a 𝐾-order graph convolution that acts as a low-pass graph filter on node features is proposed to obtain smooth feature representations followed by spectral clustering on the learned features. In [133], this approach is further refined by learning the best similarity graph from the filtered features rather than constructing a graph using pre-determined similarity metrics. While these methods are intuitive and provide some interpretability to the node features, the filters are always low-pass, ignoring the useful information in higher frequency bands [265], and do not support learning arbitrary interpretable spectral filters that are optimized for the particular data. Moreover, the two steps of the algorithm, i.e., filtering and clustering, are completely decoupled from each other. Thus, there is no guarantee that the extracted features are optimal for clustering. In this chapter, we address the shortcomings of the existing methods by proposing two graph filtering based methods for community detection in attributed networks, Graph Filtering for Clus- tering Attributed Graphs (GraFiCA) and Multi-Scale Graph Wavelets for Clustering (MSGWC). A cost function quantifying the separability of the filtered attributes is proposed. For GraFiCA, a general framework for learning the parameters of both Finite Impulse Response (FIR) and Au- toregressive Moving Average (ARMA) filters is introduced. FIR filters are the most general form of graph convolutional filters implementing a polynomial frequency response. Their descriptive power increases as the filter order 𝑇 grows. However, using higher orders implies handling higher matrix powers, which introduces numerical instabilities and in turn leads to poor performance. A more versatile class of filters is the family of ARMA filters [184], which offer a larger variety of frequency responses and can account for higher-order neighborhoods compared to polynomial filters with the same number of parameters. For MSGWC, we learn the optimal combination of the multi-scale features from graph spectral wavelet and scaling filters by minimizing the proposed cost function that quantifies the separability between clusters. Both methods are formulated as a two-step alternating minimization problem, where the first step learns the optimal graph partition- 75 ing for the given node attributes while the second step learns the optimal graph filter parameters for FIR and ARMA filters, and the optimal combination of graph filters at multiple scales, for GraFiCA and MSGWC, respectively. The main contributions of the proposed framework are as follows: • GraFiCA is the first that addresses the problem of parametric graph filter design in the form of both FIR and IIR filters for the purpose of attributed graph clustering. While there has been prior work in graph filter design for denoising or graph signal recovery [214], GraFiCA is the first unsupervised approach that learns the parameters of the filters for the purpose of clustering. • Most of the methods based on spectral graph wavelets [248, 268] focus only on the output of the wavelet filter across scales without considering the scaling filter which captures the local neighborhood information. MSGWC learns the optimal combination of the multi-scale features from both graph spectral wavelet and scaling filters. • The filters learned by the proposed approaches are not limited to low-pass filters as the structure of the filter is determined directly by the data. The filters take into account the useful information in middle and high frequency bands, i.e., higher-order neighborhoods, providing interpretability to the learned filters. • The proposed cost function quantifies the discriminability between different classes unlike GCN-based approaches where the loss function is usually the reconstruction error. Thus, the filter coefficients are updated at each step of the algorithm to ensure that the smoothed node attributes are representative of the node assignments. 4.2 Background 4.2.1 Graph Filtering In graph signal processing, two fundamental filter types, Finite Impulse Response (FIR) and Autoregressive Moving Average (ARMA) [119] graph filters, are considered [120]. FIR polynomial 76 graph filter is described as the linear operator H (L) = 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 = U( 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡𝚲𝑡)U⊤, where 𝑇 is the filter order and ℎ𝑡’s are the coefficients. On the other hand, an ARMA filter is defined as (cid:205)𝑇−1 I + (cid:205)𝑄−1 where (𝑇, 𝑄) are the filter orders and 𝑎𝑡’s and 𝑏𝑞’s are the filter coefficients. Note that the FIR 𝑡=0 𝑎𝑡𝚲𝑡 𝑞=1 𝑏𝑞𝚲𝑞 U⊤, H (L) = U polynomial filter is a special case of ARMA filter for 𝑄 = 1. Signals defined on the nodes of an attributed graph can be represented as a matrix F ∈ R𝑁×𝑃, where 𝑃 is the number of attributes for each node. The filtered graph signal ˜F is obtained as ˜F = H (L)F = UH (𝚲)U⊤F, where H (𝚲) = diag(H (𝜆1), . . . , H (𝜆𝑁 )) is the frequency response of the graph filter. 4.2.2 Spectral Graph Wavelet Transform The Spectral Graph Wavelet Transform (SGWT) is a continuous multi-scale transform on graphs, enabling localization of graph signals in both the vertex and spectral domains, simultane- ously [109]. The wavelet, 𝜓𝑠,𝑎, at scale 𝑠 centered at node 𝑎 is generated by stretching a band-pass filter kernel 𝑔(·) with a parameter 𝑠 > 0. The frequency response of the wavelet at scale 𝑠 can be written as G𝑠 (𝚲) = diag(𝑔(𝑠𝜆1), 𝑔(𝑠𝜆2), ..., 𝑔(𝑠𝜆𝑁 )). The wavelet basis at scale 𝑠 is then given by 𝚿𝑠 = {𝜓𝑠,1|𝜓𝑠,2|...|𝜓𝑠,𝑁 } = UG𝑠 (𝚲)U⊤. The wavelet coefficients at scale 𝑠 for a graph signal F, are defined as 𝚿⊤ 𝑠 F. By this definition, a wavelet centered at node 𝑎 corresponds to a signal on the graph diffused away from that node, remaining highly localized in the vertex domain. At small scales, the kernel is stretched out and lets through high-frequency modes. Therefore, the corresponding wavelet extends only to the close neighborhood of the node in the graph. At large scales, the filter function is localized around low-frequency modes, and the corresponding wavelet spans a larger neighborhood. 77 Figure 4.1: Framework of the proposed method. Similarly, the scaling basis can be generated using a low-pass scaling filter kernel ℎ(·), designed to smoothly represent the low-frequency content of the graph signal, as 𝚽𝑠 = Udiag(ℎ(𝑠𝜆1), ℎ(𝑠𝜆2), . . . , ℎ(𝑠𝜆𝑁 ))U⊤ = UH𝑠 (𝚲)U⊤, and the corresponding scaling coefficients as 𝚽⊤ 𝑠 F. The wavelet and scaling function kernels, 𝑔(·) and ℎ(·), used in this work are the band-pass and low-pass filter kernels, respectively, as defined in [109, 248], along with their associated parameters. 4.3 Graph Filtering for Clustering Attributed Graphs (GraFiCA) In this work, we propose to learn the optimal graph filter such that the within-cluster association of the filtered attributes, i.e., the total within-cluster dissimilarity or distance, is minimized while the between-class distance of the filtered attributes is maximized. Using an alternating minimization approach, in the first step, given the node attributes, we find the best cluster assignment to minimize the clustering cost function. In the second step, the graph partition is fixed and the cost function is optimized with respect to the filter coefficients. In this manner, the resulting graph filters are optimized for the clustering task (see Figure 4.1 for an overview of the method). This results in an attributed graph clustering method that takes into account both the topology and the node attributes. In this section, we will first introduce the general problem formulation and the corresponding optimization problem. We will then present solutions for the optimal filter design for two different filter types; FIR and ARMA. 4.3.1 Problem Formulation Given a graph G with normalized adjacency matrix A𝑛 ∈ R𝑁×𝑁 and graph signal F ∈ R𝑁×𝑃, the goal is to find the best partition, i.e., 𝐾 non-overlapping clusters, C = {C1, C2, . . . , C𝐾 }, and the 78 optimal graph filter H (𝚲; 𝛽) with parameters 𝛽. We quantify the quality of the clustering based on the filtered graph attributes, ˜F = UH (𝚲)U⊤F, as follows: L (C,H (𝚲; 𝛽)) = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2 − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2, (4.1) where the first and second terms quantify the dissimilarity/distance of the filtered node attributes within and between clusters, respectively. D (C𝑘 ) is total dissimilarity of nodes in C𝑘 with respect to the whole graph quantified by D (C𝑘 ) = (cid:205)𝑖∈C𝑘, 𝑗 ∈𝑉 || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2. Thus, we want to minimize the the ratio of dissimilarity attributed to within cluster connections to the total dissimilarity across the whole graph, i.e., association, while maximizing the separation between clusters, i.e., cut. Defining the dissimilarity matrix based on ˜F as ˜𝑊𝑖 𝑗 = || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2, (4.1) can be rewritten as L (C,H (𝚲; 𝛽)) = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ ˜𝑊𝑖 𝑗 − 𝛾 𝑖, 𝑗 ∈C𝑘 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ˜𝑊𝑖 𝑗 . ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 (4.2) Our goal is to minimize this cost function in terms of both the graph partition, C and the graph filter parameters 𝛽. The corresponding optimization problem can be formulated as minimize C,𝛽 L ( C, H (𝚲; 𝛽)) + 𝛼R (C), (4.3) where the regularization term R (C) will be specified to put additional constraints on the partition such that the connectivity information is also taken into account. We propose a two-step alternating minimization approach to solve this problem, where at each iteration 𝑙 we first learn the optimal graph partitioning for the given graph signal while the second step learns the optimal graph filter parameters: C (𝑙+1) := argmin L ( C, H (𝚲; 𝛽(𝑙))) + 𝛼R (C), C 𝛽(𝑙+1) := argmin 𝛽 L ( C (𝑙+1), H (𝚲; 𝛽)). (4.4) 79 4.3.2 C update: Clustering For the clustering task, given the filtered attributes, ˜F, we aim to find the graph partition, C. Fixing H (𝚲; 𝛽), the cost function in (4.2) can be rewritten as: L (C, H (𝚲; 𝛽)) = 𝐾 ∑︁ 𝑘=1 diss(C𝑘 , C𝑘 ) D (C𝑘 ) − 𝛾 𝐾 ∑︁ 𝑘=1 diss(C𝑘 , 𝑉 \ C𝑘 ) D (C𝑘 ) . (4.5) ˜𝑊𝑖 𝑗 is the total dissimilarity within a cluster and diss(C𝑘 , 𝑉 \ C𝑘 ) = ˜𝑊𝑖 𝑗 is the total distance/dissimilarity of C𝑘 with respect to rest of the graph. Since it can be where diss(C𝑘 , C𝑘 ) = (cid:205)𝑖, 𝑗 ∈C𝑘 (cid:205)𝑖∈C𝑘 𝑗∉C𝑘 shown that diss(C𝑘 , 𝑉 \ C𝑘 ) = D (C𝑘 ) − diss(C𝑘 , C𝑘 ), we can rewrite (4.5) in terms of within-cluster dissimilarity as L (C, H (𝚲; 𝛽)) = (1 + 𝛾) 𝐾 ∑︁ 𝑘=1 diss(C𝑘 , C𝑘 ) D (C𝑘 ) + 𝐾. (4.6) This problem can be rewritten as a trace optimization problem. Therefore, minimizing the cost function in Eq. (4.6) is equivalent to minimizing the following L ( ¯Z, H (𝚲; 𝛽)) = tr( ¯Z⊤D−1/2 ˜WD−1/2 ¯Z), (4.7) subject to ¯Z⊤ ¯Z = I. Regularization R: In order to incorporate the connectivity information into the clustering problem, we propose to use the regularization term R ( ¯Z) = −tr( ¯Z⊤D−1/2AD−1/2 ¯Z), such that minimizing our regularization term is equivalent to maximizing NAssoc in Eq. (1.9). Thus, C update in Eq. (4.4) can be equivalently expressed by the following ¯Z update equation: ¯Z (𝑙+1) := argmin ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤D−1/2 ˜WD−1/2 ¯Z) − 𝛼tr( ¯Z⊤D−1/2AD−1/2 ¯Z), (4.8) := argmin ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤( ˜W𝑛 − 𝛼A𝑛) ¯Z). The first term in (4.8) corresponds to minimizing the normalized dissimilarity matrix con- structed from the graph attributes while the second term corresponds to normalized association with respect to graph connectivity. While the two terms are similar mathematically, they corre- spond to two different sources of information, node attributes and graph topology, respectively. 80 The optimal solution to this problem is the set of eigenvectors corresponding to the 𝐾 smallest eigenvalues of ˜W𝑛 − 𝛼A𝑛. The graph partition, C (𝑙+1), is then updated by applying k-means to the rows of ¯Z. 4.3.3 𝛽 Update: Optimal Filter Design Once we have the cluster assignments, we want to determine the coefficients of the optimal filter H (𝚲; 𝛽). In this chapter, we present the derivations for both FIR and IIR filter types, with 𝛽 = h for FIR and 𝛽 = {a, b} for the ARMA filter. In this section, we present the optimization problem for learning the parameters of FIR Filter: the optimal polynomial filter with a given filter order 𝑇, H (𝚲) = (cid:205)𝑇−1 𝑡=0 ℎ𝑡𝚲𝑡, for the clustering task. Following the definitions in [220], we can define the 𝑡-th shifted input signal, S(𝑡) ∈ R𝑁×𝑃, as S(𝑡) := UΛ𝑡U⊤F = L𝑡 𝑛F and S(𝑖) can then be defined as a 𝑇 × 𝑃 matrix corresponding to the 𝑖-th node where each row corresponds to the 𝑡-th shifted input signal at that node with [S(𝑖)]𝑡 := [S(𝑡)] (𝑖). With ˜F denoting the output of a graph filter for the input signal F, it follows that ˜F = U (cid:33) ℎ𝑡Λ𝑡 (cid:32)𝑇−1 ∑︁ 𝑡=0 U⊤F = 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡S(𝑡). (4.9) Hence, the filtered graph signal corresponding to the 𝑖-th node can be computed as ˜F𝑖 = 𝑡=0 ℎ𝑡 [S(𝑡)]𝑖 = h⊤S(𝑖), with h = [ℎ0, ℎ1, · · · , ℎ𝑇−1]. The cost function in (4.1) can then be (cid:205)𝑇−1 rewritten in terms of the filter coefficient vector, h, as follows L (C, h) = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘) ∑︁ 𝑖, 𝑗 ∈ C𝑘 ||h⊤S(𝑖) − h⊤S( 𝑗 ) ||2 − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘) ∑︁ 𝑖 ∈ C𝑘 𝑗∉C𝑘 ||h⊤S(𝑖) − h⊤S( 𝑗 ) ||2, = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘) ∑︁ 𝑖, 𝑗 ∈ C𝑘 h⊤(S(𝑖) − S( 𝑗 ) )(S(𝑖) − S( 𝑗 ) )⊤h − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘) ∑︁ 𝑖 ∈ C𝑘 𝑗∉C𝑘 h⊤(S(𝑖) − S( 𝑗 ) )(S(𝑖) − S( 𝑗 ) )⊤h. Eq. (4.30) can be rewritten as follows: L (C, h) = (h⊤(B − 𝛾C)h), where B and C are 𝑇 × 𝑇 matrices defined as 81 (4.10) (4.11) B = C = 𝐾 ∑︁ 𝑘=1 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 (S(𝑖) − S( 𝑗)) (S(𝑖) − S( 𝑗))⊤, 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 (S(𝑖) − S( 𝑗)) (S(𝑖) − S( 𝑗))⊤. Our optimization problem becomes minimize h,h⊤h=1 (h⊤(B − 𝛾C)h). (4.12) (4.13) (4.14) The solution to the optimization problem is the eigenvector of B − 𝛾C corresponding to the 𝑡=0 ℎ𝑡𝚲𝑡U⊤F. smallest eigenvalue. Once h is obtained, the filtered signal can be updated as ˜F = U (cid:205)𝑇−1 ARMA Filter: In this section, we present the optimization problem for learning the parameters of an ARMA filter with the following graph frequency response H (𝚲) = (cid:205)𝑇−1 I + (cid:205)𝑄−1 𝑡=0 𝑎𝑡𝚲𝑡 𝑞=1 𝑏𝑞𝚲𝑞 , (4.15) where a = [𝑎0, 𝑎1, . . . , 𝑎𝑇−1] and b = [𝑏1, 𝑏2, . . . , 𝑏𝑄−1] are the filter coefficients, and (𝑇, 𝑄) is the pair of filter orders. In order to find the filter parameters a and b we introduce an auxiliary polynomial Γ𝑀 (𝜆) = 𝑚=0 𝑐𝑚𝜆𝑚. Γ𝑀 (𝜆) can be viewed as the reciprocal polynomial of (1 + (cid:205)𝑄−1 (cid:205)𝑀−1 𝑞=1 𝑏𝑞𝜆𝑞). In general, the reciprocal polynomial has a larger order than the denominator polynomial, i.e., 𝑀 ≥ 𝑄 [269]. Letting (cid:205)𝑀−1 𝑚=0 𝑐𝑚𝚲𝑚 = 1 I+(cid:205)𝑄−1 𝑞=1 𝑏𝑞𝚲𝑞 , Eq. (4.15) becomes H (𝚲) = 𝑀−1 ∑︁ 𝑚=0 𝑐𝑚𝚲𝑚 𝑇−1 ∑︁ 𝑡=0 𝑎𝑡𝚲𝑡 . (4.16) To solve for the filter parameters, we introduce the constraint ((cid:205)𝑀−1 𝑚=0 𝑐𝑚𝚲𝑚 = diag(𝚿𝑀c) and (cid:205)𝑄−1 We can rewrite the polynomials as (cid:205)𝑀−1 𝑚=0 𝑐𝑚𝚲𝑚) (I+(cid:205)𝑄−1 𝑞=1 𝑏𝑞𝚲𝑞) = I. 𝑞=1 𝑏𝑞𝚲𝑞 = diag( ¯𝚿𝑄b), where 𝚿𝑀 and ¯𝚿𝑄 are defined as 82 1 𝜆1 𝜆2 1 𝜆2 2 𝜆2 1 𝚿𝑀 = · · · 𝜆𝑀−1 · · · 𝜆𝑄−1   1 𝜆1           1 𝜆2       ... ...           1 𝜆𝑁 𝜆2       Using these definitions we can then rewrite ((cid:205)𝑀−1               𝜆𝑁 𝜆2    𝑚=0 𝑐𝑚𝚲𝑚) (I + (cid:205)𝑄−1                  𝑞=1 𝑏𝑞𝚲𝑞) = I as (diag(𝚿𝑀c))(I + · · · 𝜆𝑄−1 ... 𝑁 · · · 𝜆𝑄−1 𝑁 · · · 𝜆𝑀−1 · · · 𝜆𝑀−1 𝜆2 ... , ¯𝚿𝑄 = . . . . . . 𝜆2 2 ... ... ... 𝑁 𝑁 2 2 1 . diag( ¯𝚿𝑄b)) = I and the optimization problem becomes minimize a,b,c L ( C, H (𝚲; [a, c])) + ||diag(𝚿𝑀c) [I + diag( ¯𝚿𝑄b)] − I||2 𝐹 . (4.17) In order to find the parameters a, b and c, we propose an alternating minimization approach where in order to learn each of the variables we fix the other two. Thus, the 𝛽 update in (4.4) becomes a(𝑙+1) :=argmin L (C, H (𝚲; [a, c(𝑙)])), a b(𝑙+1) :=argmin b c(𝑙+1) :=argmin c ||diag(𝚿𝑀c(𝑙)) [I + diag( ¯𝚿𝑄b)] − I||2 𝐹, (4.18) L (C, H (𝚲; [a(𝑙+1), c])) + ||diag(𝚿𝑀c) [I + diag( ¯𝚿𝑄b(𝑙+1))] − I||2 𝐹 . Update a: In order to update a, we define the 𝑡-th shifted input signal as S(𝑡) := U((cid:205)𝑀−1 which can be rewritten as S(𝑡) := (cid:205)𝑀−1 𝑚=0 𝑐𝑚𝚲𝑚)𝚲𝑡U⊤F 𝑚=0 𝑐𝑚L𝑚+𝑡F and S(𝑖) as a 𝑇 × 𝑃 matrix corresponding to the 𝑖-th node where each row corresponds to the 𝑡-th shifted input signal at that node, [S(𝑡)] (𝑖). Therefore, ˜F𝑖 = (cid:205)𝑇−1 𝑡=0 𝑎𝑡 [S(𝑡)]𝑖 = a⊤S(𝑖). The cost function in (4.1) can be rewritten in terms of the filter coefficients a and the newly defined S(𝑖) as follows L ( C, a, c) = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 ||a⊤S(𝑖) − a⊤S( 𝑗) ||2 − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 ||a⊤S(𝑖) − a⊤S( 𝑗) ||2, 83 = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 a⊤(S(𝑖) − S( 𝑗))(S(𝑖) − S( 𝑗))⊤a − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 a⊤(S(𝑖) − S( 𝑗)) (S(𝑖) − S( 𝑗))⊤a. Eq. (4.19) can be rewritten as follows: L ( C, a, c) = (a⊤(B − 𝛾C)a), (4.19) (4.20) where B and C are 𝑇 × 𝑇 matrices defined as in Eqs. (4.32) and (4.33) using the newly defined S(𝑖) which depends on both C and c. Our optimization problem becomes minimize a,a⊤a=1 (a⊤(B − 𝛾C)a). (4.21) whose solution is the eigenvector of B − 𝛾C corresponding to the smallest eigenvalue. Update b: Vectorizing each term in (4.18) for the b update, we have the following equivalent optimization problem minimize b ||𝚿𝑀c + diag(𝚿𝑀c) ¯𝚿𝑄b − 1𝑁 ||2, (4.22) where 1𝑁 ∈ R𝑁 is an all ones vector. To find b = [𝑏1, 𝑏2, · · · , 𝑏𝑄−1], the following objective function L1(b) = ||𝚿𝑀c + diag(𝚿𝑀c) ¯𝚿𝑄b − 1𝑁 ||2 is minimized with respect to b by setting ∇𝑏L1 = 2 ¯𝚿⊤ 𝑄diag(𝚿𝑀c) [𝚿𝑀c + diag(𝚿𝑀c) ¯𝚿𝑄b] − 2 ¯𝚿⊤ 𝑄diag(𝚿𝑀c)1𝑁 = 0. After some algebraic manipulations, b can be updated as b = Y−1 1 v1, (4.23) where Y1 ∈ R𝑄−1×𝑄−1 and v1 ∈ R𝑄−1 are defined as Y1 = ( ¯𝚿⊤ v1 = ¯𝚿⊤ 𝑄diag(𝚿𝑀c)diag(𝚿𝑀c) ¯𝚿𝑄) and 𝑄diag(𝚿𝑀c) [1𝑁 − 𝚿𝑀c], respectively. Y1 ∈ R𝑄−1×𝑄−1 is full rank and therefore invertible if and only if the Vandermonde matrix has full rank. Since 𝑄 − 1 ≪ 𝑁, ¯𝚿𝑄 will be full rank if there are at least 𝑄 − 1 distinct eigenvalues of the normalized graph Laplacian. The full explanation and proof for this statement is provided in Appendix C. 84 Update c: We can define the 𝑚-th shifted input signal as S(𝑚) ˜F𝑖 = (cid:205)𝑀−1 𝑡=0 𝑎𝑡𝚲𝑡)U⊤F and 𝑚=0 𝑐𝑚 [S(𝑚)]𝑖 = c⊤S(𝑖), with S(𝑖) being a 𝑀 × 𝑃 matrix corresponding to the 𝑖-th node := U𝚲𝑚 ((cid:205)𝑇−1 where each row corresponds to the 𝑚-th shifted input signal at that node, [S(𝑚)] (𝑖). The c update in (4.18) can be rewritten as minimize c c⊤(B − 𝛾C)c + ||𝚿𝑀c + diag( ¯𝚿𝑄b)𝚿𝑀c − 1𝑁 ||2. (4.24) To find c = [𝑐0, 𝑐1, · · · , 𝑐𝑀−1], the following objective function L2(c) = c⊤(B−𝛾C)c+ ||𝚿𝑀c+ diag( ¯𝚿𝑄b)𝚿𝑀c − 1𝑁 ||2 is minimized with respect to c by setting ∇𝑐L2 = 2(B − 𝛾C)c + 2𝚿⊤ 𝑀 (I + diag( ¯𝚿𝑄b))⊤(I + diag( ¯𝚿𝑄b))𝚿𝑀c − 2𝚿⊤ 𝑀 (I + diag( ¯𝚿𝑄b)⊤)1𝑁 = 0. After some algebraic manipulations, c can be updated as c = Y−1 2 v2, (4.25) where Y2 ∈ R𝑀×𝑀 and v2 ∈ R𝑀 are defined as Y2 = (B − 𝛾C) + 𝚿⊤ 𝑀 (I + diag( ¯𝚿𝑄b))⊤(I + diag( ¯𝚿𝑄b))𝚿𝑀 and v2 = 𝚿⊤ 𝑀 (I + diag( ¯𝚿𝑄b)⊤)1𝑁 , respectively. Y2 is full rank and therefore invertible if there are at least 𝑀 (𝑀 ≪ 𝑁) distinct eigenvalues of the normalized graph Laplacian. We solve iteratively for the filter coefficients a, b, and c until convergence. Once the filter coefficients are obtained at the 𝑙-th iteration, we update the filtered signal ˜F(𝑙). Both variable updates, C and 𝛽, for Clustering and Optimal Filter Design steps, respectively, are repeated until convergence as described in Algorithm 4.1. 4.4 Multi-Scale Graph Wavelets for Clustering (MSGWC) In this section, we propose to learn the optimal combination of multi-scale features using graph scaling and wavelet bases. Given the input signal F, let ˜F denote the multi-scale features defined as (cid:32) ˜F = 𝑤1𝚽⊤ 1 + (cid:33) 𝑤𝑠𝚿⊤ 𝑠 F = 𝑇 ∑︁ 𝑠=2 𝑇 ∑︁ 𝑠=1 𝑤𝑠S(𝑠), (4.26) 85 Algorithm 4.1: GraFiCA. Input: Normalized adjacency matrix A𝑛, graph signal F, number of clusters 𝐾, parameters 𝛼, 𝛾, filter orders (𝑇, 𝑄), and 𝑀 ≥ 𝑄. ⊲ FIR Filter ⊲ ARMA Filter 𝑟 = 0 while ||a(𝑟) − a(𝑟−1) ||2 > 10−3, ||b(𝑟) − b(𝑟−1) ||2 > 10−3, and ||c(𝑟) − c(𝑟−1) ||2 > 10−3 do if 𝑄 = 1 then Output: Cluster partition C, graph filter H (L). 1: L𝑛 = U𝚲U⊤ 2: [NMI(0), C (0)]=ClusteringStep(F) 3: Initialize c(0) for ARMA 4: 𝑙 = 0 5: while |NMI(𝑙) − NMI(𝑙−1) | > 10−3 do 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: Compute B and C using (4.32) and (4.33) S ← (B − 𝛾C) S = HΓH⊤ h ← H·1 ˜F ← (cid:205)𝑇−1 Compute B and C using (4.32) and (4.33) S ← (B − 𝛾C) S = HΓH⊤ a(𝑟) ← H·1 b(𝑟) ← Y−1 c(𝑟) ← Y−1 𝑟 = 𝑟 + 1 1 v1 using (4.23) 2 v2 using (4.25) 𝑡=0 ℎ𝑡L𝑡 else 𝑛F 20: 21: 22: 23: end while ˜F ← U (cid:205)𝑇 −1 I+(cid:205)𝑄−1 𝑡=0 𝑎𝑡 𝚲𝑡 𝑞=1 𝑏𝑞𝚲𝑞 U⊤F end if [NMI(𝑙),C (𝑙)]=ClusteringStep( ˜F) 𝑙 = 𝑙 + 1 24: 25: 26: 27: end while 28: function ClusteringStep(F) 𝑊𝑖 𝑗 ← ||𝐹𝑖· − 𝐹𝑗 ·||2 29: W′ ← W𝑛 − 𝛼A𝑛 30: W′ = VΓV⊤ 31: V ← V(:, 1 : 𝐾) 32: C ← 𝑘-means(V, 𝐾) 33: Compute NMI 34: return NMI and C 35: 36: end function 86 where S(1) := 𝚽⊤ 1 F is the output of the graph scaling filter and S(𝑠) := 𝚿⊤ 𝑠 F is the output of the graph wavelet filter at scale 𝑠 for 𝑠 ≥ 2. The multi-scale graph signal at node 𝑖 can be written as ˜F𝑖 = (cid:205)𝑇 𝑠=1 𝑤𝑠 [S(𝑠)]𝑖 = w⊤S(𝑖), where S(𝑖) is a 𝑇 × 𝑃 matrix corresponding to the 𝑖-th node where each row is the 𝑠-th scale feature with [S(𝑖)] 𝑠 := [S(𝑠)] (𝑖) [220]. To further interpret the multi-scale features ˜F in the frequency domain, we can write ˜F using the scaling and wavelet filter responses, H𝑠 (𝚲) and G𝑠 (𝚲), as (cid:32) ˜F = U 𝑤1H1(𝚲) + 𝑇 ∑︁ (cid:33) 𝑤𝑠G𝑠 (𝚲) U⊤F. (4.27) 𝑠=2 This formulation shows how the multi-scale signal ˜F is formed by a combination of low-pass and band-pass filters across different scales, with each scale’s contribution weighted by 𝑤𝑠’s, resulting in a graph filter with frequency response H (𝚲, w) = 𝑤1H1(𝚲) + (cid:205)𝑇 𝑠=2 𝑤𝑠G𝑠 (𝚲). 4.4.1 Problem Formulation Given a graph G with normalized adjacency matrix A𝑛 ∈ R𝑁×𝑁 and graph signal F ∈ R𝑁×𝑃, our goal is to find the best partition, i.e., 𝐾 non-overlapping clusters, C = {C1, C2, . . . , C𝐾 }, and the optimal weights w = [𝑤1, 𝑤2, · · · , 𝑤𝑇 ]. The quality of the clustering is quantified based on the separability of the multi-scale features, ˜F as in Eq. 4.27, using the cost function proposed in Eq. 4.1 as follows: L ( C,H (𝚲; w)) = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2 − 𝛾 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 || ˜𝐹𝑖· − ˜𝐹𝑗 ·||2. Similarly to Section 4.3, our goal here is to minimize this cost function in terms of both the graph partition, C and the optimal weights w. The corresponding optimization problem can be formulated as: minimize C,w L ( C, H (𝚲; w)) + 𝛼R ( C), (4.28) where the regularization term R ( C) will be specified to put additional constraints on the partition such that the connectivity information is also taken into account. We propose a two-step alternating minimization approach to solve this problem, where at each iteration 𝑙 we first learn the optimal 87 graph partitioning for the given graph signal while the second step learns the optimal graph filter parameters: C (𝑙+1) := argmin L ( C, H (𝚲; w(𝑙))) + 𝛼R (C), C w(𝑙+1) := argmin L (C (𝑙+1), H (𝚲; w)). w (4.29) 4.4.2 Learning the partition, C Similarly to Section 4.3, the C update in Eq. (4.29) can be equivalently expressed by the ¯Z update equation, as in Eq. 4.8: ¯Z (𝑙+1) := argmin ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤D−1/2 ˜WD−1/2 ¯Z) − 𝛼tr( ¯Z⊤D−1/2AD−1/2 ¯Z), := argmin ¯Z, ¯Z⊤ ¯Z=I tr( ¯Z⊤( ˜W𝑛 − 𝛼A𝑛) ¯Z), whose optimal solution is the set of eigenvectors corresponding to the 𝐾 smallest eigenvalues of ˜W𝑛 − 𝛼A𝑛. The graph partition, C (𝑙+1), is then updated by applying k-means to the rows of ¯Z. 4.4.3 Learning optimal weights, w Once the partition C is learned, the goal is to determine the optimal weights for the multi-scale features that achieve the best separability between clusters. Using ˜F𝑖 = (cid:205)𝑇 𝑠=1 𝑤𝑠 [S(𝑠)]𝑖 = w⊤S(𝑖), with S(1) := 𝚽⊤ 1 F is the output of the graph scaling filter and S(𝑠) := 𝚿⊤ 𝑠 F is the output of the graph wavelet filter at scale 𝑠 for 𝑠 ≥ 2, the cost function in (4.1) can be expressed in terms of the optimal weights as follows: 𝐾 ∑︁ 𝑘=1 1 D (C𝑘) ∑︁ 𝑖, 𝑗 ∈ C𝑘 w⊤(S(𝑖) − S( 𝑗 ) )(S(𝑖) − S( 𝑗 ) )⊤w − 𝛾 𝐾 ∑︁ 𝑐=1 1 D (C𝑘) ∑︁ 𝑖 ∈ C𝑘 𝑗∉C𝑘 w⊤(S(𝑖) − S( 𝑗 ) )(S(𝑖) − S( 𝑗 ) )⊤w, which can be rewritten to obtain the following optimization problem: minimize w,ww⊤=1 (w⊤(B − 𝛾C)w), where B and C are the following 𝑇 × 𝑇 matrices: B = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 (S(𝑖) − S( 𝑗)) (S(𝑖) − S( 𝑗))⊤, 88 (4.30) (4.31) (4.32) C = 𝐾 ∑︁ 𝑘=1 1 D (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 (S(𝑖) − S( 𝑗)) (S(𝑖) − S( 𝑗))⊤. (4.33) The solution to the optimization problem is the eigenvector of B − 𝛾C corresponding to its smallest eigenvalue. Once the optimal weights w are obtained, the multi-scale features can be updated as (cid:16) ˜F = 𝑤1𝚽⊤ 1 + (cid:205)𝑇 𝑠=2 𝑤𝑠𝚿⊤ 𝑠 (cid:17) F. 4.5 Computational Complexity The computational complexity of the algorithms is mostly due to the eigendecompositions and inverse operations at each iteration. There is only one full eigendecomposition at the beginning of the algorithm for the normalized Laplacian of the graph. Eigendecompositions of a 𝑁 × 𝑁 matrix, in general, have complexity on the order O (𝑁 3). However, it is important to note that this is the worst-case scenario. In practice, there are algorithms that approximate the spectral decomposition of graphs [61] reducing the computational complexity to O (𝑁 2). The work in [61] complexity bounds for spectral decomposition by restricting the analysis to graph families with good recursive separators. These graphs can be partitioned into roughly equal subgraphs with a small separator, allowing for efficient recursive spectral decomposition. We conducted an empirical analysis of the graphs used in this chapter using the Matlab Mesh Partitioning and Graph Separator Toolbox. Specifically, we applied the recursive spectral partitioning function and found that our graphs can be divided into multiple smaller subgraphs of nearly equal size, consistent with the properties required in [61]. Thus, the computational complexity of finding the spectral decomposition can be reduced to O (𝑁 2). At each iteration, we find the clusters by computing the eigenvectors corresponding to the 𝐾 smallest eigenvalues of ˜W𝑛 − 𝛼A𝑛. The computational complexity of finding the 𝐾 eigenvectors corresponding to the smallest eigenvalues is O (𝑁 2𝐾). The filter coefficients h for FIR and a for ARMA and the optimal weights for MSGWC are found by computing the eigenvector corresponding to the smallest eigenvalue of 𝑇 × 𝑇 matrices, with computational complexity O (𝑇 2). For finding b and c, we have an inverse operation, and the standard matrix inversion algorithm has a time complexity of O ((𝑄 − 1)3) and O ((𝑀)3) for finding b and c, respectively. It is important 89 to note that 𝐾, 𝑇, 𝑄, 𝑀 << 𝑁, so the total complexity of both steps is dominated by O (𝑁 2𝐾). Table 4.1: Computational complexities of the optimization steps. Clustering Step Filter Design Step FIR h: O (𝑇 2) Total MSGWC w: O (𝑇 2) O (𝑁 2𝐾) ARMA a: O (𝑇 2) b: O ((𝑄 − 1)3) c: O ((𝑀)3) O (𝑁 2𝐾) 4.6 Experimental Results on Real Networks 4.6.1 Datasets We evaluate the proposed graph filter learning method for both FIR and ARMA filters on five attributed networks. The first three, Cora, Citeseer, and PubMed [223], are citation networks where the nodes correspond to publications, and the edges correspond to citations among the papers. Cora has 2,708 machine learning papers classified into seven classes: case-based reasoning, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, and theory. Citeeser has 3,327 machine learning publications classified into 6 classes: agents, artificial intelligence, database, information retrieval, machine learning, and human-computer interaction. PubMed has 19,717 papers classified into one of three classes: Diabetes Mellitus -Experimental, Diabetes Mellitus type 1, and Diabetes Mellitus type 2. Wiki [272] is a webpage network where the nodes are webpages and the edges are the links between them. It contains 2,405 long text documents classified into 17 classes. Sinanet [125] is a microblog users’ relationship network where the edges between the users represent the followers/followees relationships. It contains 3,490 users from 10 major forums, including finance and economics, literature and arts, fashion and vogue, current events and politics, sports, science and technology, entertainment, parenting and education, public welfare, and normal life. The nodes in Cora and Citeseer are associated with binary word vectors indicating the presence or absence of some words, and the nodes in PubMed and Wiki are associated with tf-idf weighted word vectors. Sinanet is associated with a tf-idf weighted vector indicating the users’ interest distribution in each forum. Table 4.2 summarizes the details for each dataset. 90 Table 4.2: Datasets statistics. Dataset Cora Citeseer Sinanet Wiki PubMed Nodes Links 2708 5429 Features 1433 Classes 7 3327 4732 3703 6 3490 2405 30282 17981 10 10 4973 17 19717 44338 500 3 4.6.2 Baseline Methods and Metrics In order to evaluate the performance of our methods and the importance of taking into account both the topology and the attributes of a graph, we compare our method with three classes of methods. • The first class of methods, 𝐾-means, spectral clustering, and CGFKM [81] only use the node attributes. CGFKM is a k-means based method that learns a Chebyshev polynomial approximation graph filter similar to our graph learning stage. However, this method does not use the connectivity of the network. • The second class of methods only use the graph topology. Spectral clustering on the graph, SC-G, uses eigendecomposition on the graph Laplacian, Louvain [28] is a well-known community detection method, and Multi-Scale Community Detection (MS-CD) [248], uses graph spectral wavelets to compute a similarity metric between nodes to find communities at different scales. • Finally, the third class of methods include AGC [286], GAE and VGAE [139], EGAE [284], ARGE and ARVGE [194], which use both node attributes and graph structure. AGC is based on spectral graph filtering, GAE and VGAE are benchmark autoencoder-based methods, while ARGE and ARVGE are benchmark adversarial GAE methods. EGAE is a GAE based model designed specifically for graph clustering. The clustering performance of the different methods is quantified using Normalized Mutual Information (NMI) [65] and Adjusted Rand Index (ARI) [117]. NMI provides a quantitative measure of the agreement between the true class labels and the labels assigned by a clustering 91 algorithm, taking into account both precision and recall, offering a balanced perspective on the quality of the clustering solution. It can be computed as: 𝑁 𝑀 𝐼 = 2𝐼 (𝑌 ; 𝐶) 𝐻 (𝑌 ) + 𝐻 (𝐶) , where 𝑌 is the set of true class labels, 𝐶 is the set of cluster labels assigned by the algorithm, 𝐼 (𝑌 ; 𝐶) is the mutual information between 𝑌 and 𝐶, and 𝐻 (𝑌 ) and 𝐻 (𝐶) are the entropies of 𝑌 and 𝐶, respectively. NMI ranges from 0 to 1, where 0 indicates no mutual information, and 1 implies perfect agreement between the true and predicted labels. ARI is a widely used metric in cluster analysis and machine learning for evaluating the similarity between two clustering solutions. It measures the agreement between the true class labels and the labels assigned by a clustering algorithm while correcting for chance and is given by: 𝐴𝑅𝐼 = 𝑅𝐼 − Expected RI max(𝑅𝐼) − Expected RI , where RI is the Rand Index, which measures the proportion of agreements between the true and predicted clusterings. Expected RI is the expected Rand Index under the assumption of random clustering. It represents the expected value of RI when clustering is performed randomly. The max(RI) term in the denominator represents the maximum possible Rand Index, which normalizes the ARI to the range [-1, 1]. 4.6.3 Hyperparameter Selection The input parameters required by our methods are the number of clusters 𝐾, the filter orders (𝑇, 𝑄) for GraFiCA, the number os scales 𝑇 for MSGWC, and the hyperparameters, 𝛼 for the clustering step, and 𝛾 for the filter optimization step in both methods. For a fixed 𝛾, the optimal solution h in Eq. (4.14) is given by the eigenvector corresponding to the smallest eigenvalue of the matrix B − 𝛾C, as expressed in equations (4.14) and (4.31). Since B and C correspond to the within- and between-cluster scatter of the filtered attributes, this formulation is analogous to the scatter difference criterion used in linear discriminant analysis [95]. The solution to this problem is equivalent to that of linear discriminant function when 𝛾 is the Lagrange multiplier for the following 92 optimization: min h h⊤Bh h⊤Ch . Due to the homogeneity of the Rayleigh quotient, we can normalize h⊤Ch = 1, and the problem can be rewritten as the following constrained minimization problem: min h h⊤Bh s.t. h⊤Ch = 1, which leads to the generalized eigenvalue problem Bh = 𝛾Ch, where 𝛾 is the generalized eigenvalue. While the solution to this minimization problem would be the eigenvector corresponding to the smallest eigenvalue, we propose to consider all eigenvalues of the generalized eigenvalue problem as candidate values for 𝛾. This broader selection allows us to explore different trade-offs in the balance between B and C scatter terms. To mitigate the effects of noise and ensure optimizing the accuracy, we perform a local grid search for 𝛾 between the minimum and maximum eigenvalues of the generalized eigendecomposition problem, 𝛾 ∈ [𝜆𝑚𝑖𝑛, 𝜆𝑚𝑎𝑥]. For selecting the best 𝛼 , we tested 𝛼 ∈ [0, 1], and the results for the filter with the best performance in terms of NMI are reported. The filter orders are determined within a range through exhaustive search. For the FIR filter, we evaluated different values for the filter order 𝑇 between 3 and 10 to determine the order that gives the best NMI value for each dataset. For ARMA, we assume that 𝑄 > 𝑇 for a pair (𝑇, 𝑄) and 𝑇 + 𝑄 ≤ 10 [162]. For MSGWC we used 𝑇 = 10, and 𝚽1 is generated using 𝑠min = 1/𝜆2 as the scale, as defined in [248]. The optimal values of the parameters 𝛼, 𝛾, for both methods, and the filter orders 𝑇 for FIR, (𝑇, 𝑄) for ARMA, are listed for each dataset in Table 4.3. Our experimental results indicate that for sparse graphs, small 𝛼 values perform the best whereas for fully connected dense graphs, 𝛼 values close to 1 are optimal. It was also observed that the performance of GraFICA is robust to the filter order. In particular, the performance of ARMA filters is less sensitive to the order of the filter compared to FIR filters. In all of the tested datasets, the number of clusters is known. For the baseline methods, the parameter settings reported in the original papers are used [286, 139, 194]. 93 Table 4.3: The optimal parameters for the different datasets. Dataset Cora Citeseer Sinanet Wiki PubMed FIR ARMA MSGWC 𝛼 𝛾 𝑇 𝛼 𝛾 (𝑇, 𝑄) 𝛼 0.056 0.074 3 0.05 0.08 (2,3) 0.03 𝛾 0.072 0.06 0.10 3 0.058 0.101 (2,3) 0.110 0.124 0.044 0.022 3 0.05 0.03 (2,3) 0.041 0.03 0.001 0.03 3 0.001 0.028 (3,4) 0.001 0.032 0.01 0.42 3 0.01 0.40 (2,3) − − 4.6.4 Performance Evaluation Table 4.4 shows the performance of all the methods, wherein the top three results are highlighted in bold. For all datasets, our method performs better than the state-of-the-art methods in terms of NMI. In particular, GraFiCA and MSGWC outperform all methods that use only the node attributes (first class of methods) or only the graph topology (second class of methods) in all cases our methods consider both the graph topology and node attributes, which complement each other and consequently enhance the clustering algorithm. In addition, GraFiCA and MSGWC outperform conventional GCN-based methods such as GAE, VGAE, ARGE, and ARVGE. This is due to the fact that in GraFiCA the filters are optimized for the clustering task, rather than for reconstruction, and the filter shape is not pre-determined. This leads us to consider larger neighborhoods unlike the traditional GCN-based methods which exploit information from only 2-hop neighborhoods. GraFiCA and MSGWC outperform spectral graph filtering based methods such as AGC in most cases as AGC only considers lowpass filters. For PubMed, AGC performs the best in terms of ARI, but GraFiCA performs the best in terms of NMI. Similarly, GraFiCA and MSGWC outperform EGAE which incorporates relaxed k-means into the decoder of GAE to learn embeddings optimized for graph clustering. This shows that while optimizing the weights in GAE for the clustering task improves the results compared to traditional GCN-based approaches, providing additional flexibility to the graph filter structure as in GraFiCA and MSGWC yields better clustering results. For Wiki, EGAE performs better than GraFiCA using FIR in terms of ARI, but GraFICA gives higher NMI. 94 This difference in ranking of the different methods using ARI vs. NMI may be due to the fact that Wiki is the most imbalanced data set. In such cases, ARI may be biased towards the larger clusters. For most datasets, the performance of the FIR and ARMA filters are very similar to each other. In the next section, we will discuss the interpretation of these results in terms of the learned filters. Table 4.4: Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) results. The top three performance metrics in each case are shown by bold. OM indicates cases where the method ran out of memory. Cora Citeseer Sinanet Wiki PubMed Algorithms NMI ARI NMI ARI NMI ARI NMI ARI NMI ARI k-means SC-F 0.2825 0.1621 0.3597 0.3279 0.2856 0.1615 0.3244 0.2898 0.6413 0.5395 0.5828 0.3323 0.0578 0.2053 0.1810 0.3802 0.3974 0.0954 0.0164 0.0098 CGFKM [81] 0.1327 0.0522 0.0773 0.0011 0.6415 0.5664 0.0765 0.0025 0.1347 0.0911 SC-G 0.0872 0.0185 0.0573 0.0051 0.1653 0.0444 0.1709 0.0073 0.0075 0.0013 Louvain [28] 0.5044 0.3433 0.3645 0.3457 0.2490 0.2044 0.4105 0.2369 0.2059 0.1099 MS-CD [248] 0.5072 0.3453 0.4064 0.3926 0.4064 0.1997 0.3548 0.2677 0.1773 0.1465 AGC [286] GAE [139] 0.5170 0.3982 0.4086 0.4124 0.5573 0.3892 0.4268 0.1440 0.3158 0.4209 0.3160 0.1706 0.1018 0.4346 0.3523 0.1316 0.0781 0.2374 0.3105 0.1955 VGAE [139] 0.4276 0.3188 0.2112 0.1440 0.4567 0.3881 0.2982 0.1143 0.2403 0.2119 EGAE [284] 0.5401 0.4723 0.4122 ARGE [194] 0.4562 0.3865 0.2967 0.4324 0.2781 0.3404 0.2773 0.4711 0.4815 0.3781 0.3715 0.3308 0.1129 0.3205 0.2893 0.2359 0.2258 ARVGE [194] 0.4657 0.3895 0.3124 0.3022 0.4854 0.3993 0.3987 0.1084 0.0826 0.0373 GraFiCAFIR GraFiCAARMA MSGWC 0.5465 0.4743 0.4228 0.4283 0.6578 0.6721 0.5125 0.2771 0.5421 0.4746 0.4261 0.4365 0.5662 0.4928 0.4223 0.4356 0.6561 0.6107 0.6736 0.5150 0.2807 0.5919 0.5005 0.2912 0.3279 0.3265 OM 0.2995 0.2915 OM 4.6.5 Interpretation of the Filters Learned by GraFiCA Figure 4.5a-4.2o show the frequency responses of the optimal FIR and ARMA filters for each dataset. For most datasets, the optimal filters have both a low-pass and high-pass region, thus extracting both smooth and non-smooth features from the node attributes. This is in contrast to GAE-based methods that are limited to first-order low-pass filtering and AGC, which is a higher- order low-pass filter. While the clustering performance of FIR and ARMA filters are similar to each other, the filter shapes obtained by ARMA are smoother as they fit IIR filters to approximate the same frequency response. It is also interesting to note that the optimal filter for Sinanet is an all-pass filter, as the original node attributes carry most of the class information. In this case, 95 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) Figure 4.2: (a)-(e) Spectral Energy Distributions (SED) of the graph signal F. (f)-(j) Optimal FIR graph filters for each dataset. (k)-(o) Optimal ARMA graph filters for each dataset. From left to right, Cora, Citeseer, Sinanet, Wiki, and PubMed, respectively. filtering the attributes may not improve the accuracy of clustering as seen in the second row of Figure 4.3 where the attributes before and after filtering are very similar. This also explains why methods like k-means and CGFKM which only rely on the attributes perform well in clustering this dataset. On the contrary, for Cora, the clusters become better separated after filtering, as seen in the first row of Figure 4.3. This is in line with the poor clustering performance of methods that only use node attributes. These results indicate that our method adapts to the characteristics of the data and yields interpretable filters. Except for the filters for Sinanet, which are all-pass, for most of the other datasets, the filters are mostly low-pass with content in the middle and high frequencies. The fact that our filters do not entirely suppress high-frequency components suggests a balance between preserving the cohesiveness of clusters and highlighting crucial differences across various regions of the graph. In order to better understand and interpret the frequency responses of the learned filters, we examine the Spectral Energy Distribution (SED) of the graph signals with respect to the eigenvalues of the graph Laplacian. For a graph signal F and its graph Fourier transform ˆF = U⊤F, the spectral 96 Figure 4.3: t-SNE visualization [254] of the original (left) and filtered (right) attributes for Cora (top row) and Sinanet (bottom row) for GraFiCA with FIR filter. Each color represents a cluster. energy distribution at 𝜆𝑖 is defined as the average of ˆ𝐹2 𝑖=1( ˆ𝐹𝑖 𝑝)2 across the 𝑝 attributes. Figure 4.2a-4.2e show the spectral energy distribution of the graph signals for each dataset. A concentration 𝑖 𝑝/(cid:205)𝑁 of energy in the lower end of the spectrum suggests that the graph signal is smooth and varies slowly across the graph. This can also indicate strong connectivity within certain graph regions or the presence of well-defined communities. Significant energy in the higher frequencies indicates that the graph signal has high variability or rapid changes across edges. This might also suggest that the graph contains regions of sparse connectivity and other regions with higher density. As we can see in Figure 4.2a and 4.2b, Cora and Citeseer are similar in terms of their SED. They both have SED uniformly distributed across all frequencies with increased SED in the low frequencies. Both datasets represent papers in different machine learning areas, where it is fair to assume that the papers could be easily related to more than just one of those areas and therefore, the clusters are not very well defined. Due to this distribution of SED, the optimal filters for Cora and Citeseer have significant power in the middle frequencies. On the other hand, in Sinanet, the main areas of the 10 different forums range from parenting, history and arts, politics, to science and technology, and the clusters are well separated, hence there is a significant peak at the low frequencies as seen in Figure 4.2c. As for the peak in the high frequencies for SED of Sinanet, 97 this might be due to the different densities in the clusters. Wiki and Pubmed show similar behavior with respect to their SEDs, with most of the SED concentrated in the lowest frequencies as seen in Figure 4.2. Similar to their SED profiles, the shape of the optimal filters for these two datasets are similar and primarily lowpass. We also evaluated the relationship between the cluster quality and the learned filter at each iteration of GraFiCA. Figure 4.4 shows the clustering at each iteration of the algorithm through t- SNE visualization along with the corresponding FIR filter for Cora. From this figure, it can be seen that when the clusters are not well-separated from each other, the learned filter emphasizes the high frequencies. After the first step of the iteration, while the points within clusters are moving closer to each other, the distance between clusters is also small. This results in a filter with significant low and high frequency parts. As the distance between clusters starts growing after iteration 2, the high frequency part of the filter gets attenuated. Thus, the scatter captured by B and C determine the low and high-frequency parts of the learned filter, respectively. 4.6.6 Interpretation of the Multi-scale Optimal Graph Filters Figure 4.5 shows the frequency responses of the optimal filters obtained by MSGWC for Cora, Citeseer, Sinanet, and Wiki, where most of the filter responses are low frequency except for Sinanet where the learned weighting vector results in a wideband bandpass filter, which consistent with the earlier description of Sinanet, where clusters are already well separated. Figure 4.6 illustrates the application of MSGWC to Cora where both the frequency response of the graph filters at different scales and their corresponding outputs are given. Additionally, the final filter after learning the optimal weight vector w and the multi-scale features ˜F are shown. From the t-SNE maps, it can be seen that the clusters are better separated using the optimal multi-scale features. t-SNE maps corresponding to the features from individual scales do not show as much separation as the optimal multi-scale features. This suggests that multi-scale processing of the input attributes across a range of scales provides a more flexible representation and a better separation of clusters. 98 Figure 4.4: t-SNE visualization of the original and filtered signals and the corresponding learned FIR filter at each iteration of GraFiCA for the Cora dataset. Each color in the t-SNE visualization represents a cluster. 99 (a) (b) (c) (d) Figure 4.5: Optimal graph filters for each dataset. From left to right, Cora, Citeseer, Sinanet, and Wiki, respectively. 4.6.7 Parameter Sensitivity We investigated the effect of the two hyperparameters, 𝛼 and 𝛾, on clustering accuracy for FIR filters for 𝑇 = 3 as seen in Figure 4.7. For sparse networks such as Cora and Citeseer, the connectivity information does not help to improve the performance, thus making the performance invariant to the choice of 𝛼. On the other hand, for dense networks such as Wiki, the performance is more sensitive to the choice of 𝛼. Finally, for Sinanet where most of the community information is reflected by the attributes, the performance is sensitive to the choice of 𝛾 as it quantifies the tradeoff between within and between class association of the filtered attributes. 4.6.8 Convergence Analysis The optimization problem in (4.1) is solved in an alternating way, i.e., we fix one variable and optimize the other. When h is fixed, the solution of (4.8), ¯Z, obtained by selecting the 𝐾 eigenvectors corresponding to 𝐾 smallest eigenvalues of ˜W𝑛 − 𝛼A𝑛 is the global optimum solution to the problem in (4.8). Similarly, when ¯Z is fixed, i.e., the community structure is known, the solution of (4.14) h is the eigenvector corresponding to the smallest eigenvalue of B𝑇 − 𝛾C𝑇 and is a global solution to the problem in (4.14). Although ¯Z is the global optimum solution to (4.8), the partition C is found by applying 𝐾-means to ¯Z. While k-means is known to converge quickly, it does not guarantee convergence to a global optimum. Thus, while each step of the algorithm converges to a global optimum, the final partition may not be globally optimum. Figure 4.8 illustrates the empirical convergence behavior, where the value of the cost function as a function of the number of iterations is given for the different datasets. Despite the local nature of k-means convergence, our overall algorithm consistently converges within a few number 100 Figure 4.6: Application of the proposed framework on the Cora dataset. From left to right: Original signal F, frequency response of graph filters at various scales with their respective outputs, and the final filter and the multi-scale features ˜F. 101 Figure 4.7: Parameter sensitivity for Cora, Citeseer, Sinanet, Wiki, and PubMed with 𝑇 = 3. Figure 4.8: Cost function value vs the number of iterations. of iterations. This analysis was done for GraFiCA with FIR filters but similar arguments can be made for ARMA and MSGWC. 4.6.9 Application to Brain Functional Connectivity Network We applied GraFiCA to functional connectivity networks of the brain. Electroencephalogram (EEG) data collected from a cognitive control-related error processing study [108], i.e., Flanker task, 102 was used to construct both the graphs and the graph signals. The EEG was recorded following the international 10/20 system for placement of 64 Ag–AgCl electrodes. The sampling frequency was 512 Hz. After the removal of the trials with artifacts, the Current Source Density (CSD) Toolbox [246] was employed to minimize the volume conduction. In this study, trials corresponding to Error-Related Negativity (ERN) after an error response were used. Each trial was one second long. The total number of trials was 480 in which the total number of error trials in different participants varied from 20 to 61. As previous studies indicate neural oscillations in the theta-band (4–7 Hz) may be one mecha- nism that underlies functional communication between networks involving medial prefrontal cortex (mPFC) and lateral prefrontal cortex (lPFC) regions during the ERN (25–75 ms time window) [108, 250, 44, 30], all analysis was performed for this time and frequency range. The average phase synchrony corresponding to theta band and 25-75 ms time window were computed to construct 64 × 64 connectivity matrices for each subject. The graph signal for subject 𝑙, F𝑙 ∈ R64×512, is defined as the average time series across trials for each electrode. In this chapter, we consider data from 20 participants. The FCNs across subjects can be modeled as a multiplex network with 64 nodes and 20 layers, corresponding to the number of brain regions and subjects, respectively. GraFiCA is extended to multiplex networks with 𝐿 layers with adjacency matrix A𝑙 ∈ R𝑁×𝑁 and the corresponding graph signal, F𝑙 ∈ R𝑁×𝑃 for the FIR filter. The consensus community structure across layers can be learned by extending the proposed cost function in (4.1) as 𝐿 ∑︁ 𝐾 ∑︁ 𝑙=1 𝑘=1 1 D𝑙 (C𝑘 ) ∑︁ 𝑖, 𝑗 ∈C𝑘 || ˜𝐹𝑙 𝑖· − ˜𝐹𝑙 𝑗 ·||2 − 𝛾 𝐿 ∑︁ 𝐾 ∑︁ 𝑙=1 𝑘=1 1 D𝑙 (C𝑘 ) ∑︁ 𝑖∈C𝑘 𝑗∉C𝑘 || ˜𝐹𝑙 𝑖· − ˜𝐹𝑙 𝑗 ·||2. (4.34) The partition C and the FIR filter coefficients h can be found by extending the corresponding optimization problems derived above as 𝐿 ∑︁ tr( ¯Z⊤ ( ˜W𝑙 𝑛 − 𝛼A𝑙 𝑛) ¯Z), 𝑙=1 𝐿 ∑︁ (B𝑙 − 𝛾C𝑙)h). (4.35) ¯Z := argmin ¯Z, ¯Z⊤ ¯Z=I h := argmin (h⊤ h 𝑙=1 103 For the selection of the parameters 𝛼 and 𝛾, we performed a grid search as described in Section 4.6.3 and selected the pair (𝛼, 𝛾) that achieves the highest modularity metric [164], since the ground truth is not available for this dataset and NMI cannot be computed. Figure 4.9 shows the multiplex community structure for different numbers of clusters, 𝐾, from 4 to 8 across 20 subjects, and 𝛼 = 1, 𝛾 = 0.1, and 𝑇 = 3. As shown in Figure 4.9, for each value of 𝐾, we consistently obtain a community comprised of the frontal-central nodes corresponding to the medial prefrontal cortex (mPFC), e.g. FCz, FCz, FC2. Frontal-central connectivity in theta oscillations is known to play a critical role in the flexible management of cognitive control [176]. In addition, the community structure reveals distinct communities corresponding to the visual and lateral prefrontal cortex (lPFC) areas. Figure 4.9: Community structure for 𝐾 = 4, 5, 6, 7, 8. Next, we evaluated the consistency of the community structure obtained from the multiplex network with the community structure of individual subjects. We applied GraFiCA to each layer individually, and the optimal FIR filter with 𝑇 = 3 and the corresponding community structure for each subject was learned with different values of 𝐾 ranging from 4 to 8. Scaled inclusivity (SI) [234] was employed as a metric to evaluate the consistency of the community structure across subjects. SI is calculated by measuring the overlap of communities across multiple networks while penalizing for the disjunction of communities [234, 181]. The Global Scaled Inclusity (GSI) [234, 181] across these 20 community structures is calculated. Figure 4.10 shows the GSI for 𝐾 = 4 and 𝐾 = 6. In both cases, the central nodes, FC1, FCz, FC2, C1, Cz, and C2, are among the 10 nodes with the highest GSI values. As we can see from Figure 4.9, these 6 nodes are consistently detected in the same community which indicates that the multiplex extension of the algorithm obtained 104 communities that are consistent with the individual subjects’ community structure. Figure 4.10: Global Scaled Inclusivity. 4.7 Conclusions As the amount of large-scale network data with node attributes increases, it is important to develop efficient and interpretable graph clustering methods that identify the node labels. In this chapter, we proposed two community detection methods, GraFiCA and MSGWC, for attributed networks. The proposed methods were evaluated on real-world networks with both binary and numerical attributes. The proposed methods make some key contributions to the field. First, GraFiCA and MSGWC, learn the parameters of polynomial graph filters and the optimal linear combination of multi-scale features, respectively, with respect to a loss function that quantifies both the within-cluster dissimilarity and the separation between clusters. Thus, the filter parameters are optimized for the clustering task. Second, the proposed methods do not constrain the filters to be lowpass. Results indicate that the learned filters take into account the useful information in middle and high-frequency bands and the structure of the filter, i.e., whether the filter is lowpass, highpass or bandpass, is determined directly by the data. Third, GraFiCA is formulated for both FIR and IIR filters, providing similar performance across datasets. While FIR filters are computationally less expensive, IIR filters provide a smoother frequency response. Finally, the learned filters are evaluated with respect to spectral energy distribution of the attributed graphs providing interpretability to the proposed design procedure. The interpretability of MSGWC is illustrated through both the distribution of the learned features and the corresponding multi-scale filters. Future work will consider extensions to other nonlinear filters. 105 CHAPTER 5 GRAPH FILTERING LEARNING FOR STRUCTURE-FUNCTION COUPLING BASED HUB NODE IDENTIFICATION 5.1 Introduction The recent developments in the field of human connectome research [41, 231] provide us the opportunity to unravel the topological characteristics of brain networks using graph-theoretic approaches. Both the structural and functional brain systems can be characterized using tools from complex network theory such as small-world topology, highly connected hubs and modularity [15, 230]. In this line of research, the brain is modeled as a graph composed of nodes and edges. The nodes represent neurons or brain regions and the edges represent physical connections or statistical associations between regions [27] for structural and functional networks, respectively. Graph-theoretic methods offer measures to depict the features of the network, including modules [115, 278] and hubs [39, 202]. Hubs are defined as densely connected regions in human brain networks and play a crucial role in global brain communication [253] and support a broad range of cognitive tasks, such as working memory [163] and semantic processing [270]. Growing evidence suggests that these highly connected brain hubs are preferentially targeted by many neuropsychiatric disorders [64], providing critical clues for understanding the biological mechanisms of disorders and establishing biomarkers for disease diagnosis and treatment [90]. Traditionally, hubs have been defined as nodes with high degree or high centrality based on functional connectivity networks. Node degree is the simplest and most commonly used means of identifying hubs in graphs. However, it has been shown that this approach is problematic in correlation based networks such as the functional connectivity networks [202]. The influence of community size on degree and the susceptibility of degree to distortion in volume-based brain networks result in biased estimates of hub nodes. For this reason, other centrality metrics based on the combination of degree and path length, e.g., betweenness, eigenvector, and PageRank centralities, have been proposed to characterize hubs [129]. Others have used the node role 106 approach that relies on the community structure of the functional network. Namely, centrality measures identify hubs and participation coefficients using within-module degree z-score then classify hub type, e.g., provincial vs. connector hubs [115, 199]. While these centrality measures are easy to implement, they rely only on the functional con- nectivity network without considering the coupling between the brain’s anatomical wiring, i.e., structural network, and its dynamic functional properties, e.g., the BOLD signal [93]. In particular, they have been designed for networks represented only by simple graphs and do not take additional information about nodes, i.e., the node attributes, into account. The idea of integrating graph topol- ogy with node attributes for hub node identification first originated in social sciences literature [21, 87]. In this line of work, centrality measures are either weighted by the norm of the node attributes or modified by homophily of nodes defined by the attributes. Hub nodes integrate and distribute information over the network through their high number of connections made with a diverse set of nodes. Thus, using the attribute information along with the network topology can better reveal hubs as it allows differentiating highly connected nodes based on homophily, i.e., most of the con- nections are within a group or with nodes with similar attributes, versus heterophily. Specifically, heterophilic ones are of interest as they are not only highly connected but also connected to the nodes from different groups, indicating their importance for global communication of the network. Although aforementioned methods utilize this benefit of node attributes for hub identification, they can only handle categorical attributes. In this chapter, we extend this idea to the case where the node attributes are continuous, which we consider as graph signals based on GSP literature. In particular, inspired by recent work in graph anomaly detection using graph neural networks (GNNs) [13, 96, 242], we utilize the spectral content of graph signals with respect to network topology to differentiate nodes as homophilic or heterophilic. Recent work shows the relationship between homophily and the smoothness of graph signals, while associating heterophily to non- smoothness [96]. In terms of spectral content, this indicates that homophilic and heterophilic activity concentrates at low-frequency and high-frequency spectral components, respectively. We adopt this prior research to hub node identification by using graph filters to separate homophilic 107 content from heterophilic content based on the graph signal spectrum and using the heterophilic content through a novel hub scoring to identify hubs. Based on the previous discussion, in this chapter we introduce a GSP-based framework for hub node identification in brain networks utilizing both the structural connectome and functional BOLD signals. The proposed approach is based on learning the optimal graph filter for detecting hub nodes with the following assumptions: (i) hub nodes are sparse and have high activation patterns simultaneously with a more diverse set of connections (heterophilic), i.e., their activity corresponds to the high-frequency component of the BOLD signal, and (ii) the non-hub nodes’ activation patterns are low-frequency/smooth with respect to the structural connectome, thus can be modeled as the output of low-pass polynomial graph filter. These assumptions are incorporated into a general optimization framework where the smoothness and sparsity are quantified by graph total variation and ℓ1 norm, respectively. Once the optimal graph filter is learned, a hub scoring function based on the local gradient of the nodes is introduced to identify the hub nodes. Participation coefficient is used to further identify the connector hubs. The proposed method is evaluated on both simulated data and rs-fMRI data from HCP. The results are compared to the state-of-the-art hub node identification methods and recently published meta-analysis of hub nodes in rs-fMRI [271]. The main contributions of the proposed work are as follows. First, unlike existing hub node detection methods that rely on only the connectivity graphs, the proposed method, GraFHub, incor- porates the structural connectivity and the functional activation signals into the same framework thus taking the structure-function coupling in the brain into account [93]. Second, in addition to introducing a GSP-based learning framework, this chapter also introduces a new smoothness-based metric for hub node scoring, as well as two different methods for identifying the hubs, i.e., thresh- olding vs. rank ordering. The proposed smoothness metric quantifies the local gradient of the graph signal with respect to the graph and thus quantifies the hub score by taking both the graph signal and the graph structure into account. Moreover, the smoothness metric is in line with the proposed cost function for hub node learning. Finally, by learning the optimal graph filter for separating 108 hub nodes from non-hub nodes, GraFHub provides interpretability to hub node identification. In particular, we show a strong correlation between the average hub score for a given brain network, the graph spectrum of the functional activation signals and the shape of the learned filters. 5.2 Related Work 5.2.1 Graph Signal Processing for the Brain Tools from GSP have been adapted to study brain neuroimaging data in order to characterize anatomical, functional and pathological features of brain. A common approach in this line of work is to employ structural and diffusion MRI data to construct a brain graph representing anatomical features such as cortical morphology or white matter fiber architecture. The functional and patho- logical neuroimaging data is then treated as graph signals defined on the constructed structural brain graph. GSP concepts, such as graph Fourier transform (GFT) and graph filtering, have been utilized to analyze these functional and pathological signals with respect to the spectrum of the underlying structural connectivity. The early work focuses on extracting the graph Fourier modes of the functional brain signals (collected with fMRI [116, 204, 205]; or Electroencephalogram (EEG) [102]) with the structural connectivity graph estimated from diffusion MRI. This analysis reveals low (high) frequency modes that are aligned (disaligned) with respect to the underlying graph structure. For example, in [116], it is shown how eigenmodes of structural connectome constrain spatiotemporal patterns of neural dynamics in humans. Within this setting, [205] quantified the degree of structure-function dependency for each brain region by means of Structural-Decoupling Index (SDI). Another line of work explored the use of graph filtering for neuroimaging data. Early works employ graph filters derived from the adjacency or Laplacian matrices of brain structural network for decoding brain states [177, 226]. In follow up work, fMRI data was processed by the polynomial approximated graph filters with different kernel functions for computationally efficient filtering on large voxel-based graphs [154, 18]. This line of work was extended to local filter design to concentrate energy onto specific graph nodes or predefined subgraphs. Kernel functions, such as the Slepian basis, was applied to construct graph filters adapted to diffusion MRI to focus energy 109 on the predefined subnetworks [31]. It was shown that the Slepian kernel and localized GFT that enable spatial and graph spectral localization classify fMRI data tasks better than the traditional GFT. Finally, GSP-based approaches have been used for feature extraction for subsequent learning tasks. These features include signal or signal energy decomposed into different frequency bands [97], signal smoothness across underlying graph structure [180], and eigenvalues of graph Laplacian matrix. For instance, in [35], projection of resting-state fMRI (rs-fMRI) time series on a structural brain graph was used as features for autism spectrum disorder classification. In [288], a functional graph Laplacian embedding of deep neural networks is used to classify task fMRI time series, in a joint GSP-deep learning framework. Similarly, projection of recorded scalp EEG [132] and MEG data [217] onto the lower and higher graph frequency bands have been used to reduce the data dimensionality and extract features for Brain Computer Interface (BCI) and visual stimuli classification tasks, respectively. 5.2.2 Hub node identification Current hub node identification methods can be grouped into two categories. The first category of methods determines hubs based on node centrality. These methods sequentially select a set of hub nodes by ranking a nodal centrality metric such as degree [188], clustering coefficient [189], vulnerability [131], betweenness [282], and eigenvector centrality [165]. However, detecting hubs only using high nodal centrality ignores the interdependencies in the networks resulting in the detection of provincial hubs, instead of connector hubs which predominantly connect nodes across different modules. The second group of methods uses module-based methods that identify hub nodes based on the network modularity [253]. These methods detect hub nodes by first identifying the modular organization of the network using a community detection algorithm [91]. Connector hub nodes are then detected based on the diversity of connections associated with the module partition. Although this method initially considers global network properties, the final hub detection uses a sorting-based method. Moreover, the optimality of the detected modules is not guaranteed [91]. In recent years, alternatives to these two categories of methods have been 110 proposed by combining multiple approaches, such as degree and participation coefficient, to obtain more reliable estimates of hubs [126]. Recently, graph spectral methods have been proposed to detect the hubs such that the removal of the identified hubs results in a network with multiple connected components, or equivalently an increase in the number of 0-eigenvalues in the graph Laplacian spectrum [274, 273]. GFT has also been employed to define a measure of centrality called GFT centrality (GFT-C) [227]. GFT-C first defines an importance signal for each node based on the shortest paths between that node and the other nodes. Hub scores are then determined by the weighted sum of GFT coefficients of the importance signal where the weight function is a pre-determined high-pass filter. Both graph spectral methods and GFT-C still rely on only the connectivity graph, i.e., the structural or functional connectome, without considering the coupling between the two modalities. 5.2.3 Graph Filter Learning Graph filtering offers an extension of conventional filtering approaches to signals defined in the non-Euclidean domain, e.g., irregular data structures arising in biological, financial, social, economic, sensor networks etc. [78, 121]. Graph filters are information processing architectures tailored to graph-structure data and have been used for many signal processing tasks, such as denoising [54], signal recovery [173, 240], classification [53], and anomaly detection [80]. The design of graph filters to obtain a desired graph frequency response has been studied and analyzed in prior work [162, 220]. More recently, the problem of learning the optimal graph filter for a given task has been addressed. For example, in [210], the problem of blind deconvolution is addressed where the observed signals are modeled as the output of a graph filter and both the filter and the input signal are learned simultaneously. In [143], the problem of random graph signal estimation from a nonlinear observation model is addressed with an estimator that is parameterized in terms of shift-invariant graph filters. In all of these cases, the goal is graph signal recovery or reconstruction, minimizing the mean square error and not to identify the outlying nodes. Closely related to the problem of hub node identification, spectral graph filtering has recently been employed in node-based anomaly detection. In particular, spectral properties of anomalies 111 have been analyzed using graph signal processing concepts and graph spectral filters [84, 94, 242, 96]. In [84], graph-based filters are employed to project graph signals onto normal and anomaly subspaces, and a thresholding mechanism is used to label anomalous instances. In [94], a community based anomaly detection approach is proposed using spectral graph filters. However, both of these methods use pre-determined filters such as the ideal low-pass/high-pass filters with no optimization of the filter shapes. More recently, the problem of estimating network centrality from the data observed on the nodes has been addressed [114, 212]. The data supported on the nodes is modeled as the output of a graph filter applied to white noise and centrality rank is learned without inferring the graph structure. This formulation reduces to determining centrality based on the principal components of the observed data’s covariance matrix without taking the graph structure into account. Our proposed approach is similar to this line of work in the way it models the non-hub activity as the output of a graph filter. However, in our framework the graph structure is known and the non-hub activity is the output of an unknown graph filter where the input is observed. 5.3 Optimal Graph Filtering for Hub Node Identification 5.3.1 Problem Formulation Given a graph G = (𝑉, 𝐸, A) with 𝑁 nodes and 𝑃 graph signals F ∈ R𝑁×𝑃 defined on G, in this chapter we focus on the detection of hub nodes. First, we model the observed graph signal F as ˜F + Fℎ, where ˜F and Fℎ correspond to the non-hub part and hub part of the observed graph signal, respectively. Note that with this decomposition, each node will have both non-hub and hub activity and whether a node is a hub node or not will be determined by the strength of Fℎ at that particular node. We characterize the spectral properties of these two parts of the observed graph signal through the following assumptions. Assumption 1. Fℎ is high-frequency with respect to G. This assumption is inspired by the hypothesis that “hub regions possess the highest level of activity” [107]. This implies that hub nodes have different levels of activity compared to their neighbors, which leads to a ‘right-shift’ in spectral energy, i.e. the energy of graph signals concentrates less in low frequencies and more in high frequencies [242, 45]. 112 Assumption 2. ˜F is the output of a low-pass graph filter defined in terms of the GSO, L𝑛. This is the direct consequence of Assumption 1. As high-frequency part of the graph signals is attributed to Fℎ, the remaining spectral content is mostly concentrated in low-frequency com- ponents. While this implies ˜F’s smoothness, Assumption 2 further models ˜F as the output of a low-pass graph filter based on prior work in GSP literature. Prior work shows that the observed graph signals in a lot of applications can be viewed as the output of an unknown graph-based filter excited by an input [113, 85, 221]. This modeling provides flexibility and does not assume precise knowledge of graph generative models. In particular, graph-based filters model smoothness of signals, defined on a designated graph. Some examples include diffusion kernels defined on graphs [141] and polynomials of graph Laplacian matrices used to define localized diffusion operators on graphs. The assumption that observed graph signals are filtered by a low pass graph filter is commonly encountered in applications such as economics, social networks, power systems, and brain connectomics [209]. Assumption 3. The number of hub nodes is much smaller than the non-hub nodes, i.e., hub nodes are sparse. This assumption implies Fℎ is a sparse matrix and is based on the fact that the graph data generally includes only a small number of hub nodes. We propose an optimization problem based on these three assumptions to learn ˜F, which is later employed to identify hub nodes. In particular, using Assumption 2, ˜F can be learned by filtering out high-frequency components in observed signals F, i.e., ˜F = H (L𝑛)F, where H (L𝑛) = (cid:205)𝑇−1 𝑛 is a low-pass graph filter to be learned. Since ˜F is considered to be smooth in Assumption 2, the coefficients of H (L𝑛) are learned such that the total variation of ˜F as calculated by (1.1) is minimized. Finally, using Assumption 3, hub node activity, i.e., Fℎ = F − ˜F, needs to be sparse. This leads to the following optimization problem: 𝑡=0 ℎ𝑡L𝑡 min h,h⊤h=1 𝛼||F − ˜F||1 + tr( ˜F⊤L𝑛 ˜F), s.t. ˜F = H (L𝑛)F = 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 𝑛F, 113 (5.1) where h = [ℎ0, ℎ1, · · · , ℎ𝑇−1] is the vector of filter coefficients with the added constraint that h⊤h = 1, i.e., the filter coefficients are normalized. The first term enforces the sparsity of the hub nodes (Assumption 3), the second term quantifies the smoothness of the filtered signal (Assumption 2), and 𝛼 controls the trade-off between these two terms. Expressing the filtered signal as ˜F = H (L𝑛)F, the problem in (5.1) becomes: min h,h⊤h=1 𝛼||F − 𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 𝑛F||1 + tr (cid:32)𝑇−1 ∑︁ (cid:32) F⊤ (cid:33) (cid:32)𝑇−1 ∑︁ (cid:33) (cid:33) F . ℎ𝑡L𝑡 𝑛 L𝑛 ℎ𝑡L𝑡 𝑛 (5.2) 𝑡=0 𝑡=0 5.3.2 Optimization In this section, we derive the solution to (5.2) using ADMM [195]. By introducing an auxiliary variable Z = F − (cid:16)(cid:205)𝑇−1 𝑡=0 ℎ𝑡L𝑡 𝑛 (cid:17) F, the optimization problem is rewritten as tr min h,Z (cid:32)𝑇−1 ∑︁ (cid:32) F⊤ (cid:33) L𝑛 ℎ𝑡L𝑡 𝑛 𝑡=0 (cid:32)𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 𝑛 (cid:33) (cid:33) F + 𝛼||Z||1, s.t h⊤h = 1, Z = F − (cid:32)𝑇−1 ∑︁ (cid:33) F. ℎ𝑡L𝑡 𝑛 𝑡=0 (5.3) The corresponding scaled augmented Lagrangian is L (Z, h, V) = 𝜌 2 ||Z − (F − (cid:33) ℎ𝑡L𝑡 𝑛 (cid:32)𝑇−1 ∑︁ 𝑡=0 F) + V||2 𝐹 + tr (cid:32)𝑇−1 ∑︁ (cid:32) F⊤ (cid:33) L𝑛 ℎ𝑡L𝑡 𝑛 𝑡=0 (cid:32)𝑇−1 ∑︁ 𝑡=0 ℎ𝑡L𝑡 𝑛 (cid:33) (cid:33) F + 𝛼||Z||1, where V ∈ R𝑁×𝑃 is the Lagrange multiplier. The ADMM steps are then as follows. 1. Z update: The variable Z can be updated as Z(𝑙+1) = argmin L (Z, h(𝑙), V(𝑙)), Z = argmin Z (cid:32) =𝑆 𝛼 𝜌 F − 𝛼||Z||1 + 𝜌 2 ||Z − F + (cid:33) ℎ(𝑙) 𝑡 L𝑡 𝑛 (cid:32)𝑇−1 ∑︁ 𝑡=0 F − V(𝑙) 𝑡=0 (cid:33) , (cid:32)𝑇−1 ∑︁ (cid:33) ℎ(𝑙) 𝑡 L𝑡 𝑛 F + V(𝑙) ||2 𝐹, (5.4) (5.5) (·) is the elementwise thresholding operator, which is the proximal operator of ℓ1 norm where 𝑆 𝛼 𝜌 [195]. 114 2. h update: The filter coefficients h can be updated using: h(𝑙+1) = argmin h,h⊤h=1 L (Z(𝑙+1), h, V(𝑙)), = argmin h,h⊤h=1 𝜌 2 ||Z(𝑙+1) − F + (cid:33) ℎ𝑡L𝑡 𝑛 (cid:32)𝑇−1 ∑︁ 𝑡=0 F + V(𝑙) ||2 𝐹 + tr (cid:32)𝑇−1 ∑︁ (cid:32) F⊤ (cid:33) (cid:32)𝑇−1 ∑︁ (cid:33) (cid:33) F . ℎ𝑡L𝑡 𝑛 L𝑛 ℎ𝑡L𝑡 𝑛 𝑡=0 𝑡=0 (5.6) Following the definitions in [220], we define the 𝑡 times shifted input signal as S(𝑡) := UΛ𝑡U⊤F. We also define S(𝑖) ∈ R𝑇×𝑃 corresponding to node 𝑖 where each row is the 𝑡 times shifted input := [S(𝑡)]𝑖·. Hence, the filtered graph signal at node 𝑖 is signal at the 𝑖-th node, i.e. [S(𝑖)]𝑡· 𝑡=0 ℎ𝑡 [S(𝑡)]𝑖· = h⊤S(𝑖) and we have ˜𝐹𝑖 𝑝 = h⊤s ˜F𝑖· = (cid:205)𝑇−1 in (5.6) can then be written element-wise: 𝑝 𝑖 where s 𝑝 𝑖 = [S(𝑖)]·𝑝. The objective function L (Z(𝑙+1), h, V(𝑙)) = 𝜌 2 𝑃 ∑︁ 𝑁 ∑︁ 𝑝=1 𝑖=1 (𝑍 (𝑙+1) 𝑖 𝑝 − 𝐹𝑖 𝑝 + h⊤s 𝑝 𝑖 + 𝑉 (𝑙) 𝑖 𝑝 )2 + 𝑃 ∑︁ 𝑁 ∑︁ 𝑝=1 𝑖, 𝑗=1 (h⊤s 𝑝 𝑖 )𝐿𝑖 𝑗 (h⊤s 𝑝 𝑗 ). Taking derivative of L (Z(𝑙+1), h, V(𝑙)) with respect to h and equating it to 0 yields h(𝑙+1) = −Y−1b, where b = 𝜌 𝑃 ∑︁ 𝑁 ∑︁ 𝑝=1 𝑖=1 𝑝 𝑖 (𝑍 (𝑙+1) s 𝑖 𝑝 − 𝐹𝑖 𝑝 + 𝑉 (𝑙) 𝑖 𝑝 ), Y = 𝑃 ∑︁ (2 𝑁 ∑︁ 𝑝=1 𝑖, 𝑗=1 𝑝 𝑖 𝐿𝑖 𝑗 s s 𝑝⊤ 𝑗 + 𝜌 𝑁 ∑︁ 𝑖=1 𝑝⊤ 𝑝 𝑖 s s 𝑖 ). Finally, in order to satisfy the constraint h⊤h = 1, h(𝑙+1) is projected onto the set defined by h⊤h = 1. 3. V update: The Lagrangian multiplier can be updated using: V(𝑙+1) = V(𝑙) + 𝜌(Z(𝑙+1) − F + ℎ(𝑙+1) 𝑡 L𝑡 𝑛 (cid:33) F). (cid:32)𝑇−1 ∑︁ 𝑡=0 (5.7) These three variables are updated until convergence as described in Algorithm 5.1. Since our problem is a non-smooth convex optimization over a non-convex manifold h⊤h = 1, when applying ADMM to the proposed optimization problem, there are no formal global convergence guarantees. However, ADMM is known to perform well on non-convex problems, often converging to locally optimal solutions [33, 263]. Recent works also empirically show the convergence of ADMM for non-smooth problems over non-convex manifolds [142]. 115 Algorithm 5.1: GraFHub. Input: Adjacency matrix A, graph signal F, parameters 𝛼, 𝜌, and filter order 𝑇. Output: ˜F, graph filter H (𝚲). 1: L𝑛 ← I − A𝑛 2: [U, 𝚲] ← EVD(L𝑛) 3: S(𝑡) ← UΛ𝑡U⊤F, 𝑡 ∈ {0, 1, · · · , 𝑇 − 1} 4: [S(𝑖)]𝑡· := [S(𝑡)]𝑖·, for each node 𝑖 ∈ 𝑉 5: Initialize h = rand(𝑇, 1), V = rand(𝑁, 𝑃) 6: while ||h(𝑙+1) − h(𝑙) ||2 > 10−3 do 7: 8: 9: 10: end while 11: ˜F ← U update Z(𝑙+1) according to Eq. (5.5) update h(𝑙+1) according to Eq. (5.6) update V(𝑙+1) according to Eq. (5.7) 𝚲𝑡U⊤(cid:17) (cid:16)(cid:205)𝑇−1 F 𝑡=0 ℎ(𝑙+1) 𝑡 5.3.3 Hub Scoring Once the filtered graph signal ˜F is obtained, hubs are scored using the graph signal’s local smoothness, node gradient [228], before and after filtering: scores(𝑖) = 𝐸 (𝑖) − ˜𝐸 (𝑖), where 𝐸 (𝑖) = (cid:205)𝑁 𝑗=1 𝐴𝑖 𝑗 || ˜F𝑖· − ˜F 𝑗 ·||2 are the node gradients at node 𝑖 for the original and filtered signals, respectively. This metric quantifies the difference in the 𝑗=1 𝐴𝑖 𝑗 ||F𝑖· − F 𝑗 ·||2 and ˜𝐸 (𝑖) = (cid:205)𝑁 similarity of a node’s value to its neighbors before and after filtering. It is hypothesized that for hub nodes, this difference will be larger as the hub nodes’ activity tends to be dissimilar to its neighbors (Assumption 1). Once the scores for each node 𝑖 ∈ 𝑉 are computed, hubs are detected in two different ways: (i) thresholding and (ii) top 𝐾-hubs. For thresholding, we use the z-score approach, i.e., nodes whose z-score is larger than 3 are denoted as hubs. For the top 𝐾-hub approach, we consider the top-𝐾 nodes with the highest scores as hubs [273]. In our experiments, 𝐾 is chosen as the point where there is a significant drop in the hub scoring metric, similar to the elbow criterion [183], as detailed in Section 5.4. 116 5.4 Experiments on simulated data 5.4.1 Benchmark Methods In this section, we evaluate the performance of our method, GraFHub, on simulated data. We compare the accuracy of GraFHub to three groups of commonly used hub node detection methods. The first group of methods relies only on the graph topology. This group includes graph-theoretic centrality measures such as the degree, eigenvector, and betweeness centrality [253], Graph Fourier Transform Centrality (GFT-C) [227], a GSP based method that uses the GFT coefficients of an importance signal derived from the shortest path with respect to a particular node, and joint hub identification (JHI) method [273] which uses the spectrum of the graph to detect connector hubs. The second group of methods only relies on graph signals and does not consider graph connectivity. This class includes clustering methods, such as Isolation Forest [159]. These methods learn an anomaly region and classify the nodes based on whether the node resides within the region or not. Unlike our method, none of these methods utilizes both the graph topology and the graph signals. Finally, we compare our learning based method to a fixed graph filtering based method, graph high-pass filtering (GHF) [291]. GHF utilizes the connectivity information and the graph signal to detect the hub nodes. GHF solves the following optimization problem to obtain the graph signals, ˜F, corresponding to the non-hub activity: (cid:13) (cid:13) (cid:13) 2 (I + 𝛽L𝑛)1/2 (cid:0) ˜F − F(cid:1)(cid:13) (cid:13) (cid:13) 𝐹 + 𝜉 2 tr (cid:16) ˜F𝑇 L𝑛 ˜F (cid:17) , min ˜F (5.8) where 𝛽 is a parameter that controls the cutoff frequency of the high-pass filter, and 𝜉 is the regularization parameter. Similar to GraFHub, GHF learns ˜F that is smooth with respect to the underlying graph. Unlike GraFHub, this method does not learn the filter shape and does not enforce sparsity on the hub nodes. 5.4.2 Simulated Data The simulated data are generated by first constructing a graph 𝐺 from either Erd¨os-R´enyi (ER) or Barab´asi–Albert (BA) models. ER model creates a random graph where each edge is generated independently with probability 𝑝, resulting in a graph with no inherent structure, where edges are 117 uniformly distributed. In our simulations, an ER random graph with 𝑁 nodes and edge density probability 𝑝 = 0.1 is generated. BA model follows a preferential attachment mechanism with growth parameter 𝑚, producing a scale-free graph. In particular, a graph with 𝑁 nodes is generated iteratively starting from an initial graph with 𝑚 + 1 nodes. At each iteration, a new node is added to the graph and the new node has 𝑚 edges, which preferentially attach to the higher-degree nodes. The generated graph has a few highly connected nodes and many nodes with fewer connections, mimicking real-world networks like social and biological networks. In the following, BA parameter 𝑚 is set to 3. Once G is constructed, 𝑃 smooth graph signals, X = [x1| · · · |x𝑃], are generated using Tikhonov filtering, i.e., x𝑝 = (𝛾L𝑛 + I)−1x0 ∈ R𝑁 , 𝑝 ∈ {1, . . . , 𝑃}, as the non-hub nodes’ activity where x0 ∼ N (0, I) and 𝛾 is the degree of smoothness. We then add synthetic hub nodes to X by selecting a fixed percentage of the nodes as hubs. For the ER graph, these hub nodes are selected randomly, while for the BA graphs, they are selected in two different ways: (i) hub nodes are selected as the nodes with the highest degree in the generated graph (BAdegree); (ii) half of the hub nodes are selected randomly and the other half are selected from nodes with the highest degree (BAmixed). The signal values of selected hub nodes are perturbed by adding uniform noise in the interval [−𝑢𝜎, 𝑢𝜎] where 𝜎 is the standard deviation of the norm of X·𝑖, and 𝑢 is the strength of the hubs. Unless noted otherwise, 𝑁 = 1000, 𝑃 = 100, 𝛾 = 30, 𝑢 = 1, and the hub nodes represent 10% of 𝑁. For all experiments, we report the performance of the aforementioned methods and our proposed method, GraFHub The best 𝛼 and 𝑇 values for GraFHub are determined from 𝛼 ∈ [0.001, 0.01, 0.1, 1, 10, 50, 100, 1000, 2000] and 𝑇 between 2 and 6. The results with the best AUC performance are reported. For GHF, 𝜉/2 = 1/𝛼, and 𝛽 = 1. For GFT-C, the weights of the weighting function are computed as described in [227]. For JHI, 𝜌 = 2 and 𝜇 is selected from the same set of values as 𝛼 above, and the results with the best AUC values are reported. For signal-based anomaly detection methods, i.e., Isolation Forest, the default parameters are used. For Isolation Forest, number of estimators is set to 100. The results with the best performance are reported. The performance is quantified using Area Under the Receiver Operating Characteristic 118 (a) (d) (b) (e) (c) (f) Figure 5.1: Performance of GraFHub on synthetic attributed graphs. AUC vs. Hub Signal Strength (𝑢) (first row) and AUC vs. Percentage of nodes that are hubs (second row). From left to right, ER, BAdegree, and BAmixed models. curve (AUC-ROC) where the hub scores returned by the methods are used without any thresholding or ranking. The average AUC-ROC over 50 runs is reported. Experiment 1: Strength of hubs In the first experiment, we vary the strength of the hubs, 𝑢. The top row of Figure 5.1 shows the results for ER (Figure 5.1 (a)), BAdegree (Figure 5.1 (b))and BAmixed (Figure 5.1 (c)) as 𝑢 is increased from 0.5 to 3. It can be seen that AUC-ROC score improves as 𝑢 increases for most of the methods that utilize graph signals for hub identification since the hubs become better separated. Methods that rely only on network connectivity alone, such as centrality measures and JHI, show no change in performance as hub strength increases. These methods are inherently limited because they do not take the graph signal into account. In ER and BAmixed models, GraFHub outperforms all of the other methods even when the strength of the hub signal is weak. GraFHub effectively captures the differences between hub and non-hub nodes by learning an optimal graph filter, allowing for better hub identification even when the hub nodes’ strength is weak. For BAdegree case, the performances of GraFHub, GHF and degree centrality are close to each other as shown 119 in Figure 5.1 (b). This is expected since hubs in BAdegree are defined by high connectivity, which aligns with the definition of hubs in degree centrality. GHF performs similarly to our method for BAdegree; however, GraFHub outperforms GHF for ER and BAmixed cases as it learns the filter from the graph and the observed node activity unlike GHF which uses a pre-determined filter. Experiment 2: Percentage of hub nodes In the second experiment, we evaluate the performance of the methods as the percentage of hub nodes increases. The bottom row of Figure 5.1 shows AUC-ROC for ER (Figure 5.1 (d)), BAdegree (Figure 5.1 (e)), and BAmixed (Figure 5.1 (f)), where the percentage of hub nodes is increased from 5% to 30%. The performance of most methods decreases as the percentage of hub nodes increases. In the case of GraFHub, this drop in performance can be attributed to one of the core assumptions of our method, i.e., the sparsity of hub nodes. The methods that rely only on connectivity information or node signal information are not affected by the number of hubs. However, they still perform worse than methods that use both connectivity and graph signals. In general, GraFHub has higher AUC-ROC scores than the other methods. For BAdegree, GraFHub, and degree centrality show similar performance since hub node definition aligns with degree centrality as discussed above. For ER and BAmixed, GraFHub performs the best followed by GHF. In particular, when the number of hub nodes is small, the performance gain by GraFHub is apparent thanks our sparsity assumption. 5.5 Application to Resting State fMRI Data 5.5.1 HCP Data The proposed method is applied on structural and functional neuroimaging data from 56 subjects collected as part of the Human Connectome Project (HCP)1. The subjects are selected from HCP’s healthy young adult study (HCP 900) and include 34 females and 22 males, in the age range 26-35. This subject group was selected in a previous study [205] as the data is complete and does not have any missing sessions. The consent forms, including consent to share de-identified data, were collected for all subjects (within the HCP) and approved by the Washington University institutional review board. All methods were carried out in accordance with relevant guidelines and regulations. 1db.humanconnectome.org 120 Data acquisition is performed using a Siemens 3T Skyra with a 32-channel head coil [255]. The scanning protocol includes high-resolution T1-weighted scans (256 slices, 0.7 mm3 isotropic resolution, TE = 2.14 ms, TR = 2400 ms, TI = 1000 ms, flip angle = 8◦, FOV = 224 × 224 mm2, BW = 210 Hz/px, iPAT = 2) [100]. Diffusion data is collected with Spin-echo EPI, TR = 5520 ms, TE = 89.5 ms, flip angle = 78◦, refocusing flip angle = 160◦, FOV = 210 × 180 mm2 (RO × PE), matrix = 168 × 144 (RO × PE), 111 slices with thickness of 1.25 mm and 1.25 mm isotropic voxel size, multiband factor 3, echo spacing = 0.78 ms, BW = 1488 Hz/px, phase partial Fourier = 6 8, b-values = 1000, 2000, and 3000 s/mm2. Functional scans were collected using a multi-band sequence with MB factor 8, isotropic 2 mm3 voxels, TE = 33 ms, TR = 720 ms, flip angle = 52◦, FOV = 208 × 180 mm2 (RO × PE), 72 slices, 2.0 mm isotropic resolution, BW = 290 Hz/px, echo spacing = 0.58 ms [100]. One hour of resting state data was acquired per subject in 15-minute intervals over two separate sessions with eyes open and fixation on a crosshair. Within each session, oblique axial acquisitions alternated between phase encoding in a right-to-left (RL) direction in one run and left-to-right (LR) in the other run. Minimally ICA-FIX cleaned resting-state fMRI (rsfMRI) and diffusion-weighted preprocessed images from HCP are used in the following analysis. 5.5.2 Preprocessing The images in the HCP dataset were minimally preprocessed as described in [100]. Briefly, each image was corrected for gradient distortion and motion and aligned to a corresponding T1-weighted (T1w) image with one spline interpolation step. This was further corrected for intensity bias and normalized to a mean of 10,000 and projected to a 32𝑘 𝑓 𝑠𝐿 𝑅 mesh, excluding outliers, and aligned to a common space using a multi-modal surface registration [211]. Building on this preprocessing framework, diffusion-weighted scans are analyzed using MR- trix32 to construct the structural connectomes. The following operations are employed: multi-shell multi-tissue response function estimation, Glasser’s multimodal cortical atlas parcellation, con- strained spherical deconvolution, and tractogram generation with 106 output streamlines. The volume is split into 𝑁 = 360 regions in two hemispheres (180 areas on the right and 180 areas on 2https://www.mrtrix.org/ 121 the left). In each hemisphere, Glasser divides 180 “areas” into 22 separate “regions”, which are referred to as the 22 larger partition cortices. Each of the 180 regions occupies one of 22 cortices which are displayed in a separate atlas. The number of fibers connecting two regions divided by the atlas regions’ volume is used to quantify the structural connectivity. As an additional preprocessing step, resting-state fMRI data is cleaned of structured noise using ICA-FIX, a method that combines independent component analysis with the FSL tool FIX to auto- matically remove artifactual or “bad” components [104]. Following this denoising step, functional volumes are spatially smoothed with a Gaussian kernel (5 mm full-width at half-maximum). The first 10 volumes are discarded so that the fMRI signal achieves steady-state magnetization, resulting in 𝑃 = 1190 time points. Voxel fMRI time courses are detrended and band-pass filtered [0.01 - 0.15] Hz to improve the signal-to-noise ratio for typical resting-state fluctuations. Finally, Glasser’s multimodal parcellation (the same used for the structural connectome) resliced to fMRI resolution is used to parcellate fMRI volumes and compute regionally averaged fMRI signals. These were z-scored and stored in an 𝑁 × 𝑃 matrix. The functional connectivity network for each subject is also constructed by computing the pairwise Pearson correlation between the time-series data corresponding to each region. For baseline comparison with respect to functional connectivity based hub node detection, node strengths, i.e., degrees, of the functional connectome are computed as the sum of absolute correlation values. 5.5.3 Hub Node Detection GraFHub is applied to HCP data from two sessions, where the structural networks correspond to the graphs and the BOLD signals to the graph signals. The hub nodes for all subjects and sessions are detected separately based on thresholding or top 𝐾-hub node methods as described in Section 5.3.3. For the thresholding method, hubs are nodes with z-score greater than 3. For the top 𝐾-hub node method, the value of 𝐾 is determined by calculating the average hub score of each node across subjects, and 𝐾 is set to the value where there is a significant drop in the average hub score, as shown in Figure 5.2. Based on Figure 5.2, we select 𝐾 = 8. The detected hub nodes are further filtered through the participation coefficient to identify the 122 Figure 5.2: Average hub scores across subjects are sorted in decreasing order. connector hubs. For this purpose, the Louvain algorithm [28] is applied to each subject’s structural connectivity graph to detect the community structure. Each node’s participation coefficient, which quantifies how evenly a node’s connections are distributed with respect to community structure, is computed as [106]: 𝑃𝑖 = 1 − 𝑁𝑀∑︁ 𝑠=1 2 , ) ( 𝑘𝑖𝑠 𝑘𝑖 (5.9) where 𝑁𝑀 is number of identified modules (communities), 𝑘𝑖 is degree of node 𝑖 and 𝑘𝑖𝑠 is the total strength of the connections node 𝑖 makes with nodes in module 𝑠. The hub nodes with participation coefficient 0.35 < 𝑃 < 0.72 are considered as connector hubs [106]. Since hub node detection is an unsupervised task, we determined the optimal values for the filter order, 𝑇, and the hyperparameter, 𝛼, in (5.1) based on the consistency of hubs across subjects inspired by [271]. For each (𝑇, 𝛼) pair, a 56 × 22 matrix is constructed where each row indicates the number of hub nodes in a brain cortex (based on the Glasser parcellation) for a given subject across the two sessions. Next, the consistency of the detected hub nodes across subjects is quantified by computing the 56 × 56 correlation matrix, C. ∥C∥𝐹 is computed to quantify the average correlation of detected hub nodes in each region across subjects. The (𝑇, 𝛼) pair that yields the highest norm is selected. In Figure 5.3, we show the effect of 𝑇 and 𝛼 on the consistency of the detected hubs 123 Figure 5.3: Robustness of GraFHub to the choice of the hyperparameters 𝛼 and 𝑇: Top row shows the variation of hub nodes in each brain region with respect to 𝑇 for fixed 𝛼. Bottom row shows the variation of hub nodes in each brain region with respect to 𝛼 for fixed 𝑇. across subjects in different brain networks. In the top row, we show the effect of varying the filter order for 𝛼 = 0.5 and 𝛼 = 5. When 𝛼 is small, i.e., the importance of error is minimized, it can be seen that the number of hub nodes in each brain region does not vary much. On the other hand, when 𝛼 is large, i.e., the importance of sparsity is high, the variation of hub nodes increases for Default Mode Network. In the bottom row, we show the effect of varying 𝛼 for 𝑇 = 7 and 𝑇 = 3. When 𝑇 = 7, varying 𝛼 does not change the number of hub nodes in each brain region. When 𝑇 = 3, the variation increases for somatomotor and default mode networks. This result aligns with our understanding of graph filters and brain networks. 𝑇 = 3 only captures local neighborhoods and may not be sufficient for correctly identifying the hub nodes. Moreover, somatomotor and default mode networks play an important role in rs-fMRI, thus they are known to have more hub nodes. Thus, a change to the filter order 𝑇 or the sparsity parameter 𝛼 will affect these regions more than others. For the thresholding method, the optimal parameters are found as 𝑇 = 6 and 𝛼 = 5. For the top 𝐾-hub method, the optimal parameters are 𝑇 = 6 and 𝛼 = 0.2. Figure 5.4 illustrates the top-𝐾 connector hubs detected by GraFHub across all subjects for one 124 VisualSomatoMotorDorsal AttentionVentral AttentionLimbicFrontalParietalDefault02040Average Score (%)(a) Fixed = 0.5, Varying TVisualSomatoMotorDorsal AttentionVentral AttentionLimbicFrontalParietalDefault02040Average Score (%)(b) Fixed = 5, Varying TVisualSomatoMotorDorsal AttentionVentral AttentionLimbicFrontalParietalDefault02040Average Score (%)(c) Fixed T = 7, Varying VisualSomatoMotorDorsal AttentionVentral AttentionLimbicFrontalParietalDefault02040Average Score (%)(d) Fixed T = 3, Varying Figure 5.4: Consistency of the top-𝐾 hubs detected by GraFHub across all subjects plotted over the brain topomap. The size and the color of the nodes correspond to the number of times across 56 subjects a particular node has been detected as a hub and the brain network (Yeo’s parcellation networks) to which the node belongs, respectively. run. In particular, the size of the hub denotes how many times a particular node has been detected as a hub across 56 subjects. The color of the hub denotes the resting state brain network, determined by Yeo’s parcellation [278], the node belongs to. From this figure, it can be seen that nodes in the default mode network (DMN), frontal parietal and dorsal attention networks are consistently selected as hubs across subjects. For both detection methods for GraFHub, the percentage of hubs within a brain network across two sessions and all subjects are calculated similarly to a recently published meta-analysis of hub nodes in rs-fMRI [271]. While the hub detection is performed on the Glasser parcellation with 360 regions, the percentage of hubs is reported for larger brain networks determined by Yeo’s parcellation with 7 networks (Table 5.1). We compare the different implementations of GraFHub, denoted GraFHubZ and GraFHubK for the thresholding and top 𝐾-hubs approaches, respectively, with the methods discussed in Section 5.4. For centrality-based methods, i.e., degree, eigenvector, and betweenness, nodes with top-10 hub scores are selected as hubs. For methods that are only based on graph connectivity, i.e., centrality-based measures, GHFC, GFT-C and JHI, hubs are further filtered using participation coefficient to find connector hubs similar to GraFHub. For Isolation Forest, all detected hubs are used since they only employ graph signals. 125 LRLRVisualSomatomotorDorsal AttentionVentral AttentionLimbicFrontalparietalDefault Mode Network Table 5.1: Percentage of hub nodes detected in brain networks defined by Yeo’s parcellation [278]. Networks (area%) Degree Eigenvector Betweenness Isolation Forest GHFC GFT-C JHI Meta Analysis [271] GraFHubZ GraFHubK Visual (14.8%) SomatoMotor (20.2%) Dorsal Attention (11.4%) Ventral Attention (12.1%) Limbic (7.8%) FrontalParietal (12.9%) Default (20.8%) 38.2 26.7 13.1 15.4 0.1 3.0 3.7 40.4 26.5 12.3 15.2 0.0 2.4 3.3 7.5 8.2 6.8 6.4 39.4 8.2 23.5 3.9 3.6 5.7 4.8 58.9 4.5 18.8 4.8 15.9 15.5 18.8 0.5 12.6 31.9 14.6 10.7 22.9 9.1 9.9 9.1 35.3 23.0 12.0 10.1 0 .0 6.9 23.7 12.6 9.9 14.4 16.5 15.6 0.2 15.9 27.5 8.0 16.0 14.9 11.2 0.0 9.6 40.4 6.3 18.7 17.7 12.4 0.5 10.9 33.4 (a) (b) (c) (d) Figure 5.5: (a) Average filter response across subjects. (b) Average Spectral Energy Distribution (SED) of the graph signal F across all subjects. (c) SED of the graph signal F of subject 13 (Average Hub Score = 3.7354). (d) SED of the graph signal F of subject 36 (Average Hub Score = 2.7851). 5.5.4 Frequency Response of the Learned Filters In Figure 5.5a, the frequency response of the learned filter, H (𝚲), averaged across subjects, is shown. As expected, the frequency response is low-pass since ˜F corresponds to the non-hub or smooth activity on the graph (Assumption 2). The standard deviation of the filter across subjects shows that while there’s subject variation in the magnitude response of the filter, the overall shape across subjects does not change. In order to better understand and interpret the frequency response of the learned filters, we examine the Spectral Energy Distribution (SED) of the graph signals with respect to the eigenvalues of the graph normalized Laplacian. For a graph signal F and its GFT (cid:98)F = U⊤F, the spectral energy 𝑖=1( ˆ𝐹𝑖 𝑝)2 across the 𝑃 graph distribution at 𝑖-th eigenvalue 𝜆𝑖 is defined as the average of (cid:98)𝐹2 signals. Figure 5.5b shows the average SED of the graph signals across all subjects. As it can 𝑖 𝑝/(cid:205)𝑁 be seen, the average of the spectral energy distribution across all subjects is mostly localized in the low-frequency range, similar to the learned filter in Figure 5.5a. This alignment of the filter shape with the SED profile is expected as the filter is learned to capture the non-hub activity, i.e., 126 low-frequency content. In order to further investigate the relationship between the underlying signal’s graph spectrum and the learned hub nodes, we compute the ratio of the total SED in the high-frequency band to the total SED in the low-frequency band as (cid:205)𝜆𝑖 ≥1 SED(𝜆𝑖)/(cid:205)𝜆𝑖 <1 SED(𝜆𝑖) for each subject. In addition, the hub activity for each subject is quantified by the average score of the top-𝐾 hub nodes. Figure 5.5c shows the SED of subject 13, which is the subject with the highest hub activity and the highest SED ratio. On the other hand, Figure 5.5d shows the SED of subject 36, which is the subject with the lowest hub scores and the lowest SED ratio. From these figures, it can be seen that subject 13 has significant high-frequency activity compared to subject 36, whose SED is mostly concentrated in the low frequencies. The average hub activity z-scores for subject 13 is 3.7354, compared to the average hub activity z-scores across all subjects 3.0959. On the other hand, the hub activity z-scores for subject 36 is 2.7851. These results validate Assumption 1 as the subjects with high hub activity have more high-frequency content. 5.5.5 Inter-Subject Variability The consistency of the hubs detected by GraFHub across subjects is quantified using normalized entropy. For each cortex 𝑟, we construct a vector x𝑟 ∈ R56×1 whose entries correspond to the number of hub nodes within cortex 𝑟 for a particular subject. After normalizing this vector, 𝑝𝑟 𝑖 = we calculate the normalized entropy for cortex 𝑟 as: 𝑥𝑟 𝑖 𝑖=1 𝑥𝑟 (cid:205)56 𝑖 (𝑖) , log2(56) The higher the entropy, the more consistent the number of hubs is in that cortex across subjects. (5.10) (cid:205)56 𝑖=1 𝑝𝑟 𝑖 log2( 𝑝𝑟 𝑖 ) . 𝑆𝑟 = − In order to quantify the significance of the normalized entropy estimates, we utilize bias- corrected and accelerated (BCa) bootstrapping with 9,999 samples and calculate the normalized entropy of each sample. We report the cortices with significant normalized entropy values at the 95% confidence interval in Table 5.2. Cortices with high normalized entropy correspond to regions with higher consistency across subjects. For example, posterior cingulate cortex (PCC) has the highest normalized entropy for 127 GraFHub and is part of the DMN which has the highest percentage of hubs according to Table 5.1. Similarly, anterior cingulate and medial prefrontal cortices have high normalized entropy and correspond to frontoparietal and ventral attention networks. Thus, the statistical significance of the brain networks with high percentage of hub nodes in Table 5.1 is established through normalized entropy as cortices with significantly high normalized entropy correspond to these networks. Table 5.2: Normalized entropy of the hub nodes determined by GraFHub for each cortex. − refers to cortices where no hub nodes are detected for any of the subjects. ∗ refers to cortices that have statistically significant normalized entropy values. Cortices GraFHubZ GraFHubK Anterior Cingulate and Medial Prefrontal 0.64* 0.77* Auditory Association Dorsal Stream Visual Dorsolateral Prefrontal Early Auditory Early Visual Inferior Frontal Inferior Parietal Insular and Frontal Opercular Lateral Temporal MT+ Complex and Neighboring Visual Areas Medial Temporal Orbital and Polar Frontal Paracentral Lobular and Mid Cingulate Posterior Cingulate Posterior Opercular Premotor Primary Visual Somatosensory and Motor Superior Parietal Temporo Parieto Occipital Junction Ventral Stream Visual 0.00 0.46 0.17 0.50 − − 0.34 0.00 0.00 0.33 − − 0.17 0.00 0.58 0.30 0.67* − 0.17 0.53 0.49 0.16 0.47 0.00 − 0.60 0.80* 0.90* 0.16 0.00 − 0.50 0.56* − − 0.34 0.39 − 0.61 0.70* 0.00 − 5.5.6 Verification of Hubs Through Global Efficiency Global Efficiency (GE) is a metric that characterizes the efficiency of a parallel working system, where all the nodes in the network exchange information simultaneously [146, 3] and is, therefore, a measure of integration and global communication efficiency. Given a graph, GE is defined as the average inverse shortest path length in the network, which is inversely related to the characteristic 128 path length [216, 215]: GE = 1 𝑁 (𝑁 − 1) ∑︁ 𝑖, 𝑗 ∈𝑉,𝑖≠ 𝑗 1 𝑑𝑖 𝑗 , (5.11) where 𝑑𝑖 𝑗 is the shortest path between nodes 𝑖 and 𝑗. A small-world network will have GE greater than a regular lattice but GE less than a random network. In order to verify that the detected hub nodes are important for the overall information processing in the brain network, we calculate the change in GE when a hub node is removed from the network. In particular, we remove each node and its connections from the network and calculate the GE of the resulting network. We then compute the difference in the global efficiency of the network before and after node removal. We repeat this procedure for each node. A larger difference in GE implies that the removed node is important for information processing, i.e., it has “hub-like” characteristics. For each subject, we calculate the average difference in global efficiency of the network before and after removing non-hub and hub nodes as shown in the last two columns of Table 5.3. Table 5.3: Average difference in GE for hub nodes detected by GraFHub compared to hub nodes removed by other methods. Method GraFHubNon-Hubs GraFHubHubs Degree Eigenvector ΔGE (10−4) (0.59 ± 0.13) (7.40 ± 4.46) (1.02 ± 0.6) (1.06 ± 0.6) As expected, the difference in global efficiency for removing hub nodes is larger than removing non-hub nodes for every subject. The average difference is about 12.1 times the global efficiency loss when non-hub nodes are removed. This result shows that the detected hub nodes are indeed more important for information processing in the brain and contribute more to the small-world characteristics. In addition, we compared the change in GE when hub nodes detected by GraFHub are removed versus the change for hub nodes detected by two other centrality measures. The loss in global efficiency was greater for GraFHub compared to the loss when hub nodes detected by eigenvector and degree centrality are removed. This indicates that our method detects hub nodes that contribute 129 more to the network organization and information transfer compared to traditional methods (see the first three columns of Table 5.3. 5.6 Discussion From Table 5.1, it can be seen that graph-theoretic methods such as degree and eigenvector centrality detect hub nodes that are concentrated in the visual and somatomotor networks. These networks comprise the primary sensory-motor cortices and have been shown to have high global connection in prior studies [58]. This high connectivity may be either due to the relatively large size of the network or reflect the privileged placement of visual processing in the human brain [251]. Similarly, JHI detects hub nodes consistent with degree centrality as it relies directly on the graph spectrum of the connectivity graph without using the BOLD signal. On the other hand, betweenness centrality measure detect hub nodes in the limbic and default mode networks. While the DMN is known to be a critical network during resting state [70], less is known about the limbic network. Similarly, methods that only utilize the functional BOLD signals primarily detect hub nodes in the limbic network. This network consists of regions outside the cerebral cortex and is important for emotion, reward, and other valence processing [192]. Unfortunately, the areas that form this resting state network are usually poorly visualized with fMRI due to nearby portions of the skull creating susceptibility artifact [222]. Thus, the BOLD signal values in this network may be very different from those in other regions, causing methods that only utilize the signals to detect these as hub nodes. As there is no ground truth for hub nodes, in this chapter, we compare our results to a recent harmonized meta-connectomic analysis [271] of resting-state functional MRI data of 5212 healthy young adults across 61 independent cohorts. The majority of the hub nodes detected by GrafHub are in the DMN followed by somatomotor and dorsal attention networks similar to the ordering in [271]. These results are also consistent with prior studies which report components of the DMN as hubs [58, 247, 70]. DMN has been noted to be active primarily in studies of resting state activity [208] and is engaged by mind wandering [175], prospective and retrospective self-reflection [63], and memory retrieval [40], suggesting that the ‘default mode’ involves ongoing processing of 130 information for relevance to the self. In prior studies, DMN has been shown to have the highest global brain connectivity which may reflect connections necessary to implement the wide variety of cognitive functions the network is involved in. In conjunction with DMN, another large-scale network implementing a variety of cognitive functions, the cognitive control network (CCN), is also among the highest connectivity networks [58]. While our results indicate that the majority of the hub nodes are in DMN, this is followed by somatomotor, dorsal and ventral attention networks which are part of CCN. The proposed graph filter learning framework provides additional interpretability to the im- portance of structure-function coupling in hub node identification. In particular, the frequency response of the optimal graph filter and the spectral energy distribution of the BOLD signal are shown to be closely related to the average hub score for a given subject. Thus, subjects whose BOLD signals have higher graph frequency content, i.e., reduced structure-function coupling, tend to have more hub node activity. In addition to introducing a learning framework, the proposed approach also introduces a new hub scoring metric and hub node detection methods. Comparing the thresholding and top-𝐾 approaches for in Table 5.1, we can see that the top-𝐾 approach distributes the hub nodes more evenly across resting state networks. This is due to the fact that the same number of hub nodes, 𝐾, is selected across subjects, treating each subject equally, while with the thresholding method, one may detect more hubs for one subject vs. another detecting only the highest activity regions, such as DMN and neglecting other important networks. 5.7 Conclusions In this chapter, we introduced a graph signal processing based framework for identifying the hub nodes in the brain. The proposed framework relies on the assumption that hub nodes are highly connected and have high activity levels with respect to their neighbors. From the perspective of GSP, this assumption results in modeling the hub nodes’ activity as high-frequency with respect to the underlying graph, while the non-hub nodes have low-frequency or smooth activity. This model is implemented through an optimization problem that learns the optimal graph filter for 131 detecting hub nodes. The proposed framework, GraFHub, is applied to both simulated and real brain network data. It is shown that GraFHub performs better than existing connectivity-based hub node identification methods for both simulated and real brain networks as it takes the coupling between the graph topology and the graph signals defined on the graph. Moreover, the learned graph filters are low-pass and the filter response is highly correlated with the spectral energy density of the signals. Thus, learning the optimal filter provides interpretability to the spectrum of the underlying graph signal and can be used as a predictor for the number of hubs in a given brain network. 132 CHAPTER 6 CONCLUSIONS In this thesis, methods for community detection and hub node identification problems in complex networks using graph-based learning techniques are presented. The contributions of this work span multiple aspects of network analysis, including community detection in multiplex networks, discriminative subgraph identification between different multiplex networks, and graph filtering for clustering attributed graphs and hub node identification. In Chapter 2, we presented an algorithm for community detection in multiplex networks that identifies both common and private communities. We also proposed an algorithm for determining the number of communities, which is an input parameter in most community detection methods. The experiment results indicate that our method is superior to existing multiplex community de- tection methods as it does not enforce a consensus community structure. A proof of convergence is provided, along with an in-depth analysis of the algorithm, including studies of overfitting and ablation, recovery guarantees and consistency. The proposed algorithm is evaluated on synthetic and real multiplex networks, as well as for multiview clustering applications, and compared to state-of-the-art techniques. MX-ONMTF consistently outperformed established approaches across a range of synthetic scenarios with varying complexity, noise levels, and inter-layer dependencies. Additionally, when applied to real-world multiplex networks—including social networks and bio- logical data—our method identified meaningful community structures aligned closely with known metadata. In addition, the application of MX-ONMTF to an fMRI dataset where the nodes are subjects and the layers represent different functional areas of the brain, reveal subgroups of subjects that exhibit significant differences in key functional areas, such as the default mode network (DMN) and anterior prefrontal cortex (antPFC), as well as in their corresponding clinical scores. In Chapter 3, we introduced a spectral clustering-based discriminative community detection framework designed to identify communities that distinguish structural differences between dis- tinct groups or conditions. Unlike traditional methods that focus on finding shared community structures, our framework explicitly focuses on capturing discriminative network substructures 133 across different multiplex networks. We presented three methods: the first, MX-DSC, identifies discriminative subspaces between two static multiplex networks; the second, MX-DCSC, extends this by simultaneously learning consensus, discriminative, and layer-specific subspaces; and the third, TMX-DiSG, further adapts this discriminative framework to temporal multiplex networks, finding discriminative communities between two groups across time. These methods are evaluated on synthetic networks involving extensive experiments under various conditions, including changes in the noise level, variability across layers and time, and the number of shared communities. Real- world applications to EEG and dynamic fMRI brain networks demonstrated that our framework effectively identifies task-specific discriminative subgraphs. In Chapter 4, we proposed two graph signal processing-based methods for clustering attributed networks, GraFiCA and MSGWC. GraFiCA learns finite impulse response (FIR) and infinite impulse response (IIR) graph filters, whereas MSGWC optimally combines multi-scale features derived from graph wavelet transforms. Both approaches optimize filter parameters specifically for clustering by minimizing a novel loss function that simultaneously quantifies within-cluster and between-cluster dissimilarity of the filtered attributes. Thus, the filter parameters are optimized for the clustering task. Experiments on various real-world datasets—including EEG brain networks, citation networks, and social graphs—revealed that both methods significantly outperform state-of- the-art techniques, yielding more accurate and interpretable clusters and learning filters that adapt to the characteristics of the datasets. In Chapter 5, we introduced a graph signal processing based framework for identifying the hub nodes in the human brain. The proposed framework relies on the assumption that hub nodes are highly connected and have high activity levels with respect to their neighbors. From the perspective of GSP, this assumption results in modeling the hub nodes’ activity as high-frequency with respect to the underlying graph, while the non-hub nodes have low-frequency or smooth activity. This model is implemented through an optimization problem that learns the optimal graph filter for detecting hub nodes. The proposed framework, GraFHub, is applied to both simulated and real brain network data. It is shown that GraFHub performs better than existing connectivity-based 134 hub node identification methods for both simulated and real brain networks as it takes the coupling between the graph topology and the graph signals defined on the graph. Moreover, the learned graph filters are low-pass and the filter response is highly correlated with the spectral energy density of the signals. Thus, learning the optimal filter provides interpretability to the spectrum of the underlying graph signal and can be used as a predictor for the number of hubs in a given brain network. 6.1 Future Work The work in this thesis suggests new research directions. In this section, we summarize some potential areas of future work for each chapter. 6.1.1 Multiplex Community Detection In Chapter 2, we introduced a multiplex community detection method that identifies both common and layer-specific communities in multiplex networks. While our method effectively captures the heterogeneous structure of different network layers, an important future direction is extending this model to handle signed graphs, where edges can take positive or negative values. Signed graphs naturally arise in many real-world applications. In brain network analysis, for instance, functional connectivity graphs are often signed, representing both positive (correlated) and negative (anti-correlated) interactions between brain regions. Standard community detection methods often assume only positive links, which limits their applicability to neuroscientific datasets where functional interactions are inherently signed. Furthermore, signed multiplex networks extend beyond neuroscience and have applications in social network analysis, where friendships and antagonisms co-exist, as well as financial networks, where assets may exhibit both positive and negative correlations [150]. By developing a signed multiplex community detection framework, we can enhance our method’s applicability across diverse domains. Thus, future work will focus on adapting the Multiplex Orthogonal Nonnegative Matrix Tri-Factorization framework to integrate signed adjacency matrices using one of the existing Semi-NMF variants [156, 76], ensuring that both positive and negative interactions are appropriately handled during the community detection process. 135 6.1.2 Discriminative Sugbraph Identification between Multiple Groups In Chapter 3, we proposed a framework for finding discriminative subgraphs between two static and dynamic multiplex networks. Future work will focus on extending this approach to multiple datasets where the goal would be to differentiate between multiple groups of networks at the subgraph level. There are two main approaches for extending our framework: (1) identifying a discriminative subspace that uniquely characterizes each group in relation to the rest of the multiplex networks collectively, and (2) performing pairwise discrimination among multiple groups, systematically identifying subgraph features that differentiate between pair of groups. Beyond identifying discriminative subgraphs, this framework can also be adapted into a super- vised classification framework. Currently, our method is fully unsupervised, focusing on learning discriminative and consensus subspaces between and within two multiplex networks, respectively. A key future direction is to leverage the obtained discriminative subspaces to develop a classifica- tion model capable of predicting the group affiliation of new, unseen brain connectivity networks. By training on networks from distinct groups (e.g., healthy versus diseased individuals, or patients with different neurological conditions), a supervised learning extension could enable more precise classification. This adaptation holds significant potential for clinical applications, particularly in early detection of neurodegenerative diseases and cognitive impairments. 6.1.3 Graph Filtering for Clustering Attributed Graphs In Chapter 4, we introduced two methods for clustering in attributed graphs, where the param- eters of FIR and IIR graph filters, along with the optimal linear weights for combining multi-scale wavelet transforms, were learned by minimizing a cost function specifically designed for the clus- tering task. Future work will focus on developing a family of cost functions for clustering in attributed graphs. Our proposed cost function in Chapter 4, in its general form, has two compo- nents: (i) L (C, H (𝚲; 𝛽)) that quantifies the quality of the partition based on the filtered attributes, ˜F = UH (𝚲)U⊤F; (ii) R (C, A𝑛) that quantifies the alignment between the community structure C and the input connectivity matrix A𝑛, thus explicitly taking the connectivity information into account. Moreover, L (C, H (𝚲; 𝛽)) may be decomposed into two parts ℓ𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 (C, ˜F) which quan- 136 tifies the cohesiveness within communities and ℓ𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 ( C, ˜F) which quantifies the separability between communities, resulting in the following form 𝑓 (ℓ𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 (C, ˜F), ℓ𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 ( C, ˜F)) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:123)(cid:122) (cid:124) L ( C,H (𝚲;𝛽)) +𝛼R (C, A𝑛), (6.1) where 𝑓 is a mapping that combines the internal and external cluster quality functions. In the cost function proposed in Chapter 4, the quality of the clusters is determined by Euclidean distance of filtered attributes within and between clusters. As future work, we will consider different quality functions to quantify ℓ𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 and ℓ𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 as well as different mappings, 𝑓 , to combine them as well as different regularization functions, R (C, A𝑛) for incorporating the connectivity information. For instance, two quality metrics we can consider for quantifying the “goodness” of a community structure are the sum of squared errors (SSE), commonly used as the cost function for 𝑘-means, and modularity. The regularization function, R ( C, A𝑛) in (6.1) that quantifies the quality of the partition with respect to the observed connectivity matrix will be chosen to match the chosen quality function. For example, for 𝑘-means quality function, we will employ Euclidean distance between A𝑛 and the community membership matrix. Similarly, for modularity-based clustering, the regularization function will be based on the spectral clustering of the modularity matrix, B. 6.1.4 Graph Filtering for Hub Node Identification In Chapter 5, we presented a graph signal processing based framework for identifying hub nodes in brain networks. Future work will consider several extensions of the proposed framework. First, we will consider the dynamic change in hub nodes across time. It is well-known that rs-fMRI is a dynamic process, thus the hub nodes may be changing across time similar to network connectivity states [10]. In our current framework, the graph signal F ∈ R𝑁×𝑃 was constructed using the BOLD signals, where 𝑁 represents the number of nodes (brain regions) and 𝑃 represents the number of time points. However, extending this approach to a dynamic setting requires a fundamental change in the construction of F. Instead of a single F representing all time points, we need to construct a separate graph signal F𝑡 for each time point 𝑡 = 1, 2 . . . , 𝑃. To achieve this, we can redefine our 137 graph signals using subjects as independent samples. This would result in a set of 𝑃 graph signal matrices, F𝑡 for 𝑡 = 1, 2 . . . , 𝑃 , where each F𝑡 ∈ R𝑁×𝑆 represents a graph signal for a given time point across 𝑆 subjects. Second, we will more closely study the relationship between hub nodes and the graph frequency spectrum. For example, the contribution of different hub nodes to the SED in different frequency bands can be quantified and used as predictors for hub node identification. Third, we will explore alternative hub scoring metrics to complement our current approach. A comparative analysis of these metrics could refine our identification strategy and improve the robustness of hub detection across different datasets and conditions. 138 BIBLIOGRAPHY [1] Emmanuel Abbe. “Community detection and stochastic block models: recent develop- ments”. In: The Journal of Machine Learning Research 18.1 (2017), pp. 6446–6531. [2] Abubakar Abid et al. “Contrastive principal component analysis”. In: arXiv preprint arXiv:1709.06716 (2017). [3] Sophie Achard and Ed Bullmore. “Efficiency and cost of economical brain functional networks”. In: PLoS computational biology 3.2 (2007), e17. [4] Tulay Adali, Matthew Anderson, and Geng-Shen Fu. “Diversity in independent component and vector analyses: Identifiability, algorithms, and applications in medical imaging”. In: IEEE Signal Process. Mag 31.3 (2014), pp. 18–33. [5] T¨ulay Adali, MABS Akhonda, and Vince D Calhoun. “ICA and IVA for data fusion: An overview and a new approach based on disjoint subspaces”. In: IEEE sensors letters 3.1 (2018), pp. 1–4. [6] Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. “Link communities reveal multiscale complexity in networks”. In: nature 466.7307 (2010), pp. 761–764. [7] Alberto Aleta, Sandro Meloni, and Yamir Moreno. “A multilayer perspective for the analysis of urban transportation systems”. In: Scientific reports 7.1 (2017), pp. 1–9. [8] Hafiz Tiomoko Ali et al. “Latent heterogeneous multilayer community detection”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE. 2019, pp. 8142–8146. [9] Esmaeil Alinezhad et al. “Community detection in attributed networks considering both In: two mathematical programming approaches”. structural and attribute similarities: Neural Computing and Applications 32.8 (2020), pp. 3203–3220. [10] Elena A Allen et al. “Tracking whole-brain connectivity dynamics in the resting state”. In: Cerebral cortex 24.3 (2014), pp. 663–676. [11] Alessia Amelio and Clara Pizzuti. “Community detection in multidimensional networks”. In: 2014 IEEE 26th International Conference on Tools with Artificial Intelligence. IEEE. 2014, pp. 352–359. [12] Aamir Anis, Akshay Gadde, and Antonio Ortega. “Efficient sampling set selection for bandlimited graph signals using graph spectral proxies”. In: IEEE Transactions on Signal Processing 64.14 (2016), pp. 3775–3789. [13] Muhammet Balcilar et al. “Analyzing the expressive power of graph neural networks in In: Proceedings of the International Conference on Learning a spectral perspective”. Representations (ICLR). 2021. [14] Albert-L´aszl´o Barab´asi. In: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371.1987 (2013), p. 20120375. “Network science”. 139 [15] Danielle S Bassett and Edward T Bullmore. “Small-world brain networks revisited”. In: The Neuroscientist 23.5 (2017), pp. 499–516. [16] Marya Bazzi et al. “Generative benchmark models for mesoscale structure in multilayer networks”. In: arXiv preprint arXiv:1608.06196 (2016), p. 20. [17] Christian F Beckmann and Stephen M Smith. “Tensorial extensions of independent com- ponent analysis for multisubject FMRI analysis”. In: Neuroimage 25.1 (2005), pp. 294– 311. [18] Hamid Behjat and Martin Larsson. “Spectral characterization of functional MRI data In: 2020 IEEE 17th International Symposium on on voxel-resolution cortical graphs”. Biomedical Imaging (ISBI). IEEE. 2020, pp. 558–562. [19] Berrabah Bendoukha and Hafida Bendahmane. “Inequalities between the sum of powers and the exponential of sum of positive and commuting selfadjoint operators”. In: Archivum Mathematicum 47.4 (2011), pp. 257–262. [20] Yoav Benjamini and Daniel Yekutieli. “False discovery rate–adjusted multiple confidence intervals for selected parameters”. In: JASA 100.469 (2005), pp. 71–81. [21] Oualid Benyahia and Christine Largeron. “Centrality for graphs with numerical attributes”. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. 2015, pp. 1348–1353. [22] Dimitris Berberidis and Georgios B Giannakis. “Adaptive-similarity node embedding for scalable learning over graphs”. In: arXiv preprint arXiv:1811.10797 (2018). [23] Michele Berlingerio, Michele Coscia, and Fosca Giannotti. “Finding and characteriz- ing communities in multidimensional networks”. In: 2011 International Conference on advances in social networks analysis and mining. IEEE. 2011, pp. 490–494. [24] Michele Berlingerio, Fabio Pinelli, and Francesco Calabrese. “Abacus: frequent pattern mining-based community discovery in multidimensional networks”. In: Data Mining and Knowledge Discovery 27.3 (2013), pp. 294–320. [25] Sharmodeep Bhattacharyya and Shirshendu Chatterjee. “Spectral clustering for multiple sparse networks: I”. in: arXiv preprint arXiv:1805.10594 (2018). [26] Filippo Maria Bianchi et al. “Graph neural networks with convolutional arma filters”. In: IEEE transactions on pattern analysis and machine intelligence 44.7 (2021), pp. 3496– 3507. [27] Bharat Biswal et al. “Functional connectivity in the motor cortex of resting human brain using echo-planar MRI”. in: Magnetic resonance in medicine 34.4 (1995), pp. 537–541. [28] Vincent D Blondel et al. “Fast unfolding of communities in large networks”. In: Journal of statistical mechanics: theory and experiment 2008.10 (2008), P10008. [29] Deyu Bo et al. “A survey on spectral graph neural networks”. In: arXiv preprint arXiv:2302.05631 (2023). 140 [30] Marcos Bolanos et al. “A weighted small world network measure for assessing functional connectivity”. In: Journal of neuroscience methods 212.1 (2013), pp. 133–142. [31] Thomas AW Bolton et al. “Graph slepians to strike a balance between local and global In: 2018 IEEE 15th IEEE. 2018, pp. 1239– network interactions: Application to functional brain imaging”. International Symposium on Biomedical Imaging (ISBI 2018). 1243. [32] Oualid Boutemine and Mohamed Bouguessa. “Mining community structures in multidi- mensional networks”. In: ACM Transactions on Knowledge Discovery from Data (TKDD) 11.4 (2017), pp. 1–36. [33] S. Boyd et al. “Distributed optimization and statistical learning via the alternating direction method of multipliers”. In: 3.1 (Jan. 2011), pp. 1–122. [34] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cam- bridge university press, 2004. [35] Abdelbasset Brahim and Nicolas Farrugia. “Graph Fourier transform of fMRI temporal signals based on an averaged structural connectome for the classification of neuroimaging”. In: Artificial Intelligence in Medicine 106 (2020), p. 101870. [36] Guillaume Braun, Hemant Tyagi, and Christophe Biernacki. “Clustering multilayer graphs with missing nodes”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 2260–2268. [37] Maria Brbi´c and Ivica Kopriva. “Multi-view low-rank sparse subspace clustering”. In: Pattern Recognition 73 (2018), pp. 247–258. [38] Randy L Buckner, Jessica R Andrews-Hanna, and Daniel L Schacter. “The brain’s default network: anatomy, function, and relevance to disease”. In: Annals of the new York Academy of Sciences 1124.1 (2008), pp. 1–38. [39] Randy L Buckner et al. “Cortical hubs revealed by intrinsic functional connectivity: map- ping, assessment of stability, and relation to Alzheimer’s disease”. In: Journal of neuro- science 29.6 (2009), pp. 1860–1873. [40] Randy L Buckner et al. “Molecular, structural, and functional characterization of Alzheimer’s disease: evidence for a relationship between default activity, amyloid, and memory”. In: Journal of neuroscience 25.34 (2005), pp. 7709–7717. [41] Ed Bullmore and Olaf Sporns. “Complex brain networks: graph theoretical analysis of structural and functional systems”. In: Nature reviews neuroscience 10.3 (2009), pp. 186– 198. [42] Hongyun Cai, Vincent W Zheng, and Kevin Chang. “A comprehensive survey of graph embedding: problems, techniques and applications”. In: (2018). [43] Vince D Calhoun, Jingyu Liu, and T¨ulay Adalı. “A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data”. In: Neuroimage 45.1 (2009), S163–S172. 141 [44] James F Cavanagh and Michael J Frank. “Frontal theta as a mechanism for cognitive control”. In: Trends in cognitive sciences 18.8 (2014), pp. 414–421. [45] Ziwei Chai et al. “Can Abnormality be Detected by Graph Neural Networks?” In: Proceed- ings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria. 2022, pp. 23–29. [46] Guoqing Chao, Shiliang Sun, and Jinbo Bi. “A survey on multiview clustering”. In: IEEE transactions on artificial intelligence 2.2 (2021), pp. 146–168. [47] Kamalika Chaudhuri et al. “Multi-view clustering via canonical correlation analysis”. In: Proceedings of the 26th annual international conference on machine learning. 2009, pp. 129–136. [48] Beth L Chen, David H Hall, and Dmitri B Chklovskii. “Wiring optimization can relate neuronal structure and function”. In: Proceedings of the National Academy of Sciences 103.12 (2006), pp. 4723–4728. [49] Chuan Chen, Michael K Ng, and Shuqin Zhang. “Block spectral clustering methods for multiple graphs”. In: Numerical Linear Algebra with Applications 24.1 (2017), e2075. [50] Jia Chen, Gang Wang, and Georgios B Giannakis. “Nonlinear dimensionality reduction for discriminative analytics of multiple datasets”. In: IEEE Transactions on Signal Processing 67.3 (2018), pp. 740–752. [51] Mingming Chen, Konstantin Kuzmin, and Boleslaw K Szymanski. “Community detection via maximization of modularity and its variants”. In: IEEE Transactions on Computational Social Systems 1.1 (2014), pp. 46–65. [52] Pin-Yu Chen and Alfred O Hero. “Multilayer spectral graph clustering via convex layer aggregation: Theory and algorithms”. In: IEEE Transactions on Signal and Information Processing over Networks 3.3 (2017), pp. 553–567. [53] Siheng Chen et al. “Semi-supervised multiresolution classification using adaptive graph filtering with application to indirect bridge structural health monitoring”. In: IEEE Trans- actions on Signal Processing 62.11 (2014), pp. 2879–2893. [54] Siheng Chen et al. “Signal denoising on graphs via graph filtering”. In: 2014 ieee global conference on signal and information processing (globalsip). IEEE. 2014, pp. 872–876. [55] Siheng Chen et al. “Signal recovery on graphs: Variation minimization”. In: IEEE Transactions on Signal Processing 63.17 (2015), pp. 4609–4624. [56] Fan Chung and Mary Radcliffe. “On the spectra of general random graphs”. In: the electronic journal of combinatorics (2011), P215–P215. [57] FRK Chung. “Spectral Graph Theory, Providence, RI: Amer”. In: Math. Soc (1997). [58] Michael W Cole, Sudhir Pathak, and Walter Schneider. “Identifying the brain’s most globally connected regions”. In: Neuroimage 49.4 (2010), pp. 3132–3148. [59] Andrej ˇCopar, Blaˇz Zupan, and Marinka Zitnik. “Fast optimization of non-negative matrix 142 tri-factorization”. In: PloS one 14.6 (2019), e0217994. [60] Michael Costanzo et al. “The genetic landscape of a cell”. In: science 327.5964 (2010), pp. 425–431. [61] Mario Coutino et al. “Fast spectral approximation of structured graphs with applications to graph filtering”. In: Algorithms 13.9 (2020), p. 214. [62] Emanuele Cozzo et al. “Structure of triadic relations in multiplex networks”. In: New Journal of Physics 17.7 (2015), p. 073029. [63] Arnaud D’Argembeau et al. “Self-reflection across time: cortical midline structures differ- entiate between present and past selves”. In: Social cognitive and affective neuroscience 3.3 (2008), pp. 244–252. [64] Zhengjia Dai et al. “Identifying and mapping connectivity patterns of brain network hubs in Alzheimer’s disease”. In: Cerebral cortex 25.10 (2015), pp. 3723–3742. [65] Leon Danon et al. “Comparing community structure identification”. In: Journal of statis- tical mechanics: Theory and experiment 2005.09 (2005), P09008. [66] Caterina De Bacco et al. “Community detection, link prediction, and layer interdependence in multilayer networks”. In: Physical Review E 95.4 (2017), p. 042317. [67] Manlio De Domenico. “Multilayer modeling and analysis of human brain networks”. In: Giga Science 6.5 (2017), gix004. [68] Manlio De Domenico, Mason A Porter, and Alex Arenas. “MuxViz: a tool for multilayer In: Journal of Complex Networks 3.2 (2015), analysis and visualization of networks”. pp. 159–176. [69] Manlio De Domenico et al. “Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems”. In: Physical Review X 5.1 (2015), p. 011027. [70] Francesco De Pasquale et al. “The connectivity of functional cores reveals different degrees of segregation and integration in the brain at rest”. In: Neuroimage 69 (2013), pp. 51–61. [71] Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst. “Convolutional neural net- works on graphs with fast localized spectral filtering”. In: Advances in neural information processing systems 29 (2016). [72] Shay Deutsch, Antonio Ortega, and G´erard Medioni. “Robust denoising of piece-wise smooth manifolds”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 2786–2790. [73] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. A unified view of kernel k-means, spectral clustering and graph cuts. Citeseer, 2004. [74] Chris Ding, Xiaofeng He, and Horst D Simon. “On the equivalence of nonnegative matrix In: Proceedings of the 2005 SIAM international factorization and spectral clustering”. conference on data mining. 2005, pp. 606–610. 143 [75] Chris Ding et al. “Orthogonal nonnegative matrix t-factorizations for clustering”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006, pp. 126–135. [76] Chris HQ Ding, Tao Li, and Michael I Jordan. “Convex and semi-nonnegative matrix factorizations”. In: IEEE transactions on pattern analysis and machine intelligence 32.1 (2008), pp. 45–55. [77] Xiaowen Dong et al. “Clustering on multi-layer graphs via subspace analysis on Grassmann manifolds”. In: IEEE Transactions on signal processing 62.4 (2013), pp. 905–918. [78] Xiaowen Dong et al. “Graph signal processing for machine learning: A review and new perspectives”. In: IEEE Signal processing magazine 37.6 (2020), pp. 117–127. [79] Claire Donnat et al. “Learning Structural Node Embeddings via Diffusion Wavelets”. In: 2018. [80] Elisabeth Drayer and Tirza Routtenberg. “Detection of false data injection attacks in smart grids based on graph signal processing”. In: IEEE Systems Journal 14.2 (2019), pp. 1886– 1896. [81] Liang Du et al. “K-Means Clustering Based on Chebyshev Polynomial Graph Filtering”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 7175–7179. [82] Yuhui Du et al. “Evidence of shared and distinct functional and structural brain signatures in schizophrenia and autism spectrum disorder”. In: Communications biology 4.1 (2021), pp. 1–16. [83] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017. url: http: //archive.ics.uci.edu/ml. [84] Hilmi E Egilmez and Antonio Ortega. “Spectral anomaly detection using graph-based fil- tering for wireless sensor networks”. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2014, pp. 1085–1089. [85] Hilmi E Egilmez, Eduardo Pavez, and Antonio Ortega. “Graph learning from filtered In: IEEE Transactions on signals: Graph system and diffusion kernel identification”. Signal and Information Processing over Networks 5.2 (2018), pp. 360–374. [86] Justine Eustace, Xingyuan Wang, and Yaozu Cui. “Overlapping community detection using neighborhood ratio matrix”. In: Physica A: Statistical Mechanics and its Applications 421 (2015), pp. 510–521. [87] Martin G Everett and Stephen P Borgatti. “Categorical attribute based centrality: E–I and G–F centrality”. In: Social Networks 34.4 (2012), pp. 562–569. [88] Jie Fan, Cihan Tepedelenlioglu, and Andreas Spanias. “Graph-based classification with multiple shift matrices”. In: IEEE Transactions on Signal and Information Processing over Networks 8 (2022), pp. 160–172. [89] Xing Fan et al. “ALMA: alternating minimization algorithm for clustering mixture multi- 144 layer network”. In: The Journal of Machine Learning Research 23.1 (2022), pp. 14855– 14900. [90] Alex Fornito, Andrew Zalesky, and Michael Breakspear. “The connectomics of brain disorders”. In: Nature Reviews Neuroscience 16.3 (2015), pp. 159–172. [91] Santo Fortunato. “Community detection in graphs”. In: Physics reports 486.3-5 (2010), pp. 75–174. [92] Santo Fortunato and Darko Hric. “Community detection in networks: A user guide”. In: Physics reports 659 (2016), pp. 1–44. [93] Panagiotis Fotiadis et al. “Structure–function coupling in macroscale human brain net- works”. In: Nature Reviews Neuroscience (2024), pp. 1–17. [94] Rodrigo Francisquini, Ana Carolina Lorena, and Mari´a CV Nascimento. “Community- based anomaly detection using spectral graph filtering”. In: Applied Soft Computing 118 (2022), p. 108489. [95] Keinosuke Fukunaga. Introduction to statistical pattern recognition. Elsevier, 2013. [96] Yuan Gao et al. “Addressing heterophily in graph anomaly detection: A perspective of graph spectrum”. In: Proceedings of the ACM Web Conference 2023. 2023, pp. 1528– 1538. [97] K Georgiadis et al. “Connectivity steered graph Fourier transform for motor imagery BCI decoding”. In: Journal of neural engineering 16.5 (2019), p. 056021. [98] Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. “Evaluating overfit and underfit in models of network community structure”. In: IEEE Transactions on Knowledge and Data Engineering 32.9 (2019), pp. 1722–1735. [99] Michelle Girvan and Mark EJ Newman. “Community structure in social and biological networks”. In: Proceedings of the national academy of sciences 99.12 (2002), pp. 7821– 7826. [100] Matthew F Glasser et al. “The minimal preprocessing pipelines for the Human Connectome Project”. In: Neuroimage 80 (2013), pp. 105–124. [101] Vladimir Gligorijevi´c, Yannis Panagakis, and Stefanos Zafeiriou. “Non-negative matrix factorizations for multiplex network analysis”. In: IEEE transactions on pattern analysis and machine intelligence 41.4 (2018), pp. 928–940. [102] Katharina Glomb et al. “Connectome spectral analysis to track EEG task dynamics on a subsecond scale”. In: NeuroImage 221 (2020), p. 117137. [103] Evan M Gordon et al. “Precision functional mapping of individual human brains”. In: Neuron 95.4 (2017), pp. 791–807. [104] Ludovica Griffanti et al. “ICA-based artefact removal and accelerated fMRI acquisition for improved resting state network imaging”. In: Neuroimage 95 (2014), pp. 232–247. [105] Adrian R Groves et al. “Linked independent component analysis for multimodal data 145 fusion”. In: Neuroimage 54.3 (2011), pp. 2198–2217. [106] Roger Guimera and Lu´ıs A Nunes Amaral. “Cartography of complex networks: modules and universal roles”. In: Journal of Statistical Mechanics: Theory and Experiment 2005.02 (2005), P02001. [107] W de Haan et al. “Activity Dependent Degeneration Explains Hub Vulnerability in Alzheimer’s Disease”. In: PLoS computational biology 8.8 (2012), e1002582. [108] Jason R Hall, Edward M Bernat, and Christopher J Patrick. “Externalizing psychopathology and the error-related negativity”. In: Psychological science 18.4 (2007), pp. 326–333. [109] David K Hammond, Pierre Vandergheynst, and R´emi Gribonval. “Wavelets on graphs via spectral graph theory”. In: Applied and Computational Harmonic Analysis 30.2 (2011), pp. 129–150. [110] Qiuyi Han, Kevin S Xu, and Edoardo M Airoldi. “Consistent estimation of dynamic and multi-layer networks”. In: arXiv preprint arXiv:1410.8597 (2014). [111] Chaobo He et al. “A survey of community detection in complex networks using nonnegative matrix factorization”. In: IEEE Transactions on Computational Social Systems 9.2 (2021), pp. 440–457. [112] Mingguo He, Zhewei Wei, Hongteng Xu, et al. “Bernnet: Learning arbitrary graph spectral In: Advances in Neural Information Processing filters via bernstein approximation”. Systems 34 (2021), pp. 14239–14251. [113] Yiran He and Hoi-To Wai. “Detecting central nodes from low-rank excited graph signals via structured factor analysis”. In: IEEE Transactions on Signal Processing 70 (2022), pp. 2416–2430. [114] Yiran He and Hoi-To Wai. “Estimating centrality blindly from low-pass filtered graph signals”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 5330–5334. [115] Yong He et al. “Uncovering intrinsic modular organization of spontaneous brain activity in humans”. In: PloS one 4.4 (2009), e5226. [116] Weiyu Huang et al. “A graph signal processing perspective on functional brain imaging”. In: Proceedings of the IEEE 106.5 (2018), pp. 868–885. [117] Lawrence Hubert and Phipps Arabie. “Comparing partitions”. In: Journal of classification 2.1 (1985), pp. 193–218. [118] Muhammad U Ilyas et al. “A distributed algorithm for identifying information hubs in social networks”. In: IEEE Journal on Selected Areas in Communications 31.9 (2013), pp. 629–640. [119] Elvin Isufi et al. “Autoregressive moving average graph filtering”. In: IEEE Transactions on Signal Processing 65.2 (2016), pp. 274–288. [120] Elvin Isufi et al. “Graph filters for signal processing and machine learning on graphs”. In: 146 arXiv preprint arXiv:2211.08854 (2022). [121] Elvin Isufi et al. “Graph filters for signal processing and machine learning on graphs”. In: IEEE Transactions on Signal Processing (2024). [122] Assen Jablensky. “The diagnostic concept of schizophrenia: its history, evolution, and future prospects”. In: Dialogues Clin. Neurosci. 12.3 (2010), p. 271. [123] Anil K Jain. “Data clustering: 50 years beyond K-means”. In: Pattern recognition letters 31.8 (2010), pp. 651–666. [124] LGS Jeub and M Bazzi. A generative model for mesoscale structure in multilayer net- works implemented in MATLAB. 2016. url: https://github.com/MultilayerGM/ MultilayerGM-MATLAB. [125] Caiyan Jia et al. “Node attribute-enhanced community detection in complex networks”. In: Scientific reports 7.1 (2017), pp. 1–15. [126] Zhuqing Jiao et al. “Hub recognition for brain functional networks by using multiple-feature combination”. In: Computers & Electrical Engineering 69 (2018), pp. 740–752. [127] Rui Jin et al. “Dictionary learning-based fMRI data analysis for capturing common and in- dividual neural activation maps”. In: IEEE Journal of Selected Topics in Signal Processing 14.6 (2020), pp. 1265–1279. [128] Bing-Yi Jing et al. “Community detection on mixture multilayer networks via regularized tensor decomposition”. In: The Annals of Statistics 49.6 (2021), pp. 3181–3205. [129] Karen E Joyce et al. “A new measure of centrality for brain networks”. In: PloS one 5.8 (2010), e12200. [130] Inderjit S Jutla, Lucas GS Jeub, Peter J Mucha, et al. “A generalized Louvain method for community detection implemented in MATLAB”. in: (). url: https://github.com/ GenLouvain/GenLouvain%20(2011-2019). [131] Marcus Kaiser and Claus C Hilgetag. “Edge vulnerability in neural and metabolic net- works”. In: Biological cybernetics 90.5 (2004), pp. 311–317. [132] Golnar Kalantar and Arash Mohammadi. “Graph-based dimensionality reduction of EEG signals via functional clustering and total variation measure for BCI systems”. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE. 2018, pp. 4603–4606. [133] Zhao Kang et al. “Fine-grained attributed graph clustering”. In: Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). SIAM. 2022, pp. 370–378. [134] Esin Karahan et al. “Tensor analysis and fusion of multimodal brain images”. In: Proceed- ings of the IEEE 103.9 (2015), pp. 1531–1559. [135] Fatemeh Karimi, Shahriar Lotfi, and Habib Izadkhah. “Multiplex community detection in complex networks using an evolutionary approach”. In: Expert Systems with Applications 146 (2020), p. 113184. 147 [136] Brian Karrer and Mark EJ Newman. “Stochastic blockmodels and community structure in networks”. In: Physical review E 83.1 (2011), p. 016107. [137] Stanley R Kay, Abraham Fiszbein, and Lewis A Opler. “The positive and negative syndrome scale (PANSS) for schizophrenia”. In: Schizophr. Bull. 13.2 (1987), pp. 261–276. [138] Thomas N Kipf and Max Welling. “Semi-supervised classification with graph convolutional networks”. In: Toulon, France, Apr. 2017. [139] Thomas N Kipf and Max Welling. “Variational Graph Auto-Encoders”. In: NIPS Workshop on Bayesian Deep Learning (2016). [140] Mikko Kivel¨a et al. “Multilayer networks”. In: Journal of complex networks 2.3 (2014), pp. 203–271. [141] R. I. Kondor and J. Lafferty. “Diffusion kernels on graphs and other discrete structures”. In: Sydney, Australia, July 2002, pp. 315–322. [142] Artiom Kovnatsky, Klaus Glashoff, and Michael M Bronstein. “MADMM: a generic algo- rithm for non-smooth optimization on manifolds”. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer. 2016, pp. 680–696. [143] Ariel Kroizer, Tirza Routtenberg, and Yonina C Eldar. “Bayesian estimation of graph signals”. In: IEEE transactions on signal processing 70 (2022), pp. 2207–2223. [144] Abhishek Kumar, Piyush Rai, and Hal Daume. “Co-regularized multi-view spectral clus- tering”. In: Advances in neural information processing systems 24 (2011). [145] Zhana Kuncheva and Giovanni Montana. “Community detection in multiplex networks us- ing locally adaptive random walks”. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. 2015, pp. 1308– 1315. [146] Vito Latora and Massimo Marchiori. “Economic small-world behavior in weighted net- works”. In: The European Physical Journal B-Condensed Matter and Complex Systems 32 (2003), pp. 249–263. [147] Emmanuel Lazega et al. The collegial phenomenon: The social mechanisms of cooperation among peers in a corporate law partnership. Oxford University Press on Demand, 2001. [148] Daniel Lee and Hyunjune Seung. “Algorithms for Non-negative Matrix Factorization”. In: Adv. Neural Inform. Process. Syst. 13 (2001), pp. 535–541. [149] [150] Jing Lei, Kehui Chen, and Brian Lynch. “Consistent community detection in multi-layer network data”. In: Biometrika 107.1 (2020), pp. 61–73. Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. “Predicting positive and negative links in online social networks”. In: Proceedings of the 19th international conference on World wide web. 2010, pp. 641–650. [151] Jure Leskovec and Julian Mcauley. “Learning to discover social circles in ego networks”. 148 In: Advances in neural information processing systems 25 (2012). [152] Qimai Li et al. “Label efficient semi-supervised learning via graph filtering”. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 9582–9591. [153] Tao Li and Cha-charis Ding. “Nonnegative matrix factorizations for clustering: A survey”. In: Data Clustering. Chapman and Hall/CRC, 2018, pp. 149–176. [154] Yang Li and Gonzalo Mateos. “Identifying structural brain networks from functional con- nectivity: A network deconvolution approach”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 1135– 1139. [155] Yixuan Li et al. “Local spectral clustering for overlapping community detection”. In: ACM Transactions on Knowledge Discovery from Data (TKDD) 12.2 (2018), pp. 1–27. [156] Zhen Li et al. “Community detection based on regularized semi-nonnegative matrix tri- factorization in signed networks”. In: Mobile Networks and Applications 23 (2018), pp. 71– 79. [157] Zhenping Li et al. “Quantitative function for community detection”. In: Physical review E 77.3 (2008), p. 036109. [158] Li F. F., Andreeto M., Ranzato M., and Perona P. Caltech 101 (1.0) [Data set]. Caltech- DATA.. 2022. url: https://doi.org/10.22002/D1.20086. [159] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation forest”. In: 2008 eighth ieee international conference on data mining. IEEE. 2008, pp. 413–422. [160] Fuchen Liu et al. “Global spectral clustering in dynamic networks”. In: Proceedings of the National Academy of Sciences 115.5 (2018), pp. 927–932. [161] [162] [163] [164] In: Jialu Liu et al. “Multi-view clustering via joint nonnegative matrix factorization”. Proceedings of the 2013 SIAM international conference on data mining. SIAM. 2013, pp. 252–260. Jiani Liu, Elvin Isufi, and Geert Leus. “Filter design for autoregressive moving average graph filters”. In: IEEE Transactions on Signal and Information Processing over Networks 5.1 (2018), pp. 47–60. Jin Liu et al. “Intrinsic brain hub connectivity underlies individual differences in spatial working memory”. In: Cerebral cortex 27.12 (2017), pp. 5496–5508. Jin Xia Liu et al. “Quantitative function for community detection”. In: Advanced Materials Research 433 (2012), pp. 6441–6446. [165] Gabriele Lohmann et al. “Eigenvector centrality mapping for analyzing connectivity pat- terns in fMRI data of the human brain”. In: PloS one 5.4 (2010), e10232. [166] Dan-Dan Lu et al. “Community detection combining topology and attribute information”. In: Knowledge and Information Systems (2022), pp. 1–22. 149 [167] Hong Lu et al. “Community detection algorithm based on nonnegative matrix factorization and pairwise constraints”. In: Physica A: Statistical Mechanics and its Applications 545 (2020), p. 123491. [168] Xin Luo et al. “Symmetric nonnegative matrix factorization-based community detection models and their convergence analysis”. In: IEEE Transactions on Neural Networks and Learning Systems 33.3 (2021), pp. 1203–1215. [169] Xiaoke Ma, Di Dong, and Quan Wang. “Community detection in multi-layer networks using joint nonnegative matrix factorization”. In: IEEE Transactions on Knowledge and Data Engineering 31.2 (2018), pp. 273–286. [170] Matteo Magnani et al. “Community detection in multiplex networks”. In: ACM Computing Surveys (CSUR) 54.3 (2021), pp. 1–35. [171] Deepanshu Malhotra and Anuradha Chug. “A modified label propagation algorithm for In: International Journal of Information community detection in attributed networks”. Management Data Insights 1.2 (2021), p. 100030. [172] Shawn Mankad and George Michailidis. “Structural and functional discovery in dynamic In: Physical Review E 88.4 (2013), networks with non-negative matrix factorization”. p. 042812. [173] Antonio G Marques et al. “Sampling of graph signals with successive local aggregations”. In: IEEE Transactions on Signal Processing 64.7 (2015), pp. 1832–1843. [174] Sohir Maskey et al. “A fractional graph laplacian approach to oversmoothing”. In: Advances in Neural Information Processing Systems 36 (2024). [175] Malia F Mason et al. “Wandering minds: the default network and stimulus-independent thought”. In: science 315.5810 (2007), pp. 393–395. [176] Gr´ainne McLoughlin et al. “Midfrontal theta activity in psychiatric illness: an index of cognitive vulnerabilities across disorders”. In: Biological Psychiatry 91.2 (2022), pp. 173– 182. [177] John D Medaglia et al. “Functional alignment with anatomical networks is associated with cognitive flexibility”. In: Nature human behaviour 2.2 (2018), pp. 156–164. [178] Marek-Marsel Mesulam and Norman Geschwind. “On the possible role of neocortex and its limbic connections in the process of attention and schizophrenia: clinical cases of inattention in man and experimental anatomy in monkey.” In: Journal of psychiatric research (1978). [179] Gianluca Mingoia et al. “Default mode network activity in schizophrenia studied at resting state using probabilistic ICA”. in: Schizophrenia research 138.2-3 (2012), pp. 143–149. [180] Sepehr Mortaheb et al. “A graph signal processing approach to study high density EEG signals in patients with disorders of consciousness”. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE. 2019, pp. 4549–4553. 150 [181] Malaak N Moussa et al. “Consistency of network modules in resting-state FMRI connec- tome data”. In: (2012). [182] Peter J Mucha et al. “Community structure in time-dependent, multiscale, and multiplex networks”. In: science 328.5980 (2010), pp. 876–878. [183] Rena Nainggolan et al. “Improved the performance of the K-means cluster using the sum of squared error (SSE) optimized by using the Elbow method”. In: Journal of Physics: Conference Series. Vol. 1361. 1. IOP Publishing. 2019, p. 012015. [184] Sunil K Narang, Akshay Gadde, and Antonio Ortega. “Signal processing techniques for interpolation in graph structured data”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2013, pp. 5445–5449. [185] Sunil K Narang and Antonio Ortega. “Compact support biorthogonal wavelet filterbanks for arbitrary undirected graphs”. In: IEEE transactions on signal processing 61.19 (2013), pp. 4673–4685. [186] Sunil K Narang and Antonio Ortega. “Perfect reconstruction two-channel wavelet filter banks for graph structured data”. In: IEEE Transactions on Signal Processing 60.6 (2012), pp. 2786–2799. [187] Hung T Nguyen, Thang N Dinh, and Tam Vu. “Community detection in multiplex so- In: 2015 IEEE Conference on Computer Communications Workshops cial networks”. (INFOCOM WKSHPS). IEEE. 2015, pp. 654–659. [188] Emil HJ Nijhuis, Anne-Marie van Cappellen van Walsum, and David G Norris. “Topo- graphic hub maps of the human structural neocortical network”. In: PloS one 8.6 (2013), e65511. [189] Jukka-Pekka Onnela et al. “Intensity and coherence of motifs in weighted complex net- works”. In: Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 71.6 (2005), p. 065103. [190] Masaki Onuki et al. “Graph signal denoising via trilateral filter on graph spectral domain”. In: IEEE Transactions on Signal and Information Processing over Networks 2.2 (2016), pp. 137–148. [191] Alp Ozdemir et al. “Hierarchical spectral consensus clustering for group analysis of func- tional brain networks”. In: IEEE Transactions on Biomedical Engineering 62.9 (2015), pp. 2158–2169. [192] Camillo Padoa-Schioppa and John A Assad. “Neurons in the orbitofrontal cortex encode economic value”. In: Nature 441.7090 (2006), pp. 223–226. [193] A Roxana Pamfil et al. “Relating modularity maximization and stochastic block models in multilayer networks”. In: SIAM Journal on Mathematics of Data Science 1.4 (2019), pp. 667–698. [194] Shirui Pan et al. “Adversarially Regularized Graph Autoencoder for Graph Embedding.” In: IJCAI. 2018, pp. 2609–2615. 151 [195] Neal Parikh, Stephen Boyd, et al. “Proximal algorithms”. In: Foundations and trends® in Optimization 1.3 (2014), pp. 127–239. [196] Sohee Park and Philip S Holzman. “Schizophrenics show spatial working memory deficits”. In: Arch. Gen. Psychiatry 49.12 (1992), pp. 975–982. [197] Subhadeep Paul and Yuguo Chen. “Spectral and matrix factorization methods for consistent community detection in multi-layer networks”. In: The Annals of Statistics 48.1 (2020), pp. 230–250. [198] Martin P Paulus et al. “Parietal dysfunction is associated with increased outcome-related In: Biological Psychiatry 51.12 (2002), decision-making in schizophrenia patients”. pp. 995–1004. [199] Mangor Pedersen et al. “Reducing the influence of intramodular connectivity in participa- tion coefficient”. In: Network Neuroscience 4.2 (2020), pp. 416–431. [200] JB Pochon et al. “The neural system that bridges reward and cognition in humans: an fMRI study”. In: Proc. Natl. Acad. Sci. 99.8 (2002), pp. 5669–5674. [201] Filippo Pompili et al. “Two algorithms for orthogonal nonnegative matrix factorization with application to clustering”. In: Neurocomputing 141 (2014), pp. 15–25. [202] Jonathan D Power et al. “Evidence for hubs in human functional brain networks”. Neuron 79.4 (2013), pp. 798–813. In: [203] Soumajit Pramanik et al. “Discovering community structure in multilayer networks”. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE. 2017, pp. 611–620. [204] Maria Giulia Preti, Thomas AW Bolton, and Dimitri Van De Ville. “The dynamic functional connectome: State-of-the-art and perspectives”. In: Neuroimage 160 (2017), pp. 41–54. [205] Maria Giulia Preti and Dimitri Van De Ville. “Decoupling of brain function from structure reveals regional behavioral specialization in humans”. In: Nature communications 10.1 (2019), p. 4747. [206] Ioannis Psorakis et al. “Overlapping community detection using Bayesian non-negative matrix factorization”. In: Physical Review E 83.6 (2011), p. 066114. [207] Guo-Jun Qi et al. “Exploring context and content links in social media: A latent space IEEE Transactions on Pattern Analysis and Machine Intelligence 34.5 In: method”. (2011), pp. 850–862. [208] Marcus E Raichle et al. “A default mode of brain function”. In: Proceedings of the national academy of sciences 98.2 (2001), pp. 676–682. [209] Raksha Ramakrishna, Hoi-To Wai, and Anna Scaglione. “A user guide to low-pass graph signal processing and its applications: Tools and applications”. In: IEEE Signal Processing Magazine 37.6 (2020), pp. 74–85. [210] David Ram´ırez, Antonio G Marques, and Santiago Segarra. “Graph-signal reconstruc- 152 tion and blind deconvolution for structured inputs”. p. 108180. In: Signal Processing 188 (2021), [211] Emma C Robinson et al. “MSM: a new flexible framework for multimodal surface match- ing”. In: Neuroimage 100 (2014), pp. 414–426. [212] T Mitchell Roddenberry and Santiago Segarra. “Blind inference of eigenvector centrality rankings”. In: IEEE Transactions on Signal Processing 69 (2021), pp. 3935–3946. [213] Martin Rosvall and Carl T Bergstrom. “Maps of random walks on complex networks reveal community structure”. In: Proceedings of the national academy of sciences 105.4 (2008), pp. 1118–1123. [214] Tirza Routtenberg. “Non-Bayesian estimation framework for signal recovery on graphs”. In: IEEE Transactions on Signal Processing 69 (2021), pp. 1169–1184. [215] Mikail Rubinov and Olaf Sporns. “Complex network measures of brain connectivity: uses and interpretations”. In: Neuroimage 52.3 (2010), pp. 1059–1069. [216] Mikail Rubinov and Olaf Sporns. “Weight-conserving characterization of complex func- tional brain networks”. In: Neuroimage 56.4 (2011), pp. 2068–2079. [217] Liu Rui, Hossein Nejati, and Ngai-Man Cheung. “Dimensionality reduction of brain imaging data using graph signal processing”. In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE. 2016, pp. 1329–1333. [218] Pilar Salgado-Pineda et al. “Sustained attention impairment correlates to gray matter de- In: Neuroimage 19.2 creases in first episode neuroleptic-naive schizophrenic patients”. (2003), pp. 365–375. [219] Aliaksei Sandryhaila and Jose MF Moura. “Classification via regularization on graphs”. IEEE. 2013, In: 2013 IEEE global conference on signal and information processing. pp. 495–498. [220] Santiago Segarra, Antonio G Marques, and Alejandro Ribeiro. “Optimal graph-filter design and applications to distributed linear network operators”. In: IEEE Transactions on Signal Processing 65.15 (2017), pp. 4117–4131. [221] Santiago Segarra et al. “Network topology inference from spectral templates”. In: IEEE Transactions on Signal and Information Processing over Networks 3.3 (2017), pp. 467–483. [222] Benjamin A Seitzman et al. “The state of resting state networks”. In: Topics in Magnetic Resonance Imaging 28.4 (2019), pp. 189–196. [223] Prithviraj Sen et al. “Collective classification in network data”. In: AI magazine 29.3 (2008), p. 93. [224] D. I. Shuman et al. “The emerging field of signal processing on graphs: Extending high- dimensional data analysis to networks and other irregular domains”. In: 30.3 (May 2013), pp. 83–98. issn: 1053-5888. [225] David I Shuman et al. “Distributed signal processing via Chebyshev polynomial approxi- 153 mation”. In: IEEE Transactions on Signal and Information Processing over Networks 4.4 (2018), pp. 736–751. [226] Saurabh Sihag et al. “Multimodal dynamic brain connectivity analysis based on graph In: IEEE signal processing for former athletes with history of multiple concussions”. Transactions on Signal and Information Processing over Networks 6 (2020), pp. 284–299. [227] Rahul Singh, Abhishek Chakraborty, and BS Manoj. “GFT centrality: A new node importance measure for complex networks”. In: Physica A: Statistical Mechanics and its Applications 487 (2017), pp. 185–195. [228] Keith Smith et al. “Locating temporal functional dynamics of visual short-term memory binding using graph modular dirichlet energy”. In: Scientific reports 7.1 (2017), p. 42013. [229] Sandra E Smith-Aguilar et al. “Using multiplex networks to capture the multidimensional nature of social structure”. In: Primates 60.3 (2019), pp. 277–295. [230] Olaf Sporns and Richard F Betzel. “Modular brain networks”. In: Annual review of psychology 67.1 (2016), pp. 613–640. [231] Olaf Sporns, Christopher J Honey, and Rolf K¨otter. “Identification and classification of hubs in brain networks”. In: PloS one 2.10 (2007), e1049. [232] R Nathan Spreng, Raymond A Mar, and Alice SN Kim. “The common neural basis of autobiographical memory, prospection, navigation, theory of mind, and the default mode: a quantitative meta-analysis”. In: Journal of cognitive neuroscience 21.3 (2009), pp. 489– 510. [233] Natalie Stanley et al. “Clustering network layers with the strata multilayer stochastic block model”. In: IEEE transactions on network science and engineering 3.2 (2016), pp. 95–105. [234] Matthew Steen et al. “Assessing the consistency of community structure in complex net- works”. In: Physical Review E 84.1 (2011), p. 016111. [235] Gregory P Strauss, James A Waltz, and James M Gold. “A review of reward processing and motivational impairment in schizophrenia”. In: Schizophrenia bulletin 40.Suppl 2 (2014), S107–S116. [236] Jing Sui et al. “An ICA-based method for the identification of optimal FMRI features and components using combined group-discriminative techniques”. In: Neuroimage 46.1 (2009), pp. 73–86. [237] Bing-Jie Sun et al. “A non-negative symmetric encoder-decoder approach for community detection”. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017, pp. 597–606. [238] Carol A Tamminga et al. “Bipolar and schizophrenia network for intermediate phenotypes: outcomes across the psychosis continuum”. In: Schizophr. Bull. 40.Suppl 2 (2014), S131– S137. [239] Carol A Tamminga et al. “Clinical phenotypes of psychosis in the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP)”. in: Am. J. Psychiatry 170.11 (2013), 154 pp. 1263–1274. [240] Yuichi Tanaka et al. “Sampling signals on graphs: From theory to applications”. In: IEEE Signal Processing Magazine 37.6 (2020), pp. 14–30. [241] Fengqin Tang, Xuejing Zhao, and Cuixia Li. “Community Detection in Multilayer Networks Based on Matrix Factorization and Spectral Embedding Method”. In: Mathematics 11.7 (2023), p. 1573. [242] Jianheng Tang et al. “Rethinking graph neural networks for anomaly detection”. In: Inter- national Conference on Machine Learning. PMLR. 2022, pp. 21076–21089. [243] Lei Tang, Xufei Wang, and Huan Liu. “Community detection via heterogeneous interaction analysis”. In: Data mining and knowledge discovery 25.1 (2012), pp. 1–33. [244] Wei Tang, Zhengdong Lu, and Inderjit S Dhillon. “Clustering with multiple graphs”. In: 2009 Ninth IEEE International Conference on Data Mining. IEEE. 2009, pp. 1016–1021. [245] Dane Taylor, Rajmonda S Caceres, and Peter J Mucha. “Super-resolution community detection for layer-aggregated multilayer networks”. In: Physical Review X 7.3 (2017), p. 031056. [246] Craig E Tenke and J¨urgen Kayser. “Generator localization by current source density (CSD): implications of volume conduction and field closure at intracranial and scalp resolutions”. In: Clinical neurophysiology 123.12 (2012), pp. 2328–2345. [247] Dardo Tomasi and Nora D Volkow. “Association between functional connectivity hubs and brain networks”. In: Cerebral cortex 21.9 (2011), pp. 2003–2013. [248] Nicolas Tremblay and Pierre Borgnat. “Graph wavelets for multiscale community mining”. In: IEEE Transactions on Signal Processing 62.20 (2014), pp. 5227–5239. [249] Nicolas Tremblay et al. “Compressive spectral clustering”. In: International conference on machine learning. PMLR. 2016, pp. 1002–1011. [250] Logan T Trujillo and John JB Allen. “Theta EEG dynamics of the error-related negativity”. In: Clinical Neurophysiology 118.3 (2007), pp. 645–668. [251] Leslie G Ungerleider and James V Haxby. “‘What’and ‘where’in the human brain”. In: Current opinion in neurobiology 4.2 (1994), pp. 157–165. [252] Ravishankar R Vallabhajosyula et al. “Identifying hubs in protein interaction networks”. In: PloS one 4.4 (2009), e5344. [253] Martijn P Van den Heuvel and Olaf Sporns. “Network hubs in the human brain”. In: Trends in cognitive sciences 17.12 (2013), pp. 683–696. [254] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.”. In: Journal of machine learning research 9.11 (2008). [255] David C Van Essen et al. “The WU-Minn human connectome project: an overview”. In: Neuroimage 80 (2013), pp. 62–79. 155 [256] Yves Van Gennip et al. “Community detection using spectral clustering on sparse geosocial data”. In: SIAM Journal on Applied Mathematics 73.1 (2013), pp. 67–83. [257] Am´erica Vera-Montecinos et al. “Analysis of networks in the dorsolateral prefrontal cor- tex in chronic schizophrenia: Relevance of altered immune response”. In: Frontiers in Pharmacology 14 (2023), p. 1003557. [258] Ulrike Von Luxburg. “A tutorial on spectral clustering”. In: Statistics and computing 17.4 (2007), pp. 395–416. [259] Chun Wang et al. “Mgae: Marginalized graph autoencoder for graph clustering”. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017, pp. 889–898. [260] Hua Wang et al. “Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation”. In: 2011 IEEE 11th international conference on data mining. IEEE. 2011, pp. 774–783. [261] Huaning Wang et al. “Evidence of a dissociation pattern in default mode subnetwork functional connectivity in schizophrenia”. In: Scientific reports 5.1 (2015), p. 14655. [262] Shuai Wang et al. “Clustering by orthogonal NMF model and non-convex penalty opti- mization”. In: IEEE Transactions on Signal Processing 69 (2021), pp. 5273–5288. [263] Yu Wang, Wotao Yin, and Jinshan Zeng. “Global convergence of ADMM in nonconvex nonsmooth optimization”. In: Journal of Scientific Computing 78 (2019), pp. 29–63. [264] Wenhui Wu et al. “Nonnegative matrix factorization with mixed hypergraph regularization for community detection”. In: Information Sciences 435 (2018), pp. 263–281. [265] Zonghan Wu et al. “Beyond low-pass filtering: Graph convolutional networks with auto- matic filtering”. In: IEEE Transactions on Knowledge and Data Engineering (2022). [266] Tian Xie, Bin Wang, and C-C Jay Kuo. “Graphhop: An enhanced label propagation method for node classification”. In: IEEE Transactions on Neural Networks and Learning Systems (2022). [267] Bingbing Xu et al. “Graph convolutional networks using heat kernel for semi-supervised learning”. In: arXiv preprint arXiv:2007.16002 (2020). [268] Bingbing Xu et al. “Graph wavelet neural network”. In: arXiv preprint arXiv:1904.07785 (2019). [269] Luming Xu et al. “ARMA Graph Filter Design by Least Squares Method Using Reciprocal In: 2023 16th International Congress on Polynomial and Second-order Factorization”. Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE. 2023, pp. 1–5. [270] Yangwen Xu et al. “Intrinsic functional network architecture of human semantic processing: Modules and hubs”. In: Neuroimage 132 (2016), pp. 542–555. [271] Zhilei Xu et al. “Meta-connectomic analysis maps consistent, reproducible, and transcrip- 156 tionally relevant functional connectome hubs in the human brain”. In: Communications Biology 5.1 (2022), p. 1056. [272] Cheng Yang et al. “Network representation learning with rich text information”. In: Twenty- fourth international joint conference on artificial intelligence. 2015. [273] Defu Yang et al. “Joint hub identification for brain networks by multivariate graph infer- ence”. In: Medical image analysis 73 (2021), p. 102162. [274] Defu Yang et al. “Joint identification of network hub nodes by multivariate graph inference”. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22. Springer. 2019, pp. 590–598. [275] H Yang et al. “Constrained Independent Component Analysis Based on Entropy Bound Minimization for Subgroup Identification from Multi-subject fMRI Data”. In: ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5. [276] Hanlu Yang et al. “Identification of Homogeneous Subgroups from Resting-State fMRI Data”. In: Sensors 23.6 (2023), p. 3264. [277] Jaewon Yang, Julian McAuley, and Jure Leskovec. “Community detection in networks with node attributes”. In: 2013 IEEE 13th international conference on data mining. IEEE. 2013, pp. 1151–1156. [278] BT Thomas Yeo et al. “The organization of the human cerebral cortex estimated by intrinsic functional connectivity”. In: Journal of neurophysiology (2011). [279] Wutao Yin, Longhai Li, and Fang-Xiang Wu. “Deep learning for brain disorder diagnosis based on fMRI images”. In: Neurocomputing 469 (2022), pp. 332–345. [280] Jiho Yoo and Seungjin Choi. “Orthogonal nonnegative matrix tri-factorization for co- clustering: Multiplicative updates on stiefel manifolds”. In: Information processing & management 46.5 (2010), pp. 559–570. [281] Liu Yue et al. “A survey of deep graph clustering: Taxonomy, challenge, and application”. In: arXiv preprint arXiv:2211.12875 (2022). [282] Andrew Zalesky, Alex Fornito, and Edward T Bullmore. “Network-based statistic: identi- fying differences in brain networks”. In: Neuroimage 53.4 (2010), pp. 1197–1207. [283] Fan Zhang and Edwin R Hancock. “Graph spectral image smoothing using the heat kernel”. In: Pattern Recognition 41.11 (2008), pp. 3328–3342. [284] Hongyuan Zhang et al. “Embedding graph auto-encoder for graph clustering”. In: IEEE Transactions on Neural Networks and Learning Systems (2022). [285] John X Zhang, Hoi-Chung Leung, and Marcia K Johnson. “Frontal activations associated In: with accessing and evaluating information in working memory: an fMRI study”. Neuroimage 20.3 (2003), pp. 1531–1539. 157 [286] Xiaotong Zhang et al. “Attributed graph clustering via adaptive graph convolution”. In: arXiv preprint arXiv:1906.01210 (2019). [287] Yipu Zhang et al. “Multi-paradigm fMRI fusion via sparse tensor decomposition in brain In: IEEE journal of biomedical and health informatics functional connectivity study”. 25.5 (2020), pp. 1712–1723. [288] Yu Zhang et al. “Functional annotation of human cognitive states using deep graph convo- lution”. In: NeuroImage 231 (2021), p. 117847. [289] Peng Zhou and Liang Du. “Learnable graph filter for multi-view clustering”. In: Proceed- ings of the 31st ACM International Conference on Multimedia. 2023, pp. 3089–3098. [290] Guangyao Zhu and Kan Li. “A unified model for community detection of multiplex net- works”. In: International Conference on Web Information Systems Engineering. Springer. 2014, pp. 31–46. [291] Meiqi Zhu et al. “Interpreting and unifying graph neural networks with an optimization framework”. In: Proceedings of the Web Conference 2021. 2021, pp. 1215–1226. 158 APPENDIX A: AUXILIARY FUNCTION PROOF Proposition 1. The following function, 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ), 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ) = L (ℎ𝑡 𝑖 𝑗 ) + 3L′(ℎ𝑡 𝑖 𝑗 )(ℎ − ℎ𝑡 𝑖 𝑗 ) + 3 2 is an auxiliary function of L (𝐻), (cid:205)𝐿 𝑙=1(4H𝑙G𝑙H⊤ 𝑙 H𝑡S𝑙 + 4H𝑡H𝑡 ⊤A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 (ℎ − ℎ𝑡 𝑖 𝑗 )2 L (ℎ) = L (ℎ𝑡 𝑖 𝑗 ) + L′(ℎ𝑡 𝑖 𝑗 ) (ℎ − ℎ𝑡 𝑖 𝑗 ) + 1 2 L′′(ℎ𝑡 𝑖 𝑗 ) (ℎ − ℎ𝑡 𝑖 𝑗 )2. Proof. First, when ℎ = ℎ𝑡 𝑖 𝑗 the equality 𝑍 (ℎ, ℎ) = L (ℎ) holds. Now, we need to show that 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ) ≥ L (ℎ). It can be seen that the first and second terms of 𝑍 (ℎ, ℎ𝑡 𝑖 𝑗 ) are greater than the first and second terms in L (ℎ). Therefore, it suffices to show that 3 (cid:205)𝐿 𝑙=1 (4H𝑙G𝑙H𝑙 ⊤H𝑡 S𝑙+4H𝑡 H𝑡 ⊤A𝑙H𝑡 S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 ≥ L′′(ℎ𝑡 𝑖 𝑗 ). It can be shown that 𝑙 H𝑡S𝑙)𝑖 𝑗 (H𝑙G𝑙H⊤ ℎ𝑡 𝑖 𝑗 = (H𝑡H𝑡⊤ A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 (H𝑡H𝑡⊤ A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 = = (cid:205)𝑝,𝑞 (H𝑙G𝑙H⊤ 𝑙 )𝑖 𝑝 ℎ𝑞 𝑗 𝑝𝑞 ℎ𝑡 𝑖 𝑗 ≥ (H𝑙G𝑙H⊤ 𝑙 )𝑖𝑖S 𝑗 𝑗 , (cid:205)𝑝 ℎ𝑡 𝑖 𝑝 (H𝑡 ⊤A𝑙H𝑡S𝑙) 𝑝 𝑗 ℎ𝑡 𝑖 𝑗 ≥ (H𝑡 ⊤A𝑙H𝑡S𝑙) 𝑗 𝑗 , (cid:205)𝑝,𝑞 ℎ𝑡 𝑖 𝑝 ℎ𝑡 𝑞 𝑝 (A𝑙H𝑡S𝑙)𝑞 𝑗 ℎ𝑡 𝑖 𝑗 ≥ ℎ𝑡 𝑖 𝑗 (A𝑙H𝑡S𝑙)𝑖 𝑗 , (H𝑡H𝑡⊤ A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 = (cid:205) 𝑚,𝑏 (H𝑡H𝑡⊤ ℎ𝑡 𝑖 𝑗 A𝑙)𝑖𝑚 ℎ𝑏 𝑗 𝑚𝑏 ≥ (H𝑡H𝑡⊤ A𝑙)𝑖𝑖S𝑙 𝑗 𝑗 . Therefore, 3 (H𝑡H𝑡⊤ A𝑙H𝑡S𝑙)𝑖 𝑗 ℎ𝑡 𝑖 𝑗 ≥ (H𝑡 ⊤A𝑙H𝑡S𝑙)𝑖 𝑗 + ℎ𝑡 𝑖 𝑗 (A𝑙H𝑡S𝑙)𝑖 𝑗 + (H𝑡H𝑡⊤ A𝑙)𝑖𝑖S𝑙 𝑗 𝑗 , and thus 3 (cid:205)𝐿 𝑙=1 (4H𝑙G𝑙H⊤ 𝑙 H𝑡 S𝑙+4H𝑡 H𝑡⊤ ℎ𝑡 𝑖 𝑗 A𝑙H𝑡 S𝑙)𝑖 𝑗 ≥ L′′(ℎ𝑡 𝑖 𝑗 ). Therefore, Eq. (2.10) is an auxiliary function of L (ℎ). 159 APPENDIX B: CONSISTENCY PROOF The proof for Theorem 2 consists of three steps. The first step was addressed in Lemma 2 , where it was shown that true community labels can be recovered from the solution of the objective function applied to the population adjacency tensor A . The rest of the proof is addressed below. Proof. We will refer to the objective function in (2.11) in the manuscript as 𝐹 (A, (H′ 1, · · · , H′ 𝐿)). For any feasible solution (H′ 1, · · · , H′ 𝐿), we have |𝐹 (A , (H′ 1, · · · , H′ 𝐿)) − 𝐹 (A, (H′ 1, · · · , H′ 𝐿))| = | 𝐿 ∑︁ 𝑙=1 (||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 − ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹)| = 𝐿 ∑︁ 𝑙=1 {(||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 − ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹)2 + |2(||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 − ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹)||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 |}. For each layer 𝑙, ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 term is upper bounded as, ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 ≤ √︁𝑘𝑙 ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||2 ≤ √︁𝑘𝑙 ||H′ 𝑙 ||2 2||A𝑙 ||2 ≤ √︁𝑘𝑙Δ𝑙 . The inequality in the first line is due to the relationship between the Frobenious norm and spectral norm since. The inequalities in lines 2 and 3 are due to the property of spectral norm ||AB||2 ≤ ||A||2||B||2 and the inequality ||A𝑙 || ≤ Δ𝑙, respectively. Since ||A||𝐹 − ||B||𝐹 ≤ ||A − B||𝐹, for each layer 𝑙, we have |(||H′ 𝑙 𝑙 ||𝐹 − ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹)||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 | ⊤A𝑙H′ ≤ √︁𝑘𝑙Δ𝑙 ||H′ = √︁𝑘𝑙Δ𝑙 √︃ 𝑙 ⊤(A𝑙 − A𝑙)H𝑙 ||𝐹 tr(H′ 𝑙 ⊤(A𝑙 − A𝑙)H′ 𝑙H′ 𝑙 ⊤(A𝑙 − A𝑙)H′ 𝑙) ≤ √︁𝑘𝑙Δ𝑙 √︃ ||H′ 𝑙H′ 𝑙 ⊤||2tr((A𝑙 − A𝑙)H′ 𝑙H′ 𝑙 ⊤(A𝑙 − A𝑙)) = √︁𝑘𝑙Δ𝑙 √︃ tr((A𝑙 − A𝑙)H′ 𝑙H′ 𝑙 ⊤(A𝑙 − A𝑙)) √︃ √︃ 𝑘𝑙 ||H′ 𝑙 ||2 2||A𝑙 − A𝑙 ||2 𝑘𝑙 (4Δ𝑙𝑙𝑜𝑔(2𝑛/𝜖))1/2 ≤ √︁𝑘𝑙Δ𝑙 ≤ √︁𝑘𝑙Δ𝑙 √ = 2𝑘𝑙Δ𝑙 5/4(𝑙𝑜𝑔(2𝑛/𝜖))1/4 with probability at least 1−𝜖. Inequality in line 4 is due to the inequality on the trace of the product of a positive semi-definite matrix tr((A𝑙 − A𝑙)H′ 𝑙H′ 𝑙 ⊤(A𝑙 − A𝑙) with a Hermitian matrix H′ 𝑙H′ 𝑙 ⊤ [197]. 160 The inequality in line 6 is due to the relation tr( 𝐴𝐵) ≤ 𝑘 || 𝐴||2||𝐵||2. And the inequality in line 7 comes from results presented in [56] for single graphs, where ||A𝑙 − A𝑙 ||2 ≤ √︁4Δ𝑙𝑙𝑜𝑔(2𝑁/𝜖))1/2 Similarly, (||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹 − ||H′ 𝑙 ⊤A𝑙H′ 𝑙 ||𝐹)2 ≤ ||H′ 𝑙 ⊤(A𝑙 − A𝑙)H𝑙 ||2 𝐹 ≤ tr(H′ 𝑙 ⊤(A𝑙 − A𝑙)H′ 𝑙H′ 𝑙 ⊤(A𝑙 − A𝑙)H′ 𝑙) Combining the above results, we have ≤ 2𝑘𝑙Δ 1/2 𝑙 (𝑙𝑜𝑔(2𝑁/𝜖))1/2. 𝐿)) − 𝐹 (A, (H′ 1, · · · , H′ 𝐿))| 1, · · · , H′ |𝐹 (A , (H′ 𝐿 ∑︁ {2𝑘𝑙Δ ≤ (𝑙𝑜𝑔(2𝑁/𝜖))1/2 + 2 √ 2𝑘𝑙Δ 5/4 𝑙 (𝑙𝑜𝑔(2𝑁/𝜖))1/4} 1/2 𝑙 𝑙=1 𝐿 ∑︁ 𝑙=1 ≤ 6𝑘𝑙Δ 5/4 𝑙 (𝑙𝑜𝑔(2𝑁/𝜖))1/2 ≤ 6𝑘𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2 𝐿 ∑︁ 𝑙=1 Δ 5/4 𝑙 ≤ 6𝑘𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2( 𝐿 ∑︁ 𝑙=1 Δ𝑙)5/4 ≤ 6𝑘𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2(𝐿 ¯Δ)5/4. The fourth inequality follows from the relation (cid:205)𝑛 𝑖=1 𝑥𝑖) 𝑝, for 𝑥𝑖 > 0 and real 𝑝 with 𝑝 ≥ 1 proved in [19]. The last inequality is due to ¯Δ = 1 𝐿 𝑖=1(𝑥𝑖) 𝑝 ≤ ((cid:205)𝑛 (cid:205)𝑙=1 Δ𝑙. Finally, let ( ˆH′ 1, · · · , ˆH′ 𝐿) maximize the population version of the objective function 𝐹 (A , (H′ 𝐿) be the solution of the optimization problem in (11). Further let 𝐿)). 1, · · · , ˆH′ 𝐿)) and 𝐹 (A , ( ¯H′ 𝐿)) ≥ 𝐹 (A , ( ˆH′ 𝐿)) ≥ 𝐹 (A, ( ¯H′ 1, · · · , ¯H′ 1, · · · , ˆH′ 1, · · · , ¯H′ 1, · · · , H′ 𝐿)). ( ¯H′ 1, · · · , ¯H′ Then 𝐹 (A, ( ˆH′ Therefore, we have with probability at least 1 − 𝜖, 𝐹 (A , ( ¯H′ 1, · · · , ¯H′ 𝐿)) − 𝐹 (A , ( ˆH′ 1, · · · , ˆH′ 𝐿)) ≤ 𝐹 (A , ( ¯H′ 1, · · · , ¯H′ 𝐿)) − 𝐹 (A , ( ˆH′ 1, · · · , ˆH′ 𝐿)) + 𝐹 (A, ( ˆH′ 1, · · · , ˆH′ 𝐿)) − 𝐹 (A, ( ¯H′ 1, · · · , ¯H′ 𝐿)) ≤ 12𝑘𝑚𝑎𝑥 (𝑙𝑜𝑔(2𝑁/𝜖))1/2(𝐿 ¯Δ)5/4. From [197], we have 𝐹 (A , ( ¯H′ 1, · · · , ¯H′ 𝐿)) − 𝐹 (A , ( ˆH′ 1, · · · , ˆH′ 𝐿)) ≥ 𝑁𝑟 𝑀 𝑋 8𝑁𝑚𝑎𝑥 𝐿 ∑︁ 𝑙=1 (𝜆𝑙)2. Hence, with probability at least 1 − 𝜖 𝑟 𝑀 𝑋 ≤ 96𝑁𝑚𝑎𝑥 𝑘𝑚𝑎𝑥 𝐿1/4 ¯Δ5/4(𝑙𝑜𝑔(2𝑁/𝜖))1/2 (cid:205)𝐿 . 𝑁 1 𝐿 𝑙=1(𝜆𝑙)2 161 APPENDIX C: INVERTIBILITY PROOF The structure of Y1 involves a Vandermonde-like matrix, which depends on the eigenvalues of the normalized Laplacian, L𝑛. Specifically, the matrix Y1 ∈ R𝑄−1×𝑄−1 is defined as: Y1 = ¯𝚿⊤ 𝑄 diag(𝚿𝑀c) diag(𝚿𝑀c) ¯𝚿𝑄, where, ¯𝚿𝑄 = 𝜆2 1 𝜆2 2 𝜆1 𝜆2 ...               𝜆𝑁 𝜆2    ... · · · 𝜆𝑄−1 1 2 · · · 𝜆𝑄−1 ... . . . 𝑁 · · · 𝜆𝑄−1 𝑁                  , 𝚿𝑀 = 1 𝜆1 1 𝜆2 ... ... 𝜆2 1 𝜆2 2 ... · · · 𝜆𝑀−1 1 · · · 𝜆𝑀−1 2 . . . ... 1 𝜆𝑁 𝜆2 𝑁 · · · 𝜆𝑀−1 𝑁                  ,                  are 𝑁 × 𝑄 − 1 and 𝑁 × 𝑀 Vandermonde-like matrices, respectively based on the eigenvalues {𝜆𝑖}𝑁 𝑖=1 of L𝑛. Let C = diag(𝚿𝑀c) diag(𝚿𝑀c), with entries 𝐶𝑖𝑖 = ((cid:205)𝑀−1 𝑚=0 𝑐𝑚𝜆𝑚 𝑖 )2 > 0. Proposition 2. For a symmetric and positive definite matrix C ∈ R𝑁×𝑁 , Y1 = ¯𝚿⊤ 𝑄C ¯𝚿𝑄 ∈ R𝑄−1×𝑄−1 is invertible if and only if ¯𝚿𝑄 has full rank. Proof. Let x ∈ R𝑄−1 \ {0} be an arbitrary vector. If ¯𝚿𝑄 has full (column) rank, then ¯𝚿⊤ 𝑄 has full (row) rank as well and is injective. Hence, define y := ¯𝚿𝑄x ∈ R𝑁 \ {0}. The positive definiteness of C yields x𝑇 ( ¯𝚿⊤ 𝑄C ¯𝚿𝑄)x = ( ¯𝚿𝑄x)𝑇 C( ¯𝚿𝑄x) = y𝑇 Cy > 0, i.e., ¯𝚿⊤ 𝑄C ¯𝚿𝑄 is (symmetric and) positive definite and thus invertible. Conversely, if ¯𝚿𝑄 does not have full rank, it is not injective, and there exists a vector x ∈ R𝑄−1 \ {0} such that ¯𝚿𝑄x = 0. Hence, 𝑇 𝑇 𝑄C ¯𝚿𝑄 would not be injective and thus not invertible. 𝑄C ¯𝚿𝑄x = 0, ¯𝚿 ¯𝚿 Therefore, for Y1 to be invertible, the Vandermonde matrix must have full rank. In our case, 𝑄 − 1 < 𝑁. Thus, the Vandermonde matrix will be full rank if L𝑛 has at least 𝑄 − 1 distinct eigenvalues. This is a reasonable assumption given that 𝑄 − 1 ≪ 𝑁. 162 Similarly, Y2 depends on 𝚿𝑀. Since the same Vandermonde-like structure is present, the same reasoning applies. As long as 𝑀 ≪ 𝑁, it suffices to have 𝑀 distinct eigenvalues of L𝑛 for Y2 to be invertible. These conditions are typically satisfied in practical scenarios where 𝑄, 𝑀 ≪ 𝑁. 163