SIGNAL PROCESSING ON GRAPHS: COMMUNITY DETECTION AND GRAPH LEARNING FOR MULTILAYER NETWORKS By Abdullah Karaaslanli A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering—Doctor of Philosophy 2023 ABSTRACT Community detection and graph learning are two important problems in graph analysis. The former problem deals with topological analysis of graphs to identify their mesoscale organization; while graph learning aims to infer the interactions between nodes of a graph from data when the graph topology is not known a priori. Existing community detection and graph learning methods are mostly limited to single-layer graphs, where nodes are assumed to be connected with a single static edge. However, this assumption ignores the fact that many real-world relational data have multiple dimensions, which can be better represented with multilayer graphs. In this thesis, we propose various community detection and graph learning methods for different types of multilayer graphs. In Chapter 2, we tackle the community detection problem in dynamic networks. Specifically, we focus on evolutionary spectral clustering, which extends spectral clustering to dynamic networks to learn a community structure that changes smoothly over time. We show the equivalence of evolutionary spectral clustering to a variant of dynamic stochastic blockmodel. For this purpose, we first introduce a novel dynamic SBM where the evolution of communities over time is modeled with pairwise Markov random fields. We then show that the log-posterior of the proposed model is equivalent to the quality function of evolutionary spectral clustering. This equivalence is used to determine the forgetting factor in evolutionary spectral clustering and to develop two new algorithms for dynamic community detection. The proposed algorithms are applied to both simulated and real-world dynamic networks and their performances are compared to state-of-the-art dynamic community detection methods. Chapter 3 introduces a multilayer community detection method, which is especially tailored to handle multilayer brain networks constructed from electroencephalogram(EEG) data. In particular, we first construct functional multilayer networks from EEG data, where layers correspond to different frequency bands and interlayer edges are allowed between all brain regions. Next, a new multilayer modularity metric is defined based on a multilayer null model that preserves the layer-wise node degrees while randomizing the remaining characteristics of the network. The proposed modularity is parameterized with resolution parameter to handle the resolution limit of modularity, and interlayer scale parameter to control the importance of interlayer edges in community formation. Third, a group community detection method is proposed to find the common community structure for a set of subjects. The proposed multilayer community detection method is employed to identify the group level differences between the two response types during Flanker task, i.e. error and correct. In Chapter 4, we present an algorithm to learn signed graphs, which we represent as a two-layer multiplex network where one layer corresponds to positive edges while the other to negative edges. The algorithm is based on graph learning approaches developed using graph signal processing. Existing graph learning methods rely on smoothness of graph signals over the graph; however, they are only capable of learning unsigned graphs. To this end, we propose a signed graph learning approach, that learns signed graphs based on the assumption of smoothness and non-smoothness of graph signals over positive and negative edges, respectively. The proposed method is further extended using kernels to take the nonlinear relations between nodes into account. From GSP perspective, this extension corresponds to assuming smoothness/non- smoothness of graph signals in a higher dimensional space defined by the kernel. The proposed approach is applied to the problem of gene regulatory network inference from single cell gene expression data. Experiments on simulated and real single cell datasets show that the method compares favorably with other single cell gene regulatory network reconstruction algorithms. Chapter 5 addresses the problem of how to learn multiple signed graphs simultaneously. Existing GSP based GL approaches for this problem are limited to unsigned graph topologies. Therefore, we extend the algorithm developed in Chapter 4 to learn multiple signed graphs. In particular, given multiple datasets each of which includes graph signals associated with a signed graph, we assume smoothness and non-smoothness of graph signals as in Chapter 4. Furthermore, we assume that the signed graphs are similar to each other, which is ensured by regularizing the learned signed graphs through a learned signed consensus graph. The proposed method is employed for the joint inference of multiple gene regulatory networks from single cell gene expression data. Experiments on simulated and real single cell datasets show that the method performs better than methods that can learn a single graph at a time and previous joint gene regulatory network reconstruction algorithms. In Chapter 6, we tackle the problem of learning multiple unsigned graphs from a heterogeneous dataset, which requires clustering graph signals while learning a graph for each cluster. Namely, we present an optimization problem for joint graph signal clustering and graph topology inference. The approach extends graph cut based clustering by partitioning the graph signals not only based on their pairwise similarities but also their smoothness with respect to the graphs associated with the clusters. The proposed method also learns the representative graph for each cluster using the smoothness of the graph signals with respect to the graph topology. Results on simulated and real data indicate the effectiveness of the proposed method. ACKNOWLEDGEMENTS This thesis is written with the help of many great people and I would like to express my appreciation to them. First, I would like to give a special thank you to my advisor, Dr. Selin Aviyente, for giving me the opportunity to work with her and for introducing the fascinating topic of this thesis to me. Without her guidance, patience and encouragement, this thesis would not be possible. Secondly, I want to thank Dr. Tapabrata Maiti, whose co-advising helped me greatly during the preparation of parts of this thesis. I would also like to thank my co-authors, Dr. Satabdi Saha, Dr. Tamanna Munia and Meiby Ortizbouza. It was a great pleasure to work with them and to learn from their expertise. Furthermore, I would like to thank the committee members of this thesis for their time and effort. They provided comments and suggestions, which gave this thesis its final shape. Finally, I am grateful for my family and my friends, whose valuable support gave me courage while pursuing my degree. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Graph Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization and Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 8 CHAPTER 2 COMMUNITY DETECTION IN DYNAMIC NETWORKS . . . . . . . . . . . . . . 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Dynamic MRF-DCSBM and Log-Posterior Formulation . . . . . . . . . . . . . . . . . 15 2.4 Dynamic Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 CHAPTER 3 COMMUNITY DETECTION IN MULTILAYER NETWORKS . . . . . . . . . . . . 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Multi-frequency EEG Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Multilayer Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 CHAPTER 4 LEARNING SIGNED GRAPHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Learning Signed Graphs from Graph Signals . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 CHAPTER 5 LEARNING MULTIVIEW SIGNED GRAPHS . . . . . . . . . . . . . . . . . . . . . 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 CHAPTER 6 SIMULTANEOUS GRAPH SIGNAL CLUSTERING AND GRAPH LEARNING . . 85 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 CHAPTER 7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 v CHAPTER 1 INTRODUCTION Many real-world applications consist of networked systems, i.e. they include entities that are related to each other in different ways. For example, users on a social media platform connect through messaging, or genes and proteins within a cell interact through regulatory relations. Such systems can be modeled as graphs (or networks1), where entities and their interactions are represented by nodes and edges, respectively [166, 5, 27]. Although graphs have successfully been used in many disciplines, existing work is generally limited to single-layer graphs, where nodes are assumed to be connected with a single static edge. This assumption ignores the fact that many real-world relational data have multiple dimensions. For instance, a user on a social media platform can connect to another user through friendship, messaging or post-sharing. Similarly, in brain networks, the functional connectivity between brain regions occurs across multiple frequency bands [59, 248]. Multilayer graphs are developed to represent and study this multiplicity of interactions, simultaneously [117, 28, 6]. In a multilayer graph, different interactions are represented by layers as depicted in Figure 1.1c. Layers consist of nodes and intralayer edges representing entities and interactions, respectively. Beside intralayer edges, a multilayer network may include interlayer edges that connect nodes from different layers. Albeit the possible oversimplification of single-layer graphs, they have been used to reveal many structural and dynamic properties of networked systems: centrality of nodes [29, 53], small-worldness [254, 172], scale-free property [15, 16] etc. One of the fundamental properties of graphs is community structure, where the nodes are partitioned into tightly connected groups of nodes [79, 81]. Many algorithms have been developed for detection of communities, as identification of communities has important applications in recommendation systems [197], social sciences [149] and network neuroscience [232]. However, most of these methods are developed for single-layer graphs and are not directly applicable for community detection in multilayer graphs. Considering the importance of communities in graph analysis, there is a need for developing community detection algorithms for multilayer graphs. The topological analysis of graphs characterizes a networked system by studying the interactions between entities. However, nodes of a graph can be associated with a significant amount of data that also needs to be studied. For instance, nodes in a transportation network can have attributes related to logistic data describing how goods are traded or people in a social network are associated with various data such as age, gender 1 Throughout this thesis, the terms graph and network are used interchangeably. 1 etc. [227]. Graph Signal Processing (GSP) is a recent research field that aims to learn from this data by incorporating the graph topology into learning algorithms. In GSP, node data is represented as a graph signal, which can be considered as a vector whose entries are indexed by graph nodes. Graph signals can then be studied with different tools that extend classical signal processing concepts such as Fourier transform, filtering, sampling and imputation [178]. In many applications of network science and GSP, the graph topology is assumed to be known. This assumption holds in some areas, e.g. friendship networks or citation networks. However, there are many cases where the graph topology is not readily available. For instance, in network neuroscience, the functional interactions between brain regions are not known and they need to be learned from data collected by functional magnetic resonance imaging (fMRI) or electroencephalogram (EEG) recordings. To this end, various graph learning (GL) methods are developed to infer the graph topology from graph signals. Traditional GL methods includes statistical modeling, such as probabilistic graphical models [82, 14], or physically motivated methods, where graph signals are modelled as a product of dynamic processes on the graph [196, 161]. Recently, graph learning problem is considered from a GSP perspective, where graph Fourier transform (GFT) of signals is employed [68, 137]. Due to explicit representation of graph signals with GFT, these methods provide great flexibility and are observed to perform better than traditional GL methods [69, 107, 23]. However, most of existing graph learning methods are limited to infering only a single connection between nodes and only a few works consider learning multilayer networks [164, 110]. 1.1 Background and Notations In this thesis, scalars, vectors and matrices are indicated by letters (𝑥 or 𝑁), bold lowercase letters (x) and bold uppercase letters (X), respectively. Entries of a vector are denoted as 𝑥 𝑖 and entries of a matrix are denoted as 𝑋𝑖 𝑗 . 𝑖th row and column of X are indicated as X𝑖· and X·𝑖 , respectively and both are assumed to be column vectors. Superscript ⊤ indicates transpose of vectors and matrices. ⟨·, ·⟩ is used to represent the inner product. Identity matrix is shown by I. All ones and zero vectors and matrices are shown as 1 and 0, respectively2. The operator diag() either takes a matrix X and returns a vector x with 𝑥 𝑖 = 𝑋𝑖𝑖 or takes a vector x and returns a diagonal matrix X with 𝑋𝑖𝑖 = 𝑥𝑖 . The operator upper() : R𝑛×𝑛 → R𝑚 returns upper triangular part of the input matrix where 𝑚 = 𝑛(𝑛 − 1)/2. For an 𝑛 × 𝑛 symmetric matrix A, the matrix S ∈ R𝑛×𝑚 is defined such that Supper(A) = A1 − diag(A). Finally, 𝛿𝑖 𝑗 is Kronecker delta, which is 1 if 𝑖 = 𝑗 2 If the dimensions of these vectors/matrices are not clear from context, they will be shown with a subscript indicating the dimensions: e.g. 1𝑛 or 0𝑛×𝑛 . 2 and 0, otherwise. 1.1.1 Single-layer Graphs A single-layer network is denoted by 𝐺 = (𝑉, 𝐸) where 𝑉 is the node set with |𝑉 | = 𝑛 and 𝐸 ⊆ 𝑉 × 𝑉 is the edge set. An edge from node 𝑢 to 𝑣 is represented by 𝑒 𝑢𝑣 and it is associated with a weight 𝑤 𝑢𝑣 . If 𝑒 𝑢𝑣 = 𝑒 𝑣𝑢 , the graph is said to be undirected and otherwise, it is a directed graph. In this thesis, the graphs are assumed to be undirected unless otherwise stated. If 𝑤 𝑢𝑣 = 1, ∀𝑒 𝑢𝑣 ∈ 𝐸, the graph is binary; otherwise, it is weighted. 𝐺 is an unsigned graph, if edge weights are constrained to only positive values. Finally, if the edge weights can take on both positive and negative values, the graph is said to be signed. Algebraically, an unsigned graph 𝐺 can be represented by a symmetric adjacency matrix A ∈ R𝑛×𝑛 , where 𝐴𝑢𝑣 = 𝑤 𝑢𝑣 if 𝑒 𝑢𝑣 ∈ 𝐸 and 0, otherwise. Degree of a node 𝑢 is the sum of weights of the edges connected to it, i.e. 𝑑𝑢 = A⊤ 𝑢· 1. Degree vector of 𝐺 is d = A1 and D = diag(d) is its degree matrix. The combinatorial Laplacian matrix of 𝐺 is L = D − A. L is a positive semi-definite matrix and has eigendecomposition L = V𝚲V⊤ where 𝚲 is the diagonal matrix of eigenvalues and columns of V are eigenvectors. Eigenvalues of L are assumed to be sorted in ascending order, i.e. 0 = Λ11 ≤ Λ22 ≤ · · · ≤ Λ𝑛𝑛 . 1.1.2 Multilayer Graphs A multilayer network is a quadruplet M = (V, L, 𝑉, 𝐸) where V is the set of entities, e.g. people or brain regions, L is the set of layers with |L| = 𝐿 [117]. 𝑉 ⊆ V × L with |𝑉 | = 𝑛 is the set of nodes, which are representations of entities in layers and 𝐸 ⊆ 𝑉 × 𝑉 is the edge set. Nodes are indicated as 𝑢 h , where 𝑢 ∈ V and h ∈ L. An edge from 𝑢 h to 𝑣 k is indicated by 𝑒 hk hk 𝑢𝑣 and associated with the weight 𝑤 𝑢𝑣 . Similar to single-layer networks, M can be undirected/directed, binary/weighted or unsigned/signed. In this thesis, the multilayer graphs are assumed to be undirected unless otherwise stated. 𝑉 can be partitioned based on layers, that is 𝑉 = ℓh=1 𝑉 h where 𝑉 h with |𝑉 h | = 𝑛h is the set of nodes in Ð layer h. Similarly, 𝐸 can be partitioned by 𝐸 = ℓh=1 𝐸 h ∪ ℓh≠k =1 𝐸 hk where 𝐸 h is the set of intralayer edges Ð Ð in layer h and 𝐸 hk is the set of interlayer edges between nodes in layers h and k . From the partitioning of 𝑉 and 𝐸, one can define intralayer graphs 𝐺 h = (𝑉 h , 𝐸 h ) and bipartite interlayer graphs 𝐺 hk = (𝑉 h , 𝑉 k , 𝐸 hk )3. Let Ah be the adjacency matrix of 𝐺 h and Ahk be the incidence matrix of 𝐺 hk . M is represented by a 3A bipartite graph 𝐺 = (𝑉1 , 𝑉2 , 𝐸) consists of two node sets 𝑉1 and 𝑉2 with |𝑉1 | = 𝑛1 and |𝑉2 | = 𝑛2 and edges are only allowed between two sets, i.e. 𝐸 ⊆ 𝑉1 × 𝑉2 . The incidence matrix of 𝐺 is A ∈ R𝑛1 ×𝑛2 where 𝐴𝑖 𝑗 = 𝑤 𝑖 𝑗 is 𝑒 𝑖 𝑗 ∈ 𝐸 and 0, otherwise. 3 Edge Types: Intralayer Edge Interlayer Edge Directed Edge Undirected Edge a) b) c) Layer Layer Time Figure 1.1: Types of graphs used in this thesis. a) shows a two layer dynamic network, where layers are ordered and correspond to time points. b) shows a two layer multiplex graph, where interlayer edges are only allowed between nodes that represent the same entity. c) shows a two layer multilayer graph, where interlayer edges can occur between any pair of nodes. supra-adjacency matrix A ∈ R𝑛×𝑛 , a symmetric block matrix defined as follows:  1  A  A12 . . . A1ℓ     21 A A2 . . . A2ℓ   A =  . (1.1)  ... .. . .. . ..  .      ℓ1 ℓ2 ℓ  A A . . . A   Using the supra-adjacency matrix of M, its supra-Laplacian matrix can be defined analogous to the Laplacian matrix of single-layer networks. In a multilayer graph, there are no constraints on the set 𝐸 hk , i.e. there could be an edge between any 𝑢 h and 𝑣 k as depicted in Figure 1.1c. In this thesis, we will also use two other graph types, that can be considered as constrained versions of multilayer graphs. If interlayer connections are allowed only between nodes representing the same entity, i.e. 𝐸 hk = {𝑒 hk h h k k 𝑢𝑢 |𝑢 ∈ 𝑉 , 𝑢 ∈ 𝑉 } for all h ≠ k , the network is a multiplex (or multiview4) network (Figure 1.1b). A dynamic network is a type of multiplex network, whose layers are ordered and correspond to time points and interlayer edges are only allowed between consecutive time points, i.e. 𝐸 hk = {𝑒 hk h h k k 𝑢𝑢 |𝑢 ∈ 𝑉 , 𝑢 ∈ 𝑉 } if k = h + 1 and 𝐸 hk = ∅, otherwise (Figure 1.1a). 1.2 Community Detection Edges of many real-world networks are distributed heterogeneously such that there are high number of edges within groups of nodes and low number of edges between groups. This feature is called the community structure [79]. The community structure of a single-layer graph 𝐺 can be one of the following: 4 Throughout the thesis, the terms multiplex and multiview are used interchangeably. 4 non-overlapping, overlapping, hierarchical or local [39]. In this thesis, the focus is on non-overlapping community detection, which is the partitioning of node set 𝑉 as P = {𝐶1 , . . . , 𝐶𝐾 } where 𝐾 is the number of communities. The community structure P can be represented by various mathematical objects: community membership vector g ∈ R𝑛 whose entries are 𝑔𝑖 = 𝑟 if 𝑖 ∈ 𝐶𝑟 , or binary indicator matrix Z ∈ R𝑛×𝐾 which is defined with entries 𝑍𝑖𝑟 = 1 if 𝑔𝑖 = 𝑟 and 0, otherwise. The aim of community detection is algorithmic identification of P. This task is usually performed by optimizing a quality function that quantifies the goodness of a given partition to be a community structure. A plethora of quality functions are proposed for single-layer graphs [79] and an overview of the ones used in this thesis is given below. 1.2.1 Graph Cut and Association As mentioned, a community structure is defined as the partitioning of the nodes into well-connected groups while groups are sparsely connected to each other. Therefore, one way to measure the goodness of a partition is to count the number of inter-community edges, referred to as the cut of a partition, which is defined as [225, 249]: ∑︁𝑝 cut(P) = 𝐴𝑖 𝑗 (1 − 𝛿𝑔𝑖 𝑔 𝑗 ) = tr(Z⊤ LZ). (1.2) 𝑖, 𝑗=1 Instead of minimizing the cut, one can also maximize the number of intra-community edges, referred to as the association of a partition [64]: 𝑛 ∑︁ 1 assoc(P) = 𝐴𝑖 𝑗 𝛿𝑔𝑖 𝑔 𝑗 = tr(Z⊤ AZ). (1.3) 𝑖< 𝑗 2 Optimizing the cut or association with respect to Z leads to the trivial solution, where all nodes are assigned to the same community. To prevent this, Z is further constrained to make sure communities have similar sizes. However, due to the discreet structure of Z, the optimization problem is NP-hard [249]; therefore, Z is relaxed to take on real values, which leads to the following optimization problem: minimize 𝑓 (Z) (1.4) Z subject to Z ∈ D, (1.5) where 𝑓 (Z) is either cut or association and Z is constrained to be in a set D to ensure that Z preserves some properties of its discrete form. These properties can include positivity (Z ≥ 0), row-sum constraint (Z1 = 1) or orthogonality (Z⊤ Z = I) [249, 223, 267]. Once a real valued Z is learned, clustering algorithms such as 𝑘-means can be employed to identify the community structure. 5 1.2.2 Modularity Another popular quality function for community detection is the modularity function, which quantifies the quality of a community structure by comparing intra-community connections to those expected under a specified null model. It is calculated as [171]: ∑︁ 𝑄= [ 𝐴𝑖 𝑗 − 𝑃𝑖 𝑗 ]𝛿𝑔𝑖 𝑔 𝑗 , (1.6) 𝑖, 𝑗 where 𝑃𝑖 𝑗 is the expected connection between nodes 𝑖 and 𝑗 under a null model. Depending on the graph under study, different expressions for 𝑃𝑖 𝑗 can be assumed. The most commonly used null models are the configuration null model and Erdős–Rényi null model [21]. Despite its popularity, modularity is known to suffer from the resolution limit that limits the size of detectable communities; communities smaller than some size are mathematically undetectable. In order to detect communities of all sizes, modularity has been extended to include a resolution parameter, 𝛾, which is tuned to uncover communities of different size [199]: ∑︁ 𝑄= [ 𝐴𝑖 𝑗 − 𝛾𝑃𝑖 𝑗 ]𝛿𝑔𝑖 𝑔 𝑗 . (1.7) 𝑖, 𝑗 By varying the value of 𝛾 one can detect communities of different sizes, i.e. when 𝛾 is large or small maximizing modularity will return correspondingly small or large communities, respectively, resulting in multi-scale community structure. 1.2.3 Stochastic Blockmodeling Stochastic blockmodeling (SBM) is a generative network model developed to study networks with block structure, where nodes are assigned to one of 𝐾 blocks. Given block assignments, edges are sampled independently from a Bernoulli distribution with a 𝐾 × 𝐾 edge probability matrix 𝜽, where 𝜃 𝑟 𝑠 is the probability of connectivity between blocks 𝑟 and 𝑠 [97, 87]. For networks with a community structure 𝜃 𝑟𝑟 > 𝜃 𝑟 𝑠 , ∀𝑟 ≠ 𝑠. In this thesis, we employ a restricted version of SBM called planted-partition model, where 𝜃 𝑟 𝑠 = 𝜃 𝑖𝑛 if 𝑟 = 𝑠 and 𝜃 𝑟 𝑠 = 𝜃 𝑜𝑢𝑡 , otherwise [49]. 𝜃 𝑖𝑛 and 𝜃 𝑜𝑢𝑡 are intra- and inter-community connectivity probabilities, respectively. Besides generating networks, SBM is also used for statistical inference of community structure [231, 113, 2, 170, 3]. In [113], community detection with standard SBM is shown to fail in networks with a heterogeneous degree distribution. To overcome this problem, degree corrected SBM (DCSBM) is introduced, where edge probabilities are modified by the degrees of nodes. Given the community assignment g, edges are sampled independently from a Poisson distribution with mean 𝜆 𝑖 𝑗 = 𝑑𝑖 𝑑 𝑗 𝜃 𝑔𝑖 𝑔 𝑗 . Community 6 detection is then performed by maximizing the likelihood function, which can be written as: 𝑛 𝜆 𝐴𝑖 𝑗 exp(−𝜆 ) Ö 𝑖𝑗 𝑖𝑗 P(A|g, 𝜽) = , (1.8) 𝑖< 𝑗 𝐴𝑖 𝑗 ! Different techniques, such as heuristic methods [113], variational inference [2, 3] and Markov Chain Monte Carlo methods [231, 184], are employed to maximize the log-likelihood function. 1.3 Graph Signal Processing A graph signal over a graph 𝐺 is a function 𝑥 : 𝑉 → R and can be represented by a vector x ∈ R𝑛 where each 𝑥 𝑖 is the signal value on node 𝑖. An important concept in the processing of graph signals is their representation in graph frequency domain through graph Fourier transform (GFT). This representation allows us to characterize x in terms of its graph spectral content as either low- or high-frequency, where low(high)-frequency graph signals have small (large) variation with respect to the graph [210]. For an unsigned graph 𝐺, GFT is defined as the expansion of x in terms of the eigenbasis of the graph Laplacian [227]. Let L be the combinatorial Laplacian of an unsigned graph 𝐺 with eigendecomposition L = V𝚲V⊤ as described in Section 1.1.1. GFT of x is then b x = V⊤ x and inverse GFT is [227]: ∑︁ 𝑛 x = Vb x= 𝑥 𝑖 V·𝑖 . b (1.9) 𝑖=1 Thus, x is the linear combination of eigenvectors of L with the coefficients equal to the entries of b x. Eigenvectors of L corresponding to small eigenvalues have small variation over the graph. Thus, if most of the energy of b 𝑥 𝑖 s corresponding to the small eigenvalues, then x varies little over 𝐺, i.e. it is smooth. x lies in b On the other hand, if most of the energy of b x lies in b 𝑥 𝑖 s corresponding to the large eigenvalues, x has high variation over 𝐺, i.e. it is non-smooth. The total variation of x over 𝐺 is then quantified as [227]: x⊤ 𝚲b tr(b x) = tr(x⊤ V𝚲V⊤ x) = tr(x⊤ Lx), (1.10) which is small for low-frequency graph signals and large for high-frequency ones. 1.3.1 Unsigned Graph Learning An unknown unsigned graph 𝐺 can be learned from a set of observed graph signals based on the assumptions made about the relation between graph signals and the topology of 𝐺. In GSP based GL, two major approaches are followed: smoothness based methods [68], where the graph is learned with the assumption that graph signals vary smoothly with respect to 𝐺; and stationarity based methods, where the 7 graph is learned from signals that are assumed to be stationary on 𝐺 [137]. In this thesis, we focus on learning graphs with the smoothness assumption because of the following reasons. First, smooth signals admit low- pass and sparse representations in the graph Fourier domain. Thus, the GL problem is equivalent to finding efficient information processing transforms for graph signals. Second, many graph-based machine learning tasks, such as spectral clustering, graph regularized learning etc., are developed based on the smoothness of the graph signals. Finally, smooth graph signals are observed ubiquitously in real-world applications [137]. Smoothness based GL is first considered in [69] by modeling graph signals using factor analysis, where the transformation from factors to observed signals exploits the graph topology. By imposing a suitable prior on factors, the graph signals are modelled to have low-frequency representation in the graph Fourier domain. This analysis results in an optimization problem where 𝐺 is learned by minimizing (1.10) with respect to L 𝑝 given a set of graph signals {x𝑖 }𝑖=1 as follows: minimize tr(X⊤ LX) + 𝛼∥L∥ 2𝐹 L∈L (1.11) subject to tr(L) = 2𝑛, where X ∈ R𝑛× 𝑝 is the data matrix whose columns are x𝑖 ’s, L = {L : 𝐿 𝑖 𝑗 = 𝐿 𝑗𝑖 ≤ 0 ∀𝑖 ≠ 𝑗, L1 = 0} is the set of valid Laplacian matrices. The first term in (1.11) measures the total variation of the graph signals. The second term is the Frobenius norm of L and controls the density of the learned graph such that larger values of 𝛼 result in denser graphs. Finally, the last constraint is added to prevent the trivial solution L = 0. 1.4 Organization and Contributions of the Thesis In this thesis, we develop methods for two important problems in multilayer networks: community detection and graph learning. Methods for community detection are developed by answering questions like what constitutes a community in a multilayer network and how to incorporate information from multiple layers to detect meaningful communities. In current literature, there is no consensus on the definition of communities and how to incorporate data from different layers is still an open problem. This thesis aims to answer these questions by extending quality functions defined for single-layer graphs to multilayer networks in a principled way. Graph learning approaches are developed based on recent advances in GSP, where graph frequency representation of graph signals are exploited for topology inference. Most of the existing graph learning approaches are limited to cases where the observed data is assumed to be homogeneous and low frequency with respect to a single common graph topology. Thus, we extend graph learning to multilayer network settings. These extensions lead to optimization problems which are solved by efficient algorithms. 8 In Chapter 2, we tackle community detection problem in dynamic networks. Specifically, we focus on evolutionary spectral clustering, which extends spectral clustering to dynamic networks by incorporating information from past time points to improve community detection at a time point. In order to answer the question of how to incorporate the past information, we show the equivalence of evolutionary spectral clustering to a variant of dynamic stochastic blockmodel. Namely, we first introduce a novel dynamic SBM (MRF-DCSBM) where the evolution of communities over time is modeled with pairwise Markov random fields. We then show that the log-posterior of the proposed model is equivalent to the quality function of evolutionary spectral clustering. This equivalence is used to determine the forgetting factor in evolutionary spectral clustering and to develop two new algorithms for dynamic community detection. The proposed algorithms are applied to both simulated and real-world dynamic networks and their performances are compared to state-of-the-art dynamic community detection methods. Chapter 3 introduces a multilayer community detection method, which is especially tailored to handle multilayer brain networks constructed from EEG data. In particular, we first construct functional multilayer networks from EEG data, where layers correspond to different frequency bands and interlayer edges are allowed between all brain regions. Next, a new multilayer modularity metric is defined based on a multilayer null model that preserves the layer-wise node degrees while randomizing the remaining characteristics of the network. The proposed modularity is parameterized with resolution parameter to handle the resolution limit of modularity, and interlayer scale parameter to control the importance of interlayer edges in community formation. Third, a group community detection method is proposed to find the common community structure for a set of subjects. The proposed multilayer community detection method is employed to identify the group level differences between the two response types during Flanker task, i.e. error and correct. In Chapter 4, we present an algorithm to learn signed graphs, which we represent as a two-layer multiplex network where one layer corresponds to positive edges while the other to negative edges. The algorithm is based on graph learning approaches developed using graph signal processing. Existing graph learning methods rely on smoothness of graph signals over the graph; however, they are only capable of learning unsigned graphs. To this end, we propose a signed graph learning approach, that learns signed graphs based on the assumption of smoothness and non-smoothness of graph signals over positive and negative edges, respectively. The proposed method is further extended with kernels to take the nonlinear relations between nodes into account. From GSP perspective, this extension corresponds to assuming smoothness/non- smoothness of graph signals in a higher dimensional space defined by the kernel. The proposed approach 9 is applied to the problem of gene regulatory network inference from single cell gene expression data. Experiments on simulated and real single cell datasets show that the method compares favorably with other single cell gene regulatory network reconstruction algorithms. Chapter 5 addresses the problem of how to learn multiple signed graphs simultaneously. Existing GSP based GL approaches for this problem are limited to unsigned graph topologies. Therefore, we extend the algorithm developed in Chapter 4 to learn multiple signed graphs. In particular, given multiple datasets each of which includes graph signals associated with a signed graph, we assume smoothness and non-smoothness of graph signals as in Chapter 4. Furthermore, we assume that the signed graphs are similar to each other, which is ensured by regularizing the learned signed graphs through a learned signed consensus graph. The proposed method is employed for the joint inference of multiple gene regulatory networks from single cell gene expression data. Experiments on simulated and real single cell datasets show that the method performs better than methods that can learn a single graph at a time and previous joint gene regulatory network reconstruction algorithms. In Chapter 6, we tackle the problem of learning multiple unsigned graphs from a heterogeneous dataset, which includes graph signals that are clustered and each cluster is associated with a different graph. Namely, we present an optimization problem for joint graph signal clustering and graph topology inference. The approach extends graph cut based clustering by partitioning the graph signals not only based on their pairwise similarities but also their smoothness with respect to the graphs associated with the clusters. The proposed method also learns the representative graph for each cluster using the smoothness of the graph signals with respect to the graph topology. Results on simulated and real data indicate the effectiveness of the proposed method. Finally, concluding remarks are presented in Chapter 7, where we summarize the contributions of the thesis and discuss future work that will extend community detection and graph learning methods presented throughout the thesis. 10 CHAPTER 2 COMMUNITY DETECTION IN DYNAMIC NETWORKS 2.1 Introduction An important problem in the study of networks is community detection where the nodes of a network are partitioned into tightly connected groups of nodes [79]. Identification of communities has important applications in recommendation systems [197], social sciences [149] and network neuroscience [232]. Community detection is usually performed by optimizing a quality function that quantifies the goodness of a given partition. Quality functions can be divided into two categories [181, 84]. The first category consists of functions that are based on heuristic definitions of what constitutes a good community, such as modularity [171], normalized cut (spectral clustering) [225, 249] and InfoMap [204]. The second category relies on statistical network modeling [87], such as stochastic blockmodels (SBM) [231], degree-corrected SBM (DCSBM) [113] or latent space models [95]. In this category, the network is assumed to be generated from a statistical network model and the communities are detected by maximizing the likelihood. Since different quality functions define what constitutes a good community differently, community structures detected by different algorithms may vary from each other. Furthermore, as shown in [84], no single method can provide the correct community structure for all types of real-world networks. Understanding why a particular method fails in some networks is important for both finding ways to improve existing algorithms and deciding which method is more suitable for a given network. To this end, there has been an interest in quantifying the relationship between different quality functions to better understand why a particular quality function fails for certain networks. Moreover, this relationship can provide a way to select the hyperparameters, e.g. resolution parameter in the definition of modularity, in a principled way. Recently, Newman et al. [167] have shown that maximizing modularity is equivalent to maximizing the likelihood function of DCSBM under the planted-partition model. A similar result is also shown for spectral clustering in [169]. These results reveal that modularity and spectral clustering assume communities to be statistically similar, i.e. the communities have similar size and edge density. The accuracy of community detection deteriorates if this assumption does not hold. This equivalence is also used for determining the resolution parameter of modularity using DCSBM parameters. All these works consider only single-layer networks and their analysis is not applicable to multilayer networks. Considering the ubiquity of multilayer networks in real life, there is a need to extend this analysis 11 to multilayer networks. One of the important types of multilayer networks is dynamic networks. Community detection methods developed for single-layer networks are not directly applicable to dynamic networks, since the aim of the latter is not only to partition the nodes at each time point but also to track the changes in the partitions over time [203]. Recently, both heuristic and statistical quality functions have been extended to dynamic networks. For heuristic methods, either evolutionary clustering or multilayer network models have been used. Evolutionary clustering methods rely on the quality functions for single-layer networks, by combining snapshot cost at each time with a temporal cost to obtain community structures that change smoothly over time [46, 78, 132, 126]. The amount of smoothness is controlled by tuning the temporal cost with a forgetting factor, whose value is generally determined through grid search and set to be the same at all time points. In [259], an adaptive forgetting factor is proposed to eliminate grid search and to obtain a time-varying forgetting factor. Multilayer models [117], on the other hand, extends quality functions for single-layer networks to dynamic networks using the multilayer representation of dynamic networks shown in Figure 1.1. Examples of multilayer models are multislice modularity [153] and temporal normalized cut [228]. These works also require selection of interlayer coupling that controls the evolution of community structure. In the second category, statistical models for dynamic networks are proposed by defining a dynamic process describing the evolution of the community structure or the parameters of the statistical model [115, 268, 257, 52]. In [264, 85, 181], SBM and DCSBM are extended to dynamic networks by modeling the evolution of community structure over time as a first-order Markov process. [258] uses a state-space model to characterize the evolution of SBM parameters. In [138], both parameters and community structure are allowed to change and identifiability issues are handled by assuming stable intra-community connectivity over time. Dynamic latent space models are also developed [213]. Similar to static networks, showing the relationship between different quality functions defined for dynamic community detection can help us to refine the methods and understand their shortcomings. Using the relationship between heuristic methods and statistical models, one can also set the different hyperparameters in a more rigorous way. Recently, Pamfil et al. [181] have shown that multislice modularity is equivalent to dynamic planted-partition DCSBM, when the evolution of community structure over time is modeled by a first-order Markov process. This equivalence is used to determine the resolution parameter (𝛾) and interlayer coupling (𝜔) in multislice modularity through the parameters of dynamic DCSBM. Furthermore, this equivalence provides a better understanding of the assumptions and shortcomings of multislice modularity. In this chapter, we show equivalence between evolutionary spectral clustering, i.e. preserving cluster 12 membership (PCM), and a novel dynamic DCSBM formulation. This equivalence provides a principled framework for the selection of the forgetting factor in PCM through dynamic DCSBM parameters. Further- more, the equivalence between these two methods provides an efficient algorithm, i.e. spectral clustering, for likelihood maximization of dynamic DCSBM. The contributions of this chapter can be summarized as follows: • We introduce a new dynamic DCSBM assuming a planted-partition model. Different from previous dynamic DCSBMs, we model the evolution of community structure over time using pairwise Markov Random Fields (MRFs). In the proposed MRF model, the potential functions at the current time depend on the community structure at the previous time point. This new model is referred to as dynamic MRF-DCSBM. • We show the equivalence between dynamic MRF-DCSBM and evolutionary spectral clustering by deriving a relationship between log-posterior function of the statistical model and trace maximization in PCM. • This equivalence is exploited to propose two new dynamic community detection algorithms: online (DSC𝑜𝑛 ) and offline (DSC𝑜 𝑓 𝑓 ) dynamic spectral clustering. • The equivalence between dynamic MRF-DCSBM and evolutionary spectral clustering provides a principled way to select the forgetting factor, which determines the amount of tradeoff between accuracy and smoothness in community structure, through the parameters of dynamic DCSBM. Unlike regular evolutionary spectral clustering, in the proposed algorithms, this factor is time dependent and adapts to the changes in community structure. The remainder of this chapter is organized as follows. In Section 2.2, background on evolutionary spectral clustering, dynamic SBM and MRFs are presented. In Section 2.3, dynamic MRF-DCSBM is introduced and the equivalence between the log-posterior function and PCM quality function is derived. Proposed algorithms are derived in Section 2.4. Finally, experimental results and conclusions are given in Sections 2.5 and 2.6, respectively. 2.2 Background A dynamic network G is a type of multilayer network, where each layer is a single-layer network observed at a time point 𝑡. G can be considered as a sequence of single-layer networks, i.e. G = {𝐺 1 , . . . , 𝐺 𝑇 }, 13 where 𝐺 𝑡 = (𝑉 𝑡 , 𝐸 𝑡 ) is the network observed at time 𝑡 with |𝑉 𝑡 | = 𝑛𝑡 and |𝐸 𝑡 | = 𝑚 𝑡 . Adjacency matrices in G are represented by a sequence A = {A1 , . . . , A𝑇 }. Similarly, D = {D1 , . . . , D𝑇 } is the sequence of degree matrices where D𝑡 is the degree matrix of 𝐺 𝑡 . Community structure of a dynamic network G is the partitioning of its nodes at each time 𝑡 into 𝐾 𝑡 communities and is represented by g = {g1 , . . . , g𝑇 } and Z = {Z1 , . . . , Z𝑇 } where g𝑡 and Z𝑡 are defined as in Section 1.2. In the following derivations, we assume 𝑛𝑡 = 𝑛 and 𝐾 𝑡 = 𝐾 ∀𝑡 ∈ {1, . . . , 𝑇 } and the extension is discussed in Section 2.4. 2.2.1 Evolutionary Spectral Clustering Evolutionary spectral clustering, i.e. PCM, is developed by extending association described in Section 1.2 to dynamic networks. The cost function of PCM is defined as [46]: ⊤ ⊤ ⊤ 𝑃𝐶 𝑀 := tr(Z𝑡 A𝑡 Z𝑡 ) + 𝛼tr(Z𝑡 Z𝑡−1 Z𝑡−1 Z𝑡 ), (2.1) where 𝛼 is the forgetting factor and the second term quantifies the difference between the community structures at time 𝑡 − 1 and 𝑡 to ensure the smoothness of community structures across time. In PCM, 𝛼 is set a priori empirically. Furthermore, with this formulation, 𝛼 is time independent, which implies that the smoothness of community structure is the same across time. However, in real-world networks community structures may vary in a non-stationary manner. 2.2.2 Dynamic Stochastic Blockmodeling Recently, SBM and DCSBM have been extended to dynamic networks by defining a dynamic process to model the evolution of either the community structure or the parameters of the model. In this work, the focus is on models that define a dynamic process on the evolution of community structure [264, 84, 181]. These models can be defined as follows: Definition 2.1. Dynamic DCSBM is a generative dynamic network model with community structures g = {g1 , . . . , g𝑇 } and edge parameter matrices 𝜗 = {𝜽 1 , . . . , 𝜽 𝑇 } where (i) The network at each time 𝑡 is generated by a DCSBM with connectivity matrix 𝜽 𝑡 and community structure g𝑡 . (ii) Community assignments follow a first order Markov Process such that: P(g ) = P(g𝑇 |g𝑇−1 ) . . . P(g2 |g1 )P(g1 ), (2.2) 14 where the transition probability P(g𝑡 |g𝑡−1 ) describes the evolution of community structure and P(g1 ) is prior probability for the first time point. 2.2.3 Pairwise Markov Random Fields One way to model the joint distribution of a set of random variables is graphical models, where the nodes of the graph represent random variables and edges indicate the dependence between them. When the edges are undirected, the corresponding graphical model is Markov Random Field (MRF) [160]. Let x = {𝑥 1 , . . . , 𝑥 𝑛 } be a set of random variables with joint distribution P(x). MRF defines P(x) as proportional to the product of non-negative parametric potential functions defined over maximal cliques of the graph. Instead of maximal cliques, one can also define the potential function over the edges of a graph [160]. Let 𝜓𝑖 𝑗 (𝑥 𝑖 , 𝑥 𝑗 ) be the potential function defined over the edges of the MRF graph. The joint distribution is then defined as: 1 Ö P(x) = 𝜓𝑖 𝑗 (𝑥𝑖 , 𝑥 𝑗 ), (2.3) 𝑍 𝑖, 𝑗:𝑒𝑖 𝑗 ∈𝐸 𝑀 𝑅𝐹 where 𝐸 𝑀 𝑅𝐹 is the edge set of MRF graph and 𝑍 is the partition function to normalize the product. This type of MRF is called pairwise MRF and is widely used because of its simplicity [160]. Pairwise MRF has been used in community detection [198, 124, 93] by defining the potential functions following the Potts Model, where 𝜓𝑖 𝑗 (𝑔𝑖 , 𝑔 𝑗 ) = exp(𝐽𝑖 𝑗 𝛿𝑔𝑖 𝑔 𝑗 ). 𝐽𝑖 𝑗 indicates the belief about nodes 𝑖 and 𝑗 being in the same community, i.e. the larger it is the more likely nodes 𝑖 and 𝑗 are in the same community. 2.3 Dynamic MRF-DCSBM and Log-Posterior Formulation In this section, dynamic MRF-DCSBM that extends DCSBM to dynamic networks is introduced. Dif- ferent from previous works, the evolution of community structure is defined using a fully connected pairwise MRF. We then show the equivalence between the log-posterior function of the proposed model and regular- ized association function given in (2.1). 2.3.1 Dynamic MRF-DCSBM Previous works on dynamic DCSBM differ from each other based on how they define the transition prob- abilities P(g𝑡 |g𝑡−1 ) in (2.2). The most popular approach is to define the transition probabilities independently Î𝑛 for each node [85, 181], i.e. P(g𝑡 |g𝑡−1 ) = 𝑖=1 P(𝑔𝑖𝑡 |𝑔𝑖𝑡−1 ), where 1 − 𝑝𝑡 P(𝑔𝑖𝑡 |𝑔𝑖𝑡−1 ) = 𝑝 𝑡 𝛿𝑔𝑡 𝑔𝑡−1 + , (2.4) 𝑖 𝑖 𝐾 15 3 6 Case 1, pt = 2/3 2 5 3 6 1 4 2 5 1 4 3 6 Case 2, 2 5 pt = 1/3 1 4 Time t Time t + 1 Figure 2.1: "Label-switching" issue between consecutive time points when transition probabilities P(g𝑡 |g𝑡−1 ) is defined as (2.4). which assumes that at each time a node either preserves its community membership with copying probability 𝑝 𝑡 or moves to one of the communities in a uniformly random manner. This model implicitly assumes that there is a one to one correspondence between communities at time 𝑡 − 1 and 𝑡. However, this is hardly the case as the 𝑟th community at time 𝑡 − 1 is not necessarily the 𝑟th community at time 𝑡. To elaborate on this problem, consider Figure 2.1, where the community structure of a dynamic network at two consecutive time points is shown. At time 𝑡, there are two communities indicated by blue and red. At time 𝑡 + 1, we consider two cases. In case 1, nodes 3 and 6 change their community memberships while the remaining nodes preserve their community memberships, which means 𝑝 𝑡 = 2/3. In case 2, the community structure is the same as in case 1, even though the community labels are different. In this case, one can say that only two nodes preserve their community memberships, which implies 𝑝 𝑡 = 1/3. Although community structures in both cases are the same, one obtains two different values of 𝑝 𝑡 due to label switching. This problem also exists when one uses transition matrices for P(g𝑡 |g𝑡−1 ) which is discussed in [138] in detail. To address this problem, in this work, P(g𝑡 |g𝑡−1 ) is modeled with a fully-connected pairwise MRF where the potential functions are determined based on the community structure at the previous time point. For every node pair, a potential function 𝜓𝑖 𝑗 (𝑔𝑖𝑡 , 𝑔 𝑡𝑗 ; g𝑡−1 ) is defined and P(g𝑡 |g𝑡−1 ) is written as: 𝑛 1Ö P(g𝑡 |g𝑡−1 ) = 𝜓𝑖 𝑗 (𝑔𝑖𝑡 , 𝑔 𝑡𝑗 ; g𝑡−1 ). (2.5) 𝑍 𝑖< 𝑗 Following previous work that employs pairwise MRF for community detection, the potential functions are determined by Potts model, i.e., 𝜓𝑖 𝑗 (𝑔𝑖𝑡 , 𝑔 𝑡𝑗 ; g𝑡−1 ) = exp(𝐽𝑖𝑡 𝑗 𝛿𝑔𝑡 𝑖 𝑔 𝑗 ) where 𝐽𝑖𝑡 𝑗 indicates the belief that two nodes 𝑖 and 𝑗 are in the same community at time 𝑡. We propose to determine 𝐽𝑖𝑡 𝑗 based on g𝑡−1 as follows:    𝑡 , if 𝑔𝑖𝑡−1 = 𝑔 𝑡−1   𝐽𝑖𝑛   𝑗 𝐽𝑖𝑡 𝑗 = . (2.6)    𝑡 if 𝑔𝑖𝑡−1 ≠ 𝑔 𝑡−1  𝐽𝑜𝑢𝑡 ,  𝑗  16 𝑡 refers to the belief that two nodes are in the same community at time 𝑡 given that they were in the same 𝐽𝑖𝑛 community at time 𝑡 − 1. Similarly, 𝐽𝑜𝑢𝑡 𝑡 refers to the belief that two nodes are in the same community at time 𝑡 given that they were in different communities at time 𝑡 − 1. In cases where community structure changes 𝑡 is expected to be large while 𝐽 𝑡 slowly across time, 𝐽𝑖𝑛 𝑜𝑢𝑡 is small. On the other hand, when there is an 𝑡 decreases while 𝐽 𝑡 abrupt change in community structure, 𝐽𝑖𝑛 𝑜𝑢𝑡 increases. Thus, the conditional distribution 𝑃(g𝑡 |g𝑡−1 ) is able to adapt to the dynamics of community structure across time with proper selection of 𝐽𝑖𝑛 𝑡 𝑡 . The selection of 𝐽 𝑡 and 𝐽 𝑡 and 𝐽𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡 will be discussed in Section 2.4. 2.3.2 Log-posterior Maximization of Dynamic MRF-DCSBM One can infer the dynamic community structure by fitting dynamic MRF-DCSBM to an observed dynamic network G. Given parameters 𝜗, 𝐽𝑖𝑛 𝑜𝑢𝑡 ∀𝑡 and the number of communities 𝐾, the community structure 𝑡 , 𝐽𝑡 can be detected by maximizing the posterior probability P(g |A; 𝜗, 𝐽𝑖𝑛 , 𝐽𝑜𝑢𝑡 ) with respect to g . Based on Definition 2.1 and given the pairwise MRF model for P(g𝑡 |g𝑡−1 ), the posterior distribution is written as: Ö𝑇 Ö 𝑇 P(g |A; 𝜗, 𝐽𝑖𝑛 , 𝐽𝑜𝑢𝑡 ) = 𝑡 𝑡 P(A |g , 𝜽 ) 𝑡 P(g𝑡 |g𝑡−1 )P(g1 ) 𝑡=1 𝑡=2 𝑡 Ö𝑇 Ö 𝑛 (𝜆𝑡𝑖 𝑗 ) 𝐴𝑖 𝑗 exp(−𝜆𝑡𝑖 𝑗 ) Ö 𝑇 Ö 𝑛 ∝ exp(𝐽𝑖𝑡 𝑗 𝛿𝑔𝑡 𝑖 𝑔 𝑗 ), (2.7) 𝑡=1 𝑖< 𝑗 𝐴𝑖𝑡 𝑗 ! 𝑡=2 𝑖< 𝑗 where 𝜆𝑡𝑖 𝑗 = 𝑑𝑖𝑡 𝑑 𝑡𝑗 𝜃 𝑔𝑡 𝑡 𝑔𝑡 and P(g1 ) is ignored in the second line since it is set to be a uniform distribution. 𝑖 𝑗 Instead of maximizing the posterior probability, one can maximize its logarithm to find the community structure. Let L (g ) be the logarithm of the posterior distribution, which can be written as: 𝑇 ∑︁ ∑︁ 𝑛 n o ∑︁ 𝑇 ∑︁ 𝑛 L (g ) ∝ 𝐴𝑖𝑡 𝑗 log(𝜆𝑡𝑖 𝑗 ) − 𝜆𝑡𝑖 𝑗 + 𝐽𝑖𝑡 𝑗 𝛿𝑔𝑡 𝑖 𝑔 𝑗 , (2.8) 𝑡=1 𝑖< 𝑗 𝑡=2 𝑖< 𝑗 where terms that do not depend on g are ignored. Assuming a planted partition model, i.e 𝜃 𝑡𝑘𝑙 = 𝜃 𝑖𝑛 𝑡 if 𝑘 = 𝑙 𝑡 , otherwise, 𝜆 𝑡 and log(𝜆 𝑡 ) can be written as: and 𝜃 𝑜𝑢𝑡 𝑖𝑗 𝑖𝑗 n o 𝜆𝑡𝑖 𝑗 = 𝑑𝑖𝑡 𝑑 𝑡𝑗 (𝜃 𝑖𝑛𝑡 𝑡 − 𝜃 𝑜𝑢𝑡 )𝛿𝑔𝑡 𝑖 𝑔 𝑗 + 𝜃 𝑜𝑢𝑡 𝑡 , (2.9) log 𝜆𝑡𝑖 𝑗 = log(𝑑𝑖𝑡 𝑑 𝑡𝑗 ) + (log 𝜃 𝑖𝑛 𝑡 − log 𝜃 𝑜𝑢𝑡𝑡 )𝛿𝑔𝑡 𝑖 𝑔 𝑗 + log 𝜃 𝑜𝑢𝑡 𝑡 . (2.10) Substituting this into the first term of (2.8) and ignoring terms that do not depend on g , the following can be written at each time point 𝑡 [167]: 𝑛 ∑︁ ∑︁𝑛 ( 𝐴𝑖𝑡 𝑗 log(𝜆𝑡𝑖 𝑗 ) − 𝜆𝑡𝑖 𝑗 )) ∝ (𝛽𝑡 𝐴𝑖𝑡 𝑗 − 𝛾 𝑡 𝑑𝑖𝑡 𝑑 𝑡𝑗 )𝛿𝑔𝑡 𝑖 𝑔 𝑗 , (2.11) 𝑖< 𝑗 𝑖< 𝑗 17 𝑜𝑢𝑡 and 𝛾 = 𝜃 𝑖𝑛 − 𝜃 𝑜𝑢𝑡 . It is easy to see that the right hand side of the above where 𝛽𝑡 = log 𝜃 𝑖𝑛 𝑡 − log 𝜃 𝑡 𝑡 𝑡 𝑡 equation is in the form of (1.3) and it can be written in terms of a trace operator. If 𝐽𝑖𝑡 𝑗 s are set according to (2.6), then: 𝐽𝑖𝑡 𝑗 = 𝐽𝑖𝑛𝑡 𝑡−1 𝛿𝑔𝑖 𝑔 𝑗 + 𝐽𝑜𝑢𝑡𝑡 (1 − 𝛿𝑔𝑡−1 𝑖𝑔𝑗 ). (2.12) Substituting (2.12) and (2.11) into (2.8), the log-posterior can be written as: 𝑇 ∑︁ ∑︁ 𝑛 𝑇 ∑︁ ∑︁ 𝑛 L (g ) = (𝛽𝑡 𝐴𝑖𝑡 𝑗 − 𝛾 𝑡 𝑑𝑖𝑡 𝑑 𝑡𝑗 )𝛿𝑔𝑡 𝑖 𝑔 𝑗 + 𝑡 𝑡−1 (𝐽𝑖𝑛 𝑡 𝛿𝑔𝑖 𝑔 𝑗 + 𝐽𝑜𝑢𝑡 (1 − 𝛿𝑔𝑡−1 𝑖𝑔𝑗 ))𝛿𝑔𝑡 𝑖 𝑔 𝑗 . (2.13) 𝑡=1 𝑖< 𝑗 𝑡=2 𝑖< 𝑗 Theorem 2.1. Given a 𝐾 × 𝐾 matrix of pairwise beliefs, J𝑡 , with diagonal entries, 𝐽𝑖𝑛 𝑡 , and off-diagonal entries, 𝐽𝑜𝑢𝑡𝑡 , at each time point 𝑡 and assuming Z𝑡 ⊤ D𝑡 Z𝑡 = I ∀𝑡 ∈ 1, . . . , 𝑇, the log-posterior of dynamic MRF-DCSBM can be written as: 𝑇 𝑇 ⊤ ∑︁ ⊤ ∑︁ ⊤ L (g ) ∝ 𝛽𝑡 tr(Z𝑡 A𝑡 Z𝑡 ) + tr(Z𝑡 Z𝑡−1 J𝑡 Z𝑡−1 Z𝑡 ). (2.14) 𝑡=1 𝑡=2 Proof. To prove the theorem, we need to show that the degree term in (2.13) at any time 𝑡 is a constant when the constraint Z𝑡 ⊤ D𝑡 Z𝑡 = I is imposed on Z𝑡 . In the following derivations, we will ignore the superscript 𝑡. Let 𝜅𝑟 represent the total degree of community 𝑟. The degree term in (2.13) can then be rewritten as: 𝑛 𝑛 𝑛 ∑︁ 𝛾 ∑︁ ∑︁ 𝛾 ∑︁ 2 𝛾 𝑑𝑖 𝑑 𝑗 𝛿𝑔𝑖 𝑔 𝑗 = 𝑑𝑖 𝑑 𝑗 𝛿𝑔𝑖 𝑔 𝑗 − 𝑑 . (2.15) 𝑖< 𝑗 2 𝑖=1 𝑗=1 2 𝑖=1 𝑖 Í𝐾 As the second term does not depend on community structure, it can be ignored. Since 𝛿𝑔𝑖 𝑔 𝑗 = 𝑟=1 𝑍𝑖𝑟 𝑍 𝑗𝑟 , the first term can written as: 𝑛 𝑛 𝑛 𝑛 𝐾 𝛾 ∑︁ ∑︁ 𝛾 ∑︁ ∑︁ ∑︁ 𝑑𝑖 𝑑 𝑗 𝛿𝑔𝑖 𝑔 𝑗 = 𝑑𝑖 𝑑 𝑗 𝑍𝑖𝑟 𝑍 𝑗𝑟 , 2 𝑖=1 𝑗=1 2 𝑖=1 𝑗=1 𝑟=1 𝐾 𝑛 𝑛 𝛾 ∑︁ ∑︁ ∑︁ = 𝑑𝑖 𝑍𝑖𝑟 𝑑 𝑗 𝑍 𝑗𝑟 , 2 𝑟=1 𝑖=1 𝑗=1 𝐾 𝛾 ∑︁ 2 = 𝜅 (2.16) 2 𝑟=1 𝑟 𝛾 = tr(Z⊤ DZZ⊤ DZ), (2.17) 2 where in the last step we used the fact Z⊤ DZ is a 𝐾 × 𝐾 diagonal matrix with entries (Z⊤ DZ)𝑟𝑟 = 𝜅𝑟 . When the constraint Z⊤ DZ = I is imposed, the right hand side of (2.16) becomes a constant. □ 18 The final form of log-posterior in (2.14) is equivalent to PCM formulation in (2.1) for each time point 𝑡. The only difference between the two expressions is that in this case the forgetting factors are not arbitrary and are determined by 𝛽𝑡 and J𝑡 , which can be calculated through the parameters of dynamic DCSBM as will be shown in the next section. Moreover, the forgetting factors are time-varying allowing the algorithm to adapt to changes in the community structure. Thus, the selection of the optimal forgetting factor is circumvented through this equivalence. Finally, the effect of 𝛽𝑡 and J𝑡 in the quality function (2.14) can be explained as 𝑖𝑛 𝑜𝑢𝑡 increases, 𝛽 gets larger. A large 𝜃 𝑖𝑛 /𝜃 𝑜𝑢𝑡 implies follows. Since 𝛽𝑡 is equal to log(𝜃 𝑖𝑛𝑡 /𝜃 𝑡 ), as 𝜃 𝑡 /𝜃 𝑡 𝑡 𝑡 𝑡 𝑜𝑢𝑡 that the communities are well separated from each other, thus, it is desirable to emphasize the first term in (2.14) which is achieved by a large 𝛽𝑡 . On the other hand, when 𝜃 𝑖𝑛 𝑡 /𝜃 𝑡 𝑡 𝑡 𝑜𝑢𝑡 is small, 𝛽 will be small as A has a less clear community structure. These observations are in line with [181], where equivalence between multislice modularity and dynamic DCSBM is shown. Similarly, 𝐽𝑖𝑛 𝑡 and 𝐽 𝑡 𝑜𝑢𝑡 determine the importance of the second term. For example, if most of the nodes preserve their community memberships from time 𝑡 − 1 to 𝑡 needs to be large, while 𝐽 𝑡 𝑡, 𝐽𝑖𝑛 𝑜𝑢𝑡 needs to be small. However, if there is a large variation in the community structure from time 𝑡 − 1 to 𝑡, 𝐽𝑖𝑛 𝑡 needs to be small to reduce the weight of the second term. In Section 2.4, we describe a methodology to select 𝐽𝑖𝑛 𝑡 and 𝐽 𝑡 𝑜𝑢𝑡 so that the second term in (2.14) adapts to the evolution of community structure. 2.4 Dynamic Spectral Clustering The equivalence shown in (2.14) allows for the development of two spectral clustering type algorithms, i.e. online and offline dynamic spectral clustering. In online learning, communities at each time point are determined given the community structure at the previous time point. This approach is applicable to real-time streaming networks. On the other hand, offline learning identifies the community structure at each time point given the community structure at the previous and next time points and it is applicable when network data is available for all times. 2.4.1 Algorithms Online Learning (𝐷𝑆𝐶𝑜𝑛 ) In real-time applications, one has network data only up to time point 𝑡. Given the community structures up to time 𝑡 −1, g𝑡 can be found by maximizing L (g ) with respect to Z𝑡 considering 19 Algorithm 2.1 Online Dynamic Spectral Clustering A: Adjacency matrices of the dynamic network, K: Set of candidate values for number of communities, MaxIter: Number of iterations for parameter estimation ⊲ 1: function Online(A, K, MaxIter) 2: g1 ← SpecClus((D1 ) −0.5 A1 (D1 ) −0.5 , K) 3: for 𝑡 ← 2, . . . , 𝑇 do 𝑡 , 𝜃 𝑡 , 𝐽 𝑡 and 𝐽 𝑡 4: Initialize 𝜃 𝑖𝑛 𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡 as in Section 2.4.2 5: for 𝑖 ← 1, . . . MaxIter do 6: Construct A𝑡𝑜𝑛 as in (2.18) 7: g𝑡 ← SpecClus(D𝑡 −0.5 A𝑜𝑛 D𝑡 −0.5 , K) 𝑡 , 𝜃 𝑡 , 𝐽 𝑡 and 𝐽 𝑡 8: Estimate 𝜃 𝑖𝑛 𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡 as in (2.21), (2.22), (2.27), and (2.28) 9: end for 10: end for 11: return g , 𝜗, {𝐽𝑖𝑛 2 , . . . , 𝐽 𝑇 }, {𝐽 2 , . . . , 𝐽 𝑇 } 𝑖𝑛 𝑜𝑢𝑡 𝑜𝑢𝑡 12: end function 13: function SpecClus(A, K) 14: U𝚲U⊤ ← Eigendecomposition of A 15: 𝑄 ∗𝑚𝑎𝑠 ← −∞ 16: for 𝐾 ∈ K do 17: g ← 𝑘𝑚𝑒𝑎𝑛𝑠(U(:, 1 : 𝐾), 𝐾) 18: if 𝑄 𝑚𝑎𝑠 (g) > 𝑄 ∗𝑚𝑎𝑠 then 19: g∗ ← g 20: 𝑄 ∗𝑚𝑎𝑠 ← 𝑄 𝑚𝑎𝑠 (g) 21: end if 22: end for 23: return g∗ 24: end function only the terms that depend on Z𝑡 . The corresponding optimization problem is: ⊤ ⊤ maximize tr(Z𝑡 (𝛽𝑡 A𝑡 + Z𝑡−1 J𝑡 Z𝑡−1 )Z𝑡 ), Z 𝑡 (2.18) 𝑡⊤ 𝑡 𝑡 subject to Z D Z = I ⊤ where the modified adjacency matrix at each time point can be defined as A𝑡𝑜𝑛 = 𝛽𝑡 A𝑡 + Z𝑡−1 J𝑡 Z𝑡−1 with the initialization A1𝑜𝑛 = A1 . Z𝑡 that maximizes (2.18) is the matrix whose columns are the 𝐾 eigenvectors that correspond to the largest 𝐾 eigenvalues of D𝑡 −0.5 A𝑡𝑜𝑛 D𝑡 −0.5 . g𝑡 is then found by applying k-means to the rows of this Z𝑡 . Pseudocode for this algorithm is given in Algorithm 1. Offline Learning (𝐷𝑆𝐶𝑜 𝑓 𝑓 ) In some applications, e.g. dynamic social networks, one might have access to network data for all time points. In this case, both past and future data can be used to identify the communities at time 𝑡. Given the community structures at time 𝑡 − 1 and 𝑡 + 1, g𝑡 can be found by maximizing L (g ) with 20 Algorithm 2.2 Offline Dynamic Spectral Clustering A: Adjacency matrices of dynamic network, K: Set of candidate values for number of communities, MaxIter: Number of iterations for parameter estimation ⊲ 1: function Offline(A, K, MaxIter) 2: g , 𝜗, {𝐽𝑖𝑛 2 , . . . , 𝐽 𝑇 }, {𝐽 2 , . . . , 𝐽 𝑇 } ← Online(A, K, 1) 𝑖𝑛 𝑜𝑢𝑡 𝑜𝑢𝑡 ⊲ Initialization 3: for 𝑖 ← 1, . . . , MaxIter do 4: for 𝑡 ← 1, . . . , 𝑇 do 5: Construct A𝑡𝑜 𝑓 𝑓 as in (2.20) 6: g𝑡 ← SpecClus(D𝑡 −0.5 A𝑜 𝑓 𝑓 D𝑡 −0.5 , K) 𝑡 , 𝜃 𝑡 , 𝐽 𝑡 and 𝐽 𝑡 7: Estimate 𝜃 𝑖𝑛 𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡 as in (2.21), (2.22), (2.27), and (2.28) 8: end for 9: end for 10: return g , 𝜗, {𝐽𝑖𝑛 2 , . . . , 𝐽 𝑇 }, {𝐽 2 , . . . , 𝐽 𝑇 } 𝑖𝑛 𝑜𝑢𝑡 𝑜𝑢𝑡 11: end function respect to Z𝑡 : ⊤ ⊤ ⊤ ⊤ ⊤ maximize 𝛽𝑡 tr(Z𝑡 A𝑡 Z𝑡 ) + tr(Z𝑡 Z𝑡−1 J𝑡 Z𝑡−1 Z𝑡 ) + tr(Z𝑡+1 Z𝑡 J𝑡+1 Z𝑡 Z𝑡+1 ), Z 𝑡 (2.19) 𝑡⊤ 𝑡 𝑡 subject to Z D Z = I where the last term can be rewritten as: 𝑛 𝑡+1 ⊤ 𝑡 𝑡+1 𝑡 ⊤ 𝑡+1 ∑︁ 𝑡+1 𝑡 𝑡+1 tr(Z ZJ Z Z )= (𝐽𝑖𝑛 𝛿𝑔𝑖 𝑔 𝑗 + 𝐽𝑜𝑢𝑡 (1 − 𝛿𝑔𝑡 𝑖 𝑔 𝑗 ))𝛿𝑔𝑡+1𝑖𝑔𝑗 , 𝑖< 𝑗 ∑︁𝑛 𝑡+1 𝑡+1 𝑡+1 𝑡+1 = (𝐽𝑖𝑛 𝛿𝑔𝑖 𝑔 𝑗 − 𝐽𝑜𝑢𝑡 𝛿𝑔𝑖 𝑔 𝑗 )𝛿𝑔𝑡 𝑖 𝑔 𝑗 + 𝐽𝑜𝑢𝑡 𝑡+1 𝑡+1 𝛿𝑔𝑖 𝑔 𝑗 , 𝑖< 𝑗 𝑛 ⊤ ⊤ ∑︁ 𝑡+1 𝑡+1 =(𝐽𝑖𝑛 − 𝐽𝑜𝑢𝑡 )tr(Z𝑡 Z𝑡+1 Z𝑡+1 Z𝑡 ) + 𝑡+1 𝑡+1 𝐽𝑜𝑢𝑡 𝛿𝑔𝑖 𝑔 𝑗 . 𝑖< 𝑗 When this term is used in the maximization problem in (2.19) for time 𝑡, the second term in the last line can be ignored since it does not depend on Z𝑡 . Thus, we can change the last term in (2.19) with ⊤ 𝑡+1 − 𝐽 𝑡+1 )tr(Z𝑡 ⊤ Z𝑡+1 Z𝑡+1 Z𝑡 ). With this change, g𝑡 can be found by applying k-means to 𝐾 eigenvectors (𝐽𝑖𝑛 𝑜𝑢𝑡 that correspond to the 𝐾 largest eigenvalues of the matrix D𝑡 −0.5 A𝑡𝑜 𝑓 𝑓 D𝑡 −0.5 , where A𝑡𝑜 𝑓 𝑓 is: ⊤ ⊤ A𝑡𝑜 𝑓 𝑓 = 𝛽𝑡 A𝑡 + Z𝑡−1 J𝑡 Z𝑡−1 + (𝐽𝑖𝑛 𝑡+1 𝑡+1 − 𝐽𝑜𝑢𝑡 )Z𝑡+1 Z𝑡+1 . (2.20) For 𝑡 = 1, the second term in A𝑡𝑜 𝑓 𝑓 is excluded when calculating A1𝑜 𝑓 𝑓 and for 𝑡 = 𝑇, the last term is excluded. Pseudocode for this algorithm is given in Algorithm 2. 2.4.2 Parameter Estimation The proposed algorithms are derived based on the assumption that the parameters of dynamic DCSBM are known. However, in practice, one needs to estimate the parameters of the model while inferring the 21 community structure. Based on previous works in [167, 181, 192], we use an iterative scheme to estimate intra- and inter-community connectivity parameters, 𝜃 𝑖𝑛 𝑡 and 𝜃 𝑡 , and pairwise MRF parameters, 𝐽 𝑡 and 𝑜𝑢𝑡 𝑖𝑛 𝑡 , for all time points. 𝐽𝑜𝑢𝑡 𝑡 and 𝜃 𝑡 𝑡 𝑡 𝑡 At each time point, first, 𝜃 𝑖𝑛 𝑜𝑢𝑡 are initialized such that 𝛽 = 1 and 𝐽𝑖𝑛 and 𝐽𝑜𝑢𝑡 are respectively 𝑡 and 𝑠 𝑡 set to 𝑠𝑖𝑛 𝑜𝑢𝑡 defined below. Next, community structure is found with these values. Then, intra- and inter-community connectivity parameters, 𝜃 𝑖𝑛 𝑡 and 𝜃 𝑡 , are estimated with maximum likelihood estimation 𝑜𝑢𝑡 using the detected communities as follows [167]: Í 𝑛𝑡 𝑖< 𝑗 𝐴𝑖𝑡 𝑗 𝛿𝑔𝑡 𝑖 𝑔 𝑗 𝑡 𝜃 𝑖𝑛 = Í 𝑛𝑡 , (2.21) 𝑖< 𝑗 𝑑𝑖𝑡 𝑑 𝑡𝑗 𝛿𝑔𝑡 𝑖 𝑔 𝑗 Í 𝑛𝑡 𝑖< 𝑗 𝐴𝑖𝑡 𝑗 (1 − 𝛿𝑔𝑡 𝑖 𝑔 𝑗 ) 𝑡 𝜃 𝑜𝑢𝑡 = Í 𝑛𝑡 . (2.22) 𝑖< 𝑗 𝑑𝑖𝑡 𝑑 𝑡𝑗 (1 − 𝛿𝑔𝑡 𝑖 𝑔 𝑗 ) Similarly, 𝐽𝑖𝑛𝑡 and 𝐽 𝑡 𝑜𝑢𝑡 can be estimated using any of the approaches developed to learn the parameters of 𝑡 and 𝐽 𝑡 an MRF. However, estimation of 𝐽𝑖𝑛 𝑜𝑢𝑡 depends on calculating the partition function of MRF, which is not an easy task [160]. Instead of relying on these methods, we propose a procedure using the following statistics on network and community structure to estimate 𝐽𝑖𝑛 𝑡 and 𝐽 𝑡 : 𝑜𝑢𝑡 Í𝑛 𝑡 𝑡−1 𝑡 𝑖< 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 𝑝 𝑖𝑛 = Í𝑛 𝑡−1 , (2.23) 𝑖< 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 Í𝑛 𝑡 𝑖< 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 (1 − 𝛿 𝑔𝑖 𝑔 𝑗 ) 𝑡−1 𝑝 𝑡𝑜𝑢𝑡 = Í𝑛 , (2.24) 𝑖< 𝑗 1 − 𝛿 𝑔𝑖 𝑔 𝑗 𝑡−1 Í𝑛 𝑡 𝑡−1 𝑡 𝑖< 𝑗 𝐴𝑖 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 𝑠𝑖𝑛 = Í𝑛 𝑡−1 , (2.25) 𝑖< 𝑗 𝛿 𝑔𝑖 𝑔 𝑗 Í𝑛 𝑖< 𝑗 𝐴𝑖 𝑗 (1 − 𝛿 𝑔𝑖 𝑔 𝑗 ) 𝑡 𝑡−1 𝑡 𝑠𝑜𝑢𝑡 = Í𝑛 , (2.26) 𝑖< 𝑗 1 − 𝛿 𝑔𝑖 𝑔 𝑗 𝑡−1 where 𝑝 𝑖𝑛𝑡 is the probability of node pairs remaining in the same community from time 𝑡 − 1 to 𝑡 and 𝑝 𝑡 𝑜𝑢𝑡 is the probability of node pairs moving into the same community at time 𝑡 when they are not in the same community at time 𝑡 − 1. 𝑠𝑖𝑛 𝑡 and 𝑠 𝑡 𝑜𝑢𝑡 quantify the intra- and inter-community sparsity levels of A using the 𝑡 community structure at the previous time point, respectively. For any node pair, 𝐽𝑖𝑛𝑡 and 𝐽 𝑡 𝑜𝑢𝑡 correspond to the belief that a node pair is in the same community at time 𝑡 given the community structure at 𝑡 − 1. Since 𝑝 𝑖𝑛 𝑡 and 𝑝 𝑡 𝑜𝑢𝑡 quantify the ratios of node pairs that stay in or move into the same community from time 𝑡 − 1 to 𝑡, they can be used as indicators for the belief about a pair of nodes being in the same community. Moreover, as the pairwise MRF is fully connected and real 22 world networks are generally sparse, sparsity terms are used to ensure that the two terms in (2.14) are in the 𝑡 and 𝐽 𝑡 same range. Thus, 𝐽𝑖𝑛 𝑜𝑢𝑡 are defined as follows: 𝑡 𝑡 𝑡 𝐽𝑖𝑛 = 𝑝 𝑖𝑛 𝑠𝑖𝑛 , (2.27) 𝑡 𝐽𝑜𝑢𝑡 = 𝑝 𝑡𝑜𝑢𝑡 𝑠𝑜𝑢𝑡 𝑡 . (2.28) The community structure is then updated using the estimated values of 𝜃 𝑖𝑛 𝑡 , 𝜃 𝑡 , 𝐽 𝑡 and 𝐽 𝑡 . This process 𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡 is iterated until convergence, i.e. either the community structure or the parameters do not change anymore, or until the maximum number of iterations is reached. 2.4.3 Number of Communities Determining the number of communities is an important part of community detection problem. Different methods such as Bayesian Information Criterion (BIC) [209], Integrated Completed Likelihood (ICL) [58] or Minimum Description Length (MDL) [185], have been proposed in literature. In these approaches, first a range of possible number of communities, K, is defined. Next, the number of communities is set as the value that optimizes the given criterion. In this work, we use a quality function based on the linear combination of asymptotic surprise [244] and modularity with configuration null model [171]. Asymptotic surprise is a heuristic quality function for community detection and is defined as follows: 𝑚 𝑖𝑛 𝑀𝑖𝑛  𝑄 𝑎𝑠 = 𝑚𝐷 𝐾 𝐿 , (2.29) 𝑚 𝑀 where 𝐷 𝐾 𝐿 is Kullback–Leibler divergence, 𝑚 𝑖𝑛 is the number of intra-community edges, 𝑚 is the total number of edges, 𝑀𝑖𝑛 is the number of possible intra-community edges and 𝑀 is the total number of possible edges. Modularity with configuration null model, on the other hand, compares the number of intra-community edges in an observed network to the number of intra-community edges expected under a configuration null model and is defined as: 𝑛 ∑︁ 𝑑𝑖 𝑑 𝑗 𝑄 𝑐𝑛 = ( 𝐴𝑖 𝑗 − 𝛾 )𝛿𝑔𝑖 𝑔 𝑗 . (2.30) 𝑖, 𝑗 2𝑚 Asymptotic surprise has been previously used as a model selection approach [244, 221]. However, it is known that it can overestimate the number of communities. On the other hand, modularity can underestimate the number of communities due to its resolution limit [80]. Therefore, we propose to maximize a linear 23 combination of both quality functions to determine the number of communities as: 𝐾 ∗ = argmax 𝑄 𝑚𝑎𝑠 := 𝑄 𝑎𝑠 + 𝑄 𝑐𝑛 . (2.31) 𝐾 ∈K 2.4.4 Extensions In previous sections, the number of nodes and communities are assumed to be the same at all time points. However, in many real-world dynamic networks, both the number of nodes and communities may change over time. The dynamic MRF-DCSBM can handle the changes in the number of nodes in the following manner. Let 𝑗 be a new node added to the network at time 𝑡. Since we do not have any information about the community of this node, we set 𝐽𝑖𝑡 𝑗 = 0 for all 𝑖 ≠ 𝑗 when defining the transition distribution. In the case a node is removed from the network, all information about the removed node is discarded from the transition distribution. These updates to the transition distribution change the construction of A𝑡𝑜𝑛 and A𝑡𝑜 𝑓 𝑓 ⊤ as follows. When a new node joins the network at time 𝑡, we add an all-zero row and column to Z𝑡−1 J𝑡 Z𝑡−1 corresponding to the new node. When a node leaves the network, the row and column corresponding to that ⊤ ⊤ node is removed from Z𝑡−1 J𝑡 Z𝑡−1 . Similar changes are applied to (𝐽𝑖𝑛 𝑡+1 − 𝐽 𝑡+1 )Z𝑡+1 Z𝑡+1 , when there are 𝑜𝑢𝑡 different number of nodes at times 𝑡 and 𝑡 + 1. Finally, changes in the number of communities do not affect the proposed model, as our transition distribution is defined based on whether two nodes are in the same community or not, and not on the actual community label. 2.4.5 Computational Complexity The computational complexity of the proposed algorithms is governed by the cost of eigendecomposition, which has a computational complexity of 𝑂 (𝑛3 ) where 𝑛 is the number of nodes. If 𝐼 is the maximum number of iterations for parameter estimation, the total computational complexity of both algorithms is 𝑂 (𝑇 𝐼𝑛3 ) since the eigendecomposition needs to be computed at each time point 𝐼 times. However, in practice 𝑇 and 𝐼 are small compared to 𝑛, thus computational complexity of both algorithms are approximately 𝑂 (𝑛3 ). 2.5 Results The proposed algorithms1 are compared to state-of-the-art dynamic community detection methods for both simulated and real networks. We consider dynamic community detection methods developed using heuristically defined quality functions such as evolutionary spectral clustering (PCM) [46], multislice 1 Implementations of both algorithms can be found at https://github.com/abdkarr/ DynamicSpectralClustering 24 modularity based generalized Louvain (GL2) [181], PisCES3 [126] and DYNMOGA4 [78]. PisCES extends spectral clustering to dynamic networks by smoothing eigenvectors of adjacency matrices over time and applying k-means to smoothed eigenvectors. DYNMOGA [78] uses genetic algorithms to maximize a multi-objective optimization problem, whose objective function includes modularity as snapshot cost and Normalized Mutual Information (NMI) [57] as temporal cost. We also compare to dynamic SBM model based methods such as (DSBM𝑋𝑢 5) [258] and (DSBM𝑌 𝑎𝑛𝑔 6) [264]. Among these methods, GL and PisCES learn communities in an offline manner, while the rest are online. Parameters for the different algorithms are set as follows. For the proposed algorithms, 𝑀𝑎𝑥𝐼𝑡𝑒𝑟 is set to 20. For PCM, the forgetting factor is selected from the set {0.1, 0.15, 0.2, 0.25, 0.35, 0.4} as the one that maximizes the normalized association. This range is selected based on prior empirical evidence. The implementation of GL follows [181] and learns the resolution (𝛾) and interlayer coupling (𝜔) parameters using the equivalence between multislice modularity and a variant of dynamic DCSBM. Furthermore, we use a multi-iteration version of GL such that GL is run until there is no improvement in multislice modularity and each run is initialized using the community structure detected from the previous run. Forgetting factor in PisCES is determined via cross-validation over the set {0.05, 0.1, 0.15, 0.2} as recommended in [126]. Parameters of the genetic algorithms for DYNMOGA are set as 𝑐𝑟𝑜𝑠𝑠𝑜𝑣𝑒𝑟𝑟𝑎𝑡𝑒 = 0.8, 𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑡𝑒 = 0.2, 𝑝𝑜 𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠𝑖𝑧𝑒 = 200 and 𝑛𝑢𝑚𝑏𝑒𝑟𝑜 𝑓 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 200. The parameters of DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 are set as recommended in the corresponding papers. Finally, the number of communities for the proposed algorithms and PCM is selected as described in Section 2.4.3. GL, PisCES and DYNMOGA have their own number of communities selection procedures. DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 do not describe a procedure for selecting the number of communities, thus for simulated networks, we give the true number of communities as an input. 2.5.1 Simulated Networks Simulated networks are generated following the benchmark model described in [20]. This model will be referred to as multilayer generative model (MLGM)7. This benchmark generates a dynamic network 2 https://github.com/roxpamfil/IterModMax 3 https://www.andrew.cmu.edu/user/davidch/ 4 http://staff.icar.cnr.it/pizzuti/codes.html 5 https://github.com/IdeasLabUT/Dynamic-Stochastic-Block-Model 6 https://homepage.cs.uiowa.edu/~tyng/publications.html 7 https://github.com/MultilayerGM/MultilayerGM-MATLAB 25 DSCon DSCoff PCM GL DSBMXu DSBMY ang DYNMOGA PisCES Performances for µt = 0.4 Performances for µt = 0.5 Performances for µt = 0.6 1 1 1 0.95 0.9 0.8 NMI 0.9 0.8 0.6 0.85 0.8 0.7 0.4 0 5 10 15 0 5 10 15 0 5 10 15 Time Time Time a) b) c) Figure 2.2: Simulation 1: Average NMI values for the different methods as a function of time. The mixing coefficient is set to (a) 0.4, (b) 0.5 and (c) 0.6. using dynamic DCSBM described in Definition 2.1. Transition probabilities are set similar to (2.4) with the only difference being that the probability of nodes moving to other communities is determined by a categorical distribution instead of a uniform distribution. Probabilities of categorical distribution are drawn from a Dirichlet distribution with parameter 𝜈. Node degrees are drawn from a truncated power law distribution with exponent 𝑘, minimum degree 𝑑 𝑚𝑖𝑛 and maximum degree 𝑑 𝑚𝑎𝑥 to obtain heterogeneous degree distributions. The parameters of the benchmark model are number of communities 𝐾, copying probability 𝑝 𝑡 and mixing coefficient 𝜇𝑡 , which indicates the ratio of edges that are set as inter-community edges. Finally, 𝑞 1 ∈ [0, 1] and 𝑞 2 ∈ [0, ∞) control death and birth rates of communities, respectively. At any time point, each community may disappear with probability 𝑞 1 . The number of emerging communities is determined by a Poisson distribution with rate 𝑞 2 . The performance of the different algorithms is quantified by Normalized Mutual Information (NMI) [57]. Simulation 1 In this simulation, we evaluate the effect of the mixing coefficient on the performance of different methods. A dynamic network with 𝑇 = 15 and 128 nodes at each time point is generated with MLGM. Nodes are divided into 𝐾 = 4 communities with 𝑞 1 = 0 and 𝑞 2 = 0 so that there are 4 communities at all time points. Copying probability is 𝑝 𝑡 = 0.9 ∀𝑡 with 𝜈 set to 100. Parameters of the power-law distribution are set as 𝑘 = −2.5, 𝑑 𝑚𝑖𝑛 = 8 and 𝑑 𝑚𝑎𝑥 = 16. Aforementioned methods are applied to 100 realizations of networks generated using these settings. The number of communities is selected from K = {2, . . . , 10} as the value that maximizes the proposed quality function, 𝑄 𝑚𝑎𝑠 . Average NMI as a function of time is reported in Figure 2.2 for three different values of the mixing coefficient, i.e., {0.4, 0.5, 0.6}. It can be seen 26 that offline learning methods, DSC𝑜 𝑓 𝑓 , GL and PisCES, have higher accuracy than online approaches for 𝜇𝑡 = 0.4 and 𝜇𝑡 = 0.5, since they use both past and future networks to detect communities at any time point. Among offline methods, GL performs the best for 𝜇𝑡 = 0.4. DSC𝑜 𝑓 𝑓 and GL have similar performances for 𝜇𝑡 = 0.5 and perform better than PisCES. For 𝜇𝑡 = 0.6, GL’s accuracy drops significantly while the proposed offline algorithm achieves the best performance. Among online methods, DSC𝑜𝑛 shows similar or better performance compared to others. Methods based on dynamic SBM variants cannot detect the communities accurately except for a low mixing coefficient. The reason for this loss in accuracy is that both DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 require the selection of a set of hyperparameters and their performance depends on the correct estimation of these parameters. On the other hand, the proposed methods are hyperparemeter-free. Simulation 2 In the previous simulation, the copying probability is set as a constant across time. However, in real-world dynamic networks, community structure can evolve at different rates across time. To generate such dynamic networks, we set 𝜇𝑡 = 0.5 and consider three different cases with varying 𝑝 𝑡 as given in Table 2.1. Communities are detected with K = {2, . . . , 10} and results are shown in Figure 2.3. For case 1, DSC𝑜 𝑓 𝑓 and GL have similar NMI values and perform slightly better than PisCES. The performances of all methods decrease between 𝑡 = 6 and 𝑡 = 10, where 𝑝 𝑡 = 0.75. This is expected as a small copying probability implies that the network is more non-stationary. DSC𝑜 𝑓 𝑓 performs the best among all methods for these time points. This implies that our method is more robust to changes in community membership. For case 2, the NMI values are lower for all methods compared to the first case as copying probability is further reduced. Compared to GL, the proposed offline method maintains its effectiveness, while the former’s performance drops significantly. It can also be observed that DSC𝑜 𝑓 𝑓 detects communities better than PisCES. Similar results are observed for Case 3 in Figure 2.3c, where GL performs well only for the time range where 𝑝 𝑡 = 0.9 while DSC𝑜 𝑓 𝑓 maintains a high NMI value across all time. Among online methods, DSC𝑜𝑛 performs the best in all cases. In summary, the proposed methods are more stable across time and robust against changes in copying probability over time. This ability to adapt to changes in copying probability also indicates that Table 2.1: Copying probability values for Simulation 2 From 𝑡 = 2 to 𝑡 = 5 From 𝑡 = 6 to 𝑡 = 10 From 𝑡 = 11 to 𝑡 = 15 Case 1 0.90 0.75 0.90 Case 2 0.75 0.65 0.75 Case 3 0.80 0.90 0.70 27 DSCon DSCoff PCM GL DSBMXu DSBMY ang DYNMOGA PisCES Performances for Case 1 Performances for Case 2 Performances for Case 3 1 1 1 0.9 0.9 0.9 NMI 0.8 0.8 0.8 0.7 0.7 0.7 0 5 10 15 0 5 10 15 0 5 10 15 Time Time Time a) b) c) Figure 2.3: Simulation 2: Average NMI values for the different methods as a function of time. Case 1 (a), Case 2 (b) and Case 3 (c). The mixing coefficient is set to 0.5. 𝑡 and 𝐽 𝑡 the method for selecting 𝐽𝑖𝑛 𝑜𝑢𝑡 proposed in Section 2.4.2 is effective. Simulation 3 In most real world dynamic networks, the number of communities may also change over time. In this simulation, we evaluate the performance of different algorithms for changing number of communities using 𝑞 1 and 𝑞 2 to control the birth and death rates of communities. A dynamic network with 𝑇 = 15 and 256 nodes is generated using MLGM benchmark. Number of communities for the first time point is set to 𝐾 = 8. For subsequent time points, the number of communities is determined by 𝑞 1 = 0.1 and 𝑞 2 = 1. The remaining parameters are set as 𝑘 = −2.5, 𝑑 𝑚𝑖𝑛 = 8, 𝑑 𝑚𝑎𝑥 = 16, 𝑝 𝑡 = 0.9 and 𝜇𝑡 = 0.5. With these parameters, on average one new community emerges, one community disappears and 18% of nodes change their communities at each time. In Figure 2.4, results are shown for the different methods with K = {2, . . . , 20}. We omitted the results for DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 as they do not perform well with changing number of communities across time. Figure 2.4a shows the average NMI as a function of time. The proposed offline algorithm performs the best compared to other offline and online methods. Among online methods, DSC𝑜𝑛 performs slightly better than PCM and both methods are better than DYNMOGA. Figure 2.4b illustrates the estimated number of communities for each method along with the true number of communities. It can be seen that GL and DYNMOGA overestimate the number of communities resulting in low NMI values. On the other hand, the number of communities estimated by the proposed methods along with PCM and PisCES are very close to the true number of communities. These results indicate that the quality function, 𝑄 𝑚𝑎𝑠 , proposed in Section 2.4.3 is effective at determining the number of communities. 28 Performances for Simulation 3 Estimated Number of Communities in Simulation 3 DSCon DSCoff PCM True NC DSCon DSCoff GL DYNMOGA PisCES 20 PCM GL DYNMOGA Number of communities 1 PisCES 15 NMI 0.8 10 0.6 5 0 5 10 15 0 5 10 15 Time Time a) b) Figure 2.4: Results for 100 realizations of the network described in Simulation 3: (a) Average NMI as a function of time; (b) Estimated number of communities. Black dashed line is the true number of communities averaged over 100 realizations. Scalability Analysis We compare the scalability of the aforementioned methods with increasing number of nodes. A dynamic network with 𝑇 = 10 is generated using MLGM benchmark. Number of nodes are set to 2𝑚 where 𝑚 varies from 6 to 12 in increments of 1. Number of communities is set to 2𝑚−5 , such that the average community size remains the same with increasing number of nodes. The remaining parameters are set as 𝑘 = −2.5, 𝑑 𝑚𝑖𝑛 = 8, 𝑑 𝑚𝑎𝑥 = 16, 𝑝 𝑡 = 0.9, 𝜇𝑡 = 0.5, 𝑞 1 = 0 and 𝑞 2 = 0. The average run time for community detection is reported across 10 realizations in Figure 2.5. Number of communities is assumed to be known so the run time corresponds only to the time required for community detection and forgetting factor estimation. Results for DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 are not reported, as we could not obtain their results in a reasonable time for networks with more than 1000 nodes. It can be observed that the proposed methods have lower computational complexity than other methods for the considered network sizes. Although the proposed methods learn forgetting factor through an iterative scheme, these results indicate that this learning process does not have high computational complexity compared to methods that require a priori selection of forgetting factor. This is due to the fact that the learning algorithm converges fast. 2.5.2 Real World Networks In this section, the proposed algorithms are applied to real-world dynamic networks and their perfor- mances are compared to aforementioned methods. As the number of communities may change over time, results for DSBM𝑋𝑢 and DSBM𝑌 𝑎𝑛𝑔 are not reported since they do not perform well in such cases. For 29 DSCon DSCoff PCM 106 GL DYNMOGA PisCES Total run time (sec) 104 102 100 102 103 104 Number of nodes Figure 2.5: Computational complexity of methods with respect to number of nodes. the first dataset, metadata about the nodes are used as ground truth community structure and performances are evaluated using NMI. For the remaining datasets, we do not have any information about ground truth community structure, thus we compare the detected communities using quality functions developed for community detection. We use three such metrics: modularity (see (2.30)) with resolution parameter set as [167]; asymptotic surprise (AS) and conductance [79], which quantifies the ratio of inter-community edges to the total degree. Smaller values of conductance indicate better community structure. These metrics are computed for the communities detected by each algorithm at each time point and averaged over time. Finally, parameters of the methods are set as in the simulations except that generation number and population size of DYNMOGA are reduced to 50 due to computational complexity. Reality Mining This dynamic network is constructed by using the data from MIT Reality mining project [72]. The data is collected from cell phones of 94 students and staff at MIT over a year. Cell phones are equipped with Bluetooth sensors which record nearby Bluetooth devices every 5 minutes. These recordings are used to construct a dynamic network with 46 time points, each of which correspond to 1 week. Affiliations of students and staff are available and used as ground truth community structure as in [258]. Namely, there are 2 communities corresponding to people who work in MIT Media Lab and first-year business school students. Table 2.2: Mean NMI values for detected communities of reality mining DSC𝑜𝑛 DSC𝑜 𝑓 𝑓 PCM GL DYNMOGA PisCES NMI 0.677(0.005) 0.652(0.014) 0.637(0.009) 0.724(0.309) 0.517(0.008) 0.466(0.016) 30 Similarity between Consecutive Weeks 1 Fall Fall Spring Spring Spring begins. ends. begins. break ends. Similarities (NMI) 0.5 DSCon DSCof f GL 0 0 5 10 15 20 25 30 35 40 45 50 Weeks Figure 2.6: Similarity between the community structures at consecutive time points for reality mininig data. Due to randomness in community detection algorithms, communities are found by running each algorithm 100 times. Number of communities are selected from K = {2, . . . , 10}. Average NMI over time and runs along with standard deviation across runs are reported in Table 2.2. The values that are significantly higher than the rest are given in bold, where significance is determined by t-test at 𝛼 = 0.05. The highest NMI values are obtained by GL and DSC𝑜𝑛 followed by DSC𝑜 𝑓 𝑓 . Although the mean NMI value of GL is higher than DSC𝑜𝑛 , standard deviation of the former is high and thus no significant difference is found between NMI values of the two methods. The similarities between community structures at consecutive time points are calculated by NMI and plotted in Figure 2.6 to show how community structures detected by DSCon and DSCoff change over time. The similarity of communities detected by GL across time is also reported. It can be seen that there are drops in similarities of DSCon and DSCoff between weeks 20 and 25, around week 33 and after week 40. These drops are expected, since these weeks correspond to winter break, spring break and end of the school year, respectively [144]. These changes are not detected by GL. Enron Email Data This dataset is a dynamic email communication network between Enron employees constructed by [258] using Enron corpus [191] that include 500, 000 emails from 1998 to 2002. A snapshot network is generated for each week by connecting two employees with an edge, if they communicated through an email. There are 𝑇 = 120 time points and 184 nodes corresponding to the employees. More details about the dynamic network can be found in [258]. Community structure is found using aforementioned methods for 100 runs due to randomness in methods’ outputs. Number of communities is estimated from the range K = {2, 3, . . . , 20}. Table 2.3 reports mean 31 Table 2.3: Conductance, Modularity and AS values of detected communities of Enron E-mail data Conductance Modularity AS DSC𝑜𝑛 1.060(0.072) 0.726(0.010) 127.04(2.65) DSC𝑜 𝑓 𝑓 1.311(0.094) 0.697(0.008) 132.80(2.02) PCM 1.696(0.165) 0.658(0.015) 120.6(3.34) GL 4.225(0.446) 0.421(0.008) 91.58(2.22) DYNMOGA 1.160(0.024) 0.635(0.001) 151.88(0.39) PisCES 4.648(0.060) 0.424(0.006) 63.79(1.79) and standard deviation of conductance, modularity and AS. The values that are significantly better than the rest are given in bold, where significance is determined by t-test at 𝛼 = 0.05. In terms of conductance and modularity, the best values are obtained by DSC𝑜𝑛 . In terms of asymptotic surprise, DYNMOGA followed by the proposed offline method outperform the rest. As reported in [258], the structure of the network starts to change after week 89, due to the resignation of some of the CEOs and the federal investigation the company falls under. These changes include an increase in the number of edges and in the amount of communication between company’s CEOs and presidents. To see how these changes affect the community structure, the similarity between community structures at consecutive time points is plotted in Figure 2.7. It is observed that the similarities increase after week 89 for both of the proposed methods. As the amount of communication between employees increases after week 89, the community structure becomes more stable across time due to increasing connectivity. This finding is in line with the results in [258]. Similarity between Consecutive Weeks 0.8 Similarities (NMI) 0.6 0.4 DSCon DSCof f 0.2 0 20 40 60 80 100 120 Weeks Figure 2.7: Similarity between the community structures at consecutive time points for Enron e-mail data. Moving average of the similarity is taken with a window size of 7 to reduce noise. 32 Table 2.4: Conductance, Modularity and AS values of detected communities of Day 1 of middle school data Conductance Modularity AS DSC𝑜𝑛 3.799(0.212) 0.697(0.004) 4264.8(30.15) DSC𝑜 𝑓 𝑓 3.641(0.260) 0.699(0.004) 4319.6(34.81) PCM 4.105(0.300) 0.689(0.006) 4200.1(45.14) GL 6.462(0.151) 0.652(0.001) 4486.4(6.32) DYNMOGA 11.192(0.429) 0.607(0.004) 4061.7(25.00) PisCES 11.031(0.183) 0.595(0.008) 2750.0(81.17) Middle School Network The third real world network we consider is a dynamic social network between students of a middle school in Utah. The data is collected by [243] for two days between 8:25 a.m. and 3:15 p.m. The interactions between students are obtained by proximity sensors that have a time resolution of 20 seconds. 591 7th and 8th graders participated in the study. A school day consists of 7 class periods and 2 lunch times and students switch their classrooms between class periods. A dynamic network with 𝑇 = 28 snapshots is generated for each day as in [221]. Each snapshot corresponds to a 15 minute interval and two students are connected with an edge if they interacted during this period. Each of the community detection methods is applied to the constructed dynamic networks for each day for 100 runs. The set of candidate number of communities is set as K = {10, 11, . . . , 30} for DSC𝑜𝑛 , DSC𝑜 𝑓 𝑓 and PCM. The mean and standard deviation of conductance, modularity and AS for all methods are reported in Tables 2.4 and 2.5 for the first and second days, respectively. The values that are significantly better than others are shown in bold, where statistical significance is determined as before. For the first day, DSC𝑜 𝑓 𝑓 followed by DSC𝑜𝑛 gives the best results in terms of conductance and modularity. GL performs the best in terms of asymptotic surprise followed by the proposed methods. For the second day, the proposed methods achieve the best performances in terms of all metrics. Table 2.5: Conductance, Modularity and AS values of detected communities of Day 2 of middle school data Conductance Modularity AS DSC𝑜𝑛 4.067(0.226) 0.686(0.004) 4398.8(31.14) DSC𝑜 𝑓 𝑓 3.962(0.210) 0.688(0.003) 4470.3(30.52) PCM 4.300(0.266) 0.677(0.006) 4319.3(51.32) GL 14.555(1.558) 0.232(0.074) 1042.1(107.60) DYNMOGA 11.255(0.443) 0.611(0.004) 4191.2(26.23) PisCES 9.054(0.172) 0.646(0.005) 3272.4(69.66) 33 Similarity between Consecutive Time Points for Day 1 Similarity between Consecutive Time Points for Day 2 1 1 Similarities (NMI) 0.5 0.5 DSCon DSCon DSCof f DSCof f 0 0 8:25 9:25 10:25 11:25 12:25 1:25 2:25 3:25 8:25 9:25 10:25 11:25 12:25 1:25 2:25 3:25 Time Time Figure 2.8: Similarity between detected communities at consecutive time points on first and second days of middle school data. The community structure of the middle school network changes substantially during a day as the students switch their classrooms during breaks between class periods. In Figure 2.8, we plot the average NMI between detected communities at consecutive time points for both days. As can be seen from the figure, the similarity drops every hour or so corresponding to break times, which indicates the effectiveness of the proposed methods in tracking changes in the community structure. DBLP Finally, we consider a dynamic co-authorship network generated from DBLP database by [10] and studied previously in [264, 125]. The dynamic network is generated from the papers published in 28 conferences over 10 years (1997 - 2006). A snapshot network is generated for each year by connecting two authors with an edge, if they co-authored a paper during that year. There are 958 authors in total. All of the methods are applied to the generated dynamic network for 20 runs. Candidate values for the number of communities are set as K = {60, 61, . . . , 120} for DSC𝑜𝑛 , DSC𝑜 𝑓 𝑓 and PCM. The means and standard deviations of conductance, modularity and AS are reported in Table 2.6. For each metric, the values that are significantly better than others are shown in bold, where significance is determined by t-test at 𝛼 = 0.05. Table 2.6: Conductance, Modularity and AS values of detected communities of DBLP co-authorship data Conductance Modularity AS DSC𝑜𝑛 2.626(0.328) 0.722(0.018) 4288.7(67.44) DSC𝑜 𝑓 𝑓 2.470(0.309) 0.729(0.014) 4276.6(66.40) PCM 3.872(0.884) 0.688(0.022) 4222.3(87.85) GL 2.874(0.127) 0.675(0.003) 4052.2(27.37) DYNMOGA 3.245(0.286) 0.716(0.003) 4473.4(19.76) PisCES 72.573(0.551) 0.544(0.007) 3227.4(82.12) 34 In terms of conductance and modularity, best performances are obtained by the two proposed methods. In terms of AS, DYNMOGA followed by the proposed online method outperform the rest. 2.6 Conclusions In this chapter, we investigated the equivalence between statistical inference and heuristic quality function based community detection methods for dynamic networks. In particular, we proposed a new dynamic MRF- DCSBM that captures the evolution of community membership. We showed that under the planted partition model, the log-posterior of MRF-DCSBM is equivalent to the objective function of evolutionary spectral clustering, where the weight of temporal smoothness is time dependent and adapts to the community structure. This equivalence resulted in the derivation of two new community detection methods. The proposed methods are shown to be more accurate at detecting the community structure when the network is noisy and more robust to non-stationarities in the network structure compared to state-of-the-art methods. The proposed methods are also shown to have superior performance on various real-world networks, where they are able to track changes in community structure over time. Future work will consider approaches developed for parameter estimation in MRFs for estimating the parameters 𝐽𝑖𝑛 𝑡 and 𝐽 𝑡 . Moreover, the implementation of the proposed methods can be speeded up by faster 𝑜𝑢𝑡 eigendecomposition techniques and incremental spectral clustering [175]. 35 CHAPTER 3 COMMUNITY DETECTION IN MULTILAYER NETWORKS 3.1 Introduction Advances in neuroimaging technologies allow the brain to be modeled as a complex network, where the nodes correspond to the different brain units and the edges represent structural or functional connections among the units [37]. In order to characterize the topology and dynamics of brain networks, various descriptive and inferential network measures such as centrality, degree distribution and small-worldness [27, 156, 140, 18, 17] with respect to disease, task, learning, cognitive control, attention and memory [37, 33, 17, 140, 48, 22, 239, 218] are utilized. Current network models have been mostly limited to examining a single network instance either of a subject, a frequency band or a task. However, most neurophysiological recordings, such as the electroencephalogram (EEG), allows one to capture brain dynamics across multiple temporal and spatial scales. Reducing this rich information into a single network disregards the high amount of dependency that exists between networks of different subjects, frequency bands or tasks. Thus, a principled mathematical framework to accurately study this multiplicity of brain connectivity is needed. Recently, multilayer networks have gained attention in network neuroscience [59, 61, 240, 24, 155, 247], due to their ability to represent and study multi-dimensional and multi-scale data. Initial work to model multiplicity of brain connectivity primarily employs multiplex networks, where the meaning of layer can vary depending on context, such as different modalities, subjects, and frequency bands. For example, Battiston et al. [19] introduce a two layer network combining structural and functional modalities using diffusion tensor imaging (DTI) and functional MRI (fMRI), respectively. Another line of work considers multiplex networks, where each layer corresponds to a different subject, to investigate intra- and inter-subject variability of brain connectivity [25]. Finally, multiplex networks where each layer corresponds to the connectivity in different frequency bands are considered to study the connectivity across multiple frequency bands, simultaneously [271, 214, 215, 56, 266]. While this line of work reveals important characteristics of multiplicity of brain connectivity, it restricts interlayer edges by using multiplex networks. Recently, this restriction on interlayer edges has been removed by modeling the brain connectivity using multilayer networks, where interlayer edges are allowed between any brain regions [65, 35, 240, 36]. For example, magnetoencephalography (MEG) [35, 36, 240] and EEG [162] recordings are used to construct functional multilayer networks, where each layer corresponds to the links within a frequency band, and the interlayer edges correspond to the 36 Subject 1 EEG Construct Multilayer Multilayer Detect Community Co-clustering Group Community Recordings Network Network Structures Matrix Detection Subject L EEG Construct Multilayer Multilayer Detect Community Co-clustering Recordings Network Network Structures Matrix s Time-Frequency de ectro Generate Surrogate ML Modularity Co-clustering Matrix El Surface Networks Maximization Intralayer EEG Recordings Edges Select Optimal and Comodulogram Multiple RID-Rihaczek Interlayer Runs Distribution Edges ML Modularity Construct Maximization Co-clustering Matrix Figure 3.1: Flowchart of the proposed approach for community detection of multi-frequency EEG networks. Bottom two panels illustrate multilayer network construction (left) and community detection for each subject (right). cross-frequency coupling across frequency bands. Topological characteristics of multiplex and multilayer brain networks have been analyzed with various graph theoretical tools, such as hub node identification [61], motif analysis [19] and algebraic connectivity [36]. An important tool in the analysis of graphs is community detection [81]. Communities are defined as groups of nodes that are more strongly connected among themselves than they are to the rest of the network. Various community detection methods have been developed and applied to single-layer brain networks to find communities, which often correspond to specialized functional subnetworks of the brain [143, 232]. Although these methods can be applied to multiplex and multilayer graphs, they do not achieve good performance as they do not take the heterogeneity of connections across layers. Thus, recent work aims to extend community detection methods to these high-dimensional graphs [133, 194, 62, 189, 43]. However, most of these extensions are limited to multiplex networks except the following recent work. Pramanik et al. [189] extends the definition of modularity to multilayer networks. The proposed multilayer modularity metric is maximized using Girvan-Newman and Louvain algorithms. However, this approach does not take the resolution limit of modularity into account [189], limiting its practical use. Chen et al. [43], on the other hand, extends the definition of normalized cut to multilayer networks by constructing a block supra-Laplacian 37 matrix and proposes a spectral clustering algorithm based on this supra-Laplacian matrix. Although the method is developed for multilayer networks, it does not take the heterogeneity of interlayer edge weights into account. In this chapter, we aim to characterize the topological organization of multilayer brain networks through multilayer community detection. In order to achieve this goal, we first construct multi-frequency networks from EEG data, where the intralayer and interlayer edges are quantified by previously published time- frequency phase synchrony [12] and phase amplitude coupling [157] measures, respectively. Thus, the constructed network is a multilayer network with interlayer edges allowed between all brain regions. Next, a new multilayer modularity metric is defined based on a multilayer null model that preserves the layer-wise node degrees while randomizing the remaining characteristics of the network. The proposed modularity is parameterized with resolution parameter to handle the resolution limit of modularity, and interlayer scale parameter to control the importance of interlayer edges in community formation. The optimal values of these parameters are determined using a surrogate data based procedure. Third, a group community detection method is proposed to find the common community structure for a set of subjects. The method uses subjects’ co-clustering matrices obtained from multiple runs of modularity maximization, thus it is able to address the issue of degeneracy in modularity maximization [88]. Finally, the group level differences between the two response types during Flanker task, i.e. error and correct, are evaluated from a multi-frequency network perspective. The proposed approach is outlined in Figure 3.1. 3.2 Multi-frequency EEG Networks 3.2.1 EEG Data The EEG data was acquired during a cognitive control-related error processing task where the subjects performed a letter version of the speeded reaction Flanker task [150]. The experimental protocol of this study was approved by the Institutional Review Board (IRB) of the Michigan State University (IRB: LEGACY13- 144). The data collection was conducted by following the regulations approved by this protocol. Prior to data acquisition, all subjects signed an informed and written consent form. The EEG signals were recorded with a BioSemi ActiveTwo system using a cap with 64 Ag-AgCl electrodes placed at standard locations of the International 10-20 system. The sampling rate of the data was 512 Hz. After using standard artifact rejection algorithms [63], volume conduction was minimized using the Current Source Density (CSD) Toolbox [238]. During recording, each subject was presented with a string of five letters at each trial. Letters could be 38 congruent (e.g. SSSSS) or incongruent stimuli (e.g. SSTSS) and the subject was instructed to respond to the center letter with a standard mouse. The trials started with a flanking stimulus (e.g. SS SS) of 35ms followed by the target stimuli (e.g. SSSSS/SSTSS) displayed for about 100 ms. The total display time is 135 ms, followed by a 1200 to 1700 ms inter-trial break between the trials. These trials capture the Error-Related Negativity (ERN) after an error response and the Correct-Related Negativity (CRN) after a correct response. For each subject, 480 total trials (each of 1-second in duration) were recorded, where the number of error trials varied from 20 to 61 across the subjects. For a fair comparison between ERN/CRN, the same number of correct trials were selected randomly. As earlier studies suggested a rise in synchronization related to ERN for the 25-75 ms time window [179], all of the analysis in this paper was conducted for the 25-75 ms time period following the response. For each subject and each response type (error and correct), a multilayer network with four layers is constructed where layers correspond to the four EEG frequency bands: 𝜃 (4-7 Hz), 𝛼 (8-12 Hz), 𝛽 (13-30 Hz), 𝛾 (31-100 Hz). In this chapter, we consider data from 20 participants. 3.2.2 Construction of Multilayer EEG Networks Intralayer Edges For a multilayer brain network where each layer corresponds to a different frequency band, the intralayer edges correspond to functional connectivity and can be quantified using measures of correlation, coherence or phase synchrony. Prior work illustrates the superior performance of reduced interference Rihaczek (RID- Rihaczek) time-frequency distribution-based phase synchrony index, i.e. RID-TFPS, in terms of time and frequency resolution and robustness to noise [12, 11]. This complex time-frequency distribution can be utilized to calculate the phase difference 𝜙𝑢,𝑣 (𝑡, 𝑓 ), between two signals 𝑥 𝑢 and 𝑥 𝑣 as: 𝐶𝑢 (𝑡, 𝑓 )𝐶𝑣∗ (𝑡, 𝑓 )   𝜙𝑢,𝑣 (𝑡, 𝑓 ) = arg , (3.1) |𝐶𝑢 (𝑡, 𝑓 )||𝐶𝑣 (𝑡, 𝑓 )| where 𝐶𝑢 (𝑡, 𝑓 ) and 𝐶𝑣 (𝑡, 𝑓 ) are the complex time-frequency distributions of 𝑥 𝑢 and 𝑥 𝑣 , respectively. Phase Locking Value (PLV) quantifies the consistency of the phase differences across trials and is computed as follows [120]: 𝐾 1 ∑︁ 𝑗 𝜙𝑢,𝑣 𝑘 (𝑡 , 𝑓 ) PLV𝑢,𝑣 (𝑡, 𝑓 ) = 𝑒 , (3.2) 𝐾 𝑘=1 where 𝐾 is the total number of trials and 𝜙𝑢,𝑣 𝑘 (𝑡, 𝑓 ) is the phase difference between 𝑥 𝑘 and 𝑥 𝑘 for trial 𝑢 𝑣 𝑘. After the pairwise PLV values are computed, the average pairwise synchrony within a predefined time window of interest, 𝑊 = [𝑡1 , 𝑡2 ], and a chosen frequency band is used as intralayer edge weights, i.e. 39 𝑤 hh 1 1 Í Í 𝑢𝑣 = |𝑊 | |h | 𝑡 ∈𝑊 𝑓 ∈h PLV𝑢,𝑣 (𝑡, 𝑓 ), 1 ≤ 𝑢, 𝑣 ≤ 𝑁, where 𝑁 is the number of brain regions, |𝑊 | is the length of the time interval and |h | is the bandwidth of the particular frequency band h. Interlayer Edges In a multilayer network, where the different layers correspond to different frequencies, the interlayer edges can be quantified through measures of cross-frequency coupling. In particular, phase amplitude coupling (PAC) which computes the modulation of the amplitude/power of a high frequency rhythm by the phase of a slower frequency rhythm is a commonly used metric [31, 242]. Prior work introduces a RID-Rihaczek time- frequency-based PAC measure and illustrated its superior performance with respect to Hilbert transform and wavelet-based methods [157, 158]. To quantify PAC, we first extract the instantaneous amplitude envelope of the high frequency component at node 𝑢, 𝑎 𝑢𝑓𝑎 (𝑡), and the instantaneous low frequency phase component at node 𝑣, 𝜙 𝑣𝑓 𝑝 (𝑡), using RID-Rihaczek distribution, where 𝑓 𝑝 and 𝑓 𝑎 are frequencies within the hth and k th frequency bands, respectively. 𝑎 𝑢𝑓𝑎 (𝑡) is obtained from the frequency constrained time marginal of 𝐶𝑢 (𝑡, 𝑓 ) as: ∫ 𝑓 𝑎2 𝑎 𝑢𝑓𝑎 (𝑡) = 𝐶𝑢 (𝑡, 𝑓 )𝑑𝑓 , (3.3) 𝑓 𝑎1 where 𝑓 𝑎1 and 𝑓 𝑎2 is the bandwidth around the chosen high frequency. Similarly, the low frequency phase at node 𝑣 is obtained from 𝐶𝑣 (𝑡, 𝑓 ), as:   𝐶𝑣 (𝑡, 𝑓 𝑝 ) 𝜙 𝑣𝑓 𝑝 (𝑡) = arg . (3.4) |𝐶𝑣 (𝑡, 𝑓 𝑝 )| Once the amplitude and phase components are extracted, PAC is estimated by distributing 𝑎 𝑢𝑓𝑎 (𝑡) and 𝜙 𝑣𝑓 𝑝 (𝑡) to a composite vector in the complex plane at each time point and measuring the amplitude normalized modulation index (MI) [180]: 𝑗 𝜙 𝑣,𝑘 (𝑡) 𝑎 𝑢,𝑘 Í𝐾 𝑓 𝑎 (𝑡)𝑒 𝑓 𝑝 𝑘=1 1 MI𝑢,𝑣 ( 𝑓 𝑝 , 𝑓 𝑎 , 𝑡) = √ √︃ . (3.5) 𝐾 Í𝐾 𝑢,𝑘 2 𝑘=1 𝑎 𝑓 𝑎 (𝑡) The weights of the interlayer edges between node 𝑢 and 𝑣 are computed as: 1 1 ∑︁ ∑︁ ∑︁ 𝑤 hk 𝑢𝑣 = MI𝑢,𝑣 ( 𝑓 𝑝 , 𝑓 𝑎 , 𝑡). (3.6) |𝑊 | |h ||k | 𝑡 ∈𝑊 𝑓 𝑝 ∈h 𝑓 𝑎 ∈k 40 3.3 Multilayer Modularity As mentioned in Section 1.2, modularity function quantifies the quality of a partition by comparing the intra-community edge density to that expected under a null model and is calculated as follows [171]: 𝑁 ∑︁ ∑︁ 𝑁 𝑄= ( 𝐴𝑖 𝑗 − 𝛾𝑃𝑖 𝑗 )𝛿𝑔𝑖 𝑔 𝑗 , (3.7) 𝑖=1 𝑗=1 where 𝑃𝑖 𝑗 is the expected edge weight between nodes 𝑖 and 𝑗 under the null model, 𝑔𝑖 is the community of node 𝑖, and 𝛿𝑔𝑖 𝑔 𝑗 = 1 if 𝑔𝑖 = 𝑔 𝑗 and 0, otherwise. 𝛾 is the resolution parameter [199] to overcome the resolution limit of modularity [80]. By tuning 𝛾, one can change the resolution of the modularity function such that larger 𝛾 values can detect smaller communities. The selection of 𝑃𝑖 𝑗 depends on the null model which is a random graph with some properties, e.g. edge density, of the observed network preserved. Different null models can be used to define 𝑃𝑖 𝑗 depending on the graph under study. For example in the configuration null model, the degree of each node is the same as that of the observed network so that the identified community structure is not affected by the heterogeneity of the degree distribution. This assumption is based on the fact that nodes with a high degree tend to connect with each other merely because they have high number of connections and not necessarily because they are within the same community [113]. To prevent this tendency to bias community detection, the null model preserves the node degrees. On the other hand, Erdős–Rényinull model does not make such an assumption and allows the identified community structure to be influenced by the degree distribution. Based on this insight on the role of null models, we extend the definition of modularity function to multilayer networks by considering which properties of the observed multilayer network we want to preserve in the null model. In neuronal networks such as the multi-frequency brain networks, the edge weights are expected to be heterogeneous across layers [36, 266]. This is due to the fact that after a given task, usually oscillations across only a subset of frequencies are activated. Thus, the edge weights across layers cannot be homogeneous. It is important to take this heterogeneity into account to prevent trivial partitions based on the layer label rather than the true community membership. Therefore, the null model used in the definition of the modularity function should preserve the heterogeneity of edge weights across layers. We define multilayer configuration null model, which preserves layer-wise node degrees while randomizing the remaining characteristics of the observed multilayer graph. The expected edge weight between 𝑢 h and 𝑣 k 41 based on multilayer configuration null model is then defined as: hk 𝑠𝑢k h 𝑠h𝑣k 𝑃𝑢𝑣 = , (3.8) (1 + 𝛿hk )𝑚 hk where 𝑚 hh is the total weight of the intralayer edges in layer h, 𝑚 hk is the total weight of the interlayer edges between layers h and k , and 𝛿hk = 1 if h = k and 0, otherwise. The multilayer modularity is then defined as follows: 𝐿 ∑︁ ∑︁ 𝑛h ∑︁ 𝑛h ∑︁𝐿 ∑︁ 𝐿 ∑︁𝑛h ∑︁ 𝑛k 𝑄= ( 𝐴𝑖hh𝑗 − 𝛾𝑃𝑖hh𝑗 )𝛿𝑔h 𝑔h +𝜔 ( 𝐴𝑖hk𝑗 − 𝛾𝑃𝑖hk𝑗 )𝛿𝑔h 𝑔k , (3.9) 𝑖 𝑗 𝑖 𝑗 h=1 𝑖=1 𝑗=1 h=1 k =1 𝑖=1 𝑗=1 where 𝛾 is the resolution parameter and 𝜔 is the scaling parameter that weighs the importance of interlayer connections. (3.9) can be optimized with greedy algorithms, such as the Louvain algorithm [26], developed for maximizing the single-layer modularity function defined in (1.6). In this chapter, we use the Leiden algorithm, which is an extension of the Louvain algorithm with better performance [246]. 3.3.1 Resolution Parameter and Inter-layer Scale Selection We propose a statistical testing approach comparing the modularity value of the observed multilayer network to that of surrogate networks to determine the resolution and interlayer scale parameters in (3.9). Since the multilayer EEG networks are fully connected and weighted, we focus on randomization techniques presented in [8] and extend it for generating multilayer surrogate networks. In particular, we select two edges 𝑒 hk lm 𝑢𝑣 and 𝑒 𝑠𝑡 and swap their edge weights. Edges are selected such that h = l and k = m, which ensures that the heterogeneity of edge weights across layers is preserved in the surrogate network. Assume that we are given an observed multilayer network M and 𝑐 surrogate multilayer networks generated from M as described above. We perform community detection on surrogate multilayer networks for a given pair of (𝛾, 𝜔) values. We then calculate the modularity values of the detected community structures and compute the average modularity, 𝑄 𝑠𝑢𝑟𝑟 . Next, we perform modularity maximization for M 𝑐 times and compute the average of the modularity values for the 𝑐 community structures, 𝑄 𝑜𝑏𝑠 . This process is repeated for different pairs of (𝛾, 𝜔) ∈ Γ × Ω where Γ and Ω are given sets of resolution parameters and interlayer scales, from which the optimal parameter values are searched. The pair with the largest difference, 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 , is selected as the optimal parameter values. 3.3.2 Group Community Detection Once the community structures of the multilayer networks for a group of subjects are detected, it is often desirable to find a group community structure, which summarizes the shared communities across subjects 42 Algorithm 3.1 Multilayer community detection for a set of subjects’ multilayer networks Input: M = {M1 , . . . , M 𝑆 }: Multilayer networks of 𝑆 subjects, Γ and Ω: Search sets of resolution parameter and inter-layer scale, 𝑐: Number of times to run multilayer modularity maximization. Output: 𝑃𝑔𝑟 𝑜𝑢 𝑝 : Group community structure 1: C ← {} 2: for 𝑠 ∈ {1, . . . , 𝑆} do 3: 𝑄 𝑚𝑎𝑥 = −∞ ⊲ To store maximum 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 4: for (𝛾, 𝜔) ∈ Γ × Ω do 5: 𝑄 𝑜𝑏𝑠 ← 0, 𝑄 𝑠𝑢𝑟𝑟 ← 0, P ← {} 6: for 𝑖 ∈ {1, . . . , 𝑐} do 𝑃 is found community structure and 𝑄 is its modularity value ⊲ 7: 𝑃, 𝑄 ← MLModularityMaximization(M 𝑠 , 𝛾, 𝜔) 8: 𝑄 𝑜𝑏𝑠 ← 𝑄 𝑜𝑏𝑠 + 𝑄/𝑐, 9: P ← P ∪ {𝑃} 10: N ← GenerateSurrogateMLNetwork(M 𝑠 ) 11: 𝑃, 𝑄 ← MLModularityMaximization(N , 𝛾, 𝜔) 12: 𝑄 𝑠𝑢𝑟𝑟 ← 𝑄 𝑠𝑢𝑟𝑟 + 𝑄/𝑐 13: end for 14: if 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 > 𝑄 𝑚𝑎𝑥 then 15: 𝑄 𝑚𝑎𝑥 ← 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 , P𝑜 𝑝𝑡 ← P 16: end if 17: end for 18: C ← ConstructCoClusteringMatrix(P𝑜 𝑝𝑡 ) 19: C ← C ∪ {C} 20: end for 21: 𝑃𝑔𝑟 𝑜𝑢 𝑝 ← SC-ML(C, K) ⊲ K is the average of the number of communities detected for each subject [24, 70, 121]. In this paper, we propose a group community structure detection method based on multiplex graphs. Given 𝐿 subjects, for each subject we maximize the modularity function with the optimal 𝛾 and 𝜔 values 𝑐 times to obtain 𝑐 community structures. Since modularity maximization is an NP-hard problem [32], modularity maximization algorithms yield locally optimal results. By running the algorithm multiple times, one can obtain a collection of informative community structures for each subject. From these community structures, for each subject we construct a co-clustering matrix Ah , h ∈ {1, 2, . . . , 𝐿} where 𝐴𝑢𝑣 h is the number of times nodes 𝑢 and 𝑣 are in the same community for subject h across all runs. The resulting 𝐿 co-clustering matrices can be modeled as the layers of a multiplex graph, where each layer is an undirected, weighted graph corresponding to a subject. The group community structure is then found using Spectral Clustering on Multi-Layer graphs (SC-ML) [67], which finds a common community structure shared by the layers of a multiplex graph. SC-ML applies spectral clustering to a modified Laplacian defined as: 𝐿 𝐿 ⊤ ∑︁ ∑︁ h L𝑚𝑜𝑑 = L −𝛼 Uh Uh , (3.10) h=1 h=1 43 Qobs − Qsurr for Error Qobs − Qsurr for Correct Histogram of optimal ω across subjects for error 0.95 17.4 3 # Subjects 2 0.97 1 0.99 0 γ Histogram of optimal ω across subjects for correct 1.01 8 6 # Subjects 1.03 4 2 1.05 0.15 0 0.00 0.10 0.20 0.30 0.40 0.50 0.00 0.10 0.20 0.30 0.40 0.50 0.00 0.10 0.20 0.30 0.40 0.50 ω ω ω a) b) c) Figure 3.2: Selection of the resolution (𝛾) and inter-layer scale (𝜔) parameters: a) and b) show the average of 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 across 20 subjects for error and correct responses, respectively. c) shows the histogram of optimal 𝜔 values for error (top) and correct (bottom) responses across subjects. where Lh is the normalized graph Laplacian for layer h defined as Lh = (Dh ) −1/2 (Dh − Ah ) (Dh ) −1/2 , Dh is the diagonal matrix of node degrees and Uh is the low-rank subspace embedding of layer h. In this work, we set 𝛼 = 0.5, following the guidelines in [67]. Algorithm 1 gives the complete procedure to obtain group community structure from a given set of multilayer networks. 3.4 Results 3.4.1 Optimal resolution and scale parameters Using the statistical testing approach described in Section 3.3.1, we first study the optimal values of 𝛾 and 𝜔. For each subject and each response type, 100 surrogate networks are generated and their community structures are found for each (𝛾, 𝜔) ∈ Γ × Ω, where Γ = {𝛾 : 𝛾 = 0.95 + 0.0025𝑛, 𝑛 ∈ {0, 1, . . . , 40}} and Ω = {𝜔 : 𝜔 = 0.0 + 0.0125𝑛, 𝑛 ∈ {0, 1, . . . , 40}}. For each subject, 100 community structures are detected for each (𝛾, 𝜔) ∈ Γ × Ω. Modularity values of these community structures are evaluated and the optimal 𝛾 and 𝜔 values for each subject are then found from 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 . Figures 3.2a. and 3.2b. show the average of 𝑄 𝑜𝑏𝑠 − 𝑄 𝑠𝑢𝑟𝑟 across subjects for error and correct responses, respectively. For both response types, optimal 𝛾 is found to be close to 0.99, while optimal 𝜔 values are observed to be more diverse across subjects, ranging between 0.0-0.2 for error and between 0.0-0.1 for correct. In Figure 3.2c, we plotted the histograms of the optimal 𝜔 values across subjects for both response types. This figure shows that the optimal 𝜔 values are non-zero for all subjects except one for the error response. On the other hand, for correct response, the optimal 𝜔 values for 7 subjects is 0, while most of the 44 JS distance between subjects’ association matrices 0.30 0.25 0.20 Distance 0.15 0.10 0.05 Error Correct 0.00 1 4 8 12 16 20 Subject Figure 3.3: Consistency of the community structure for error and correct responses as measured by JS distance. Average JS distance of each subject with respect to other subjects is shown. Shaded area is the 95% confidence interval. remaining subjects have optimal 𝜔 values close to 0. 3.4.2 Consistency of Community Structures for Error and Correct After obtaining the optimal community structure for each subject and both response types, the consistency of community structures across subjects within each response type is assessed. A multiplex graph is constructed where layer h corresponds to hth subject’s co-clustering matrix as described in Materials and Methods. The distance between any two layers is used to quantify the consistency of the community structures for those two subjects. Jensen-Shannon (JS) distance for graphs [60], which is always in [0, 1] and is shown to be effective in assessing similarity of graphs based on their community structure [60], is used as the distance measure. Figure 3.3 shows the average JS distance between each subject and the others for each response type. This plot shows that the average distance for each subject with respect to the other subjects is lower for error response compared to the correct response. 3.4.3 Group Community Structure for Error and Correct Responses Once the optimal community structures are obtained for each subject and for each response type, the group community structure is detected using SC-ML. The number of communities is determined as the average of the number of communities detected for each subject. These values are 5 and 9 for error and correct responses, respectively. Figure 3.4 illustrates the group community structure for error and correct responses for the multi-frequency networks. For error response, the group community structure consists of communities that include nodes from multiple layers. Community structure of 𝜃, 𝛼 and 𝛽 are found to be very similar to each other. On the other hand, the community structure for the 𝛾 is different and has one 45 Bands Bands θ α θ α FPZ FPZ FP1 FP2 FP1 FP2 γ β γ β AF7 AF8 AF7 AF8 AF3 AFZ AF4 AF3 AFZ AF4 F7 F8 F7 F8 F5 F3 F4 F6 F5 F3 F4 F6 F1 FZ F2 F1 FZ F2 FT7 FC5 FC6 FT8 FT7 FC5 FC6 FT8 FC3 FC1 FCZ FC2 FC4 FC3 FC1 FCZ FC2 FC4 T7 C5 C3 C1 CZ C2 C4 C6 T8 T7 C5 C3 C1 CZ C2 C4 C6 T8 CP3 CP1 CPZ CP2 CP4 CP3 CP1 CPZ CP2 CP4 TP7 CP5 CP6 TP8 TP7 CP5 CP6 TP8 P3 P1 PZ P2 P4 P3 P1 PZ P2 P4 P5 P6 P5 P6 P7 P8 P7 P8 PO3 POZ PO4 PO3 POZ PO4 P9 PO7 PO8 P10 P9 PO7 PO8 P10 O1 O2 O1 O2 OZ OZ IZ IZ a) b) Figure 3.4: Multilayer group community structures for error (a)) and correct (b)) responses. Each electrode is shown with a circle with 4 quadrants, corresponding to the 4 frequency bands. Different colors represent different communities. Correspondence of the quadrants to the frequency bands are shown at the upper right corners of a) and b). within layer community, while the rest are across layers. For correct response, all communities are within a single layer. Nodes in the 𝜃 band are all assigned to a single community, while the other bands have distinct community structures. In order to better interpret the multilayer community structure, community structure for 𝜃 band is detected using single-layer modularity (see (3.7)). In particular, for each subject the community structure for the 𝜃 band is detected using single-layer modularity for each 𝛾 ∈ Γ = {𝛾 : 𝛾 = 0.95 + 0.0025𝑛, 𝑛 ∈ {0, 1, . . . , 40}}. The optimal resolution parameter is selected using the surrogate network approach. Using this optimal resolution parameter, group community structure for 𝜃 band for a given response type is found using SC-ML. The number of communities is determined as the average number of communities detected for each subject’s 𝜃 band. Figure 3.5 shows the group community structure for 𝜃 band for error response. We do not consider the community structure for the correct response in the 𝜃 band, since all of its nodes were assigned to a single community with the proposed multilayer modularity as shown in Figure 3.4a. Comparing Figure 3.5 with Figure 3.4a, it can be seen that there are similarities between the community structures detected by single-layer and multilayer modularity maximization. For instance, the green community in Figure 3.5 is also detected in Figure 3.4a. Similarly, most of the nodes in purple and red communities in Figure 3.5 are in 46 FPZ FP1 FP2 AF7 AF8 AF3 AFZ AF4 F7 F8 F5 F3 F1 F2 F4 F6 FZ FT7 FC5 FC6 FT8 FC3 FC1 FCZ FC2 FC4 T7 C5 C3 C1 CZ C2 C4 C6 T8 CP3 CP1 CPZ CP2 CP4 TP7 CP5 CP6 TP8 P3 P1 PZ P2 P4 P5 P6 P7 P8 PO3 POZ PO4 P9 PO7 PO8 P10 O1 O2 OZ IZ Figure 3.5: Community structure of 𝜃 band functional connectivity network found by maximizing the single-layer modularity function (see (3.7)) for error response. Each electrode is shown with a circle where the different colors correspond to different communities. the same communities in the structure detected by the proposed multilayer modularity. 3.5 Discussion The study of the community structure in multilayer functional connectivity networks reveal some interest- ing differences between error and correct responses at both the individual and group level. First, we observe the different role that inter-layer coupling plays in community formation for error vs. correct response. At the individual subject level, Figure 3.2 illustrates that while inter-layer connections are not important for the community structure of correct response as indicated by the optimal value of the scale parameter, 𝜔, being close to 0 for the majority of subjects, they are influential in community formation following the error response. Our prior work comparing PAC between response types supports this observation as there is significantly higher cross-frequency coupling during error monitoring [157]. This increased cross-frequency coupling is between low frequency cognitive control signals which are activated after an error response and high-frequency oscillations related to motor activity and visual processing in the gamma band [94]. At the group level, the community structures in Figure 3.4, show a community comprised of the frontal- central nodes corresponding to the medial prefrontal cortex (mPFC), e.g. Fz, FCz, FCz, FC2, in the 𝜃 and 𝛼 bands with parietal-occipital nodes corresponding to the visual , e.g. Pz, POz, Oz, and motor cortices, e.g. C2, C4, C6, in the 𝛾 band during ERN. mPFC is known to play an important role during ERN. In 47 particular, it is thought to detect conflicts and recruit additional resources from other brain areas including the lateral prefrontal cortices, visual and motor cortices to coordinate task relevant large scale networks and support adaptations of goal-directed behavior [174]. Physiologically, these interactions may occur through local and long range synchronized oscillation dynamics, particularly in the theta range (4–8 Hz). While this mPFC community structure in 𝜃 band has been observed in prior work that indicates the role of mPFC during ERN [179], the cross-frequency nature of this community is a new finding made possible by the proposed multilayer model. Our recent work shows that the phase of the 𝜃 band oscillations from the frontal-central regions modulate the amplitude of the 𝛾 band oscillations in the parietal-occipital regions following an error response supporting this finding [159]. Prior studies from others also hypothesize that error-related negativity initiates the medial frontal based top-down control mechanisms to improve the performance after an error response [98]. More recently, it has been proposed that low frequency network oscillations in prefrontal cortex, e.g. theta, guide the expression of motor-related activity in action planning and guide perception-related activity, e.g. gamma, in memory access [200]. Thus, the communities detected are consistent with previous literature reflecting higher theta-gamma coupling in the medial frontal cortex and relating this with error-related negativity. Another observation that can be made from Figure 3.4a is that the nodes corresponding to 𝛼 and 𝛽 bands are primarily in the same communities. This is line with recent work that indicates interlayer connectivity is dominated by one-to-one interactions for alpha-to-beta bands while for 𝜃–𝛾 band networks, there are additional interlayer connections between distant nodes in addition to the one-to-one connections [240]. The community structure for the correct response is mostly within-layer indicating the lack of coupling across different frequency bands. When the group community structure for 𝜃 band in Figure 3.5 is compared to the that of Figure 3.4a, some similarities are observed. As mentioned before, the community consisting of frontal and central electrodes in Figure 3.5 is also found by the proposed multilayer community detection method. Partitioning of the remaining electrodes is also consistent across both Figures. In order to quantify the similarity of community structures of 𝜃 band shown in Figures 3.4a and 3.5, we use Normalized Mutual Information (NMI) [57]. For Figures 3.4a and 3.5, NMI is found to be 0.60, indicating an agreement between the community structures in the 𝜃 band detected by single-layer and multilayer modularity maximization methods. This consistency between the community structures across the two definitions of modularity is enabled by the way we define multilayer modularity. Our definition of multilayer modularity takes the heterogeneity of edge weights into account, thus we are able to resolve the structure within layers. 48 Finally, Figure 3.3 shows that there is more group level consistency in terms of topological organization for the error response compared to the correct response. This is in line with prior work [179] that shows that the organization of the functional connectivity networks for correct response is similar to pre-stimulus networks. Thus, there is more variation across subjects for the correct response compared to response-evoked networks following an error response. 3.6 Conclusions This paper introduced a multilayer model of functional connectivity of the brain. In particular, we provided a data-driven approach to construct multi-frequency connectivity networks where layers correspond to different frequency bands. The resulting networks capture both within and cross-frequency coupling in a single framework. We then introduced a new definition of modularity for multilayer networks such that the null model preserves the heterogeneity of edge weights across layers. The community detection algorithm resulting from the maximization of this multilayer modularity function is applied to EEG data collected during error monitoring. The results indicate that following an error response, the brain organizes itself to form cross-frequency communities. This cross-frequency community formation is not observed for the correct response which indicates that the cross-frequency coupling is primarily associated with cognitive control. Moreover, we observed that the community structures detected for the error response were more consistent across subjects compared to the community structures for correct response. Future work will consider extension of this multilayer model to higher dimensions, e.g. multi-aspect multilayer brain networks such as temporal multi-frequency connectivity networks. Compared to current work where subjects’ community structure is found separately and then combined through group community detection, future work can use multi-aspect multilayer networks constructed from subjects’ multilayer net- works. This approach will allow simultaneous detection of communities of subjects similar to [25]. Future work will also consider different null models in the definition of modularity such as the constant Potts model, which is shown to be resolution limit free [245]. Finally, in this work we aimed to find the optimal resolution and inter-layer scale parameter; future work can focus on a multi-scale approach where the aim is to combine community structures from different resolutions and inter-layer scales [104]. 49 CHAPTER 4 LEARNING SIGNED GRAPHS 4.1 Introduction Gene regulatory networks (GRNs) represent fundamental molecular regulatory interactions among genes that establish and maintain all required biological functions characterizing a certain physiological state of a cell in an organism [152]. Cell type identity in an organism is determined by how active transcription factors interact with a set of cis-regulatory regions in the genome and controls the activity of genes by either activation or repression of transcription [75]. Usually, the relationship between these active transcription factors and their target genes characterize GRNs. Due to the inherent causality captured by these meaningful biological interactions in GRNs, genome-wide inference of these networks holds great promise in enhancing the understanding of normal cell physiology, and also in characterizing the molecular compositions of complex diseases [207, 147]. GRNs can be mathematically characterized as graphs where nodes represent genes and the edges quantify the regulatory relations. GRN reconstruction attempts to infer this regulatory network from high-throughput data using statistical and computational approaches. Multiple methods encompassing varying mathematical concepts have been proposed during the last decade to infer GRNs using gene expression data from bulk population sequencing technologies, which accumulate expression profile from all cells in a tissue. These methods can be broadly classified into two groups: the first group infers a static GRN, considering steady state of gene expression, while the second group uses temporal measurements to capture the expression profile of the genes in a dynamic process. A thorough evaluation of the static and dynamic models used in bulk GRN reconstruction can be found in [135, 38]. Recent advances in RNA-sequencing technologies have enabled the measurement of gene expression in single cells. This has led to the development of several computational approaches aimed at quantifying the expression of individual cells for cell-type labelling and estimation of cellular lineages. Several algorithms have been developed to arrange cells in a projected temporal order (pseudotime trajectory) based on similari- ties in their transcriptional states. In parallel, several dynamic models for single cell GRN reconstruction have also been developed taking into account the estimated pseudotimes. Since single cell network reconstruction algorithms try to establish functional relationships between genes taking into account the entire population of cells, it is debatable as to whether additional knowledge regarding cell state transitions may provide any 50 added benefits [45, 190]. In summary, direct application of bulk GRN reconstruction methods may not be adequate for single cell network inference. The complex nature of single-cell transcriptomics data pose unique challenges in GRN inference. Changes in gene expression due to cell-cell stochastic variation, cell-cycle heterogeneity and high spar- sity due to insufficient sequencing depths and capture inefficiency for genes with low expression form some of the unique characteristics of these datasets [183, 4]. Most importantly the high sparsity/high zero values feature in single cell datasets has garnered a lot of attention and several statistical methods have been designed to particularly model this phenomenon [114, 76, 201]. Recent research has indicated that these zero values referred to as "dropouts" most likely result from biological variation and may be indicative of heterogeneity in gene expression for varying cell types [236, 229]. To account for these unique challenges a variety of algorithms for network reconstruction in scRNAseq data have been recently proposed, but most of these methods fail to outperform network estimation methods developed for bulk data or microarrays [190, 45]. To that end, we propose a network reconstruction algorithm that learns the co-expression between genes using smoothness based GL algorithms. As mentioned in Section 1.3, smoothness based GL is first considered in [69] and different variations of this framework with constraints on the learned topology and for handling noisy graph signals were considered in [107, 99, 23, 106, 206]. All of the previous works learn unsigned graphs with the exception of [141], where a signed graph is learned by employing signed graph Laplacian defined by [119]. By using signed Laplacian, [141] aim to learn positive edges between nodes whose signal values are similar and negative edges between nodes whose signal values have opposite signs with similar absolute values. However, this approach is not suitable when graph signals are either all positive- or negative-valued, as in the case of gene expression data. Considering the advantages of GL approaches in learning graph topologies that are consistent with the observed signals, in this chapter, we propose a novel GL algorithm for the reconstruction of GRNs. In particular, we assume gene expression data obtained from cells are graph signals residing on an unknown graph structure, which corresponds to the GRN. One important characteristic of GRNs is that they are signed graphs, where positive and negative edges correspond to activating and inhibitory regulations between genes. To this end, we propose a novel and computationally efficient signed GL approach, scSGL, that reconstructs the GRN under the assumption that graph signals admit low-frequency representation over activating edges, while admitting high-frequency representation over inhibitory edges. Biologically, this modelling implies that two genes that are connected with an activating edge have similar expressions, while two genes connected 51 Activating Inhibitory Euclidean Distances Correlations 1.00 1.0 0.75 0.5 Distance Correlation 0.50 0.0 0.25 0.5 0.00 1.0 GSD HSC mCAD VSC GSD HSC mCAD VSC Dataset Dataset Figure 4.1: Euclidean distances (left, normalized to [0, 1]) and correlations (right) between expressions of gene pairs in curated datasets studied in Section 4.3. Values are calculated only for gene pairs that are connected in the ground truth GRNs and they are reported separately for activating and inhibitory edges. Only inhibitory edges are reported for VSC, since its GRN includes only inhibitory edges. with an inhibitory edge have dissimilar expressions. In Figure 4.1, we show how these assumptions hold for curated datasets studied in Section 4.3. The figure shows that Euclidean distances between expressions are smaller for gene pairs connected by activating edges than for those connected by inhibitory edges. The figure also reports correlations between expressions, which indicates that expressions of gene pairs connected with activating and inhibitory edges are positively correlated, i.e. similar, and negatively correlated, i.e. dissimilar, respectively. We also performed a Wilcoxon Rank Sum test to determine whether the calculated associations for the positive ground truth connections were significantly lower than the associations for the negative ground truth connections for Euclidean distances. We test the null hypothesis, 𝐻0 : the distributions of both populations are equal versus the alternative hypothesis 𝐻 𝑎 : the distribution of the negative associations are stochastically greater than the distribution of positive associations. In case of the correlation distances we want to test 𝐻 𝑎 : the distribution of the positive associations are stochastically greater than the distribution of negative associations. The calculated p-values were all less than 0.01, hence justifying our assumptions for all curated datasets except VSC, which only has negative associations. Another important characteristic of scRNAseq is high proportion of dropouts. We address this issue by employing kernel functions to map graph signals to a higher dimensional space and assuming low- and high-frequency representation for these high dimensional graph signals. This mapping allows us to use kernels that are appropriate for modelling single cell data structures. The remainder of the chapter is organized as follows. In Section 4.2, the proposed signed graph learning approach is given. Performance of scSGL on various synthetic and real datasets are reported in Section 4.3. Finally, Section 4.4 includes discussion and concluding remarks. 52 4.2 Learning Signed Graphs from Graph Signals 4.2.1 Signed Graphs Revisited In Section 1.1, a signed graph 𝐺 = (𝑉, 𝐸) is defined as a network whose edges are associated with weights that can be both negative and positive edges. The edge set 𝐸 can be partitioned into two sets based on the edge signs. Namely, 𝐸 = 𝐸 + ∪ 𝐸 − , where 𝐸 + = {𝑒 𝑢𝑣 |𝑒 𝑢𝑣 ∈ 𝐸, 𝑤 𝑢𝑣 > 0} and 𝐸 − = {𝑒 𝑢𝑣 |𝑒 𝑢𝑣 ∈ 𝐸, 𝑤 𝑢𝑣 < 0}. Using this partitioning, 𝐺 can be considered as a two-layer multiplex network, where layers are 𝐺 + = (𝑉, 𝐸 + ) and 𝐺 − = (𝑉, 𝐸 − ). Edge weights of 𝐺 + and 𝐺 − are determined from 𝐺: edge weights in 𝐺 + are 𝑤 𝑢𝑣 , while edges of 𝐺 − are |𝑤 𝑢𝑣 |. Since both layers are now unsigned graphs, we can define their adjacency matrices and combinatorial Laplacian matrices as described in Section 1.1. These matrices are indicated by A+ , A− , L+ and L− . Finally, any GSP concepts developed for unsigned graphs can also be employed. 4.2.2 Signed Graph Learning Consider a data matrix X ∈ R𝑛× 𝑝 , whose columns are observed graph signals over an unknown signed graph 𝐺. In Section 1.3, an unsigned graph is learned with the assumption that the observed graph signals have low-frequency representation in graph spectral domain. In order to learn a signed graph 𝐺, one needs to make some additional assumptions about the graph signals X. In this chapter, we make the following assumptions: 1. Signal values on nodes connected by positive edge values are similar to each other, i.e. variation over positive edges is small. 2. Signal values on nodes connected by negative edge values are dissimilar to each other, i.e. variation over negative edges is large. From GSP perspective, these assumptions correspond to graph signals being low- and high-frequency over positive and negative edges, respectively. Assumption 1 implies that the graph signals have low-frequency representation in the graph Fourier domain of 𝐺 + . On the other hand, assumption 2 implies that the graph signals have high-frequency representation in graph Fourier domain of 𝐺 − . We use (1.10) to quantify how well the graph signals fit these assumptions. Thus, to learn an unknown signed graph, we minimize 53 tr(X⊤ L+ X) with respect to L+ while maximizing tr(X⊤ L− X) with respect to L− : minimize tr(X⊤ L+ X) − tr(X⊤ L− X) + 𝛼1 ∥L+ ∥ 2𝐹 + 𝛼2 ∥L− ∥ 2𝐹 L+ ,L− ∈L (4.1) + − + − subject to tr(L ) = 2𝑛, tr(L ) = 2𝑛, (L , L ) ∈ C, where Frobenius norms and the first two constraints are similar to (1.11). L+ and L− are constrained to be in the set C = {(L+ , L− ) : 𝐿 𝑖+𝑗 = 0 if 𝐿 𝑖−𝑗 ≠ 0 and 𝐿 𝑖−𝑗 = 0 if 𝐿 𝑖+𝑗 ≠ 0} to ensure that they are not non-zero at the same indices. 4.2.3 Kernelized Signed Graph Learning Traditional machine learning and signal processing applications are mostly developed based on linear modelling due to their simplicity. However, real world problems require nonlinear estimation that can detect more complex patterns in the data. For this purpose, kernels are introduced to capture the nonlinearity by mapping signals to a high-dimensional space [96]. Kernels correspond to dot products in a higher dimensional feature space and overcome explicit construction of the feature space; thus providing simplicity of linear methods in nonlinear estimation. Given data from input space X, and a mapping function 𝝓 : X → H where H is an Hilbert space, a kernel function can be expressed as an inner product in the corresponding feature space, i.e. 𝜅(x𝑖 , x 𝑗 ) = ⟨𝜙(x𝑖 ), 𝜙(x 𝑗 )⟩, where 𝜅 : X × X → R is a finitely positive semi-definite kernel function [222]. An explicit representation of the feature map 𝜙 is not necessary and the dimension of mapped feature vectors could be high and even infinite. By using different kernels, learning algorithm can be augmented to exploit various (nonlinear) associa- tions between input data. This is especially crucial in GRN inference as shown in [230], where 17 different association measures between gene expressions are compared in terms of their performance in GRN infer- ence and various other tasks on single-cell transcriptomic datasets. In its current form, (4.1) cannot be used directly for different kernels. Thus, the optimization problem in (4.1) is extended using kernels. The first term in (4.1) can be written as tr(X⊤ L+ X) = tr(XX⊤ L+ ) = 𝑖, 𝑗 ⟨X𝑖· , X 𝑗 · ⟩𝐿 𝑖+𝑗 and the second term can be Í written similarly. By replacing dot products with a given kernel function, i.e. 𝜅(X𝑖· , X 𝑗 · ), the problem in (4.1) can be extended to incorporate the different kernels as: minimize tr(KL+ ) − tr(KL− ) + 𝛼1 ∥L+ ∥ 2𝐹 + 𝛼2 ∥L− ∥ 2𝐹 L+ ,L− ∈L (4.2) + − + − subject to tr(L ) = 2𝑛, tr(L ) = 2𝑛, (L , L ) ∈ C, where K ∈ R𝑛×𝑛 is the kernel matrix with 𝐾𝑖 𝑗 = 𝜅(X𝑖· , X 𝑗 · ). From GSP perspective, this modification implies that graph signals on each node, i.e. X𝑖· , are first mapped to a (higher dimensional) Hilbert space and 54 the signed graph is learned in this new space. Namely, let 𝚽 ∈ R𝑛× 𝑝b be the matrix constructed from mapping X𝑖· ’s to the Hilbert space H with dimension 𝑝b where rows of 𝚽 are 𝜙(X𝑖· ). When learning unknown signed graph 𝐺 with a kernel, each column of 𝚽 is a graph signal over 𝐺 and they are assumed to have low- and high-frequency representation with respect to 𝐺 + and 𝐺 − , respectively. Extending signed graph learning problem in (4.1) using kernels brings flexibility and any association metric in [230] can be implemented in this framework if it is a positive semi-definite kernel. In this chapter, we consider three kernels: correlation coefficient, 𝑟, measure of proportionality, 𝜌 [195] and a modification of Kendall’s tau (𝜏𝑧𝑖 ) for zero inflated non-negative continuous data [188]. These kernels are selected because 𝜌 [195], a measure of association for compositional data and 𝜏𝑧𝑖 , a measure of association for zero inflated non- negative continuous data [188] are shown to perform consistently better in all learning scenarios investigated in [230]. The strong performance of 𝜌 can be explained on the basis that scRNA-seq captures only a small proportion of messenger RNA in each cell and therefore gene expression measurements can be viewed as relative measures of abundance (as seen in compositional data). On the other hand, 𝜏𝑧𝑖 , a modification of Kendall’s rank correlation coefficient, is expected to provide less biased estimates of association in the setting of zero-inflated continuous data, a characteristic of single cell transcriptomic datasets [188]. To compare and contrast these two measures, the correlation kernel 𝑟 is additionally investigated since it’s widely used in GRN reconstruction algorithms. 4.2.4 Optimization The problem in (4.1) is non-convex due to the last constraint, which is called complementarity constraints [217]. In [251], it is shown that alternating direction method of multipliers (ADMM) converges for problems with complementarity constraints under some assumptions. First, we rewrite the problem in vector form. Let k = upper(K), d = diag(K), ℓ + = upper(L+ ), ℓ − = upper(L− ). Then, (4.2) can be rewritten as: minimize ⟨2k − S⊤ d, ℓ + ⟩ − ⟨2k − S⊤ d, ℓ − ⟩ + 𝛼1 ⟨(2I + S⊤ S)ℓ + , ℓ + ⟩ + 𝛼2 ⟨(2I + S⊤ S)ℓ − , ℓ − ⟩ ℓ + ≤0, ℓ − ≤0 (4.3) ⊤ + ⊤ − + − subject to 1 ℓ = −𝑛, 1 ℓ = −𝑛, and ℓ ⊥ℓ , where S is defined in Section 1.1, the first two terms correspond to trace terms in (4.2), the last two terms correspond to Frobenius terms of (4.2) and first two constraints are the same as the first two constraints of (4.2). The last constraint with ℓ + ≤ 0 and ℓ − ≤ 0 correspond to the complementarity constraints. By 55 introducing two slack variables v = ℓ + and w = ℓ − , the problem is written in standard ADMM form: minimize 𝚤 𝑆 (v, w) + ℎ(ℓ + , ℓ − ) + 𝚤 𝐻 (ℓ + ) + 𝚤 𝐻 (ℓ − ) v,w,ℓ + ,ℓ − (4.4) + − subject to v − ℓ = 0, w − ℓ = 0, where 𝚤 𝑆 (·) is the indicator function for the complementarity set 𝑆 = {(v, w) : v ≤ 0, w ≤ 0, v⊥w}, ℎ(ℓ + , ℓ − ) is the objective function in (4.3), and 𝚤 𝐻 () is the indicator function for the hyperplane 𝐻 = {ℓ : 1⊤ ℓ = −𝑛}. The augmented Lagrangian of (4.4) is: L𝜌 (v, w, ℓ + , ℓ − , 𝜆1 , 𝜆2 ) =𝚤 𝑆 (v, w) + ℎ(ℓ + , ℓ − ) + 𝚤 𝐻 (ℓ + ) + 𝚤 𝐻 (ℓ − ) (4.5) 𝜌 𝜌 + 𝜆⊤ + 1 (v − ℓ ) + ∥v − ℓ + ∥ 22 + 𝜆⊤ − − 2 2 (w − ℓ ) + ∥w − ℓ ∥ 2 , 2 2 where 𝜆1 and 𝜆2 are Lagrange multipliers and 𝜌 > 0 is the Augmented Lagrangian parameter. Steps in 𝑘th iteration of ADMM are as follows: (v, w)-step: The (v, w)-step of ADMM can be found as the projection onto the complementarity set 𝑆: 𝜌 𝜆𝑘 𝜌 𝜆𝑘 (c 𝑘+1 , w 𝑘+1 ) = argmin 𝚤 𝑆 (v, w) + ∥v − ℓ + 𝑘 + 1 ∥ 22 + ∥w − ℓ − 𝑘 + 2 ∥ 22 = Π𝑆 (y), (4.6) v,w 2 𝜌 2 𝜌 where y = [(ℓ + 𝑘 − 𝜆1𝑘 /𝜌) ⊤ , (ℓ − 𝑘 − 𝜆2𝑘 /𝜌) ⊤ ] ⊤ and Π𝑆 (·) is the projection operator on the set 𝑆. (ℓ + , ℓ − )-step : Using the fact that optimization can be performed separately for ℓ + and ℓ − , ℓ + -step can be written as: + 𝑘+1 ⊤ + ⊤ + + + 𝜌 𝑘+1 + 𝜆1𝑘 2 ℓ = argmin z ℓ + 𝛼1 ⟨(2I + S S)ℓ , ℓ ⟩ + 𝚤 𝐻 (ℓ ) + ∥v − ℓ + ∥ ℓ+ 2 𝜌 2 (4.7) ⊤ −1 𝑘+1 = Π 𝐻 [((4𝛼1 + 𝜌)I + 2𝛼1 S S) (𝜌v + 𝜆1𝑘 − z)], where z = 2k − S⊤ d and Π 𝐻 (·) is the projection operator on the hyperplane 𝐻. Similarly, ℓ − -step can be written as: ℓ − 𝑘+1 = Π 𝐻 [((4𝛼2 + 𝜌)I + 2𝛼2 S⊤ S) −1 (𝜌w 𝑘+1 + 𝜆2𝑘 + z)]. (4.8) Lagrange multipliers update The updates of Lagrange multipliers are: 𝜆 1𝑘+1 = 𝜆1𝑘 + 𝜌(v 𝑘+1 − ℓ + 𝑘+1 ), (4.9) 𝜆2𝑘+1 = 𝜆2𝑘 + 𝜌(w 𝑘+1 − ℓ − 𝑘+1 ). (4.10) 56 Computational and Storage Complexity The computational complexity of the optimization procedure described above can be found by determining how many computations are required for each ADMM step. Let 𝑀 = 𝑛(𝑛 − 1)/2 where 𝑛 is the number of nodes. (v, w)-step can be performed in 𝑂 (𝑀) time, or 𝑂 (𝑛2 ) time. (ℓ + , ℓ − )-step requires the inversion of the matrix (4𝛼1 + 𝜌)S + 2𝛼1 S⊤ S, which needs to be calculated only once before the optimization iterations. The inverse matrix has a closed form solution which can be found using Woodbury matrix identity. It has a decomposition of the form A⊤ A where A is a sparse matrix with 𝑂 (𝑛2 ) non-zero entries. Thus, matrix-vector multiplication of (ℓ + , ℓ − )-step can be done in 𝑂 (𝑛2 ) time. Updates of Lagrangian multipliers can also be performed 𝑂 (𝑀) time, or 𝑂 (𝑛2 ). Let 𝐼 be the number of iterations required for the convergence of ADMM. Thus, overall time complexity of scSGL is 𝑂 (𝐼𝑛2 ). The storage complexity of scSGL is determined by the size of the inverse matrix required in (ℓ + , ℓ − )-step. Since this matrix has a decomposition of the form A⊤ A, we only need to store A. Thus, the storage complexity of scSGL is 𝑂 (𝑛2 ). Based on the above analysis, computational and storage complexity of ADMM is quadratic in the number of nodes and is not affected by the number of graph signals. Note that, scSGL also requires the construction of the kernel matrix before running the optimization. Since there are already very efficient tools to construct kernel matrices [230], we did not include their complexity in the analysis above. Finally, there are recent works in GSP literature for scaling GL methods to learning graphs with millions of nodes [109]. These approaches can be employed to scale scSGL, which we left as a future pursuit. 4.2.5 Hyperparameter Selection The optimization problem in (4.2) requires the selection of two regularization parameters 𝛼1 and 𝛼2 , which determine the density of the learnt graph, i.e. large values of 𝛼1 (𝛼2 ) result in denser L+ (L− ). Their values can be set to obtain a graph with desired positive and negative edge densities. We propose a resampling approach [74] to determine desired positive and negative edge densities empirically. The approach has following steps: 1. We randomly shuffle each column of the data matrix X to generate a surrogate data matrix. 2. Association between rows of the surrogate data matrix are calculated by the kernel employed in (4.2). 3. Thresholds 𝜆 1 and 𝜆2 are selected as the 𝑝th and (100 − 𝑝)th percentiles of the values in the kernel matrix calculated in Step 2. 57 4. Steps (1-3) are repeated 𝑘 times to construct the empirical distribution of the thresholds 𝜆1 and 𝜆2 . 5. Finally, 𝜆 b1 and 𝜆 b2 are selected to be the medians of the empirical distributions constructed in Step 4. 6. The kernel matrix for the original data X is constructed. 7. The number of entries in the kernel matrix that are smaller than 𝜆 b1 are determined and normalized by the total number of entries in the kernel matrix to obtain the density of L− . Similarly, number of b2 is used to determine the density of L+ . entries in the kernel matrix greater than 𝜆 8. 𝛼1 and 𝛼2 are then selected to learn graphs with the estimated graph densities found in Step 7. For all the datasets analyzed in the Section 4.3, we learned the densities of positive and negative parts by setting 𝑝 = 5. 4.3 Results In this section, performance of scSGL is evaluated and compared to state-of-the-art GRN inference methods on various simulated and experimental scRNAseq datasets. We selected GENIE3 [103], GRN- BOOST2 [146], PIDC [41] and PPCOR [116] for comparison as they are the top performing methods in [190]. GENIE3, GRNBOOST2 and PPCOR were originally developed for bulk analysis, while PIDC is developed for single cell gene expression data. Among these methods, GENIE3 and GRNBOOST2 return fully connected directed networks, while the remaining two infer undirected networks. Finally, only PPCOR algorithm returns signed graphs. 4.3.1 Performance Metrics AUPRC Ratio Given the inherent sparsity of gene networks, we used the area under the precision-recall curves (AUPRC) ratio as the primary evaluation metric. AUPRC are calculated by comparing inferred graphs to ground truth gene regulations. During this calculation, signs of the learned edges are ignored as AUPRC is restricted to binary classification. In particular, we first take the absolute value of edge weights and then compare them to ground truth edges. Thus, these metrics indicate how well methods detect edges without considering the signs of the inferred edges. Ground truth networks are considered as undirected and self-loops are ignored. Following [190], we defined AUPRC ratio as the ratio of AUPRC value of the methods to AUPRC of the random estimator. 58 AUPRC Ratio Activating/Inhibitory One of our goals is to learn whether the edges are activating or inhibitory. AUPRC as defined above cannot evaluate the sign information. Thus, for curated datasets, whose ground truth gene regulations include signed edge information, we calculate AUPRC for activating and inhibitory edges separately. In particular, for methods that learn signed graphs we compare the learned positive edges to activating edges in ground truth and learned negative edges to inhibitory edges in the ground truth. For methods that do not learn signed edges, we evaluate the inferred edges with respect to the ground truth activating and inhibitory edges separately to calculate two AUPRC values. 4.3.2 Synthetic Datasets Curated Datasets From BEELINE: The first simulation datasets we consider are curated from "published Boolean models of GRNs" [190]. These datasets were generated using the recently proposed single cell GRN simulator BoolODE [190]. BoolODE converts boolean functions specifying a GRN directly to ODE equations using GeneNetWeaver [216, 134], a widely used method to simulate bulk transcriptomic data from GRNs. These datasets are generated from four literature-curated Boolean models: mammalian cortical area development (mCAD), ventral spinal cord (VSC) development, hematopoietic stem cell (HSC) differentiation and gonadal sex determination (GSD). These models represent different types of graph structures, with varying numbers of positive and negative edges; thus serving as good examples for illustrating the robustness of the proposed method in modelling signed graph topologies. BoolODE is used to create ten random simulations of the synthetic gene expression datasets with 2,000 cells for each model. For each dataset, one version with a dropout rate of 50% and another with a rate of 70% are also considered to evaluate the performance of the methods under missing values. AUPRC ratios are calculated separately for activating and inhibitory edges and their average over real- izations are reported in Figure 4.2. For most of the datasets, scSGL performs better than other benchmarking methods in inferring both activating and inhibitory edges. Although there is a difference between the perfor- mances of different kernels, scSGL generally performs better than state-of-the-art methods irrespective of the selected kernel. Comparing the performances of different kernels, it is observed that 𝜏𝑧𝑖 results in higher AUPRC ratios in GSD, HSC and VSC while 𝜌 performs better in mCAD datasets. It is also observed that AUPRC ratios are higher for activating edges then inhibitory edges. Increasing the dropout ratio causes a drop in the performance of all methods for inferring the activating edges but not for learning the inhibitory edges. Overall, the best performing kernel is 𝜏𝑧𝑖 , which might be because of its robustness to increasing 59 GSD Activating GSD Inhibitory HSC Activating HSC Inhibitory mCAD Activating mCAD Inhibitory VSC Inhibitory High GENIE3 1.71 1.78 1.92 1.18 1.16 0.98 3.19 3.24 3.18 2.71 2.23 1.90 1.69 1.88 2.33 1.05 1.03 0.98 2.78 2.14 1.46 GRNBOOST2 1.60 1.65 1.50 1.43 1.42 1.29 2.93 3.10 3.02 2.93 2.54 2.24 1.72 1.64 2.00 1.00 0.99 0.99 2.73 2.24 1.77 PIDC 1.85 1.83 1.73 1.27 1.25 1.22 3.12 3.12 3.18 2.84 2.53 2.11 1.72 1.70 1.90 1.09 1.05 1.06 2.69 2.81 2.59 PPCOR 2.49 2.22 1.76 1.45 1.33 1.50 3.66 3.61 3.37 3.23 2.93 2.43 2.20 2.17 2.24 1.34 1.37 1.56 2.62 2.49 2.52 scSGL-r 2.64 2.43 2.26 2.08 2.06 2.02 3.57 3.53 3.51 3.12 3.28 3.06 2.50 2.50 2.48 1.53 1.50 1.50 2.72 2.84 2.87 scSGL- 2.42 2.61 2.36 1.81 2.03 2.16 3.77 3.76 3.75 2.54 2.87 2.23 3.06 2.88 2.48 1.67 1.63 1.72 2.70 2.57 2.61 scSGL- zi 2.74 2.57 2.27 2.23 2.30 2.32 3.83 3.76 3.80 3.02 3.31 2.75 2.75 2.69 2.48 1.46 1.48 1.55 2.69 2.67 2.78 Low %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 Figure 4.2: Performance of scSGL and state-of-the-art methods on curated datasets as measured by AUPRC ratio for activating and inhibitory edges. x-axis indicates dropout ratio in the dataset. dropout ratio compared to other kernels. Parameter Sensitivity Analysis: To mimic the zero inflated and overly dispersed nature of most scRNAseq datasets, we simulated gene expression data from a multivariate zero-inflated negative binomial (ZINB) distribution for our second simulation. These datasets were then used to conduct parameter sensitivity analysis for the proposed methods. Given a known graph structure, synthetic datasets are generated from a ZINB distribution by adapting an algorithm developed by [262]. The three parameters of the ZINB distribution; 𝜆, 𝑘 and 𝜔, which control its mean, dispersion and degree of zero-inflation, respectively were determined from a real scRNAseq dataset to make the simulations mirror the properties of real datasets. The procedure to generate simulated gene expression data is as follows: 1. For each simulation setting, we first draw a binary graph 𝐺 from a random graph model with 𝑛 genes. 2. Each edge of 𝐺 is assigned a weight, 𝑊𝑖 𝑗 such that:    𝑈𝑛𝑖 𝑓 (0.3, 0.7)    with probability 0.5, 𝑊𝑖 𝑗 =   𝑈𝑛𝑖 𝑓 (−0.7, −0.3)   otherwise.  3. 𝑝 random samples are drawn from multivariate Gaussian distribution with precision matrix W. The random samples are used as the columns of the matrix X ∈ R𝑛× 𝑝 . 4. To mimic the dropout phenomenon present in real single cell datasets, we next introduced additional zeros to the gene expression matrix X. Following [187], the dropout probability for each row of X was calculated as: 𝜋𝑖 𝑗 = exp(−𝛼𝑋𝑖 𝑗 2 ), where 𝛼 represents the exponential decay parameter that controls the dependence between the dropout probability and gene expression. 60 5. A binary indicator was next sampled for each entry: 𝜉𝑖 𝑗 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜋𝑖 𝑗 ), with 𝜉𝑖 𝑗 = 1 indicating that the corresponding entry of 𝑋𝑖 𝑗 would be replaced by 0. The dropout probability for each gene vector was calculated as 𝜔𝑖 = 𝑝𝑗=1 𝜉𝑖 𝑗 . Í 6. Using a modification of the NORTA (Normal to Anything) method [262] we generated samples from a multivariate zero inflated negative binomial distribution based on X generated in Step 3 using mean 𝜆, dispersion 𝜅 and zero-inflation parameters 𝜔 𝑗 ’s. 7. To mirror real scRNA-seq gene expression data behavior, the gene expression mean 𝜆 and standard deviation 𝜅 were estimated from a real scRNA-seq dataset, Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. The ZINB simulator is then used to generate expression data from three different graph models: random networks, networks with a given community structure and networks with hubs. Random networks are generated using Erdős–Rényi model with desired edge density. Since Erdős–Rényi model is not realistic due to its binomial degree distribution, we also consider networks with hubs. These networks are generated using a Barabási–Albert model whose degree distribution follows a power-law function. Finally, networks with community structure, also known as modular networks, are generated using a disjoint union of random graphs. To investigate the robustness of scSGL, we simulated datasets from the aforementioned network topologies by varying the following parameters: (i) number of genes (10, 50, 100 and 250), (ii) number of cells (100, 300, 500 and 1000) and (iii) dropout probabilities (0.26-0.36). To account for the inherent randomness of the simulations, 10 independent data replicates were generated for every parameter combination and the mean AUPRC ratios obtained by averaging over the replicates are reported in Figure 4.3. Recent investigations of scRNAseq datasets have revealed that dropout rates are primarily driven by a combination of technical and biological factors [75]. Consequently, while mean gene expression and proportion of zeros are linked, this may vary based on cell type, sex, and other biological and technical factors. While investigating the impact of dropout rates on network estimation accuracy, we found a steady decline in AUPRC ratios for all methods with an increase in the number of zeroes. scSGL irrespective of the kernel choice maintained the highest AUPRC ratios across all network topologies. Gene expression in scRNAseq datasets can be intepreted as relative measures of abundance owing to the datasets being a combination of gene expression derived from several cell-types. This could be the reason why proportionality measures perform well [230]. The strong performance of 𝜏𝑧𝑖 can be explained on the basis that it explicitly 61 for calculating mutual information. In general, PPCOR has the worst performance among all methods. It should also be noted that the performance of GRNBOOST2 was equivalent to scSGL for all the network topologies when the sample size was 10 times the number of genes. These results indicate the importance of sample size in accurate network estimation for all of the methods and network topologies is considered. Finally, the performance of each of the methods was evalulated by varying the number of genes. All methods had high AUPRC ratios across network topologies when the number of genes was small. While the AUPRC ratios of all the methods declined with an increase in the number of genes, scSGL performed significantly better than most of the benchmarking methods. This dip in performance could be attributed to the fact that all methods learn very dense networks. With an increase in the number of nodes, there is an increasing number of false edges detected by every algorithm. The performance of scSGL could further be improved with a more biologically informed framework for hyperparameter selection. Computational Complexity: Methods are compared in terms of their scalability to datasets with large number of genes. For this purpose, synthetic data generation process used in parameter sensitivity analysis is employed to create three datasets with 500, 1000 and 2000 genes. Each dataset is generated from Barabási–Albert model, includes 1000 cells, and has a dropout ratio of 0.26. Average run time and AUPRC ratios over 10 replicates are reported in Figure 4.4. We reported results only for the correlation kernel, as other kernels have similar performances and run times. It is observed that scSGL runs significantly faster than GENIE3, GRNBOOST2 and PIDC while having superior performance in terms of AUPRC ratio. Although PPCOR runs faster than scSGL, it shows poor performance. 4.3.3 Real Datasets For real datasets, we consider scRNAseq expressions of human embryonic stem cells (hESC) and mouse embryonic stem cells (hESC) which include 758 and 451 cells, respectively. We inferred GRNs between 500 highly varying genes along with highly varying TFs [190]. Inferred GRNs are compared to three different databases of gene regulations: STRING [237], cell-type specific [50] and nonspecific [130, 83, 92]. AUPRC ratios are reported in Figure 4.5. All methods have performance values close to random estimator. Except PPCOR, which has random performance in both datasets and for all databases, methods have comparable performances, with scSGL showing slightly better performance in hESC and while benchmarking methods working slightly better in mESC. To add biological meaning to the estimated networks we compared them to the reference networks in 63 were also detected by scSGL-𝑟 but with edge confidence less than 0.5 (0.1-0.3) [30, 47]. scSGL-𝜌 and scSGL-𝑟 identified 20 common genes including Sox2, Sox4, Gata6, Ctnnb1 and Bmp4. scSGL-𝜏𝑧𝑖 identified the least number of genes but successfully retrieved lineage markers Nanog, Sox2, Sox4, Pou5f1,Ctnnb1, Gata2, Gata3. All three kernel methods identified genes Sox4, Ctnnb1, Bmp4 and Gata6. According to the STRING database, the 56 genes identified by scSGL-𝑟 are associated with 839 significantly enriched biological process gene ontology (GO) terms that include cell differentiation, chromosome separation, specification of animal organ position, mitotic nuclear division and organ formation. Genes identified by scSGL-𝜌 and scSGL-𝜏𝑧𝑖 had similar functional enrichments for biological processes. To demonstrate some of the learned associations in hESC, we plotted the subnetwork of 24 lineage specific marker genes using scSGL [47]. Figure 4.6a shows the presence of activating relationships between key definitive endoderm (DE) markers like Gata6, Gata4, and Eomes and joint inhibition of pluripotency markers Pou5f1, Nanog, and Sox2. Gata4 and Gata6 have been reported as necessary for the development and function of a number of endoderm-derived tissues and cells [253, 250] and onset of Gata4 and Gata6 expression has been reported to be coincident with the beginning of endoderm gene expression [77]. In addition, inhibition of pluripotency markers by the key DE markers indicates progression of the cells towards a DE state. In mESC dataset, scSGL-𝜌, scSGL-𝑟 and scSGL-𝜏𝑧𝑖 each identified 67, 103 and 55 high confidence STRING interactions, respectively, with an edge confidence greater than 0.5. The three estimated networks hESC Lineage Marker Genes mESC Genes HAPLN1 LEFTY1 F1 ZFP42 SOX17 POU5 A4 LHX ESR SA GAT 1 1 RB LL4 R 16 ET SO CE IFI 3B V5 1 X2 DN MT MY CN RIF 2 EOME S PMAIP 1 CDC5L ZFP4 ERBB4 NANOG SOX2 GATA2 COL4A2 LECT1 SFPQ GS 2 CT1 C DAB GL MY A6 3 G UL T TA NO BP GA ND GA K10 GAT RY POU5F1 1 P NA HA GNG1 MA A4 KLF4 1 SOX17 PRDM14 A B Figure 4.6: The subnetworks of 24 lineage specific genes in hESC (A) and 19 well known marker genes in mESC (B). We report results of scSGL-𝑟 as it has the highest AUPRC ratio in Figure 4.5. For clarity, only those edges whose absolute edge weight fall into the top 1 percentile are shown. Node sizes are proportional to their degrees. 65 capture interactions regulated by known transcription factors Sox2, Nanog, Klf4, Myc and Sall4 [270]. scSGL-𝑟 identified known relationships between Sox2 and Nanog; Esrrb with Sox2 and Rybp among many others. scSGL-𝜌 identified known relationships between Esrrb and Etv5 and indirect interactions between Sall4 and Rybp regulated by TF Oct4. scSGL-𝜏𝑧𝑖 identified most of the important relationships identified by scSGL-𝑟 along with additional relationships between Sox2, Nanog, and Rif1. According to the STRING database, the 103 genes identified by scSGL-𝑟 are associated with 908 significantly enriched biological process GO terms that include cell fate determination, specification and commitment, mitotic DNA replication and regulation of nodal signalling pathway. Similar to hESC analysis, scSGL, irrespective of the chosen kernel, identified genes with similar functional enrichments for biological processes. To demonstrate some of the learned associations in mESC, we plotted the subnetwork of 19 well known marker genes+TF in mESC differentiation, estimated using scSGL. As can be seen in Figure 4.6b, Nanog, Gata4, Sox2, Sox17, Zfp42 and Lefty1 emerge as some of the hub nodes with high degrees of associations. The learned network also captures vital signed associations between Sox2, Nanog, Sox17, Zfp42 and Gata4. It is well known that Sox2 and Nanog form the core of a transcription factor network that promotes embryonic stem cell pluripotency and self- renewal. Zfp42 is also known to be a direct target of Nanog, which is augmented by Sox2 [226]. In addition, Sox17 together with Gata4 expression reinforce a transcriptional network that antagonizes Nanog expression to initiate differentiation [173]. Finally, to analyze the relation between edges identified by scSGL and benchmarking methods, the intersection between the top 1000 edges is reported as an UpSet plot [123] in Figure 4.7. In both datasets, PPCOR does not have any intersection with other methods probably because of its poor performance reported in Figure 4.5. The remaining 6 methods have an intersection set with cardinality around 40 edges. The same number of common edges is found in the intersection of PIDC, GENIE3, GRNBOOST2, scSGL-𝜏𝑧𝑖 , scSGL-𝑟 and in the intersection of PIDC, GENIE3, scSGL-𝜏𝑧𝑖 , scSGL-𝑟, scSGL-𝜌. These observations hold for both datasets, indicating the reproducibility of the proposed approach across different datasets. Edges identified by 𝜏𝑧𝑖 and 𝑟 have more intersecting edges with benchmarking methods and with each other than those identified by 𝜌, which indicates that the benchmarking methods have more common edges with correlation based association metrics than with proportionality measures. scSGL methods have more common edges with PIDC than with GENIE3 and GRNBOOST2, which may be due to the fact that PIDC learns co-expression GRN similar to scSGL, while GENIE3 and GRNBOOST2 learn directed interactions between genes. 66 Intersection between top 1000 edges of methods for hESC dataset 800 Intersection size 600 400 200 0 PPCOR PIDC GRNBOOST2 GENIE3 scSGL- zi scSGL- scSGL-r 1000 0 Intersection between top 1000 edges of methods for mESC dataset 1000 800 Intersection size 600 400 200 0 PPCOR PIDC GRNBOOST2 GENIE3 scSGL- zi scSGL- scSGL-r 1000 0 Figure 4.7: UpSet plot that shows intersection between the top 1000 edges by scSGL with 3 kernels and benchmarking methods in hESC and mESC datasets. 4.4 Conclusions In this chapter, we have introduced a novel network inference algorithm based on GSP. Our proposed algorithm scSGL identifies functional relationships between genes by learning the signed adjacency matrix from the gene expression data under the assumption that graph signals are similar over positive edges and dissimilar over negative edges. This novel technique also takes into account the nonlinearity of the gene interactions by employing kernel mappings. We applied scSGL to four curated datasets derived from "published Boolean models of GRN" and two real experimental scRNAseq datasets during differentiation. To conduct an in-depth analysis of gene co-expression network reconstruction from scRNAseq datasets, we generated simulations from zero inflated negative binomial distributions. These simulations, generated using different parameter combinations, were used to investigate the robustness of our proposed method to changing cell sizes, gene numbers and dropout rates. For the curated datasets, scSGL consistently obtained higher AUPRC ratios in comparison to the bench- marking methods, despite each dataset having a different number of stable cell states. Parameter sensitivity analysis reflected the superior performance of scSGL in estimating networks under varying network topolo- gies. The performance remained consistent even when the gene numbers increased, the dropout rates were high and the sample sizes were low. This indicated the robustness of scSGL in modelling networks under varying characteristics of scRNA-seq datasets. 67 The networks estimated from real data using scSGL identified important functional relationships between target genes and transcription factors and exhibited enrichment for appropriate functional processes. We also demonstrated that scSGL attained performance comparable to state-of-the-art-methods in real data experiments, with the performance of all the GRN reconstruction methods methods being close to random. Accuracy evaluation of the predicted networks for the real datasets were done using cell-type specific, non- specific and functional networks described in [190]. However, most of the information in these ground truth datasets have been accumulated based on tissue level data and hence it’s not completely appropriate to calculate precision and recall rates from these databases. Although scRNAseq techniques provide significant advantages over bulk data such as increased sample size with higher depth coverage and and presence of highly distinct cell clusters, it also comes laced with multiple sources of technical and biological noise. Moreover, the inability to differentiate between technical and biological noise, and the absence of adequate noise modelling techniques further exacerbate the problem [89, 233]. scSGL aims to capture the node similarities and dissimilarities based on distances between graph signals. These graph signals exhibit smoothness, which implies that within a given node cluster, genes tend to be homogeneous, while varying across clusters. This leads to densely connected graphs where the heterogeneity induced by distinct cell sub-populations can be simultaneously curbed. Using single cell data with cell cluster labels, easily obtained from single cell clustering algorithms [186], in conjunction with scSGL can aid in identifying functional modules that are associated with a cell type [252]. Integrating pseudotemporal ordering with scSGL can further help in identifying the functional modules associated with differential pathways [261]. Despite the availability of a large number of computational methods, accurate GRN reconstruction still remains an open problem. Most reconstruction methods are based on the assumption that presence of an edge implies regulatory relationships. They also have the tendency to establish links between genes regulated by the same regulator. These issues can generate a lot of false positives and therefore additional sources of data such as ChIP-seq measurements that help in identifying direct interactions between TFs and target genes, can provide a way to filter out the spurious interactions [1]. Finally, gene regulation has multiple layers beyond direct TF-target interaction, but functional relationships can only be established if these relationships induce persistent changes in transcriptional state. As single cell data sources over multiple modalities continue to become available, it will be interesting to see how integration of these data types aids GRN reconstruction using scSGL [235]. 68 CHAPTER 5 LEARNING MULTIVIEW SIGNED GRAPHS 5.1 Introduction As mentioned in previous chapter, gene expression arises from a network of regulatory interactions between transcription factors, co-factors and signaling molecules [211, 265]. Elucidating the topology of this underlying regulatory network is essential for understanding the mechanisms that govern complex biological processes in human physiology and pathology. A major focus area in clinical research lies in studying the changes in gene coexpression networks across different tissues, cell types/states, and conditions. For example, in the extensively studied breast cancer datasets from the cancer genome atlas, there are four main subtypes of breast cancer [165]. The variation between these subtypes holds the key to inferring how genes transcriptionally regulate each other and how their expressions and interactions change across subgroups. In addition one would expect the gene relationships corresponding to different subtypes to be similar to each other since they originate in the same tissue, but also posses crucial differences since they are in different stages of disease progression [55, 122, 91]. Thus, instead of estimating a single network for all the subtypes, constructing class-specific graphical models for different conditions will provide a more robust and deeper understanding of group-specific characteristics. Recent advances in RNA sequencing have made it possible to profile the gene expression of individual cells. Dozens of algorithms have been proposed for the reconstruction of gene regulatory networks from scRNA-seq datasets [45, 190]. Most of these algorithms, however estimate a single gene regulatory network, assuming the data samples to be identically and independently distributed; hence ignoring the presence of natural subgroups that may be present within the data. Given the assumption of a grouped dataset, one should be able to apply these algorithms to estimate networks from each subgroup separately; but this procedure of independent group-wise network estimation will fail to model the shared structures between the subgroups, eventually leading to information loss. Therefore, there is a pressing need to develop joint graph estimation models that would allow information borrowing across subgroups while retaining subgroup specific heterogenity. Multiple algorithms have been proposed for joint estimation of networks from high dimensional data. Most of these methods assume that the data has a Gaussian distribution. Seminal paper [90] paved the way for penalized estimation of multiple Gaussian graphical models, and demonstrated the use of lasso based 69 penalty functions for better estimation across multiple groups. Later, [55] proposed the fused graphical lasso and the group graphical lasso penalties for better estimation. These methods however are not directly applicable to single cell datasets. Despite many advantages, scRNA-seq datasets are undermined by a series of technical limitations, such as dropouts and a high level of noise, which renders void the assumption of Gaussianity [75, 44, 4]. Few methods have been proposed for joint estimation of multiple networks from scRNA-seq datasets. [154] developed PIPER, a penalized local Poisson graphical model [7] for joint estimation of multiple networks in scRNA-seq datasets. One of the main limitations of PIPER is that the Poisson distribution has one single parameter characterizing both the mean and the standard deviation. Single cell datasets would be better characterized by a negative binomial distribution which has a separate dispersion parameter or a zero inflated negative binomial distribution which could account for the excessive zeroes. To account for the non-Gaussian nature of the scRNA-seq datasets, [255] proposed a modification of the joint Gaussian copula graphical model based on the Gaussian copula transformation proposed in [128]. To facilitate estimation of Kendall’s 𝜏 correlation matrix in the presence of dropouts they propose a modified Kendall’s 𝜏 metric that only utilizes the completely observed values, and excludes the missing values. [66] proposed a three step hybrid joint estimation strategy that relies on (a) integrated application of a Bayesian zero inflated Poisson based model imputation strategy and single cell imputation technique McImpute [105, 148], (b) data Gaussianization [127] and eventually (c) joint estimation of a Gaussian graphical model [55]. Contrary to [154], the last two proposed approaches estimate graphical models for continuous data and rely on a data transformation step for making the data continuous. In this chapter, we focus on GSP based GL for the joint inference of multiple GRNs, where gene expressions from cells are considered as graph signals on the unknown GRNs. Since GSP based GL methods employ explicit representation of graph signals in the graph frequency domain, they have more flexibility in modeling signals compared to previous network inference methods, such as statistical models reviewed above for GRN inference. However, existing GSP based GL algorithms [69, 107, 219, 163] have two important shortcomings for multiple GRN learning. First, they cannot learn signed graphs, which is a more suitable model for GRNs as they include activating and inhibitory edges. Second, with the exception of [163], they can only learn a single graph. Thus, they are not applicable to the joint inference of multiple GRNs problem. This chapter presents a multiple signed graph learning algorithm (scMSGL) for joint inference of GRNs from multiple classes (conditions/disease states). Based on the method developed in Chapter 4, scMSGL 70 learns multiple GRNs by deriving an optimization problem using three assumptions: (i) expressions of genes connected with activating edges are similar to each other, (ii) expressions of genes connected with inhibitory edges are dissimilar to each other, and (iii) GRNs corresponding to the different datasets are related to each other. Thus, scMSGL optimizes the total variation of graph signals to learn signed graphs while ensuring that the learned signed graphs are similar to each other through regularization with respect to a learned signed consensus graph. The proposed method has several advantages over existing approaches. First, it performs joint GRN inference taking advantage of the shared information across datasets while not making any specific parametric assumptions about the data. Second, during application to single cell data, scMSGL is kernelized as in Chapter 4 to take the structure of scRNA-seq data into account. For instance, it can employ proportionality measures to reflect relative rather than absolute abundance or zero-inflated Kendall’s tau to handle drop-outs [230]. Finally, the proposed method learns an additional consensus graph, which captures the common structure across all graphs. 5.2 Methods Let {X𝑖 }𝑖=1 𝑁 be a given set consisting of 𝑁 matrices. X ∈ R𝑛× 𝑝𝑖 is a data matrix constructed from 𝑖 𝑝 𝑖 graph signals defined on an unknown signed graph 𝐺 𝑖 = (𝑉, 𝐸 𝑖 , W𝑖 ) with |𝑉 | = 𝑛. It is assumed that 𝐸 𝑖 ’s and associated edge weights are different but similar to each other. Based on this assumption, when learning 𝐺 𝑖 ’s, one can have better performance by borrowing information across graphs. For example, when analyzing scRNA-seq expressions from different disease states/conditions, the datasets generated from the varying groups are generally assumed to share a common gene co-expression structure. Thus, jointly learning cell-type specific graphs can improve inference by allowing information sharing across cell-types. To this end, we propose an optimization problem (scMSGL) that learns 𝐺 𝑖 ’s simultaneously. In the proposed approach, the learned 𝐺 𝑖 ’s are regularized to be close to a consensus graph 𝐺, which is also learned by combining information from 𝐺 𝑖 ’s. Thus, the proposed formulation ensures that information is shared across graphs when learning 𝐺 𝑖 ’s. Furthermore, the structure of 𝐺 reflects the common connections shared across 𝐺 𝑖 ’s, whose inference may be beneficial if one is interested in learning the common gene co-expression structure over the different cell-types/disease-stage subgroups. Let L𝑖,+ and L𝑖,− be the Laplacian matrices of the positive and negative parts of 𝐺 𝑖 , respectively. Similarly, define L+ and L− for the consensus graph 𝐺. Let L + = {L1,+ , . . . , L 𝑁 ,+ , L+ } and L − = {L1,− , . . . , L 𝑁 ,− , L− }. 71 The optimization problem for jointly learning 𝐺 𝑖 ’s and 𝐺 is then: ∑︁ ∑︁ 𝑁  minimize tr(K𝑖,𝑠 L𝑖,𝑠 ) + 𝛼𝑠 ∥L𝑖,𝑠 ∥ 2𝐹 + 𝛽𝑠 ∥L𝑖,𝑠 − L𝑠 ∥ 2𝐹,𝑜 𝑓 𝑓 L+ , L− 𝑠 ∈ {+,−} 𝑖=1 + 𝛾+ ∥L+ ∥ 1,𝑜 𝑓 𝑓 + 𝛾− ∥L− ∥ 1,𝑜 𝑓 𝑓 subject to L𝑖,𝑠 ∈ L, tr(L𝑖,𝑠 ) = 2𝑛, ∀𝑖, ∀𝑠 ∈ {+, −} (5.1) (L𝑖,+ , L𝑖,− ) ∈ C ∀𝑖, L+ , L− ∈ L, (L+ , L− ) ∈ C, where K𝑖,+ = K𝑖 , K𝑖,− = −K𝑖 , and K𝑖 is a kernel matrix constructed from X𝑖 as described in Section 4.2. ∥·∥ 𝐹,𝑜 𝑓 𝑓 and ∥·∥ 1,𝑜 𝑓 𝑓 are the Frobenius norm and the ℓ1 -norm of the off-diagonal entries, respectively. The first term in the summation measures the smoothness and non-smoothness of X over 𝐺 𝑖,+ and 𝐺 𝑖,− , respectively. The second term controls the density of the learned 𝐺 𝑖,+ (𝐺 𝑖,− ) such that for larger values of 𝛼+ (𝛼− ), we learn denser graphs. The third term ensures that 𝐺 𝑖,+ (𝐺 𝑖,− ) are close to the positive (negative) part of consensus graph 𝐺 with 𝛽+ (𝛽− ) controlling how close they should be. The last term is a regularizer that controls the sparsity of positive and negative parts of 𝐺 with larger values of 𝛾+ and 𝛾− resulting in a sparser consensus graph. Finally, the constraints are the same as in (4.1). 5.2.1 Optimization The problem in (5.1) can be written in a vectorized form, where one learns the upper triangular parts of the Laplacian matrices. Let k𝑖,𝑠 = upper(K𝑖,𝑠 ), d𝑖 = diag(K𝑖,𝑠 ), ℓ 𝑖,𝑠 = upper(L𝑖,𝑠 ) and ℓ 𝑠 = upper(L𝑠 ) for 𝑠 ∈ {+, −}. Also, let L 𝑣+ = {ℓ 1,+ , . . . , ℓ 𝑁 ,+ , ℓ + } and L 𝑣− = {ℓ 1,− , . . . , ℓ 𝑁 ,− , ℓ − }. The vectorized form of (5.1) is: ∑︁ ∑︁ 𝑁 ⟨k𝑖,𝑠 − S⊤ d𝑖,𝑠 , ℓ 𝑖,𝑠 ⟩ + 𝛼∥−Sℓ 𝑖,𝑠 ∥ 22 + 2𝛼∥ℓ 𝑖,𝑠 ∥ 22 + 𝛽∥ℓ 𝑖,𝑠 − ℓ 𝑠 ∥ 22 + 𝛾+ ∥ℓ + ∥ 1 + 𝛾− ∥ℓ − ∥ 1  minimize + − L𝑣 , L𝑣 𝑠 ∈ {+,−} 𝑖=1 subject to 1⊤ ℓ 𝑖,+ = −𝑛, 1⊤ ℓ 𝑖,− = −𝑛, ℓ 𝑖,+ ≤ 0, ℓ 𝑖,− ≤ 0, ℓ 𝑖,+ ⊥ℓ 𝑖,− ∀𝑖 and ℓ + ≤ 0, ℓ − ≤ 0, ℓ + ⊥ℓ − , (5.2) where S is defined in Section 1.1. The first term in the summation corresponds to the first term in (5.1), and the correspondence between the remaining terms to the term in (5.1) can be deduced using the hyperparameters. First two constraints correspond to the trace constraints in (5.1). The constraint ℓ 𝑖,+ ⊥ℓ 𝑖,− together with ℓ 𝑖,+ ≤ 0 and ℓ 𝑖,− ≤ 0 are complementarity constraints [217] and corresponds to (L𝑖,+ , L𝑖,− ) ∈ C in (5.1). The problem in (5.2) is non-convex due to complementarity constraints. However, ADMM is shown to be convergent for problems with complementarity constraints under some assumptions [251]. To write the 72 problem in standard ADMM form, introduce auxiliary variables v𝑖 = ℓ 𝑖,+ and w𝑖 = ℓ 𝑖,− for all 𝑖. Similarly, introduce v = ℓ + and w = ℓ − . Also, let V = {v1 , . . . , v 𝑁 , v} and W = {w1 , . . . , w 𝑁 , w}. Then, the problem in its standard ADMM form is: ∑︁𝑁 ∑︁ ∑︁ 𝑁 𝑓 (ℓ 𝑖,𝑠 , ℓ 𝑠 ) + 𝚤 𝐻 (ℓ 𝑖,𝑠 ) + 𝚤 𝑆 (v, w) + 𝛾+ ∥ℓ + ∥ 1 + 𝛾− ∥ℓ − ∥ 1  minimize + − 𝚤 𝑆 (v𝑖 , w𝑖 ) + L 𝑣 , L 𝑣 , V, W 𝑖=1 𝑠 ∈ {+,−} 𝑖=1 subject to v𝑖 = ℓ 𝑖,+ , w𝑖 = ℓ 𝑖,− , v = ℓ + , and w = ℓ − , (5.3) where 𝑓 (ℓ 𝑖,𝑠 , ℓ 𝑠 ) = ⟨k𝑖,𝑠 − S⊤ d𝑖,𝑠 , ℓ 𝑖,𝑠 ⟩ + 𝛼∥−Sℓ 𝑖,𝑠 ∥ 22 + 2𝛼∥ℓ 𝑖,𝑠 ∥ 22 + 𝛽∥ℓ 𝑖,𝑠 − ℓ 𝑠 ∥ 22 , 𝚤 𝑆 (·, ·) is the indicator function for the complementarity set 𝑆 = {(v, w) : v ≤ 0, w ≤ 0, v⊥w}, and 𝚤 𝐻 (·) is the indicator function for the hyperplane 𝐻 = {ℓ : 1⊤ ℓ = −𝑛}. Augmented Lagrangian can then be written as: ∑︁𝑁 ∑︁ ∑︁ 𝑁 𝐿 𝑝 (L 𝑣+ , L 𝑣− , V, W) =  𝚤 𝑆 (v𝑖 , w𝑖 ) + 𝑓 (ℓ 𝑖,𝑠 , ℓ 𝑠 ) + 𝚤 𝐻 (ℓ 𝑖,𝑠 ) 𝑖=1 𝑠 ∈ {+,−} 𝑖=1 𝑁 ∑︁ 𝜌 𝑖 𝜌 𝑖 𝜆⊤ ∥v − ℓ 𝑖,+ ∥ 22 + 𝜆⊤  𝑖 𝑖,+ 𝑖 𝑖,− 𝑖,− 2 + 𝑖,+ (v − ℓ ) + 𝑖,− (w − ℓ ) + ∥w − ℓ ∥2 (5.4) 𝑖=1 2 2 + 𝚤 𝑆 (v, w) + 𝛾+ ∥ℓ + ∥ 1 + 𝛾− ∥ℓ − ∥ 1 𝜌 𝜌 + 𝜆⊤ + + (v − ℓ ) + ∥v − ℓ + ∥ 22 + 𝜆⊤ − + (w − ℓ ) + ∥w − ℓ ∥ 2 , − 2 2 2 where 𝜌 is the parameter of augmented Lagrangian, 𝜆𝑖,+ , 𝜆𝑖,− , 𝜆+ and 𝜆 − are the Lagrange multipliers. Using augmented Lagrangian, ADMM steps at 𝑘th iteration are then found as follows: bb+ bb− ( V, 𝐿 𝑝 (L 𝑣 , L 𝑣 , V, W), b W) c = argmin (5.5) V, W (Lb𝑣+ , Lb𝑣+ ) = argmin 𝐿 𝑝 (L 𝑣+ , L 𝑣− , V, b W),c (5.6) L 𝑣+ , L 𝑣− b𝑖,+ = b 𝜆 𝜆b𝑖,+ + 𝜌(b v𝑖 − ℓb𝑖,+ ), ∀𝑖, (5.7) b𝑖,− = b 𝜆 𝜆b𝑖,− + 𝜌(b w𝑖 − ℓb𝑖,− ), ∀𝑖, (5.8) b+ = b 𝜆 𝜆b+ + 𝜌(b v − ℓb+ ), (5.9) b− = b 𝜆 𝜆b− + 𝜌(b w − ℓb+ ), (5.10) where b and b b represent the values of variables at 𝑘th and (𝑘 − 1)th iteration, respectively. To solve (5.5), we use the fact that it can be solved for each (v𝑖 , w𝑖 ) pair (and (v, w)), separately. This separation leads to a set of optimization problems all of which can be solved by projection onto the complementarity set 𝑆. The problem in (5.6) is separable across L 𝑣+ and L 𝑣− , leading to two optimization problems both of which can be solved with Block Coordinate Descent (BCD) [224]. 73 5.2.2 Hyperparameter Selection Procedure scMSGL requires the selection of six hyperparameters, three of which control the properties of the positive parts of the learned graphs while the remaining control the negative parts. As mentioned above, 𝛼+ (𝛼− ) and 𝛾+ (𝛾− ) control the edge density of positive (negative) parts of the learned 𝐺 𝑖 ’s and 𝐺, respectively. 𝛽+ (𝛽− ) controls how similar the learned 𝐺 𝑖,+ ’s (𝐺 𝑖,− ’s) are to the consensus graph. We select these hyperparameters similar to that suggested in [55], where hyperparameter selection is guided to learn graphs with desired properties. Alternative to other model selection approaches, such as cross-validation or Bayesian information criterion, this approach can achieve a model that is interpretable and plausible in practice. Thus, we tune the hyperparameters such that the obtained graphs have a desired edge density and view similarity. In particular, assume that one wants the densities of positive and negative edges in the learned 𝐺 𝑖 ’s and 𝐺 to be 𝑑+ and 𝑑− , respectively. Furthermore, assume that the pairwise similarity between 𝐺 𝑖,+ and 𝐺 𝑗,+ , ∀𝑖 ≠ 𝑗 is desired to be 𝑐 + , where the similarity is quantified by the correlation coefficient. Similarly, let 𝑐 − be the desired similarity amount for the negative edges of the graphs. Once 𝑑+ , 𝑑− , 𝑐 + , 𝑐 − are fixed, we select the six hyperparameters accordingly. The values of 𝑑+ , 𝑑− , 𝑐 + , and 𝑐 − are selected based on prior knowledge on the datasets under study as detailed in Results section. 5.3 Results The performance of scMSGL is evaluated on both simulated and two real scRNA-seq datasets. For simulated data, learned graphs are compared to ground truth networks to quantify the performance of scMSGL. Simulated data are used to benchmark the performance of scMSGL against scSGL and three GRN inference algorithms, GENIE3, GRNBOOST2 and PIDC, whose details are given in Chapter 4. These methods and scSGL can only learn a single graph from each dataset at a time. Therefore, they are applied to each X𝑖 separately and the learned graphs are compared to ground truth 𝐺 𝑖 ’s. In addition, we benchmark against Joint Graphical Lasso with fused lasso penalty (JGL-Fused) method [55], which learns multiple related Gaussian graphical models, and Joint Gene Networks with scRNA-seq data (JGNsc) [66] algorithm, which jointly learns the graphs for multiple classes of single cell data. As a performance metric we employed signed version of area under precision recall curve (AUPRC) ratio, which can measure how well a method can infer activating, inhibitory and non-existing edges. Given the ground truth GRN 𝐺 and the output of a GRN inference algorithm 𝐺, b let 𝐺 + and 𝐺 − be the activating and inhibitory edges in the ground truth GRN and 𝐺 b+ and 𝐺b− be the activating and inhibitory edges in the 74 inferred network. We compare 𝐺 b+ to 𝐺 + with AUPRC to measure how well the algorithm finds the activating edges. Similarly, we compare 𝐺 b− to 𝐺 − to measure the performance on inhibitory edges. Let 𝐴𝑈𝑃𝑅𝐶 + and 𝐴𝑈𝑃𝑅𝐶 − represent these values. We calculate signed AUPRC ratio as follows: 𝐴𝑈𝑃𝑅𝐶 + 𝐴𝑈𝑃𝑅𝐶 −   1 𝑆𝑖𝑔𝑛𝑒𝑑𝐴𝑈𝑃𝑅𝐶 𝑅𝑎𝑡𝑖𝑜 = + , (5.11) 2 𝐴𝑈𝑃𝑅𝐶𝑟+𝑎𝑛𝑑𝑜𝑚 𝐴𝑈𝑃𝑅𝐶𝑟−𝑎𝑛𝑑𝑜𝑚 where 𝐴𝑈𝑃𝑅𝐶𝑟+𝑎𝑛𝑑𝑜𝑚 and 𝐴𝑈𝑃𝑅𝐶𝑟−𝑎𝑛𝑑𝑜𝑚 are the performance measures of a random estimator. Finally, note that if an algorithm infers an unsigned GRN, we use the inferred GRN for both 𝐺 b+ and 𝐺b− . 5.3.1 Selected Hyperparameter Values Hyperparameters of scMSGL are set as described in Section 5.2.2 section with 𝑑+ = 𝑑− = 𝑑 and 𝑐 + = 𝑐 − = 𝑐. We used the BEELINE [190] pipeline to run GENIE3, GRNBOOST2 and PIDC. GENIE3 and GRNBOOST2 employs random forest and gradient boosting regressors, respectively and hyperparameters of these regressors are set to the default values used in GENIE3 and GRNBOOST2 toolboxes. PIDC uses mutual information to learn gene regulations and it requires a discretizer and an estimator for probability distribution estimation. We used the discretizer and estimator recommended by PIDC toolbox. scSGL requires 𝛼+ and 𝛼− , which are determined the same way as 𝛼𝑠 ’s of scMSGL, i.e., they are set to values such that learned graphs have desired edge densities of 𝑑+ = 𝑑− = 𝑑. JGL-Fused requires two parameters 𝜆1 and 𝜆2 , which are analogous to the parameter of scMSGL, 𝛼𝑠 and 𝛽𝑠 , respectively. Therefore, they are set the same way, i.e. we choose 𝜆1 and 𝜆2 such that the learned graphs’ desired edge densities satisfy 𝑑+ + 𝑑− = 2𝑑1 and view similarity of 𝑐 + = 𝑐 − = 𝑐. Finally, JGNsc consists of three steps: imputation, Gaussian transformation and GRN inference with JGL-Fused method. The hyperparameters of the first two steps are set to the default values provided in JGNsc toolbox and 𝜆1 and 𝜆2 of JGL-Fused step are set as described above2. For all datasets, we use 𝑐 = 0.5. For simulated data, since benchmarking GRN inference methods (GENIE3, GRNBOOST2 and PIDC) learn fully connected graphs, we set 𝑑 = 0.4 for fair comparison. For real data, we set 𝑑 = 0.1 for ease of analysis. 1 JGL-Fused does not allow edge densities of the negative and positive parts of the learned graph to be controlled separately, therefore we learned a graph with edge density equal to 2𝑑, which is the same edge density for scMSGL if the edge signs are not considered. 2 JGNsc [66] recommends to use Akaike information criterion for selection of 𝜆 and 𝜆 . In our analysis, 1 2 we found this selection technique does not perform well and its time complexity was high. 75 5.3.2 Simulated Data Data Generation: To validate the performance of scMSGL, we simulate gene expression data from a multivariate zero-inflated negative binomial (ZINB) distribution. Namely, given a known graph structure, we generate synthetic datasets using the algorithm described in Section 4.3.2 Two graph structures are considered for creating the baseline graph 𝐺 with 𝑛 = 100 genes: random graphs following an Erdős–Rényi (ER) model with an edge density of 0.1 and hub graphs following a Barabási–Albert (BA) model with a degree distribution that follows the power-law. We then convert 𝐺 to a signed graph by randomly selecting half of the edges and assigning a negative sign to them while assigning a positive sign to the other half. Next,  𝑁 by adding 0.9 × 𝑛 × 𝜂 new edges to the baseline graph 𝐺. we generate 𝑁 = 5 individual networks {𝐺 𝑖 }𝑖=1 2 Half of the added edges are set as negative edges, while the other half are set as positive. The ZINB simulator is then used to generate datasets {𝑋𝑖 }𝑖=1 𝑁 from the underlying graphs {𝐺 } 𝑁 . The three parameters of the 𝑖 𝑖=1 ZINB distribution; 𝜆, 𝑘 and 𝜔, which control its mean, dispersion and degree of zero-inflation, respectively were determined using a real scRNA-seq dataset [101]. Each simulation is repeated 10 times and the average performance over 10 realizations is reported. Sensitivity to the Number of Cells: We first study the performance of the methods with varying number of cells when the dropout ratio is set to 0.26, 𝜂 = 0.1, i.e. 90% of the edges are common across views and the correlation kernel is used for both scSGL and scMSGL. From left panel of Figure 5.1, it can be seen that for the different cell numbers, scMSGL has higher AUPRC ratios than methods that learn from a single dataset. This indicates that the proposed method incorporates valuable information across views, which improves the performance. As expected, the performance of all methods improves with increasing number of cells. These observations hold for both random graph models. Sensitivity to Dropout Ratio: In the second analysis, we evaluate the performance of the different methods with increasing dropout ratio while fixing the number of cells to 400 and 𝜂 = 0.1. Results are shown in the middle panel of Figure 5.1 for both random graph models, with correlation kernel used for scSGL and scMSGL. Similar to cell sensitivity analysis, scMSGL performs better compared to all other methods irrespective of which graph model is used to generate the datasets. Except for PIDC, AUPRC ratios of all methods drop with increasing dropout ratio as expected. Performance of PIDC mostly remains the same. Since PIDC performs poorly at all drop-out levels, this result does not imply robustness against dropouts. 76 Erdős–Rényi Model GENIE3 1.2 1.2 1.5 2.1 1.6 1.7 1.5 1.6 2.8 3.1 3.1 2.8 GRNBOOST2 1.1 1.2 1.4 1.7 1.5 1.5 1.4 1.4 2.3 2.4 2.4 2.2 PIDC 1 1.1 1.2 1.3 1.2 1.1 1.2 1.2 1.4 1.5 1.5 1.4 scSGL-r 1.5 1.9 3.4 4.6 4 3.8 3.6 3.5 3.9 4.1 4.1 3.6 JGL-Fused 1.7 2.3 5.2 7.7 6.1 5.9 5.5 5.3 6.2 5.7 5 4 JGNsc 1.2 1.4 3.1 3.8 3.5 3.5 3.3 3.1 3.8 3.3 2.9 2.4 scMSGL-r 1.8 2.5 5.7 8.2 6.9 6.6 6.3 5.9 6.7 5.6 4.7 3.7 Barabási–Albert Model GENIE3 1.2 1.4 2.4 3.2 3 3 2.9 2.7 2.8 3.1 3.2 2.8 GRNBOOST2 1.2 1.3 2 2.5 2.4 2.3 2.2 2.2 2.2 2.4 2.4 2.1 PIDC 1 1.1 1.3 1.7 1.4 1.5 1.5 1.5 1.4 1.5 1.4 1.4 scSGL-r 1.4 1.8 3.3 4.3 4.1 4 3.9 3.8 3.7 4 4 3.5 JGL-Fused 1.6 2.1 3.7 5.2 6.2 6.1 5.9 5.6 5.8 5.5 5 3.8 JGNsc 1.2 1.3 2.9 3.6 3.8 3.4 3.1 3 3.3 3.2 3.1 2.4 scMSGL-r 1.7 2.4 5.1 7 6.8 6.6 6.4 6.1 6.3 5.2 4.7 3.5 50 100 300 500 0.10 0.16 0.26 0.34 90% 83% 76% 66% Number of Cells Dropout Ratio View Similarities Low High Figure 5.1: Performance of different methods on various datasets quantified by AUPRC ratio. All datasets have 100 genes. Left panel reports the results for varying number of cells. Middle one reports the results for varying dropout ratios. Right panel report results for varying degrees of view similarities, which is measured by the percentage of common edges across views in the ground truth graphs. Top plot shows the results for Erdős–Rényi model and the bottom plot shows the results for Barabási–Albert model. 77 Barabási–Albert Model scMSGL 1.5 1.7 2.5 3.2 4 3.6 3.2 3 2.8 2.8 2.7 2.4 scMSGL-r 1.7 2.4 5.1 7 6.8 6.6 6.4 6.1 6.3 5.2 4.7 3.5 scMSGL-ρ 1.5 2 4 5.6 6.2 6 6 5.8 5 4.1 3.8 2.9 scMSGL-τzi 1.7 2.4 5.2 7 6.8 6.7 6.4 6.2 6.4 5.2 4.7 3.6 50 100 300 500 0.1 0.2 0.4 0.6 90% 83% 76% 66% Number of Cells Dropout Ratio View Similarities Low High Figure 5.2: Performance of scMSGL without any kernel (first row) and with different kernels on datasets generated from BA model and studied in Figure 5.1. Sensitivity to View Similarity: Next, we study the effect of view similarity on the performance of algorithms. Datasets are generated with varying 𝜂 values while fixing the number of cells to 400 and the dropout ratio to 0.26. Results are reported in right panel of Figure 5.1, where the correlation kernel is employed for scSGL and scMSGL. When view similarity is 90%, the best performing algorithm is scMSGL, while for lower view similarity values JGL-Fused performs slightly better than scMSGL. The reason that JGL-Fused performs better than scMSGL for smaller view similarity values could be due to the difference in the regularization terms used to impose similarity across views. JGL-Fused uses a ℓ1 norm penalty, while we employ a squared Frobenius norm. Compared to fused lasso, squared Frobenius norm is susceptible to outliers, which can degrade the performance. The performance of single-view algorithms does not get affected by the changes in view similarity, as they learn each view independently. On the other hand, there is a drop in the performances of all joint graph learning methods with decreasing view similarity. This is an expected behaviour, since both methods assume the dependence of views. Kernel Comparison: Formulation of scMSGL allows us to use various kernels. Therefore, we study how the performance changes with respect to the kernel type. Datasets are created using the BA model and results are shown in Figure 5.2 for varying number of cells, dropout ratios and view similarities. The best performing kernel is 𝜏𝑧𝑖 , followed by the correlation kernel. When Figures 5.1 and 5.2 are compared, scMSGL has higher AUPRC ratios than single-view approaches and JGNsc irrespective of the kernel choice. 78 The change in the performance of 𝜏𝑧𝑖 and 𝜌 with varying cell numbers, dropout ratios and view similarity are very similar to that of the correlation. Finally, to better understand the effect of kernels, the performance ⊤ of scMSGL without any kernels, i.e. K𝑖 = X𝑖 X𝑖 , is also reported. Figure 5.2 shows all kernels have significantly higher performance compared to when no kernel is used, which indicates the importance of kernel usage in GRN inference. Time Complexity Comparison: We compare the different methods based on their run time complexity. We generated datasets using BA model with varying number of cells and number of genes. Table 5.1 reports the run time of scSGL, scMSGL, JGL-Fused and JGNsc in seconds. Run times of GENIE3, GRNBOOST2 and PIDC are not reported as they are shown to have higher time complexity than scSGL in [112]. Reported run times correspond to one run without hyperparameter search. Run time of scSGL is the total run time to infer all views. In the first dataset, number of genes, dropout ratio and 𝜂 are fixed to 100, 0.26, and 0.1, respectively and number of cells varies. Results for this dataset indicate that scMSGL is faster than joint graph learning methods JGL-Fused and JGNsc. JGL-Fused also uses an ADMM based optimization, however it needs singular value decomposition at each ADMM iteration. scMSGL does not need this expensive operation; thus, it runs much faster than JGL-Fused. scSGL is faster than scMSGL, which is expected as scMSGL optimization takes longer time to converge due to added regularization terms and consensus graph learning. Finally, all methods except JGNsc are observed to run faster with increasing number of cells, since the inference problem becomes easier with higher number of cells, which makes iterative optimization procedure used by all methods converge faster. JGNsc runs slower with increasing number of cells, as its imputation step needs to handle a larger data matrix. In the second dataset with increasing number of genes, the number of cells, dropout ration and 𝜂 are fixed to 500, 0.26, and 𝜂 = 0.1, respectively. As before, scMSGL is faster than joint graph learning methods and is slower than scSGL. Increasing the number of genes is observed to increase run time complexity of all methods, as it makes the problem harder. 5.3.3 Analysis of scRNA-seq data from mouse embryonic stem cell differentiation Central to the differentiation process and many other cellular processes is the expression of right combi- nation of genes or modules of genes. Accurate characterization of the co-expression networks for progenitor and multiple cell types can help in understanding the cascade of cellular state transitions [139]. In this 79 Table 5.1: Run time of scMSGL and benchmarking methods in seconds with respect to number of cells and genes. All methods run on the same computing cluster with compute nodes that have similar compute power. Run times of JGL-Fused and JGNsc for 500 genes are not reported, we were not able to run them in a reasonable time limit (4 hours). Number of Cells Number of Genes Method 50 100 300 500 50 100 300 500 scSGL-𝑟 1.10 0.54 0.35 0.38 0.15 0.37 5.88 26.68 JGL-Fused 175.64 117.65 95.95 98.03 10.02 95.98 1703.66 − JGNsc 196.76 160.37 233.06 373.15 130.13 373.06 2541.45 − scMSGL-𝑟 14.00 12.39 10.00 8.51 0.35 3.89 110.49 304.71 section, we study the differentiation process of mouse embryonic stem cells (mESC) using single cell RNA sequencing datasets [118]. This data was generated using high-throughput droplet-microfluidic approach and was primarily used to study differentiation in mESC before and after leukemia inhibitory factor (LIF) withdrawal. Since LIF maintains pluripotency of mESC, LIF withdrawal is considered to initiate the dif- ferentiation process. The dataset contains cells sampled from 4 states (or natural subgroups): before LIF withdrawal, day 0 and after the withdrawal for days 2, 4 and 7. The subgroups contain 933, 303, 683 and 798 cells, respectively. This dataset has been previously analyzed using joint graphical estimation in [154, 255] and similar to them we only consider the 72 stem cell markers in our application [193] 3. We first estimated the subgroup specific and the consensus graphs. Based on the results of simulated data, we employ the zero-inflated Kendall’s tau kernel. Next, we calculate the signed node degrees of each gene, i.e., 𝐷 +𝑖𝑖 = 𝑛𝑗=1 𝑊𝑖+𝑗 and 𝐷 −𝑖𝑖 = 𝑛𝑗=1 𝑊𝑖−𝑗 from learned graphs 𝐺 + and 𝐺 − . We then consider the genes Í Í with top signed degrees as hub genes whose signed degrees are reported in Figure 5.3. The result confirms the importance of regulator genes NANOG, SOX2, POU5F1, ZFP42, UTF1 in early stages of differentiation. NANOG has been reported to maintain pluripotency by inhibiting genes that activate differentiation to lineages associated with extraembryonic endoderm [40, 145]. Figure 5.3 clearly shows that the number of inhibitory relationships associated with NANOG decreases as the ES cells proceed to a matured state. POU5F1 and SOX2 also exhibit higher number of inhibitory relationships in the the first few days. SOX2, NANOG and POU5F1 are known to play a fundamental role in the self-renewal and pluripotency of mouse embryonic stem cells [270]. Reduction in expression of NANOG has been shown to be correlated with the induction of gene GATA4 which initiates differentiation of pluripotent cells [100] and therefore GATA4 has 3 The dataset was downloaded from GEO database [73] (with ID GSE65525). For the preprocessing steps, please refer to "Data Analysis" subsection of the "Experimental Procedures" section in [118]. The only preprocessing we performed was log-transformation to make count data continuous. 80 Day 0 Day 2 Day 4 Day 7 AFP CD9 CDX2 COMMD3 CRABP2 DIAP2 DNMT3B EDNRB EOMES FGF5 FN1 FOXA2 GAL GATA4 GBX2 GCM1 HCK IAPP IFITM1 IFITM2 IL6ST KRT1 LAMC1 LEFTY1 LEFTY2 LIFR MYOD1 NANOG NES NEUROD1 NODAL NOG NR6A1 OLIG2 PAX6 PECAM1 POU5F1 PTEN PTF1A REST RUNX2 SEMA3A SERPINA1A SFRP2 SOX2 SYCP3 T TAT TDGF1 UTF1 ZFP42 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 Figure 5.3: Genes with the highest node degrees. Orange and blue bars indicate that the degree is calculated using activating and inhibitory edges, respectively. Only genes whose activating or inhibitory degrees is among the top 15 genes in any view are shown. Table 5.2: Node Degree of MYC in Learned Graphs Group 3 Intermediate Group 4 Total Degree 5.436 3.334 4.180 Avg. Edge Weight 0.077 0.037 0.039 been correctly identified as a hub gene in Days 2 and 4. Collectively, these results confirm the fundamental roles of SOX2, NANOG and POU5F1 in the pluripotency stage and how an eventual reduction in their expression initiates differentiation. Analysis of scRNA-seq data from medulloblastoma Medulloblastoma (MB) is a highly malignant cerebellar tumor mostly affecting young children [176]. Several studies have been done to pinpoint the genetic drivers in each of the four distinct tumor subgroups: WNT-pathway-activated, SHH-pathway-activated, and the less-well-characterized Group 3 and Group 4 [176]. Among these subgroups, Group 3 and Group 4 tumors account for the majority of the MB diagnoses, 81 with Group 3 MB having a metastatic diagnosis of approximately 50%. Transcription factors (TFs) MYC2 and OTX2 have commonly been identified as key oncogenic TFs in Group 3 and 4 tumorogenesis. [66] used the joint single cell network algorithm to study the roles of MYC and OTX2 utilizing the MB scRNA-seq data set (GSE119926) by [101] 4. Using the same selected samples from a subset of 17 individuals that were grouped into three subsets Group 3, Group 4 and an intermediate cell type, we estimate the joint gene regulatory network for the three groups for ∼ 750 genes among which most are enzyme-related genes from mammalian metabolic enzyme database [51]. Bulk profiling studies for MB cells have consistently observed overlapping transcriptional and epigenetic signatures in Group 3 and Group 4 tumors suggesting shared developmental origins [177, 101]. Based on this, we hypothesize that a joint analysis of the different MB cell-types would better capture the local functional interactions of MYC and OTX2 across different tumor subtypes and would eventually help in delineating their global role in regulating metabolic processes in MB cells. Subgroup specific networks along with the consensus graph were estimated with zero-inflated Kendall’s tau kernel. Table 5.2 shows that the average edge weight for the MYC network is considerably higher for Group 3 compared to Group 4 and the intermediate subgroup. Figure 5.4 further shows that Group 3 MYC network has stronger edge connections and higher density in compared to the intermediate group. In Group 4, almost all the connections become activating except for Aldh3a2 and Eno2; which were found to be strongly downregulated in all the tumor subgroups confirming their role in cancer resistance [151, 42]. This varying network structure over the subgroups confirms the major role MYC plays in initiation, maintenance, and progression of Group 3 tumors [205]. In Figure 5.4, it is shown that OTX2 has a denser network for Group 4 MB cells in comparison to the other groups. In Group 3 MB cells, OTX2’s connections to the metabolic genes are very distinct from the MYC’s. In addition, scMSGL detected relationships between OTX2 and metabolic genes PAICS and PPAT in Group 3 tumors. These genes related to the human purine biosynthesis pathways have been previously reported to be induced by MYC [129]. This confirms that OTX2 is functionally cooperating with MYC to regulate gene expression in medulloblastoma [205, 131]. Broadly these results suggest that MYC and OTX2 play significant roles in in the transcriptional regulation of the metabolic genes and the mechanisms underlying MYC and OTX2 mediated MB maintenance and 4A detailed overview of the MB scRNA-seq data generation and processing can be found in the Methods section of [101] under subsection "Human scRNA-seq data generation and processing". We log-transformed the obtained datasets to make count data continuous. 82 Group 3 Intermediate Group 4 MACROD METTL14 MBOAT1 IMPDH1 LIPA 2 MSR A FKB HSD 3 P3 LL LPI LPL 17B8 AF FK MG N2 ND NDUF C1 B FK P2 3 5D NI MDP 1 HM UF NT AF HS G M GC L UF BP UF DL NT T2 C2 ND 14 O2 5ND DC 1 PC PR OG EN X20 1 K2 C1 O2 DH PN 3 HK PO EN 2 PFK CO 17 PLA I LE U L CO X 8 GP 4 CT 1 POLM 3 PGK1 BCMO PNPO ENTPD 4 COQ PKM AS3MT POLR1E COQ2 RDH11 ENO2 PNPLA8 ARSA CAT POLR3D O ALD 3L ECH APR PNP AK H3A 2 REV 1 PST K T 1 DC AL DX 5 AC 1 1 SL 1 P R X SL D 5A K PU O4 AL DH4 D 6 SR DH A1 PR AL A2 V3 P2 X RE L SI 3 CL DH 1A 3A AC OX1 P2 2B AB RT DH TPI RE 7 2 OTX ABH SC SU TFB MYC MYC MYC TR 3A AL 1 AC HD 1M 2 MT 2 SF3 2 D AC TXNDC ABH 2 XYLB D14A ABHD 3 15 ABHD 14A GAPDH NDUFC1 NT5DC2 NT5DC2 FD1 GPI NUD FOXRE P4HB NM D2 PAIC E1 S1 FKB MTH T15 I R MT GC P5 PA PA S M HM LD FD PD GA HFD F HB FA PS I PF CS GP H DX K7 1 NT TH D2 PR PD D2 5D D1 SN KM E RE O2 GA RP H PK XR NU C 4 A1 X EN 1 POL M FO FO DT 2 O RRM S 15 EN E3 FH 1 FDP PAIC R PPAT RRM2 S DHF FDPS ENO1 PFKL DDAH1 PRDX6 FASN SOD1 DHFR CS PRDX2 ENO CTPS CES RPA1 1 TK1 1 DC1 2 EC CO PRTF AR M2 HS TP I1 X1 M2 S RR 1 1 RR AP B T2 DH T5 AT P5 D1 OA M FR M O SO 1 1 1B SH TR 5A TK CS I1 M P P 1BP 1B S I1 TP C17 AP S29 OA TRM DH B1 OA TY TP AL DH1 OTX2 OTX2 OTX2 AP 1B1 T5 ND TYMS VP MYC AL DH TYMS AL CA A ACA TX ALDO AIFM 1 ABHD 11 Figure 5.4: Connections of MYC (top) and OTX2 (bottom) genes. Edge widths are proportional to connection weights. Orange and blue edge colors indicate that the connection is activating and inhibitory, respectively. Only the top one third of the connections in all views of the multiview graph are shown. progression likely vary in different subgroups of MB cells. 5.4 Conclusion In this paper, we presented scMSGL for joint inference of multiple GRNs from scRNA-seq datasets having multiple classes. scMSGL learns functional relationships between genes across multiple related classes of single cell gene expression datasets under the assumption that there exists a shared structure across classes. The main novelty of our paper lies in the formulation of a highly efficient optimization framework that extends the signed graph learning [112] approach to high dimensional datasets with multiple classes. The kernelization trick embedded within the algorithm renders it capable of handling sparse and noisy features; expected to demonstrate highly non-linear relationships. Furthermore, the estimation of the consensus graph may help in understanding the joint structure existing within the multiple classes. Using simulation studies, we demonstrated the superior performance of scMSGL over single view learning and 83 existing joint learning methods for ER and BA graph models. In addition, performance was ascertained by varying a number of simulation parameters such as dropout levels, cell numbers and view similarity and scMSGL demonstrated superior performance in all scenarios. Applying scMSGL to the mESC dataset, we robustly identified previously reported regulatory markers as the hub genes for the different days and captured the progression of the differentiation process by analyzing these changes in hubs over the days. For the medulloblastoma data, scMSGL efficiently captured the significant roles that key oncology markers MYC and OTX2 play in the transcriptional regulation of metabolic genes. There are various aspects of the proposed method that can be considered for improvement as future work. One challenge in implementing scMSGL is how to select the kernel function. This challenge can be addressed by combining information from multiple kernels during learning. An open problem in graph learning literature is hyperparameter selection, which is also a limitation of the proposed method. Current work selects the hyperparameters by searching the values that would result in graphs with desired properties. Future work can improve the accuracy of the learned graphs through better hyperparameter selection and multi-kernel strategies. Computational complexity of scMSGL is quadratic with respect to the number of genes (similar to scSGL) and linear in number of views. Therefore, its application to datasets with very large number of genes is not feasible. However, recent developments in GSP to scale GL to large-scale problems [108] can be exploited to scale scMSGL. Finally, additional sources of data that help in identifying direct interactions between TFs and target genes, can provide a way to filter out false positives. The current availability of single-cell epigenomic datasets has made it easier to further explore the regulatory relationship between TF and genes. Single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), for example, allows the identification of DNA regulatory elements within accessible genomic DNA regions in single cells, hence enabling the identification of direct regulations in GRNs. Integration of multiomics profiles within the framework of scMSGL could be an interesting avenue for future research. 84 CHAPTER 6 SIMULTANEOUS GRAPH SIGNAL CLUSTERING AND GRAPH LEARNING 6.1 Introduction In many modern data science applications, relationships between entities, such as features or data samples, are well described with a graph structure. While many real-world data are intrinsically graph-structured, e.g. social and traffic networks, there is still a large number of applications, where the graph topology is not readily available. For instance, gene regulations in biological applications or neuronal connections in the brain are not known. In these applications, the graphs need to be learned since they reveal the relational structure and may assist in a variety of learning tasks. Graph learning (GL) deals with the inference of a topological structure among entities from a set of observations on these entities, i.e. graph signals. Methodologies to learn a graph from data include naive methods such as 𝑘-nearest neighborhood (𝑘- NN), probabilistic graphical models [14, 54, 142, 102] and more recently GSP [137, 68] and graph neural networks (GNN) [269, 256, 34]. While the probabilistic graphical models assume the normality of the data, which is not true for most real-world data, GSP based GL methods define observations on a collection of nodes as graph signals and fall into two categories. The first category assumes graph signals are outcomes of diffusion processes on graphs and reconstructs a graph from signals according to the diffusion model [241, 182, 220, 219]. The second category of methods promotes the smoothness of graph signals quantified by the Laplacian quadratic form or more generally via total variation [69, 107, 23]. GNN-based methods, on the other hand, typically require a large volume of training data and the learned connectivity is often less explainable compared to probabilistic graph models and GSP methods. Most of the work on GSP based GL has focused on the case where all data points follow the same relational model described by a single graph. However, in practice, the data may be coming from multiple graphs, i.e., multiview graphs. Examples of this setup include gene regulation networks where regulations vary across different cell types, and in social networks, where a set of users have varying interactions across different social media platforms. In this chapter, we address the problem of multiple graph learning from a heterogeneous set of graph signals, where each cluster is associated with a different graph structure. To this end, we propose GRASCale algorithm for simultaneous GRAph Signal Clustering And graph LEarning. Previous works that perform the same task employ only relations of the graph signals to the graphs associated with the clusters for clustering assignment. However, clustering algorithm can also benefit from side information in the form 85 Graph over graph signals Regularize Spectral Clustering Spectral Clustering Graph Signals Clusters 's Figure 6.1: Overview of the proposed approach: Pairwise similarity between graph signals (𝐺 𝑐 ) and smoothness of the graph signals with respect to graphs associated with each cluster (𝐺 𝑠 ’s) are used jointly in spectral clustering while simultaneously learning 𝐺 𝑠 ’s. of pairwise relations between graph signals. For instance, in a recommendation system, when clustering graph signals, e.g. ratings for items, generated by a set of users, connections among the users can be used to inform the clustering algorithm. Therefore, we formulate GRASCale1 with the following key contributions: • We propose a new framework which is an extension of conventional spectral clustering where both the signals’ pairwise similarity and smoothness with respect to the underlying graph structure are taken into account. • The proposed methodology can learn the graph structures for mixed (heterogeneous) graph data. • An efficient prox-linear block-coordinate descent (BCD) with improved consensus clustering based initialization is introduced for optimization. The overall framework is depicted in Figure 6.1. 6.1.1 Related Work Most of the existing work on GL considers simple data, where all data points follow the same model defined with only one graph. In recent work, GSP community has addressed the problem of learning multiple graphs from heterogenous data in two different settings: i) multiple views of the same data and ii) heterogenous data with possibly unknown cluster information. The first class of methods, also known as joint inference of multiple graphs [164], considers the set- ting where multiple related networks each with a subset of observations is available. In this setting, the membership of the signals to the graphs is known and the graphs are closely related to each other. This problem setting has been most widely studied for inferring the topology of dynamic networks [110, 263, 13, 1 Codes are available at the following github repository: https://github.com/SPLab-aviyente/GRASCale 86 212]. Assuming that the variation is smooth across time, the problem is reduced to learning multiple closely related graphs regularized with a term that promotes changes between consecutive graphs to be small in some pre-specified norm. More recently, the problem of joint inference of multiple graphs from the observed graph signals has been formulated with the assumption of graph stationarity [164]. In this formulation, the signals are assumed to be stationary, and pairwise similarity between all graphs is used to regularize the optimization. The second class of methods focuses on the case where the data is heterogeneous and each subgroup has its own graph structure. This problem has been addressed for both the supervised and unsupervised settings. The supervised setting, also known as multi-category GL problem, assumes that the number of classes and the signals that belong to each class are known a priori [208, 111]. In this case, the goal is to learn multiple graphs each associated to a class of signals such that the representation of signals within a class and discrimination of signals in different classes are both taken into consideration. In the unsupervised setting, the number of clusters is known but the membership of the different graph signals is not known. In this case, the goal is to simultaneously cluster the data and learn the representative graph for each cluster [9, 136]. In [136], graph signals are modelled by a graph Laplacian mixture model (GLMM), which extends the factor analysis model of [69] to jointly model the smooth graph signals and identify the clusters through Gaussian mixture model (GMM). This model assumes that the number of clusters is known a priori and the distribution of the data is Gaussian. The model is fitted to data through the expectation-maximization algorithm for simultaneous graph inference and clustering. On the other hand, [9] proposes K-graphs, which is an extension of k-means clustering where the graph signals are assigned to the clusters based on their smoothness over each cluster’s representative graph. Once the signals are clustered, the representative graphs are updated with graph learning algorithms. Both of GLMM and K-Graphs algorithms assign a graph signal to a cluster based on only the smoothness of the signal with respect to the graph associated with that cluster and do not explicitly take the pairwise relationships between the graph signals into account. 6.2 Method 6.2.1 Graph Signal Clustering with Regularized Graph Cut 𝑝 Assume we are given a dataset X = {x𝑖 }𝑖=1 where x𝑖 ∈ R𝑛 is a graph signal over a graph 𝐺 𝑠 ∈ G = {𝐺 1 , . . . , 𝐺 𝑘 }. All graphs in G are defined over the same vertex set 𝑉 with |𝑉 | = 𝑛 and have their own edge set 𝐸 𝑠 , i.e., 𝐺 𝑠 = (𝑉, 𝐸 𝑠 , W𝑠 ), ∀𝐺 𝑠 ∈ G. Let the partitioning of graph signals in X be defined as 87 C = {𝐶1 , . . . , 𝐶 𝑘 } where 𝐶𝑠 includes all of the graph signals defined over 𝐺 𝑠 . In this paper, it is assumed that the partitioning of the graph signals, C, is not known a priori. The problem of learning C can be considered as a clustering problem. Let 𝐺 𝑐 = (𝑉 𝑐 , 𝐸 𝑐 , W𝑐 ) be the graph that represents the similarity between the elements of X where 𝑉 𝑐 is the node set with |𝑉 𝑐 | = 𝑝. Node 𝑣 𝑖𝑐 ∈ 𝑉 𝑐 corresponds to x𝑖 and 𝑤 𝑖𝑐𝑗 is the similarity between x𝑖 and x 𝑗 . C can then be learned by applying spectral clustering to 𝐺 𝑐 . However, spectral clustering as formulated in (1.4) does not use the fact that x𝑖 ’s are graph signals. One can improve the clustering by incorporating information from the graphs in G. Therefore, we propose a regularized graph cut (regcut) by assuming that the graph signals are smooth over the graphs they are defined on: ∑︁𝑝 𝐾 ∑︁ ∑︁ 𝑝 regcut(C) = 𝑊𝑖𝑐𝑗 (1 − 𝛿𝑔𝑖 𝑔 𝑗 ) + 𝛼 𝛿𝑔𝑖 𝑠 x⊤ 𝑠 𝑖 L x𝑖 (6.1) 𝑖, 𝑗=1 𝑠=1 𝑖=1 where x⊤ 𝑠 𝑠 𝑖 L x𝑖 is the smoothness of x𝑖 over 𝐺 as defined in Section 1.3. By regularizing the graph cut with smoothness, we ensure that if x𝑖 is assigned to the 𝑠th cluster it is smooth with respect to 𝐺 𝑠 . As in Section 1.2.1, we relax Z to take on real values and obtain the following optimization problem: ∑︁𝑘 ⊤ 𝑐 minimize tr(Z L Z) + 𝛼 tr(diag(Z·𝑠 )X⊤ L𝑠 X), (6.2) Z∈D 𝑠=1 where X is the data matrix with X·𝑖 = x𝑖 and Z is constrained as in Section 1.2.1. 6.2.2 Joint Graph Signal Clustering and Graph Learning For the optimization problem in (6.2), one needs to know 𝐺 𝑐 and the graphs in G. Since these graphs are generally not available, they need to be learned. 𝐺 𝑐 can be learned from X using the GL methods or more classical approaches such as 𝑘-nearest neighbor graphs. However, for graphs in G, we cannot use these approaches as we do not know the partitioning of the graph signals. Thus, the graphs in G must be learned simultaneously with clustering. Therefore, we extend (6.2) with GL: ∑︁𝑘 h i minimize tr(Z⊤ L𝑐 Z) + 𝛼1 tr(diag(Z·𝑠 )X⊤ L𝑠 X) + (Z⊤·𝑠 1)𝛼2 ∥L𝑠 ∥ 2𝐹 (6.3) Z,L1 ,...,L 𝑘 𝑠=1 subject to Z ∈ D, L𝑠 ∈ L, tr(L𝑠 ) = 2𝑛 ∀𝑠 ∈ {1, . . . , 𝑘 }, (6.4) where each L𝑠 is learned by assuming that graph signals in the 𝑠th cluster are smooth over 𝐺 𝑠 . As in (1.11), the Frobenius norm controls the sparsity of the learned graphs such that large values of 𝛼2 result in denser graphs. However, in this setting we weigh this sparsity term with Z⊤·𝑠 1 which corresponds to the number of signals in cluster 𝑠 to ensure that the sparsity levels of the learned graphs are similar for a given 𝛼2 . As the 88 value of the smoothness term increases with the number of signals in the cluster, multiplying the sparsity term with Z⊤·𝑠 1 ensures that the relative importance of the sparsity term with respect to smoothness term remains similar across 𝑠. Finally, we set D = {Z ∈ R 𝑝×𝑘 | Z ≥ 0, Z1 = 1}. 6.2.3 Optimization The problem in (6.3) is a multi-convex problem, i.e., it is convex in each variable separately but non- convex when all variables are considered together. Therefore, we employ block coordinate descent (BCD) to solve (6.3) [224]. At each iteration of BCD, the problem is solved cyclically over each variable while fixing the remaining variables. When solving with respect to a variable, we perform inexact minimization with prox-linear update as it results in easy-to-solve problems with fast convergence when extrapolation is used [260]. Before applying BCD, we first vectorize (6.3) where we learn the upper triangular part of L𝑠 . Let ℓ 𝑠 ∈ R𝑚 be the upper triangular part of L𝑠 where 𝑚 = 𝑛(𝑛 − 1)/2. Define the operator mt with mt(ℓ 𝑠 ) = L𝑠 . Then, (6.3) can be rewritten as: 𝑘 h ∑︁ i minimize tr(Z⊤ L𝑐 Z) + 𝛼1 tr(diag(Z·𝑠 )X⊤ mt(ℓ 𝑠 )X) + (Z⊤·𝑠 1)𝛼2 (2⟨ℓ 𝑠 , ℓ 𝑠 ⟩ + ⟨Sℓ 𝑠 , Sℓ 𝑠 ⟩) (6.5) Z,L1 ,...,L 𝑘 𝑠=1 subject to Z ≥ 0, Z1 = 1, ℓ 𝑠 ≤ 0, 1⊤ ℓ 𝑠 = −𝑛 ∀𝑠, (6.6) where S is defined in Section 1.1. Prox-linear updates at the 𝑡th iteration of BCD can then be found as follows: b (𝑡) , Z − Z 𝜆𝑍 Z (𝑡+1) = argmin ⟨G 𝑍 b (𝑡) ⟩ + b (𝑡) ∥ 2 , ∥Z − Z 𝐹 (6.7) Z≥0, 2 Z1=1 (𝑡) 𝜆𝑠 (𝑡) ℓ 𝑠 (𝑡+1) = argmin g𝑠(𝑡) , ℓ 𝑠 − ℓb𝑠 ⟩ + ∥ℓ 𝑠 − ℓb𝑠 ∥ 2𝐹 , ⟨b (6.8) ℓ 𝑠 ≤0, 2 1⊤ ℓ 𝑠 =−𝑛 where G b (𝑡) is the gradient of the objective function in (6.5) with respect to Z evaluated at Z g𝑠(𝑡) is the b (𝑡) , b 𝑍 (𝑡) gradient with respect to ℓ 𝑠 evaluated at ℓb𝑠 , and: b (𝑡) = Z (𝑡−1) + 𝑤(Z (𝑡−1) − Z (𝑡−2) ), Z (6.9) (𝑡) ℓb𝑠 = ℓ 𝑠 (𝑡−1) + 𝑤(ℓ 𝑠 (𝑡−1) − ℓ 𝑠 (𝑡−2) ), (6.10) where 0 ≤ 𝑤 ≤ 1 is the extrapolation parameter. Finally, 𝜆 𝑍 and 𝜆 𝑠 are step sizes and can be set to the Lipschitz constants of the gradient of the objective function in (6.5) with respect to Z and ℓ 𝑠 . Solutions of 89 Algorithm 6.1 GS Clustering with Simultaneous GL Input: X, L𝑠 , 𝛼1 , 𝛼2 , 𝑘 and max_iter Set 𝑡 ← 1 Initialize Z (𝑡) , Z (𝑡−1) , ℓ 𝑠 (𝑡) and ℓ 𝑠 (𝑡−1) repeat Update L𝑠 (𝑡+1) with (6.8) for 𝑠 ∈ {1, . . . , 𝑘 } Update Z (𝑡+1) with (6.7) Set 𝑡 ← 𝑡 + 1 until convergence or 𝑡 ≥ max_iter (𝑡) (𝑡) Output: Z (𝑡) , L1 , . . . , L 𝑘 both (6.7) and (6.8) are projections onto simplex. In particular, for (6.8), we can rewrite it as follows: (𝑡) 1 (𝑡) 2 ℓ 𝑠 (𝑡+1) = argmin ∥ℓ 𝑠 − ℓb𝑠 + g ∥ subject to ℓ 𝑠 ≤ 0, 1⊤ ℓ 𝑠 = −𝑛, (6.11) 𝜆𝑠 𝑠 𝐹 b ℓ𝑠 (𝑡) 1 (𝑡) whose solution of is the projection of ℓb𝑠 − g𝑠 𝜆𝑠 b onto the negative simplex, which can be performed efficiently using the algorithm described in [71]. To solve (6.7), we rewrite it as follows: Z (𝑡+1) = argmin ∥Z − Z b (𝑡) + 1 G b (𝑡) ∥ 2 subject to Z ≥ 0, Z1 = 1 (6.12) Z 𝜆𝑍 𝑍 𝐹 b (𝑡) − 1 b (𝑡) which can be solved separately with respect to rows of Z. Let A = Z 𝜆𝑍 G 𝑍 , then the subproblem of (6.12) with respect to 𝑖th row of Z is: Z𝑖·(𝑡+1) = argmin ∥Z𝑖· − A𝑖· ∥ 22 subject to Z𝑖· ≥ 0, Z⊤ 𝑖· 1 = 1, (6.13) Z𝑖 whose solution is the projection of A𝑖· onto the positive simplex, which can be performed efficiently using the algorithm described in [71]. Overall optimization procedure is given in Algorithm 6.1. [260] show that BCD with prox-linear update converges for multi-convex problems, when the objective function consists of smooth and separable non-smooth terms. The problem in (6.5) satisfies these assumptions; thus, Algorithm 1 is guaranteed to converge. 6.2.4 Initialization BCD type algorithms may converge to poor local minima [224]. To overcome this problem, one can run the algorithm multiple times and consider the solution with the smallest objective value. One can also initiate the algorithm at a better point such that it converges to a solution with lower objective value. In this section, we describe a procedure to select better initializations for the proposed BCD algorithm. Consider the set Z = {Z1 , . . . , Z𝑏 } which is obtained by running Algorithm 6.1 𝑏 times. Each Z𝑖 indicates a possible partitioning of the graph signals. One can obtain a better clustering by combining 90 Algorithm 6.2 Initialization Procedure Input: 𝑏 Initialize Z as an empty set for 𝑖 ≤ 𝑏 do Run Algorithm 6.1 and add learned Z to Z end for Find Z0 by applying consensus clustering to Z Run Algorithm 6.1 with initial point set to Z0 Output: Solutions of the last run information from all 𝑍 𝑖 ’s using consensus clustering [234], an ensemble learning method to combine multiple clusterings. We follow the consensus clustering procedure described in [121], where the consensus clustering Z0 is found from an association matrix A whose entries 𝐴𝑖 𝑗 are equal to the number of times graph signals x𝑖 and x 𝑗 are assigned to the same cluster in Z. This association matrix can be used as the input to spectral clustering to find Z0 . Once Z0 is found, we rerun the Algorithm 1 one more time, where Z is initialized at Z0 (the rest of the variables are initialized randomly). The clustering and learned graphs obtained from this run are used as the final result. This initialization procedure is given in Algorithm 6.2. In our experiments, we set 𝑏 = 9 and we set the maximum number of iterations for each run to a small number, e.g., 100, since even sub-optimal solutions can result in a good consensus clustering. 6.2.5 Hyperparameter Selection The proposed method requires the selection of three hyperparameters: number of clusters 𝑘, 𝛼1 and 𝛼2 . In literature, various methods have been proposed to determine the number of clusters in spectral clustering. These methods generally define a quality function 𝑄 and find the number of clusters as the value that optimizes 𝑄. Possible choices of 𝑄 are eigengap [249], modularity [168], Bayesian information criterion (BIC) [209], integrated completed likelihood (ICL) [58]. 𝛼2 controls the sparsity level of the learned graphs such that larger values of 𝛼2 result in denser graphs. We set it to a value that results in graphs with a pre-determined sparsity level. This approach is similar to previous graph construction schemes, such as in 𝑘-NN graphs, where one wants to construct a graph with each node having at least 𝑘 neighbors. The selection of 𝛼1 is explained in detail through parameter sensitivity analysis in Section 6.3.1. 6.3 Results In this section, the performance of GRASCale is evaluated on synthetic and real datasets and is compared to various state-of-the-art clustering and graph learning algorithms. We compare methods based on the quality of the resulting clustering as well as the accuracy of the learned graphs associated with each cluster. For 91 the first comparison, we consider normalized spectral clustering (SC), GLMM and K-Graphs. For the latter comparison, GL (see Section 1.3.1), GLMM and K-Graphs are considered. As mentioned in Section 1.2.1, SC clusters signals only based on their pairwise similarities. Thus, by comparing GRASCale to SC, we can illustrate the benefits of considering graph signal smoothness. GLMM and K-Graphs perform simultaneous graph signal clustering and graph learning similar to the proposed method. However, they only rely on the smoothness of the signals with respect to graphs associated with each cluster. By comparing GRASCale against them, we can illustrate the benefits of incorporating pairwise similarities. Finally, when applying GL, we assume the partitioning of the signals is known; thus the performance of GL provides an upper bound for the performance of GRASCale in the graph learning task. We used the formulation of [107] for implementing GL. Parameter Selection: SC, GLMM, K-Graphs and GRASCale require the number of clusters 𝑘 as an input. We provided the ground truth 𝑘 as an input to all methods. GL, GLMM and K-graphs require a hyperparameter that controls the sparsity of the learned graphs similar to 𝛼2 in (6.3). For all methods, we set this hyperparameter to a value that results in graphs with sparsity levels between 0.1 and 0.152. GLMM and K-Graphs algorithms are based on alternating minimization, which causes their results to vary across runs. Therefore, we run each algorithm 10 times and report the average performance. For GRASCale, we set 𝑏 = 9 as mentioned in Section 6.2.5. Thus, each algorithm is run 10 times. Finally, SC is applied to a binary 𝑘-nearest neighbor graph with the number of neighbors set to 5. The same graph is used as L𝑐 for the proposed method. Performance Metrics: Normalized mutual information (NMI) [57] is used to quantify the performance of clustering. For the graph learning task, F1-score is used to quantify how close the learned graphs are to the ground truth graphs. We measure F1-score for all 𝑠 and report the average. 6.3.1 Synthetic Data Data Generation: Given a graph 𝐺 with Laplacian L = V𝚲V⊤ , we can generate a graph signal x that is smooth with respect to 𝐺 by filtering a given signal x0 with a low-pass graph filter [69, 107]. Mathematically, this is equivalent to x = ℎ(L)x0 where ℎ(L) = Vℎ(𝚲)V⊤ is a low-pass graph filter. Based on this, we generate the synthetic data as follows. We first generate 𝑘 graphs G = {𝐺 1 , . . . , 𝐺 𝑘 } based on a random graph model, 2 Real-world graphs are generally sparse, so it is desirable to learn sparse graphs. Therefore, we learn graphs at this range of sparsity level. In our experiments, we observed smaller values sparsity level can result in disconnected graphs. To prevent this, we did not consider smaller values. 92 GL SC GLMM K-Graphs GRASCale GL Performance for ER GL Performance for BA 1.00 0.75 F1 0.50 0.25 0.00 Clustering Performance for ER Clustering Performance for BA 1.00 0.75 NMI 0.50 0.25 0.00 0.0 0.4 0.8 1.2 1.6 2.0 0.0 0.4 0.8 1.2 1.6 2.0 Perturbations Figure 6.2: Results for Experiment 1 when cluster sizes are equal. Upper row illustrates the graph learning performance and the bottom row shows the clustering performance. Left and right columns are performances for ER and BA graph models, respectively. such as Erdős–Rényi (ER) [86] or Barabási–Albert (BA) models [5], where each 𝐺 𝑠 has 𝑛 nodes. For each √ † 𝐺 𝑠 , we generate 𝑝 𝑠 smooth graph signals as described above with ℎ(𝚲) = 𝚲 and x0 ∼ P, where † is the pseudo-inverse operator and P is a probability distribution to be determined. The graph signals are then used to construct data matrices X𝑠 ∈ R𝑛× 𝑝𝑠 , from which we build X = [X1 , . . . , X𝑠 ] ∈ R𝑛× 𝑝 where 𝑝 = 𝑝 1 + · · · + 𝑝 𝑠 . White Gaussian noise with variance equal to 10% of the signal power is added to the data matrix. Finally, we generate 20 different realizations of each dataset in all experiments and report the average performance across realizations. Experiment 1: In this experiment, we generate signals from G = {𝐺 1 , 𝐺 2 , 𝐺 3 } where each 𝐺 𝑠 is generated by swapping the edges of a given graph 𝐺 ⌈𝑚 𝐺 × 𝑝𝑒𝑟𝑡⌉ times. 𝑚 𝐺 is the number of edges in 𝐺 and 𝑝𝑒𝑟𝑡 > 0 refers to the amount of perturbation. Smaller values of 𝑝𝑒𝑟𝑡 causes graphs in G to be highly correlated; thus, clustering the graph signals generated from these graphs becomes a harder task. We generated 𝐺 with 50 nodes from two random graph models: ER with edge probability 𝑝 𝐸 𝑅 = 0.1 and BA model with 𝑚 𝐵 𝐴 = 3. We generated X as described above with P = N (0, I). In Figure 6.2, we report the results when the cluster sizes are equal, i.e., 𝑝 𝑠 = 200 for all 𝑠. It can be observed that the clustering performance for all methods increases with the amount of perturbation. This is due to the fact that as the perturbation level increases, the 93 GL SC GLMM K-Graphs GRASCale GL Performance for ER GL Performance for BA 1.00 0.75 F1 0.50 0.25 0.00 Clustering Performance for ER Clustering Performance for BA 1.00 0.75 NMI 0.50 0.25 0.00 0.0 0.4 0.8 1.2 1.6 2.0 0.0 0.4 0.8 1.2 1.6 2.0 Perturbations Figure 6.3: Results for Experiment 1 when cluster sizes are different. Upper row shows graph learning performance and bottom row shows clustering performance. Left and right columns are performances for ER and BA graph models, respectively. different clusters become more distinct. GRASCale performs better than GLMM and K-Graphs for both ER and BA models. SC is observed to perform very poorly as the signals are generated independently from each other. Thus, pairwise similarities between signals that are in the same cluster are not strong, resulting in low NMI values for SC. In terms of graph learning, GL performs the best as expected since it assumes that the cluster membership of the signals is known a priori. There is a slight improvement in the graph learning performances of GLMM, K-Graphs and GRASCale as perturbation level increases and their performances converge to that of GL. Graph learning performances of GLMM, K-Graphs and the proposed method for small perturbation levels may seem counter-intuitive considering their low NMI values. However, graphs in G are very correlated for small values of perturbation, thus graph signals in a given cluster carry information about other graphs too. Therefore, methods can still perform well for graph inference even though the graph clusters may not be accurately identified. Figure 6.3 illustrates the results for the same simulation setting when there is heterogeneity in cluster sizes, i.e., 𝑝 1 = 300, 𝑝 2 = 200, and 𝑝 3 = 100. Results are very similar to that of Figure 6.2. There is a slight drop in the performance of all algorithms compared to Figure 6.2 across all perturbation levels and graph models. 94 GL SC GLMM K-Graphs GRASCale GL Performance for ER GL Performance for BA 1.00 0.75 F1 0.50 0.25 0.00 Clustering Performance for ER Clustering Performance for BA 1.00 0.75 NMI 0.50 0.25 0.00 0.0 .05 .10 .15 .20 .25 0.0 .05 .10 .15 .20 .25 Mixing Coefficient Figure 6.4: Results for Experiment 2. Upper row shows graph learning performance and bottom row shows clustering performance. Left and right columns are performances for ER and BA graph models, respectively. Experiment 2: In the previous experiment, signals were generated independently; thus they do not have any explicitly imposed pairwise relations. In this experiment, we generate graph signals that have pairwise relations and are also smooth with respect to graphs associated with clusters. In order to achieve this goal, we first generate a data matrix Y ∈ R𝑛× 𝑝 with 𝑛 = 50 and 𝑝 = 600. Rows of Y are generated by filtering a signal y ∈ R 𝑝 through a low-pass graph filter defined on 𝐺 𝑐 . The signal similarity graph 𝐺 𝑐 has 𝑝 nodes and y ∼ N (0, I). If there is an edge between nodes 𝑣 𝑖 and 𝑣 𝑗 in 𝐺 𝑐 , columns Y·𝑖 and Y· 𝑗 will be similar to each other. We construct 𝐺 𝑐 from a planted partition model [49] whose nodes are partitioned into three equal sized clusters: 𝐶1 = {𝑣 1 , . . . , 𝑣 200 }, 𝐶2 = {𝑣 201 , . . . , 𝑣 400 }, and 𝐶3 = {𝑣 401 , . . . , 𝑣 600 }. Planted partition model has two parameters 𝑝 𝑖𝑛 and 𝑝 𝑜𝑢𝑡 , which determine the intra- and inter-cluster connectivity, respectively. We set 𝑝 𝑖𝑛 = 0.05(1 − 𝜇) and 𝑝 𝑜𝑢𝑡 = 0.05𝜇, where 𝜇 > 0 is the mixing coefficient. Larger values of 𝜇 causes the clusters to be less distinguishable. For the low-pass filter, we used a heat kernel ℎ(𝚲𝑐 ) = exp(−5𝚲𝑐 ) where 𝚲𝑐 is the eigenvalue matrix corresponding to the Laplacian matrix of 𝐺 𝑐 [107]. We generated graphs in G as in the first experiment with 𝑝𝑒𝑟𝑡 set to 2. Once Y and G are generated, columns of Y in 𝐶𝑠 are filtered by the graph filter corresponding to 𝐺 𝑠 ∈ G for all 𝑠 to construct X. Figure 6.4 shows the performance of the different algorithms. With the introduction of pairwise similarity within clusters, the performance of SC is observed to improve significantly. However, its NMI value is still 95 lower than GRASCale, since the latter benefits from both pairwise relations and smoothness of the graph signals. GLMM and KGraphs have lower performance than the proposed method, as these methods employ only smoothness of the graph signals. Increasing the mixing coefficient causes a decrease in the performance of all methods, as larger values of 𝜇 result in less distinguishable clusters. The decrease in NMI values for SC and GRASCale with increasing 𝜇 values follows a similar trend. This indicates that the proposed method indeed uses the pairwise relations between signals. For the graph learning task, F1 score of the proposed method is higher than those of GLMM and K-Graphs and is very close to that of GL due to its high clustering performance. Parameter Sensitivity: We study the sensitivity of the performance of GRASCale to the selection of 𝛼1 and 𝛼2 on a dataset from Experiment 2. We consider a dataset generated from the BA graph model with 𝜇 set to 0.25. The ground truth graph has a density around 0.12 in this dataset. We apply our algorithm to this dataset with varying 𝛼1 and 𝛼2 values and the performances are reported in Figure 6.5. For the 𝑥-axis, densities of the learned graph are used rather than the values of 𝛼2 . Figure 6.5 shows that the density of learned graphs is important for the performance. In particular, low density graphs have poor performance in terms of F1 and NMI, as these graphs are very sparse and do not contain enough information. Similarly, high density values also result in low performance, since learned graphs include many false positive edges. Finally, this figure also shows that the proposed method is not sensitive to the value of 𝛼1 as long as the learned graphs have a reasonable density. In particular, there is a large range of 𝛼1 values, where F1-score and NMI are stable. Based on this observation, we set 𝛼1 = 10 in all of our data analysis without any fine-tuning. 6.3.2 Real Data In this section, the proposed method is applied to a real world data clustering problem, where the aim is to cluster the digits of MNIST dataset while learning a graph for each digit. More specifically, we selected 400 images corresponding to digits 0, 1, 2 and 3. After vectorizing each image, we obtain a data matrix of size 400 × 1600, where the rows and columns correspond to pixels and images, respectively. SC, GLMM, K-Graphs and GRASCale are applied to the constructed data matrix and the clustering performance is reported in Figure 6.6. The best performing method is GRASCale, and it is followed by SC; while GLMM and K-Graphs have significantly lower performance. These results indicate that using pairwise similarities of the signals and their smoothness together improve the clustering performance. 96 F1 NMI -1.0 0.5 0.8 -0.4 0.1 log10 α1 0.7 1.3 1.9 2.4 3.0 0.1 0.1 0.0 0.01 0.17 0.0 0.01 0.17 0.24 0.20 0.36 0.24 0.20 0.36 0.33 0.495 0.33 0.495 Graph density Figure 6.5: Sensitivity of F1 and NMI values to varying values of 𝛼1 and density of learned graphs. Left panel shows the graph learning performance and the right panel shows the clustering performance. As mentioned in [136], learning a graph for each cluster can be helpful for the interpretablity of clustering. By analyzing the graph structure learned for each cluster, one can deduce why a set of graph signals are assigned to the same cluster; which leads to explainable data science [202]. In Figure 6.7, we plot the graphs learned for each digit by GRASCale. It can be seen that the method learns very interpretable graph structures. The learned graphs for digits 0, 2, 3 have high resemblance to the digits themselves. Although the graph found for digit 1 has a meaningful structure, it is noisier than the other graphs. This is due to the fact that there is a lot of variation across samples for writing digit 1. This means that while we tend to cluster digits based on their numerical values, it might be the case that there is also a clustering within each digit based on the writing style. 6.4 Conclusions In this paper, we presented GRASCale for simultaneous graph signal clustering and graph learning. Compared to previous methods developed for the same task, GRASCale uses two types of information: Clustering performance on MNIST 1.00 0.75 NMI 0.50 0.25 0.00 SC GLMM K-Graphs GRASCale Figure 6.6: Clustering performance for MNIST dataset. 97 Figure 6.7: Graph structures learned for each digit by the proposed method. Points correspond to pixels, while lines indicate the inferred edges between pixels. Only top 300 edges are shown. pairwise relations between graph signals and their smoothness with respect to graphs associated with clusters. Our results on synthetic and real datasets indicate that incorporating these complementary pieces of information within the same framework improves clustering and graph learning performance significantly. In the presented formulation of GRASCale, we assumed L𝑐 is constructed a priori; however, this graph can also be learned along with clusters and graphs associated with clusters. In future work, we will consider this extension of jointly learning L𝑐 along with the individual graphs, L𝑠 . 98 CHAPTER 7 CONCLUSIONS Community detection and graph learning are two important problems in network science and graph signal processing. The former problem deals with topological analysis of graphs to identify their mesoscale organization; while, graph learning aims to infer the interactions between nodes of a graph from data when the graph topology is not known a priori. Existing community detection and graph learning methods are mostly limited to single-layer graphs and they cannot handle multilayer graphs efficiently. In this thesis, we aimed to fill this gap by proposing multiple community detection and graph learning methods for various types of multilayer graphs. Dynamic networks are a type of multilayer networks, where layers correspond to different time points and interlayer edges are only allowed between consecutive time points. Existing community detection methods for dynamic networks identify the community structure of each time point while regularizing the identified community structures to change smoothly across time. However, it is not known how to set the regularization parameter that determines how smoothly community structure changes across time. In Chapter 2, we answered this question based on recent theoretical developments that explain community detection algorithms using statistical models. In particular, we proposed a new dynamic stochastic blockmodel which models the community changes across time with a Markov random field. Fitting the proposed model to an observed dynamic network was then shown to be equivalent to evolutionary spectral clustering under some assumptions. This equivalence was employed to determine the regularization parameter of evolutionary spectral clustering and to propose two novel spectral clustering based algorithms for dynamic networks. Performance of the proposed algorithms was investigated using simulated and real data; and it is observed that they outperform existing dynamic community detection methods. Human brain operates at different frequency bands and the functional connectivity between brain regions is different at each band. Recent developments aim to study these functional networks simultaneously through multilayer network modeling, where each layer corresponds to a frequency band. However, existing work is mostly limited to multiplex networks, where interlayer edges are only allowed between nodes that represent the same brain region. In Chapter 3, we addressed this shortcoming by proposing a multilayer community detection algorithm, which is especially tailored for multi-frequency brain networks. First, phase synchrony and phase amplitude coupling measures were used to construct a multilayer EEG network, where interlayer 99 edges are allowed between any two brain regions. Next, we proposed a multilayer modularity metric to detect communities in the constructed networks. An important characteristic of multi-frequency brain networks is the heterogeneity in edge weights across frequency bands, which can bias community detection methods to partition nodes based on layers rather than the true community structure. Therefore, the proposed modularity metric was developed based on a new null model which preserves this heterogeneity. We parameterized the proposed metric to handle resolution limit of the modularity and to be able to control importance of interlayer edges. Finally, a new method that can address the degeneracy of modularity maximization was proposed to identify group community structure of a set of subjects. The proposed approach was applied to EEG data collected during a study of error monitoring in the human brain. The results revealed important differences in the brain organization following error and correct responses. Regulatory interactions between genes can be studied with networks, where nodes and edges correspond to genes and their regulatory relations, respectively. An important characteristic of gene regulatory networks (GRN) is that they are signed graphs, where edge signs represent activating and inhibitory regulations. Existing GSP based graph learning methods cannot be used to infer GRNs, since they are restricted to learn only unsigned graph topologies. Therefore, in Chapter 4, we proposed a GSP based signed graph learning approach, which models a signed graph as a two layer multiplex network where one layer corresponds to positive edges while the other corresponds to the negative edges. We then devised an optimization problem to learn each layer based on the assumption that graph signals are smooth and non-smooth over positive and negative layers, respectively. The optimization problem was further kernelized to be able to handle various characteristics of observed graph signals such as missing values or non-linearity. The proposed problem was solved with an efficient ADMM based optimization procedure. We employed the proposed signed graph learning method to identify GRN from single cell gene expression data. The method was benchmarked against state of the art GRN inference methods on simulated and real data and it was shown that it outperforms them in terms of accuracy and computational time complexity. Given multiple datasets, each of which includes graph signals defined on a different signed graph, we can apply the method presented in Chapter 4 to each dataset separately to learn multiple signed graphs. When it is assumed that the multiple signed graphs are related, this approach will be suboptimal since it does not impose any shared structure on the learned signed graphs. Therefore, in Chapter 5, we extended the signed graph learning approach proposed in Chapter 4 to learn multiple related signed graphs. Namely, multiple signed graphs were learned simultaneously by solving an optimization problem that assumes smoothness 100 and non-smoothness of the datasets as in Chapter 4. Furthermore, we imposed a shared structure to learned signed graphs through a regularization term, that ensures the learned graphs are similar to a consensus graph. Our optimization procedure also learned the consensus graph, which represents the shared structure of the learned signed graphs. We employed the method for the inference of multiple related GRNs from single cell datasets that were generated from multiple treatment conditions or disease states. Results on simulated data showed that the proposed approach has better performance than methods that can learn a single graph at a time and previous joint multiple GRN reconstruction algorithms. Real data analysis revealed that the method learns signed graphs that are inline with previous biological findings. Existing work on multiple unsigned graph learning assumes that we are given multiple datasets, each of which includes graph signals defined on a different graph. However, there are applications where we are given a single heterogeneous dataset, which consists of graph signals from multiple clusters and each cluster includes graph signals defined on a graph. In such cases, the aim is jointly cluster graph signals and infer the graphs associated with clusters. In Chapter 6, we proposed an algorithm for this task. Compared to existing work, the novelty of the method is that it partitions the graph signals not only based on their smoothness with respect to the graphs associated with the clusters but also their pairwise similarities. The method is developed by extending graph cut based clustering. It can also learn the representative graph for each cluster using the smoothness of the graph signals. The results on simulated and real data indicate the effectiveness of the proposed method compared to existing algorithms. 7.1 Future Work In this section, we present some future research directions that can be considered to address the short- comings of the algorithms presented in this thesis. Community Detection in Multiplex Networks: Dynamic community detection methods proposed in Chapter 2 were developed based on showing the equivalence between evolutionary spectral clustering and statistical modeling. This approach can be followed to propose a multiplex community detection algorithm. Existing work on community detection in multiplex networks identify community structures of layers while regularizing them based on some assumptions, such as there is a set of nodes which are in the same community across all layers. Such assumptions can be used to propose new statistical models for multiplex networks. We can then answer the question of under which conditions fitting these models to an observed multiplex network is equivalent to existing multiplex community detection algorithms. As in Chapter 2, proving this 101 equivalence can pave the way for developing novel multiplex community detection algorithms and addressing the shortcomings of the existing ones. Multi-aspect Multilayer Community Detection: An important task in brain network analysis is to study networks from multiple subjects simultaneously, which helps one to understand characteristics that are shared and different across subjects. In Chapter 3, we performed this by finding multilayer community structures of each subject independently and the shared community structure across subjects was found by group community detection. However, this approach is suboptimal as subjects’ multilayer networks are processed independently. This shortcoming can be handled by using multi-aspect multilayer approach, which is an extension of multilayer networks to multiple dimensions. In multi-aspect multilayer network, the layer set is the product of sets of elementary layers, i.e. L = L1 × L2 . . . L 𝑑 , where L𝑖 is the set of elementary layers [117]. In our case, L1 is the set of frequency bands and L2 is that set of subjects. Thus, in our multi-aspect multilayer network, each layer includes the interactions of a subject’s frequency band. Future work can develop new community detection algorithms for multi-aspect multilayer networks. Multiple Signed Graph Learning from Heterogeneous Datasets: Multiple signed graph learning method of Chapter 5 is designed for cases where multiple datasets are available. However, some problems include only a single heterogeneous dataset, which needs to be clustered while learning a signed graph for each cluster. For example, in cell type-specific GRN inference, a single scRNA-seq dataset is often used and the goal is to identify cell types and learn a GRN for each type. In Chapter 6, we proposed an approach which simultaneously performs clustering and learns unsigned graphs. Future work can extend this work to signed graphs, where graph signals will be clustered while a signed graph will be learned for each cluster. 102 BIBLIOGRAPHY [1] Sara Aibar et al. “SCENIC: single-cell regulatory network inference and clustering”. In: Nature methods 14.11 (2017), pp. 1083–1086. [2] Christopher Aicher, Abigail Z Jacobs, and Aaron Clauset. “Learning latent block structure in weighted networks”. In: Journal of Complex Networks 3.2 (2015), pp. 221–248. [3] Edo M Airoldi et al. “Mixed membership stochastic blockmodels”. In: Advances in neural informa- tion processing systems 21 (2008). [4] Kyle Akers and TM Murali. “Gene regulatory network inference in single-cell biology”. In: Current Opinion in Systems Biology 26 (2021), pp. 87–97. [5] Réka Albert and Albert-László Barabási. “Statistical mechanics of complex networks”. In: Reviews of modern physics 74.1 (2002), p. 47. [6] Alberto Aleta and Yamir Moreno. “Multilayer networks in a nutshell”. In: Annual Review of Condensed Matter Physics 10 (2019), pp. 45–62. [7] Genevera I Allen and Zhandong Liu. “A local poisson graphical model for inferring networks from sequencing data”. In: IEEE transactions on nanobioscience 12.3 (2013), pp. 189–198. [8] Gerrit Ansmann and Klaus Lehnertz. “Constrained randomization of weighted networks”. In: Physical Review E 84.2 (2011), p. 026103. [9] Hesam Araghi, Mohammad Sabbaqi, and Massoud Babaie–Zadeh. “𝐾-Graphs: An Algorithm for Graph Signal Clustering and Multiple Graph Learning”. In: IEEE Signal Processing Letters 26.10 (2019), pp. 1486–1490. [10] Sitaram Asur, Srinivasan Parthasarathy, and Duygu Ucar. “An event-based framework for char- acterizing the evolutionary behavior of interaction graphs”. In: ACM Transactions on Knowledge Discovery from Data (TKDD) 3.4 (2009), pp. 1–36. [11] Selin Aviyente and Ali Yener Mutlu. “A time-frequency-based approach to phase and phase syn- chrony estimation”. In: IEEE Transactions on Signal Processing 59.7 (2011), pp. 3086–3098. [12] Selin Aviyente et al. “A phase synchrony measure for quantifying dynamic functional integration in the brain”. In: Human brain mapping 32.1 (2011), pp. 80–93. [13] Brian Baingana and Georgios B Giannakis. “Tracking switched dynamic network topologies from information cascades”. In: IEEE Transactions on Signal Processing 65.4 (2016), pp. 985–997. [14] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. “Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data”. In: The Journal of Machine Learning Research 9 (2008), pp. 485–516. 103 [15] Albert-László Barabási and Réka Albert. “Emergence of scaling in random networks”. In: science 286.5439 (1999), pp. 509–512. [16] Albert-László Barabási and Eric Bonabeau. “Scale-free networks”. In: Scientific american 288.5 (2003), pp. 60–69. [17] Danielle S Bassett and Olaf Sporns. “Network neuroscience”. In: Nature neuroscience 20.3 (2017), pp. 353–364. [18] Danielle S Bassett et al. “Learning-induced autonomy of sensorimotor systems”. In: Nature neuro- science 18.5 (2015), pp. 744–751. [19] Federico Battiston et al. “Multilayer motif analysis of brain networks”. In: Chaos: An Interdisci- plinary Journal of Nonlinear Science 27.4 (2017), p. 047404. [20] Marya Bazzi et al. “A framework for the construction of generative models for mesoscale structure in multilayer networks”. In: Physical Review Research 2.2 (2020), p. 023100. [21] Marya Bazzi et al. “Community detection in temporal multilayer networks, with an application to correlation networks”. In: Multiscale Modeling & Simulation 14.1 (2016), pp. 1–41. [22] Andrea Berger and MI Posner. “Pathologies of brain attentional networks”. In: Neuroscience & Biobehavioral Reviews 24.1 (2000), pp. 3–5. [23] Peter Berger, Gabor Hannak, and Gerald Matz. “Efficient graph learning from noisy and incomplete data”. In: IEEE Transactions on Signal and Information Processing over Networks 6 (2020), pp. 105–119. [24] Richard F Betzel and Danielle S Bassett. “Multi-scale brain networks”. In: Neuroimage 160 (2017), pp. 73–83. [25] Richard F Betzel et al. “The community structure of functional brain networks exhibits scale-specific patterns of inter-and intra-subject variability”. In: Neuroimage 202 (2019), p. 115990. [26] Vincent D Blondel et al. “Fast unfolding of communities in large networks”. In: Journal of statistical mechanics: theory and experiment 2008.10 (2008), P10008. [27] Stefano Boccaletti et al. “Complex networks: Structure and dynamics”. In: Physics reports 424.4-5 (2006), pp. 175–308. [28] Stefano Boccaletti et al. “The structure and dynamics of multilayer networks”. In: Physics reports 544.1 (2014), pp. 1–122. [29] Stephen P Borgatti. “Centrality and network flow”. In: Social networks 27.1 (2005), pp. 55–71. [30] DA Brafman et al. “Regulation of endodermal differentiation of human embryonic stem cells through integrin-ECM interactions”. In: Cell Death & Differentiation 20.3 (2013), pp. 369–381. 104 [31] Anatol Bragin et al. “Gamma (40-100 Hz) oscillation in the hippocampus of the behaving rat”. In: Journal of neuroscience 15.1 (1995), pp. 47–60. [32] Ulrik Brandes et al. “On modularity clustering”. In: IEEE transactions on knowledge and data engineering 20.2 (2007), pp. 172–188. [33] Urs Braun et al. “Dynamic reconfiguration of frontal brain networks during executive cognition in humans”. In: Proceedings of the National Academy of Sciences 112.37 (2015), pp. 11678–11683. [34] Michael M Bronstein et al. “Geometric deep learning: going beyond euclidean data”. In: IEEE Signal Processing Magazine 34.4 (2017), pp. 18–42. [35] Matthew J Brookes et al. “A multi-layer network approach to MEG connectivity analysis”. In: Neuroimage 132 (2016), pp. 425–438. [36] Javier M Buldú and Mason A Porter. “Frequency-based brain networks: From a multiplex framework to a full multilayer description”. In: Network Neuroscience 2.4 (2018), pp. 418–441. [37] Ed Bullmore and Olaf Sporns. “Complex brain networks: graph theoretical analysis of structural and functional systems”. In: Nature reviews neuroscience 10.3 (2009), pp. 186–198. [38] Lian En Chai et al. “A review on the computational approaches for gene regulatory network con- struction”. In: Computers in biology and medicine 48 (2014), pp. 55–65. [39] Tanmoy Chakraborty et al. “Metrics for community analysis: A survey”. In: ACM Computing Surveys (CSUR) 50.4 (2017), pp. 1–37. [40] Ian Chambers et al. “Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells”. In: Cell 113.5 (2003), pp. 643–655. [41] Thalia E Chan, Michael PH Stumpf, and Ann C Babtie. “Gene regulatory network inference from single-cell data using multivariate information measures”. In: Cell systems 5.3 (2017), pp. 251–267. [42] Peter Mu-Hsin Chang et al. “Transcriptome analysis and prognosis of ALDH isoforms in human cancer”. In: Scientific reports 8.1 (2018), pp. 1–10. [43] Chuan Chen, Michael Ng, and Shuqin Zhang. “Block spectral clustering for multiple graphs with inter-relation”. In: Network Modeling Analysis in Health Informatics and Bioinformatics 6.1 (2017), pp. 1–22. [44] Geng Chen, Baitang Ning, and Tieliu Shi. “Single-cell RNA-seq technologies and related computa- tional data analysis”. In: Frontiers in genetics 10 (2019), p. 317. [45] Shuonan Chen and Jessica C Mar. “Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data”. In: BMC bioinformatics 19.1 (2018), pp. 1–21. 105 [46] Yun Chi et al. “Evolutionary spectral clustering by incorporating temporal smoothness”. In: Pro- ceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. 2007, pp. 153–162. [47] Li-Fang Chu et al. “Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm”. In: Genome biology 17.1 (2016), pp. 1–20. [48] Michael W Cole and Walter Schneider. “The cognitive control network: integrated cortical regions with dissociable functions”. In: Neuroimage 37.1 (2007), pp. 343–360. [49] Anne Condon and Richard M Karp. “Algorithms for graph partitioning on the planted partition model”. In: Random Structures & Algorithms 18.2 (2001), pp. 116–140. [50] ENCODE Project Consortium et al. “An integrated encyclopedia of DNA elements in the human genome”. In: Nature 489.7414 (2012), p. 57. [51] Callan C Corcoran et al. “From 20th century metabolic wall charts to 21st century systems biology: database of mammalian metabolic enzymes”. In: American Journal of Physiology-Renal Physiology 312.3 (2017), F533–F542. [52] Marco Corneli, Pierre Latouche, and Fabrice Rossi. “Exact ICL maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks”. In: Neurocomputing 192 (2016), pp. 81–91. [53] L da F Costa et al. “Characterization of complex networks: A survey of measurements”. In: Advances in physics 56.1 (2007), pp. 167–242. [54] Alexandre d’Aspremont, Onureena Banerjee, and Laurent El Ghaoui. “First-order methods for sparse covariance selection”. In: SIAM Journal on Matrix Analysis and Applications 30.1 (2008), pp. 56–66. [55] Patrick Danaher, Pei Wang, and Daniela M Witten. “The joint graphical lasso for inverse covariance estimation across multiple classes”. In: Journal of the Royal Statistical Society. Series B, Statistical methodology 76.2 (2014), p. 373. [56] Weidong Dang et al. “Rhythm-dependent multilayer brain network for the detection of driving fatigue”. In: IEEE Journal of Biomedical and Health Informatics (2020). [57] Leon Danon et al. “Comparing community structure identification”. In: Journal of statistical mechanics: Theory and experiment 2005.09 (2005), P09008. [58] J-J Daudin, Franck Picard, and Stéphane Robin. “A mixture model for random graphs”. In: Statistics and computing 18.2 (2008), pp. 173–183. [59] Manlio De Domenico. “Multilayer modeling and analysis of human brain networks”. In: Giga- Science 6.5 (2017), pp. 1–8. 106 [60] Manlio De Domenico and Jacob Biamonte. “Spectral entropies as information-theoretic tools for complex network comparison”. In: Physical Review X 6.4 (2016), p. 041062. [61] Manlio De Domenico, Shuntaro Sasai, and Alex Arenas. “Mapping multiplex hubs in human functional brain networks”. In: Frontiers in neuroscience 10 (2016), p. 326. [62] Manlio De Domenico et al. “Identifying modular flows on multilayer networks reveals highly over- lapping organization in interconnected systems”. In: Physical Review X 5.1 (2015), p. 011027. [63] Arnaud Delorme and Scott Makeig. “EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis”. In: Journal of neuroscience methods 134.1 (2004), pp. 9–21. [64] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. “Weighted graph cuts without eigenvectors a multilevel approach”. In: IEEE transactions on pattern analysis and machine intelligence 29.11 (2007), pp. 1944–1957. [65] Stavros I Dimitriadis. “Assessing The Repeatability of Multi-Frequency Multi-Layer Brain Network Topologies Across Alternative Researchers Choice Paths”. In: bioRxiv (2021). [66] Meichen Dong et al. “Joint gene network construction by single-cell RNA sequencing data”. In: Biometrics (2022). [67] Xiaowen Dong et al. “Clustering on multi-layer graphs via subspace analysis on Grassmann mani- folds”. In: IEEE Transactions on signal processing 62.4 (2013), pp. 905–918. [68] Xiaowen Dong et al. “Learning graphs from data: A signal representation perspective”. In: IEEE Signal Processing Magazine 36.3 (2019), pp. 44–63. [69] Xiaowen Dong et al. “Learning Laplacian matrix in smooth graph signal representations”. In: IEEE Transactions on Signal Processing 64.23 (2016), pp. 6160–6173. [70] Karl W Doron, Danielle S Bassett, and Michael S Gazzaniga. “Dynamic network structure of interhemispheric coordination”. In: Proceedings of the National Academy of Sciences 109.46 (2012), pp. 18661–18668. [71] John Duchi et al. “Efficient projections onto the l 1-ball for learning in high dimensions”. In: Proceedings of the 25th international conference on Machine learning. 2008, pp. 272–279. [72] Nathan Eagle, Alex Pentland, and David Lazer. “Inferring friendship network structure by using mobile phone data”. In: Proceedings of the national academy of sciences 106.36 (2009), pp. 15274– 15278. [73] Ron Edgar, Michael Domrachev, and Alex E Lash. “Gene Expression Omnibus: NCBI gene expression and hybridization array data repository”. In: Nucleic acids research 30.1 (2002), pp. 207– 210. 107 [74] Bradley Efron and Robert Tibshirani. “Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy”. In: Statistical science (1986), pp. 54–75. [75] Mark WEJ Fiers et al. “Mapping gene regulatory networks from single-cell omics data”. In: Briefings in functional genomics 17.4 (2018), pp. 246–254. [76] Greg Finak et al. “MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data”. In: Genome biology 16.1 (2015), pp. 1–13. [77] JB Fisher et al. “GATA6 is essential for endoderm formation from human pluripotent stem cells”. In: Biology open 6.7 (2017), pp. 1084–1095. [78] Francesco Folino and Clara Pizzuti. “An evolutionary multiobjective approach for community discovery in dynamic networks”. In: IEEE Transactions on Knowledge and Data Engineering 26.8 (2013), pp. 1838–1852. [79] Santo Fortunato. “Community detection in graphs”. In: Physics reports 486.3-5 (2010), pp. 75–174. [80] Santo Fortunato and Marc Barthelemy. “Resolution limit in community detection”. In: Proceedings of the national academy of sciences 104.1 (2007), pp. 36–41. [81] Santo Fortunato and Darko Hric. “Community detection in networks: A user guide”. In: Physics reports 659 (2016), pp. 1–44. [82] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Sparse inverse covariance estimation with the graphical lasso”. In: Biostatistics 9.3 (2008), pp. 432–441. [83] Luz Garcia-Alonso et al. “Benchmark and integration of resources for the estimation of human transcription factor activities”. In: Genome research 29.8 (2019), pp. 1363–1375. [84] Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. “Evaluating overfit and underfit in mod- els of network community structure”. In: IEEE Transactions on Knowledge and Data Engineering 32.9 (2019), pp. 1722–1735. [85] Amir Ghasemian et al. “Detectability thresholds and optimal algorithms for community structure in dynamic networks”. In: Physical Review X 6.3 (2016), p. 031005. [86] Edgar N Gilbert. “Random graphs”. In: The Annals of Mathematical Statistics 30.4 (1959), pp. 1141–1144. [87] Anna Goldenberg et al. “A survey of statistical network models”. In: Foundations and Trends® in Machine Learning 2.2 (2010), pp. 129–233. [88] Benjamin H Good, Yves-Alexandre De Montjoye, and Aaron Clauset. “Performance of modularity maximization in practical contexts”. In: Physical Review E 81.4 (2010), p. 046106. 108 [89] Dominic Grün, Lennart Kester, and Alexander Van Oudenaarden. “Validation of noise models for single-cell transcriptomics”. In: Nature methods 11.6 (2014), pp. 637–640. [90] Jian Guo et al. “Joint estimation of multiple graphical models”. In: Biometrika 98.1 (2011), pp. 1–15. [91] Min Jin Ha, Veerabhadran Baladandayuthapani, and Kim-Anh Do. “DINGO: differential network analysis in genomics”. In: Bioinformatics 31.21 (2015), pp. 3413–3420. [92] Heonjong Han et al. “TRRUST v2: an expanded reference database of human and mouse transcrip- tional regulatory interactions”. In: Nucleic acids research 46.D1 (2018), pp. D380–D386. [93] Dongxiao He et al. “A network-specific Markov random field approach to community detection”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018. [94] Randolph F Helfrich and Robert T Knight. “Oscillatory dynamics of prefrontal cognitive control”. In: Trends in cognitive sciences 20.12 (2016), pp. 916–930. [95] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. “Latent space approaches to social network analysis”. In: Journal of the american Statistical association 97.460 (2002), pp. 1090–1098. [96] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. “Kernel methods in machine learning”. In: The annals of statistics (2008), pp. 1171–1220. [97] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. “Stochastic blockmodels: First steps”. In: Social networks 5.2 (1983), pp. 109–137. [98] Clay B Holroyd and Michael GH Coles. “The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity.” In: Psychological review 109.4 (2002), p. 679. [99] Junhui Hou et al. “Robust Laplacian matrix learning for smooth graph signals”. In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE. 2016, pp. 1878–1882. [100] Shelley R Hough et al. “Differentiation of mouse embryonic stem cells after RNA interference- mediated silencing of OCT4 and Nanog”. In: Stem cells 24.6 (2006), pp. 1467–1475. [101] Volker Hovestadt et al. “Resolving medulloblastoma cellular architecture by single-cell genomics”. In: Nature 572.7767 (2019), pp. 74–79. [102] Cho-Jui Hsieh et al. “Sparse inverse covariance matrix estimation using quadratic approximation”. In: Advances in neural information processing systems 24 (2011). [103] Vân Anh Huynh-Thu et al. “Inferring regulatory networks from expression data using tree-based methods”. In: PloS one 5.9 (2010), pp. 1–10. [104] Lucas GS Jeub, Olaf Sporns, and Santo Fortunato. “Multiresolution consensus clustering in net- works”. In: Scientific reports 8.1 (2018), pp. 1–16. 109 [105] Bochao Jia et al. “Learning gene regulatory networks from next generation sequencing data”. In: Biometrics 73.4 (2017), pp. 1221–1230. [106] Sai Kiran Kadambari and Sundeep Prabhakar Chepuri. “Learning product graphs from multidomain signals”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 5665–5669. [107] Vassilis Kalofolias. “How to learn a graph from smooth signals”. In: Artificial Intelligence and Statistics. PMLR. 2016, pp. 920–929. [108] Vassilis Kalofolias and Nathanaël Perraudin. “Large Scale Graph Learning From Smooth Signals”. In: International Conference on Learning Representations. 2018. [109] Vassilis Kalofolias and Nathanaël Perraudin. “Large scale graph learning from smooth signals”. In: arXiv preprint arXiv:1710.05654 (2017). [110] Vassilis Kalofolias et al. “Learning time varying graphs”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee. 2017, pp. 2826–2830. [111] Jiun-Yu Kao et al. “Disc-glasso: Discriminative graph learning with sparsity regularization”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 2956–2960. [112] Abdullah Karaaslanli et al. “scSGL: kernelized signed graph learning for single-cell gene regulatory network inference”. In: Bioinformatics 38.11 (2022), pp. 3011–3019. [113] Brian Karrer and Mark EJ Newman. “Stochastic blockmodels and community structure in networks”. In: Physical review E 83.1 (2011), p. 016107. [114] Peter V Kharchenko, Lev Silberstein, and David T Scadden. “Bayesian approach to single-cell differential expression analysis”. In: Nature methods 11.7 (2014), pp. 740–742. [115] Bomin Kim et al. “A review of dynamic network models with latent variables”. In: Statistics surveys 12 (2018), p. 105. [116] Seongho Kim. “ppcor: an R package for a fast calculation to semi-partial correlation coefficients”. In: Communications for statistical applications and methods 22.6 (2015), p. 665. [117] Mikko Kivelä et al. “Multilayer networks”. In: Journal of complex networks 2.3 (2014), pp. 203–271. [118] Allon M Klein et al. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells”. In: Cell 161.5 (2015), pp. 1187–1201. [119] Jérôme Kunegis et al. “Spectral analysis of signed graphs for clustering, prediction and visualization”. In: Proceedings of the 2010 SIAM International Conference on Data Mining. SIAM. 2010, pp. 559– 570. 110 [120] Jean-Philippe Lachaux et al. “Measuring phase synchrony in brain signals”. In: Human brain mapping 8.4 (1999), pp. 194–208. [121] Andrea Lancichinetti and Santo Fortunato. “Consensus Clustering in Complex Networks”. In: Scientific Reports 2.1 (Mar. 2012), p. 336. issn: 2045-2322. doi: 10.1038/srep00336. [122] Wonyul Lee and Yufeng Liu. “Joint estimation of multiple precision matrices with common struc- tures”. In: The Journal of Machine Learning Research 16.1 (2015), pp. 1035–1062. [123] Alexander Lex et al. “UpSet: visualization of intersecting sets”. In: IEEE transactions on visualiza- tion and computer graphics 20.12 (2014), pp. 1983–1992. [124] Hui-Jia Li et al. “Community structure detection based on Potts model and network’s spectral characterization”. In: Europhysics Letters 97.4 (2012), p. 48005. [125] Yu-Ru Lin et al. “Facetnet: a framework for analyzing communities and their evolutions in dynamic networks”. In: Proceedings of the 17th international conference on World Wide Web. 2008, pp. 685– 694. [126] Fuchen Liu et al. “Global spectral clustering in dynamic networks”. In: Proceedings of the National Academy of Sciences 115.5 (2018), pp. 927–932. [127] Han Liu, John Lafferty, and Larry Wasserman. “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs.” In: Journal of Machine Learning Research 10.10 (2009). [128] Han Liu et al. “High-dimensional semiparametric Gaussian copula graphical models”. In: The Annals of Statistics 40.4 (2012), pp. 2293–2326. [129] Yen-Chun Liu et al. “Global regulation of nucleotide biosynthetic genes by c-Myc”. In: PloS one 3.7 (2008), e2722. [130] Zhi-Ping Liu et al. “RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse”. In: Database 2015 (2015). [131] Yining Lu et al. “OTX2 expression contributes to proliferation and progression in Myc-amplified medulloblastoma”. In: American journal of cancer research 7.3 (2017), p. 647. [132] Xiaoke Ma and Di Dong. “Evolutionary nonnegative matrix factorization algorithms for community detection in dynamic networks”. In: IEEE transactions on knowledge and data engineering 29.5 (2017), pp. 1045–1058. [133] Matteo Magnani et al. “Community detection in multiplex networks”. In: ACM Computing Surveys (CSUR) 54.3 (2021), pp. 1–35. [134] Daniel Marbach et al. “Generating realistic in silico gene networks for performance assessment of reverse engineering methods”. In: Journal of computational biology 16.2 (2009), pp. 229–239. 111 [135] Daniel Marbach et al. “Wisdom of crowds for robust gene network inference”. In: Nature methods 9.8 (2012), pp. 796–804. [136] Hermina Petric Maretic and Pascal Frossard. “Graph Laplacian mixture model”. In: IEEE Transac- tions on Signal and Information Processing over Networks 6 (2020), pp. 261–270. [137] Gonzalo Mateos et al. “Connecting the dots: Identifying network structure via graph signal process- ing”. In: IEEE Signal Processing Magazine 36.3 (2019), pp. 16–43. [138] Catherine Matias and Vincent Miele. “Statistical clustering of temporal networks through a dy- namic stochastic block model”. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 79.4 (2017), pp. 1119–1141. [139] Hirotaka Matsumoto et al. “SCODE: an efficient regulatory network inference algorithm from single- cell RNA-Seq during differentiation”. In: Bioinformatics 33.15 (2017), pp. 2314–2321. [140] Marcelo G Mattar, Richard F Betzel, and Danielle S Bassett. “The flexible brain”. In: Brain 139.8 (2016), pp. 2110–2112. [141] Gerald Matz and Thomas Dittrich. “Learning signed graphs from data”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 5570–5574. [142] Rahul Mazumder and Trevor Hastie. “The graphical lasso: New insights and alternatives”. In: Electronic journal of statistics 6 (2012), p. 2125. [143] David Meunier et al. “Hierarchical modularity in human brain functional networks”. In: Frontiers in neuroinformatics 3 (2009), p. 37. [144] MIT Academic Calendar 2004-2005. https://web.archive.org/web/20051104091633/http://web.mit. edu/registrar/www/calendar0405.html. Accessed: 2020-12-21. [145] Kaoru Mitsui et al. “The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells”. In: Cell 113.5 (2003), pp. 631–642. [146] Thomas Moerman et al. “GRNBoost2 and Arboreto: efficient and scalable inference of gene regu- latory networks”. In: Bioinformatics 35.12 (2019), pp. 2159–2161. [147] Victoria Moignard et al. “Decoding the regulatory network of early blood development from single- cell gene expression measurements”. In: Nature biotechnology 33.3 (2015), pp. 269–276. [148] Aanchal Mongia, Debarka Sengupta, and Angshul Majumdar. “McImpute: matrix completion based imputation for single cell RNA-seq data”. In: Frontiers in genetics 10 (2019), p. 9. [149] James Moody and Douglas R White. “Structural cohesion and embeddedness: A hierarchical concept of social groups”. In: American sociological review (2003), pp. 103–127. 112 [150] Tim P Moran, Danielle Taylor, and Jason S Moser. “Sex moderates the relationship between worry and performance monitoring brain activity in undergraduates”. In: International Journal of Psychophysiology 85.2 (2012), pp. 188–194. [151] Jan S Moreb et al. “RNAi-mediated knockdown of aldehyde dehydrogenase class-1A1 and class-3A1 is specific and reveals that each contributes equally to the resistance against 4-hydroperoxycyclophosphamide”. In: Cancer chemotherapy and pharmacology 59.1 (2007), pp. 127–136. [152] Naomi Moris, Cristina Pina, and Alfonso Martinez Arias. “Transition states and cell fate decisions in epigenetic landscapes”. In: Nature Reviews Genetics 17.11 (2016), pp. 693–703. [153] Peter J Mucha et al. “Community structure in time-dependent, multiscale, and multiplex networks”. In: science 328.5980 (2010), pp. 876–878. [154] Sumit Mukherjee et al. “Identifying progressive gene network perturbation from single-cell RNA-seq data”. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE. 2018, pp. 5034–5040. [155] Sarah Feldt Muldoon and Danielle S Bassett. “Network and multilayer network approaches to understanding human brain dynamics”. In: Philosophy of Science 83.5 (2016), pp. 710–720. [156] Sarah Feldt Muldoon, Eric W Bridgeford, and Danielle S Bassett. “Small-world propensity and weighted brain networks”. In: Scientific reports 6 (2016), p. 22057. [157] T. T. K. Munia and S. Aviyente. “Time-frequency Based phase-Amplitude coupling Measure for neuronal oscillations”. In: Scientific reports 9.1 (2019), pp. 1–15. [158] Tamanna Tabassum Khan Munia and Selin Aviyente. “Comparison of Wavelet and RID-Rihaczek Based Methods for Phase-Amplitude Coupling”. In: IEEE Signal Processing Letters 26.12 (2019), pp. 1897–1901. [159] Tamanna TK Munia and Selin Aviyente. “Multivariate analysis of bivariate phase-amplitude coupling in EEG data using tensor robust PCA”. in: IEEE Transactions on Neural Systems and Rehabilitation Engineering 29 (2021), pp. 1268–1279. [160] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [161] Seth Myers and Jure Leskovec. “On the convexity of latent social network inference”. In: Advances in neural information processing systems 23 (2010). [162] Antonino Naro et al. “Multiplex and multilayer network EEG analyses: a novel strategy in the differential diagnosis of patients with chronic disorders of consciousness”. In: International Journal of Neural Systems 31.02 (2021), p. 2050052. [163] Madeline Navarro et al. “Joint Inference of Multiple Graphs from Matrix Polynomials”. In: Journal of Machine Learning Research 23.76 (2022), pp. 1–35. 113 [164] Madeline Navarro et al. “Joint inference of multiple graphs from matrix polynomials”. In: arXiv preprint arXiv:2010.08120 (2020). [165] The Cancer Genome Atlas Network. “Comprehensive molecular portraits of human breast tumours”. In: Nature 490.7418 (2012), pp. 61–70. [166] Mark Newman. Networks. Oxford university press, 2018. [167] Mark EJ Newman. “Equivalence between modularity optimization and maximum likelihood methods for community detection”. In: Physical Review E 94.5 (2016), p. 052315. [168] Mark EJ Newman. “Modularity and community structure in networks”. In: Proceedings of the national academy of sciences 103.23 (2006), pp. 8577–8582. [169] Mark EJ Newman. “Spectral methods for community detection and graph partitioning”. In: Physical Review E 88.4 (2013), p. 042822. [170] Mark EJ Newman and Aaron Clauset. “Structure and inference in annotated networks”. In: Nature communications 7.1 (2016), p. 11863. [171] Mark EJ Newman and Michelle Girvan. “Finding and evaluating community structure in networks”. In: Physical review E 69.2 (2004), p. 026113. [172] Mark EJ Newman and Duncan J Watts. “Renormalization group analysis of the small-world network model”. In: Physics Letters A 263.4-6 (1999), pp. 341–346. [173] Kathy K Niakan et al. “Sox17 promotes differentiation in mouse embryonic stem cells by directly regulating extraembryonic gene expression and indirectly antagonizing self-renewal”. In: Genes & development 24.3 (2010), pp. 312–326. [174] Roland Nigbur et al. “Theta dynamics reveal domain-specific control over stimulus and response conflict”. In: Journal of Cognitive Neuroscience 24.5 (2012), pp. 1264–1274. [175] Huazhong Ning et al. “Incremental spectral clustering by efficiently updating the eigen-system”. In: Pattern Recognition 43.1 (2010), pp. 113–127. [176] Paul A Northcott et al. “Medulloblastoma”. In: Nature reviews Disease primers 5.1 (2019), pp. 1–20. [177] Paul A Northcott et al. “Medulloblastoma comprises four distinct molecular variants”. In: Journal of clinical oncology 29.11 (2011), p. 1408. [178] Antonio Ortega et al. “Graph signal processing: Overview, challenges, and applications”. In: Proceedings of the IEEE 106.5 (2018), pp. 808–828. [179] Alp Ozdemir et al. “Hierarchical spectral consensus clustering for group analysis of functional brain networks”. In: IEEE Transactions on Biomedical Engineering 62.9 (2015), pp. 2158–2169. 114 [180] Tolga Esat Özkurt and Alfons Schnitzler. “A critical note on the definition of phase–amplitude cross-frequency coupling”. In: Journal of Neuroscience methods 201.2 (2011), pp. 438–443. [181] A Roxana Pamfil et al. “Relating modularity maximization and stochastic block models in multilayer networks”. In: SIAM Journal on Mathematics of Data Science 1.4 (2019), pp. 667–698. [182] Bastien Pasdeloup et al. “Characterization and inference of graph diffusion processes from obser- vations of stationary signals”. In: IEEE transactions on Signal and Information Processing over Networks 4.3 (2017), pp. 481–496. [183] Lucrezia Patruno et al. “A review of computational strategies for denoising and imputation of single-cell transcriptomic data”. In: Briefings in Bioinformatics 22.4 (2021), bbaa222. [184] Tiago P Peixoto. “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models”. In: Physical Review E 89.1 (2014), p. 012804. [185] Tiago P Peixoto. “Parsimonious module inference in large networks”. In: Physical review letters 110.14 (2013), p. 148701. [186] Raphael Petegrosso, Zhuliu Li, and Rui Kuang. “Machine learning and statistical methods for clustering single-cell RNA-sequencing data”. In: Briefings in bioinformatics 21.4 (2020), pp. 1209– 1223. [187] Emma Pierson and Christopher Yau. “ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis”. In: Genome biology 16.1 (2015), pp. 1–10. [188] Ronald S Pimentel, Magdalena Niewiadomska-Bugaj, and Jung-Chao Wang. “Association of zero- inflated continuous variables”. In: Statistics & Probability Letters 96 (2015), pp. 61–67. [189] Soumajit Pramanik et al. “Discovering community structure in multilayer networks”. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE. 2017, pp. 611– 620. [190] Aditya Pratapa et al. “Benchmarking algorithms for gene regulatory network inference from single- cell transcriptomic data”. In: Nature methods 17.2 (2020), pp. 147–154. [191] Carey E Priebe et al. “Scan statistics on enron graphs”. In: Computational & Mathematical Organization Theory 11 (2005), pp. 229–247. [192] Liudmila Prokhorenkova and Alexey Tikhonov. “Community detection through likelihood opti- mization: in search of a sound model”. In: The World Wide Web Conference. 2019, pp. 1498– 1508. [193] Laralynne M Przybyla and Joel Voldman. “Attenuation of extrinsic signaling reveals the importance of matrix remodeling on maintenance of embryonic stem cell self-renewal”. In: Proceedings of the National Academy of Sciences 109.3 (2012), pp. 835–840. 115 [194] Maria Grazia Puxeddu, Manuela Petti, and Laura Astolfi. “A comprehensive analysis of multilayer community detection algorithms for application to eeg-based brain networks”. In: Frontiers in systems neuroscience 15 (2021), p. 624183. [195] Thomas P Quinn et al. “propr: an R-package for identifying proportionally abundant features using compositional data analysis”. In: Scientific reports 7.1 (2017), pp. 1–9. [196] Michael G Rabbat, Mark J Coates, and Robert D Nowak. “Multiple-source Internet tomography”. In: IEEE Journal on Selected Areas in Communications 24.12 (2006), pp. 2221–2234. [197] P Krishna Reddy et al. “A graph based approach to extract a neighborhood customer community for collaborative filtering”. In: International Workshop on Databases in Networked Information Systems. Springer. 2002, pp. 188–200. [198] Jörg Reichardt and Stefan Bornholdt. “Detecting fuzzy community structures in complex networks with a Potts model”. In: Physical review letters 93.21 (2004), p. 218701. [199] Jörg Reichardt and Stefan Bornholdt. “Statistical mechanics of community detection”. In: Physical review E 74.1 (2006), p. 016110. [200] Justin Riddle, Amber McFerren, and Flavio Frohlich. “Causal role of cross-frequency coupling in distinct components of cognitive control”. In: Progress in Neurobiology 202 (2021), p. 102033. [201] Davide Risso et al. “A general and flexible method for signal extraction from single-cell RNA-seq data”. In: Nature communications 9.1 (2018), pp. 1–17. [202] Ribana Roscher et al. “Explainable machine learning for scientific insights and discoveries”. In: Ieee Access 8 (2020), pp. 42200–42216. [203] Giulio Rossetti and Rémy Cazabet. “Community discovery in dynamic networks: a survey”. In: ACM computing surveys (CSUR) 51.2 (2018), pp. 1–37. [204] Martin Rosvall and Carl T Bergstrom. “Maps of random walks on complex networks reveal com- munity structure”. In: Proceedings of the national academy of sciences 105.4 (2008), pp. 1118– 1123. [205] Martine F Roussel and Giles W Robinson. “Role of MYC in Medulloblastoma”. In: Cold Spring Harbor perspectives in medicine 3.11 (2013), a014308. [206] Liu Rui et al. “Simultaneous low-rank component and graph estimation for high-dimensional graph signals: Application to brain imaging”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 4134–4138. [207] Assieh Saadatpour et al. “Characterizing heterogeneity in leukemic cells using single-cell gene expression analysis”. In: Genome biology 15.12 (2014), pp. 1–13. [208] Seyed Saman Saboksayr, Gonzalo Mateos, and Mujdat Cetin. “Online discriminative graph learning from multi-class smooth signals”. In: Signal Processing 186 (2021), p. 108101. 116 [209] D Franco Saldana, Yi Yu, and Yang Feng. “How many communities are there?” In: Journal of Computational and Graphical Statistics 26.1 (2017), pp. 171–181. [210] Aliaksei Sandryhaila and Jose MF Moura. “Discrete signal processing on graphs: Frequency analysis”. In: IEEE Transactions on Signal Processing 62.12 (2014), pp. 3042–3054. [211] Guido Sanguinetti and Vân Anh Huynh-Thu. Gene regulatory networks. Springer, 2019. [212] Stefania Sardellitti, Sergio Barbarossa, and Paolo Di Lorenzo. “Enabling prediction via multi-layer graph inference and sampling”. In: 2019 13th International conference on Sampling Theory and Applications (SampTA). IEEE. 2019, pp. 1–4. [213] Purnamrita Sarkar and Andrew W Moore. “Dynamic social network analysis using latent space models”. In: Advances in neural information processing systems 18 (2006), p. 1145. [214] Shuntaro Sasai et al. “Frequency-specific network topologies in the resting human brain”. In: Frontiers in human neuroscience 8 (2014), p. 1022. [215] Shuntaro Sasai et al. “Frequency-specific task modulation of human brain functional networks: A fast fMRI study”. In: NeuroImage 224 (2021), p. 117375. [216] Thomas Schaffter, Daniel Marbach, and Dario Floreano. “GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods”. In: Bioinformatics 27.16 (2011), pp. 2263–2270. [217] Holger Scheel and Stefan Scholtes. “Mathematical programs with complementarity constraints: Stationarity, optimality, and sensitivity”. In: Mathematics of Operations Research 25.1 (2000), pp. 1–22. [218] William W Seeley et al. “Neurodegenerative diseases target large-scale human brain networks”. In: Neuron 62.1 (2009), pp. 42–52. [219] Santiago Segarra et al. “Network topology inference from spectral templates”. In: IEEE Transactions on Signal and Information Processing over Networks 3.3 (2017), pp. 467–483. [220] Rasoul Shafipour et al. “Identifying the topology of undirected networks from diffused non-stationary graph signals”. In: IEEE Open Journal of Signal Processing 2 (2021), pp. 171–189. [221] Esraa Al-sharoa, Mahmood A Al-khassaweneh, and Selin Aviyente. “Detecting and tracking com- munity structure in temporal networks: A low-rank+ sparse estimation based evolutionary clustering approach”. In: IEEE Transactions on Signal and Information Processing over Networks 5.4 (2019), pp. 723–738. [222] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge univer- sity press, 2004. [223] Jonathan Richard Shewchuk. “Allow Me to Introduce Spectral and Isoperimetric Graph Partition- ing”. In: (Apr. 2016), p. 69. 117 [224] Hao-Jun Michael Shi et al. “A primer on coordinate descent algorithms”. In: arXiv preprint arXiv:1610.00040 (2016). [225] Jianbo Shi and Jitendra Malik. “Normalized cuts and image segmentation”. In: IEEE Transactions on pattern analysis and machine intelligence 22.8 (2000), pp. 888–905. [226] Wenjing Shi et al. “Regulation of the pluripotency marker Rex-1 by Nanog and Sox2”. In: Journal of biological chemistry 281.33 (2006), pp. 23319–23325. [227] David I Shuman et al. “The emerging field of signal processing on graphs: Extending high- dimensional data analysis to networks and other irregular domains”. In: IEEE signal processing magazine 30.3 (2013), pp. 83–98. [228] Arlei Silva, Ambuj Singh, and Ananthram Swami. “Spectral algorithms for temporal graph cuts”. In: Proceedings of the 2018 World Wide Web Conference. 2018, pp. 519–528. [229] Justin D Silverman et al. “Naught all zeros in sequence count data are the same”. In: Computational and structural biotechnology journal 18 (2020), p. 2789. [230] Michael A Skinnider, Jordan W Squair, and Leonard J Foster. “Evaluating measures of association for single-cell transcriptomics”. In: Nature methods 16.5 (2019), pp. 381–386. [231] Tom AB Snijders and Krzysztof Nowicki. “Estimation and prediction for stochastic blockmodels for graphs with latent block structure”. In: Journal of classification 14.1 (1997), pp. 75–100. [232] Olaf Sporns and Richard F Betzel. “Modular brain networks”. In: Annual review of psychology 67 (2016), pp. 613–640. [233] Oliver Stegle, Sarah A Teichmann, and John C Marioni. “Computational and analytical challenges in single-cell transcriptomics”. In: Nature Reviews Genetics 16.3 (2015), pp. 133–145. [234] Alexander Strehl and Joydeep Ghosh. “Cluster ensembles—a knowledge reuse framework for combining multiple partitions”. In: Journal of machine learning research 3.Dec (2002), pp. 583– 617. [235] Tim Stuart et al. “Comprehensive integration of single-cell data”. In: Cell 177.7 (2019), pp. 1888– 1902. [236] Valentine Svensson. “Droplet scRNA-seq is not zero-inflated”. In: Nature Biotechnology 38.2 (2020), pp. 147–150. [237] Damian Szklarczyk et al. “The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets”. In: Nucleic acids research 49.D1 (2021), pp. D605–D612. [238] Craig E Tenke and Jürgen Kayser. “Generator localization by current source density (CSD): impli- cations of volume conduction and field closure at intracranial and scalp resolutions”. In: Clinical neurophysiology 123.12 (2012), pp. 2328–2345. 118 [239] Alessandro Tessitore et al. “Default-mode network connectivity in cognitively unimpaired patients with Parkinson disease”. In: Neurology 79.23 (2012), pp. 2226–2232. [240] Prejaas Tewarie et al. “Integrating cross-frequency and within band functional networks in resting- state MEG: a multi-layer network approach”. In: Neuroimage 142 (2016), pp. 324–336. [241] Dorina Thanou et al. “Learning heat diffusion graphs”. In: IEEE Transactions on Signal and Information Processing over Networks 3.3 (2017), pp. 484–499. [242] Adriano BL Tort et al. “Measuring phase-amplitude coupling between neuronal oscillations of different frequencies”. In: Journal of neurophysiology 104.2 (2010), pp. 1195–1210. [243] Damon JA Toth et al. “The role of heterogeneity in contact timing and duration in network models of influenza spread in schools”. In: Journal of The Royal Society Interface 12.108 (2015), p. 20150279. [244] Vincent A Traag, Rodrigo Aldecoa, and J-C Delvenne. “Detecting communities using asymptotical surprise”. In: Physical review e 92.2 (2015), p. 022816. [245] Vincent A Traag, Paul Van Dooren, and Yurii Nesterov. “Narrow scope for resolution-limit-free community detection”. In: Physical Review E 84.1 (2011), p. 016114. [246] Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. “From Louvain to Leiden: guaranteeing well-connected communities”. In: Scientific reports 9.1 (2019), pp. 1–12. [247] Michael Vaiana and Sarah Feldt Muldoon. “Multilayer brain networks”. In: Journal of Nonlinear Science (2018), pp. 1–23. [248] Michael Vaiana and Sarah Feldt Muldoon. “Multilayer brain networks”. In: Journal of Nonlinear Science 30.5 (2020), pp. 2147–2169. [249] Ulrike Von Luxburg. “A tutorial on spectral clustering”. In: Statistics and computing 17 (2007), pp. 395–416. [250] Emily M Walker, Cayla A Thompson, and Michele A Battle. “GATA4 and GATA6 regulate intestinal epithelial cytodifferentiation during development”. In: Developmental biology 392.2 (2014), pp. 283–294. [251] Yu Wang, Wotao Yin, and Jinshan Zeng. “Global convergence of ADMM in nonconvex nonsmooth optimization”. In: Journal of Scientific Computing 78.1 (2019), pp. 29–63. [252] Zhaoning Wang et al. “Cell-Type-Specific Gene Regulatory Networks Underlying Murine Neonatal Heart Regeneration at Single-Cell Resolution”. In: Cell reports 33.10 (2020), p. 108472. [253] Alistair J Watt et al. “Development of the mammalian liver and ventral pancreas is dependent on GATA4”. In: BMC developmental biology 7.1 (2007), pp. 1–11. [254] Duncan J Watts and Steven H Strogatz. “Collective dynamics of ‘small-world’networks”. In: nature 393.6684 (1998), pp. 440–442. 119 [255] Nuosi Wu et al. “Joint learning of multiple gene networks from single-cell gene expression data”. In: Computational and structural biotechnology journal 18 (2020), pp. 2583–2595. [256] Zonghan Wu et al. “A comprehensive survey on graph neural networks”. In: IEEE transactions on neural networks and learning systems 32.1 (2020), pp. 4–24. [257] Kevin Xu. “Stochastic block transition models for dynamic networks”. In: Artificial Intelligence and Statistics. PMLR. 2015, pp. 1079–1087. [258] Kevin S Xu and Alfred O Hero. “Dynamic stochastic blockmodels for time-evolving social networks”. In: IEEE Journal of Selected Topics in Signal Processing 8.4 (2014), pp. 552–562. [259] Kevin S Xu, Mark Kliger, and Alfred O Hero III. “Adaptive evolutionary clustering”. In: Data Mining and Knowledge Discovery 28 (2014), pp. 304–336. [260] Yangyang Xu and Wotao Yin. “A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion”. In: SIAM Journal on Imaging Sciences 6.3 (Sept. 2013), pp. 1758–1789. issn: 1936-4954. doi: 10.1137/ 120887795. [261] Zhigang Xue et al. “Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing”. In: Nature 500.7464 (2013), pp. 593–597. [262] Inbal Yahav and Galit Shmueli. “On generating multivariate Poisson data in management science applications”. In: Applied Stochastic Models in Business and Industry 28.1 (2012), pp. 91–102. [263] Koki Yamada, Yuichi Tanaka, and Antonio Ortega. “Time-varying graph learning based on sparse- ness of temporal variation”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 5411–5415. [264] Tianbao Yang et al. “Detecting communities and their evolutions in dynamic social networks—a Bayesian approach”. In: Machine learning 82 (2011), pp. 157–189. [265] Wencheng Yin et al. “Emergence of co-expression in gene regulatory networks”. In: PloS one 16.4 (2021), e0247671. [266] Meichen Yu et al. “Selective impairment of hippocampus and posterior hub areas in Alzheimer’s disease: an MEG-based multiplex network study”. In: Brain 140.5 (2017), pp. 1466–1485. [267] R. Zass and A. Shashua. “A Unifying Approach to Hard and Probabilistic Clustering”. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. Beijing, China: IEEE, Oct. 2005, 294–301 Vol. 1. isbn: 978-0-7695-2334-7. doi: 10.1109/ICCV.2005.27. [268] Xiao Zhang, Cristopher Moore, and Mark EJ Newman. “Random graph models for dynamic networks”. In: The European Physical Journal B 90 (2017), pp. 1–14. [269] Ziwei Zhang, Peng Cui, and Wenwu Zhu. “Deep learning on graphs: A survey”. In: IEEE Transactions on Knowledge and Data Engineering (2020). 120 [270] Qing Zhou et al. “A gene regulatory network in mouse embryonic stem cells”. In: Proceedings of the National Academy of Sciences 104.42 (2007), pp. 16438–16443. [271] Hongliang Zou and Jian Yang. “Multi-frequency dynamic weighted functional connectivity networks for schizophrenia diagnosis”. In: Applied Magnetic Resonance 50.7 (2019), pp. 847–859. 121