LEARNING WITH STRUCTURES By Yang Zhou A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2010 ABSTRACT LEARNING WITH STRUCTURES By Yang Zhou In this dissertation we discuss learning with structures, which appears frequently in both machine learning theories and applications. First we review existing structure learning algorithms, then we study several specifically interesting problems. The first problem we study is the structure learning of dynamic systems. We investigate using dynamic Bayesian networks to reconstruct functional cortical networks from the spike trains of neurons. Next we study structure learning from matrix factorization, which has been a popular research area in recent years. We propose an efficient non-negative matrix factorization algorithm which derives not only the membership assignments to the clusters but also the interaction strengths among the clusters. Following that we study the hierarchical and grouped structure in regularization. We propose a novel regularizer called group lasso which introduces competitions among variables in groups, and thus results in sparse solutions. Finally we study the sparse structure in a novel problem of online feature selection, and propose an online learning algorithm that only needs to sense a small number of attributes before the reliable decision can be made. ACKNOWLEDGMENTS First and foremost I want to thank my advisor Dr. Rong Jin. It has been an honor to be his Ph.D. student. I appreciate all his contributions of time, ideas, and funding to make my Ph.D. experience productive and stimulating. The joy and enthusiasm he has for his research was contagious and motivational for me, even during tough times in the Ph.D. pursuit. I am also thankful for the excellent example he has provided as a successful researcher. I would like to thank Dr. Christina Chan who served as my co-advisor. She has advised me in my biostatistics related research projects. Together with Dr. Jin, she helped me a lot during the most difficult time during my Ph.D. study. I am deeply impressed by her energy and passion for academic research. I also thank the other two committee members, Dr. Pang-Ning Tan and Dr. Sarat Dass, for their critical comments, which enabled me to notice the weaknesses of my dissertation and make the necessary improvements according to their comments. The research related with reconstructing gene regulatory networks in this dissertation was a collaborative result with Chan Group. I am thankful to Zheng Li, Xuerui Yang, Linxia Zhang, Shireesh Srivastava and Xuewei Wang for their help and cooperation. The research related with reconstructing cortical networks was the result of productive collaboration with Dr. Karim G. Oweiss and his student Dr. Seif Eldawlatly. I am thankful to my previous advisor Dr. Juyang Weng for his support and valuable suggestions. Special thanks to my labmates, Zhengping Ji, Wei Tong, Tianbao Yang, Yi Liu, Jinfeng Yi, Fengjie Li, Hamed Valizadegan, Mehrdad Mahdavi, Ming Wu and Ming Wu, Chen Zhang, iii Feilong Chen and Nan Zhang. It has been a pleasure discussing with them for various problems, which gave me enlightenment and inspiration for my research. I also used to hang out with them in spare time, which provides a lot of fun for the long and sometimes boring life in East Lansing area. Many people on the faculty and staff of the Computer Science and Engineering department at MSU assisted and encouraged me in various ways during my studies. I am especially grateful to Prof. George Stockman, Prof. Anil Jain, Dr. Eric Torng, Linda Moore and Norma Teague. My most sincere and most heartfelt thanks go to my wife for her unlimited and unconditional encouragement, support and love. Learning to love you and to receive your love make me a better person. You have my everlasting love. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Learning with Structures: a Comprehensive Review 2.1 Structure Learning of Probabilistic Graphical Models . 2.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . 2.1.2 Graphical Models . . . . . . . . . . . . . . . . . 2.1.3 Network Topology . . . . . . . . . . . . . . . . 2.1.4 Structure Learning of Graphical Models . . . . 2.2 Constraint-based Algorithms . . . . . . . . . . . . . . . 2.3 Score-based Algorithms . . . . . . . . . . . . . . . . . . 2.3.1 Score Metrics . . . . . . . . . . . . . . . . . . . 2.3.2 Search for the Optimal Structure . . . . . . . . 2.4 Regression-based Structure Learning . . . . . . . . . . 2.4.1 Regression Model . . . . . . . . . . . . . . . . . 2.4.2 Structure Learning through Regression . . . . . 2.5 Clustering Based Structure Learning . . . . . . . . . . 2.6 Matrix Factorization Based Structure Learning . . . . . 2.7 Regularization Based Structure Learning . . . . . . . . 2.8 Hybrid Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Inferring Functional Cortical Networks Using Dynamic works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Dynamic System Reconstruction . . . . . . . . . . . . . . . . 3.3 HMM and DBN . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Discrete Markov Process . . . . . . . . . . . . . . . . 3.3.2 Hidden Markov Model . . . . . . . . . . . . . . . . . 3.3.3 Basic Usages of HMM . . . . . . . . . . . . . . . . . 3.3.4 Limitations of HMM . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 8 13 14 14 19 20 26 33 33 37 45 45 47 48 50 50 51 52 52 53 54 55 3.4 3.5 3.3.5 Dynamic Bayesian Network . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . 3.4.1 Population Model . . . . . . . . . . . . . . 3.4.2 Granger Causality . . . . . . . . . . . . . 3.4.3 Experimental Settings . . . . . . . . . . . 3.4.4 Experiments with Excitatory Connections 3.4.5 Experiments with Inhibitory Connections . 3.4.6 Experiments with Non-linear Connections Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Using Knowledge Driven Matrix Factorization to Reconstruct Modular Gene Regulatory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Non-negative Matrix Factorization and Extensions . . . . . . . . . . . 4.2.2 Maximum-Margin Matrix Factorization . . . . . . . . . . . . . . . . . 4.2.3 Limitations with Existing Matrix Factorization Methods . . . . . . . 4.3 A Framework for Knowledge Driven Matrix Factorization (KMF) . . . . . . 4.3.1 Matrix Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Consistency with the Prior Knowledge from GO . . . . . . . . . . . . 4.3.3 Gene Module Network with Hierarchical Scale-free Structure . . . . . 4.3.4 Matrix Factorization Framework for Hierarchical Module Network Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Solving the Constrained Matrix Factorization . . . . . . . . . . . . . . . . . 4.4.1 Determining Parameters α and β . . . . . . . . . . . . . . . . . . . . 4.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Evaluation of KMF on Yeast Cell Cycle Data . . . . . . . . . . . . . 4.5.3 Application to Identifying Gene Modules and Modular network in Liver Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Exclusive Lasso for Multi-task Feature Selection . . 5.1 Grouped and Hierarchical Regularization . . . . . . . . 5.1.1 Group Lasso . . . . . . . . . . . . . . . . . . . . 5.1.2 Hierarchical Penalization . . . . . . . . . . . . . 5.1.3 Composite Absolute Penalties . . . . . . . . . . 5.1.4 Limitations with Existing Group Regularizers . 5.2 Exclusive Lasso . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Multi-task Feature Selection . . . . . . . . . . . 5.2.2 Multiple Kernel Learning . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 58 59 60 60 62 65 66 68 68 70 71 74 75 77 78 79 80 81 81 84 85 85 86 88 91 . . . . 92 . . . . 93 . . . . 93 . . . . 93 . . . . 94 . . . . 95 . . . . 97 . . . . 98 . . . . 99 5.2.3 Multi-Task Learning with Linear Classifiers . . 5.2.4 Understanding the Exclusive Lasso Regularizer 5.2.5 Multi-Task Learning with Kernel Classifiers . . 5.2.6 Algorithm . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . 5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Sparse Online Feature Selection . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . 6.3 A Naive Approach . . . . . . . . . . . . . . . . . . . . 6.4 L1 Projection for Online Feature Selection . . . . . . . 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Experimental Results . . . . . . . . . . . . . . . 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 5.4 . . . . . . . . . . . . . . . . . . . . . . . . 99 100 103 104 107 108 110 111 114 114 118 119 121 126 126 127 129 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 B Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 C Bregman Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 D Kullback-Leibler Divergence E Mutual Information Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . 139 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 vii LIST OF TABLES 4.1 Variations of NMF algorithms with different divergence measures . . . . . . 72 4.2 PWF1 measure of the experimental results on the Yeast cell cycle dataset . . 88 5.1 Variations of the CAP regularizer . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Information of the Yahoo data collection . . . . . . . . . . . . . . . . . . . . 108 6.1 Information of the 20 Newsgroup and Reuters datasets. . . . . . . . . . . . . 127 viii LIST OF FIGURES 2.1 Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 A Bayesian network for detecting credit-card fraud . . . . . . . . . . . . . . 11 2.3 Illustration of the L1 and L2 regularization . . . . . . . . . . . . . . . . . . . 36 2.4 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 The performance of using DBN to infer neuron connections at fixed latencies 61 3.2 Network topology with inhibitory feedback . . . . . . . . . . . . . . . . . . . 67 4.1 Visualization of the clustering results generated by KMF on yeast dataset . . 87 4.2 An example of the functional modules identified by KMF . . . . . . . . . . . 90 4.3 Connections between energy production modules uncovered by KMF . . . . 91 5.1 Admissible sets for various regularizers . . . . . . . . . . . . . . . . . . . . . 96 5.2 AUC of exclusive lasso on the Yahoo data collections . . . . . . . . . . . . . 113 6.1 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 Cumulative accuracy of Algorithm 6.2 . . . . . . . . . . . . . . . . . . . . . 128 6.3 Offline accuracy of Algorithm 6.2 . . . . . . . . . . . . . . . . . . . . . . . . 129 C.1 Bregman divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 ix CHAPTER 1 Introduction In this dissertation we discuss learning with structures, which appears frequently in both machine learning theories and applications. Structures can be found in many study areas. In biology, the gene regulatory network is a collection of DNAs in a cell. These DNAs interact with each other indirectly through RNA and protein expression, and thereby governing the rates at which the genes are transcribed into mRNA. In system neuroscience, numerous functional neuroimaging studies suggest that cortical regions selectively couple to each other. These interconnected regions orchestrate together and mediate perception, learning, sensory and motor processing. It is essential to reconstruct this neural network in order to understand the internal mechanism. A large amount of research work has been devoted to structure learning and many models and algorithms have been developed, including, for example, Bayesian networks, Gaussian graphical models, Markov Random Field, group lasso regularization, etc. Despite significant efforts made in learning with structures, there are a number of challenges remain to be addressed: • In many applications we are interested in the temporal effects in the dynamic structures. Note that the term “dynamic” means that we are modeling a dynamic system, not that the structure changes over time. For example, in functional neural networks, the presynaptic and postsynaptic neurons are connected by channels that are capable of passing electrical current, causing electric spiking (firing) in the presynaptic neuron to influence, either excitatory or inhibitory, the spiking of the postsynaptic neuron. Although this kind of dynamic structures are found in many applications, learning 1 with dynamic structures is an area that is less studied. • Many structures in our study are large-scale, leading to computational challenges with the existing algorithms. For example, the gene regulatory networks may be consisted of hundreds, thousands or even more DNAs. In order to study in the mechanism of cortical networks we may have millions of neurons at proposal. However, most of the existing algorithms are designed to handle either small or medium scale datasets. For example, The Bayesian network learning algorithm can usually learn structures with a few hundred nodes at most. The computational complexity grows exponentially with the number of nodes, which hinders us from exploring the structures of large sets of variables. • Although the structure information often appears explicitly in many situations like the conditional dependency in a Bayesian network which can be intuitively visualized using a directed graph, it may be implicit sometimes and thus difficult to explore. Many algorithms including group lasso and Composite Absolute Penalties (CAP) have been proposed to exploit the information embedded in the group structure. However, these algorithms are restricted to discovering positive correlations among the variables within groups only. Specifically, they assume that if a few variables in a group are important, then most of the variables in the same group should also be important. However, in many real-world applications, we may come to the opposite observation, i.e., variables in the same group exhibit negative correlations by competing with each other. For instance, in visual object recognition, the signature visual patterns of different objects tend to be negatively correlated, i.e., visual patterns valuable for recognizing one object tent to be less useful for other objects. It remains a question of how to infer this kind of negative correlations in implicit group structures. • The sparse structure is tempting in many applications, one of such is online learning. Most of the existing online learning algorithms have at least one weight for every 2 feature. Although some algorithms are proposed recently to introduce sparsity to the model, they lack a hard constraint on the number of non-zero features, and thus need access to the full information of the sample in each iteration. On the other hand, sparse representation of the online learning model has many advantages. For example, the time complexity and space complexity is significantly reduced. The constraints on sparsity may also avoid the overfitting problem. In some applications the acquisition of samples is costly, and thus the number of features available is limited by our budget. This dissertation is devoted to tackling these challenges in learning with structures. Specifically, this dissertation will present research work in the following topics: (i) Application of dynamic structure learning in reconstructing cortical networks. (ii) Developing algorithms for structure learning that can efficiently handle large scale data sets. (iii) Developing a new kind of regularization mechanism for explicitly exploring the negative correlations in implicit group structures. (iv) Developing new online feature selection algorithms with hard sparse constraints on number of features. The rest of this dissertation is organized as following. Chapter 2 gives a comprehensive review of existing structure learning algorithms. In particular, structure learning of probabilistic graphical models is given special attention since it is representative for many structure learning problems, and this area has been well studies with many models and algorithms developed. In Chapter 3, we discuss dynamic structure learning from time-series or sequential data. We give a detailed introduction to Hidden Markov Models. As a typical time-state model, it has been widely used in various engineering fields for several decades. The extension of HMMs naturally lead to dynamic Bayesian networks. We study the application of dynamic Bayesian networks to the reconstruction of functional cortical networks. In Chapter 4, we study structure learning from matrix factorization, which has been a popular research area in recent years. An efficient non-negative matrix factorization algorithm is proposed. Unlike the traditional matrix factorization methods for clustering, the proposed algorithm not only derives the membership assignment to the clusters, but also computes the 3 strength of interactions among the clusters. The proposed algorithm is applied to the reconstruction of gene regulatory networks, and it shows superior performance in our experiments compared to state-of-the-art algorithms. In Chapter 5 we study the implicit and indirect structures resulted from regularization. We are particularly interested in the grouped and hierarchical regularization, and we propose a novel exclusive lasso regularizer. Unlike the group lasso which assumes positive correlations in the groups, namely, if one variable in a group is believed to be important, then all variables in the group should be important, our proposed exclusive lasso regularizer introduces competitions among variables in groups, resulting in a sparse solution where the most prominent variable in a group stands out and the others diminish. We apply this regularization to multi-task learning in our experiments, and it show superior performance compared with other state-of-the-art algorithms. In Chapter 6, we further investigate the sparse structure in online learning, specifically in a novel problem of online feature selection. Most online learning studies assume that the learning has full access to all input features. However, in many real world applications, it is expensive, either computationally or money wise, to acquire and use all the input attributes. In this case it is desirable to develop online learning algorithms that only need to sense a small number of attributes before the reliable decision can be made. We make a first step towards solving this problem, and develop theories and algorithms for sparse online feature selection. Specifically, we develop the general algorithms for sparse online feature selection, and examine their theoretic properties such as the upper and lower bounds for the regret bound. We evaluate the proposed algorithms by extensive experiments on benchmark datasets. 4 CHAPTER 2 Learning with Structures: a Comprehensive Review In this chapter we review the existing models and algorithms for learning with structures. An important family of structures are the probabilistic graphical models which combine the graph theory and probability theory to give a multivariate statistical modeling. Many graphical models have been developed so far, e.g., Bayesian networks, Gaussian graphical models, Markov random fields, among others. The structure learning of probabilistic graphical models has been well studied and many algorithms have been proposed. We will review the two most popular approaches: constraint-based algorithms and score-based algorithms. We will also review other structure learning algorithms for different models. These algorithms can roughly be categorized into regression based approaches, matrix factorization based approaches, group regularization based approaches, and hybrid methods. 2.1 Structure Learning of Probabilistic Graphical Models Probabilistic graphical models combine the graph theory and probability theory to give a multivariate statistical modeling. They provide a unified description of uncertainty using probability and complexity using the graphical model. Especially, graphical models provide the following several useful properties: • Graphical models provide a simple and intuitive interpretation of the structures of 5 probabilistic models. On the other hand, they can be used to design and motivate new models. • Graphical models provide additional insights into the properties of the model, including the conditional independence properties. • Complex computations which are required to perform inference and learning in sophisticated models can be expressed in terms of graphical manipulations, in which the underlying mathematical expressions are carried along implicitly. The graphical models have been applied to a large number of fields, including bioinformatics, social science, control theory, image processing, marketing analysis, among others. However, structure learning for graphical models remains an open challenge, since one must cope with a combinatorial search over the space of all possible structures. 2.1.1 Preliminaries We will first define a set of notations which will be used in this chapter. We represent a graph as G = V, E where V = {vi } is the set of nodes in the graph and each node corresponds to a random variable xi ∈ X . E = {(vi , vj ) : i = j} is the set of edges. In a directed graph, if there is an edge Ei,j from vi to vj , then vi is a parent of node vj and vj is a child of node vi . If there is no cycle in a directed graph, we call it a directed acyclic graph (DAG). The number of nodes and number of edges in a graph are denoted by |V | and |E| respectively. π(i) represent all the parents of node vi in a graph. U = {x1 , · · · , xn } denotes the finite set of discrete random variables where each variable xi may take on values from a finite domain. V al(xi ) denotes the set of values that variable xi may attain, and |xi | = |V al(xi )| denotes the cardinality of this set. In probabilistic graphical network, the Markov blanket ∂vi [Pearl, 1988] of a node vi is defined to be the set of nodes in which each has an edge to vi , i.e., all vj such that (vi , vj ) ∈ E. The Markov assumption states that in a probabilistic graphical network, every set of nodes in the network is conditionally independent of vi when 6 X1 X2 X9 Figure 2.1: An Ising model with 9 nodes. conditioned on its Markov blanket ∂vi . Formally, for distinct nodes vi and vk , P (vi |∂vi ∩ vk ) = P (vi |∂vi ) The Markov blanket of a node gives a localized probabilistic interpretation of the node since it identifies all the variables that shield off the node from the rest of the network, which means that the Markov blanket of a node is the only information necessary to predict the behavior of that node. A DAG G is an I-Map of a distribution P if all the Markov assumptions implied by G are satisfied by P . Theorem 2.1.1. (Factorization Theorem) If G is an I-Map of P , then P (x1 , · · · , xn ) = i P (xi |xπ(i) ) According to this theorem, we can represent P in a compact way when G is sparse such that the number of parameter needed is linear in the number of variables. This theorem is true in the reverse direction. The set X is d-separated from set Y given set Z if all paths from a node in X to a node in Y are blocked given Z. The graphical models can essentially be divided into two groups: directed graphical models and undirected graphical models. 7 2.1.2 Graphical Models In this section we review some of the most commonly used graphical models. Markov Random Field A Markov Random Field (MRF) is defined as a pair M = G, Φ . Here G = V, E represents an undirected graph, where V = {Vi } is the set of nodes, each of which corresponds to a random variable in X ; E = {(Vi , Vj ) : i = j} represents the set of undirected edges. The existence of an edge {u, v} indicates the dependency of the random variable u and v. Φ is a set of potential functions (also called factors or clique potentials) associated with the maximal cliques in the graph G. Each potential function φc (·) has the domain of some clique c in G, and is a mapping from possible joint assignments (to the elements of c) to non-negative real values. A maximal clique of a graph is a fully connected sub-graph that can not be further extended. We use C to represent the set of maximal cliques in the graph. φc is the potential function for a maximal clique c ∈ C. The joint probability of a configuration x of the variables V can be calculated as the normalized product of the potential function over all the maximal cliques in G: P (x) = φc (xc ) c∈C φc (xc ) c∈C x′ c where xc represents the current configuration of variables in the maximal clique c, x′ reprec sents any possible configuration of variable in the maximal clique c. In practice, a Markov network is often conveniently expressed as a log-linear model, given by P (x) = exp x∈X c∈C exp wc φc (xc ) c∈C wc φc (xc ) In the above equation, φc are feature functions from some subset of X to real values, wc are weights which are to be determined from training samples. A log-linear model can provide more compact representations for any distributions, especially when the variables have large domains. This representation is also convenient in analysis because its negative log likelihood 8 is convex. However, evaluating the likelihood or gradient of the likelihood of a model requires inference in the model, which is generally computationally intractable due to the difficulty in calculating the partitioning function. The Ising model is a special case of Markov Random Field. It comes from statistical physics, where each node represents the spin of a particle. In an Ising model, the graph is a grid, so each edge is a clique. Each node in the Ising model takes binary values {0, 1}. The parameters are θi representing the external field on particle i, and θij representing the attraction between particles i and j. θij = 0 if i and j are not adjacent. The probability distribution is:  p(x|θ) = exp  = θij xi xj + i