r. x, 2 4am haw, . JR fiv..~...wn §f I! i 6.4. 3:3? .. . ’15:... 2.5. 1);}. ‘I-14.13..\ A! N311. .9.) I a: t . .5? . mg}. a»: ‘ 54. v A... in! .1 i, . . “lawn-av 30"?!va . “.4 N“)! I, I... 10‘. - x- ; au“‘”v kW)“ 11 :3? iflu. Bill. . . . I. w... it‘sznmamadfle. .. .. (wilt!!! g ks... A .x‘l ? yhh.¢.x. «1.3.? . 1.15.; .X: . 2 3-5.0 .- yu.) t). V .x. at... Iii-1 LIBRARY MiChlga. 1' State University This is to certify that the dissertation entitled GRAPH BASED METHODS FOR PATTERN MINING PhD. presented by H. D. K. Moonesinghe has been accepted towards fulfillment of the requirements for the degree in Computer Science rim fut/vii Major Professor’s Signature Cd l9 2 20017 Date MSU is an affirmative-action, equal-opponumry employer PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K.lProj/Acc&Pres/CIRC/DateDue indd GRAPH BASED METHODS FOR PATTERN MINING By H. D. K. Moonesinghe A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 2007 ABSTRACT GRAPH BASED METHODS FOR PATTERN MINING By H. D. K. Moonesinghe Pattern mining is the automatic extraction of novel and potentially useful knowledge fi'om large databases. It plays a major role in many applications, such as intrusion detec- tion, business intelligence, and bio-medical applications. This thesis explores two impor- tant pattern mining tasks, namely, frequent pattern mining and anomalous pattern mining. We illustrate the limitations of existing pattern mining algorithms and develop a class of algorithms inspired by graph theoretic principles to overcome these limitations. First, we demonstrate characteristics of existing frequent pattern mining algorithms that prohibit them from scaling-up to very large databases. We introduce a novel data structure known as a PrefixGraph to compress crucial information about the database in order to facilitate fast enumeration of frequent patterns. An efficient pattern search and pruning algorithm called PGMiner was also developed using principles derived from network flow analysis. Second, we investigate a class of anomalies that are difficult to distinguish from nor- mal observations, resulting in high false alarm rates in many anomaly detection algo- rithms. To address this problem, we introduce OutRank, a stochastic graph based algo- rithm for unsupervised anomaly detection. In this method, a graph is constructed using two approaches: one based on the similarity between objects and the other using shared neighbors of the object. The heart of this approach uses a Markov chain random walk model to determine the anomaly score of an object. Recent years have also witnessed the proliferation of graph-based data, spurred by advances in application areas such as bioinformatics, chem-informatics, Web 2.0, and sensor networks. In this thesis, we systematically develop an anomaly detection frame- work to detect anomalies in graph-based data. First, frequent subgraph patterns are ex- tracted from the data. We investigate two methods for utilizing the frequent subgraph pat- terns in graph anomaly detection. The first approach, gRank, is an extension of OutRank to graph-based data. It uses a substructure-based similarity measure to create a stochastic graph and then performs random walk to determine the anomaly score of a graph. The second approach, called gEntropy, creates features based on the frequent subgraph pat- terns and builds a probability model for the graphs using the maximum entropy principle. Anomalous graphs are detected based on the probability that such a graph is generated by the model. Finally, we extend the maximum entropy approach to supervised anomaly de- tection and show that it produces significant improvements over the unsupervised case. Graph based methods have been applied in several areas such as machine learning and networking. In this thesis, we show how graph based methods can be applied suc- cessfully in pattern mining area. More specifically, we show how graph theoretic ap- proaches can be utilized to improve the efficiency and the effectiveness of frequent item- set mining and anomaly detection. With the advances in technology, a huge amount of data is accumulated everyday and database sizes have grown to Terabyte and Petabyte scale. We believe the algorithms developed in this thesis are a step forward towards mak- ing pattern mining more effective for such large and complex data sets. To Parents For their love and affection iv ACKNOWLEDGEMENTS Many people have contributed to this work directly or indirectly. It is an honor to have come across so many wonderful people. I express deep gratitude to my advisor Dr. Pang-Ning Tan, for allowing me to pursue this research and helping me along every step of the way. This thesis would not have been in its present form without his generous support, motivation, and guidance. I highly appreciate his periodic feedback and inspirational communication that kept me pointed in the right direction to achieve this goal. I am grateful to the late Dr. Moon-Jung Chung for being my academic advisor in the early days of my degree. He was a very caring, understanding, great teacher who exposed me to the exciting field of data mining. I am thankful to the members of my thesis committee— Dr. Anthony S. Wojcik, Dr. Abdol-Hossein Esfahanian, and Dr. Shinhan Shiu. Communication with them has always brought me cheerful spirits and inspiration. I would like to thank the faculty of the Department of Computer Science and Engi- neering at Michigan State University (MSU) for providing an excellent learning, re- search, and teaching experience. In particular, I would like to thank Dr. George Stock- man, Dr. Bill Punch, Dr. John Weng, and former faculty member Dr. Jaejin Lee for their support and encouragement. I am highly indebted to the Department of Computer Science and Engineering at MSU for providing me with an Opportunity to pursue graduate studies at this university. Also, I am grateful to both of the past graduate directors— Dr. Wojcik and Dr. Esfaha- nian, and present graduate director— Dr. Eric Tomg for approving continued financial support for my entire Ph.D. education. I would like to thank all the staff members of the Department of Computer Science and Engineering for their assistance in administrative tasks. I want to thank Ms. Linda Moore for her patience, care, and advice during this period. Many thanks to Ms. Kim Thompson for proof-reading this entire document in such short notice. I would also like to thank Mr. Mark McCullen for providing valuable teaching tips and resources that helped me to carryout my teaching duties efficiently, and allowing me to spend more time on my research. Thanks also go to my students from the Computer Organization and Architecture class for a good time. Many thanks to members of the Data Mining research group— Jerry Scripps, Samah Fodeh, Hamed Valizadegan, and Haibin Cheng, and the members of my previous re- search group — Yuan Zhang, and Ming Wu for their help and a good time. I am grateful to all of my past teachers at the University of Colombo for providing me with a sound foundation on all aspects of computer science, which helped me immensely in achieving my career goals. I would like to thank all of my fiiends both in the US. and Sri Lanka. Although I can- not name them all here, I appreciate the support they provided and the wonderful time we had during this period. A special word of thanks goes to my friend Varuni for her sup- port, patience, and understanding in this period. These acknowledgements would not be complete without an expression of gratitude to my parents for their continuous support and encouragement. Their love accompanies me wherever I go. vi TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 1.1 Frequent Pattern Mining and its Challenges ......................................................... 2 1.2 Anomalous Pattern Mining and its Challenges .................................................... 5 1.3 Graph-based Approaches for Pattern Mining ....................................................... 7 1.4 Contributions ........................................................................................................ 9 1.5 Organization of the Thesis .................................................................................. 10 2 Background 11 2.1 Graphs: Basic Concepts ...................................................................................... 11 2.2 Frequent Pattern Mining ..................................................................................... 13 2.2.1 Frequent Itemset Mining ................................................................... 13 2.2.2 Frequent Subgraph Mining ................................................................ 15 2.2.3 Related Work ..................................................................................... 17 2.3 Mining Anomalous Patterns ............................................................................... 21 2.3.1 Basic Concepts in Anomaly Detection .............................................. 21 2.3.2 Related Work ..................................................................................... 22 3 PGMiner: Mining Frequent Closed Itemsets 25 3.1 Issues in Frequent Closed Itemset Mining ......................................................... 25 3.2 PrefixGraph Representation ............................................................................... 29 3.2.1 Preliminaries ...................................................................................... 29 3.2.2 PrefixGraph Construction ................................................................. 30 3.2.3 Analysis Of PrefixGraph Structure .................................................... 33 3.3 Frequent Closed Itemset Mining ........................................................................ 35 3.3.1 Intra—Node Closed Itemset Mining .................................................... 36 3.3.2 Inter-Node Pruning ............................................................................ 39 3.4 Mining Algorithm ............................................................................................... 46 3.4.1 Implementation Techniques .............................................................. 47 3.4.2 Memory Management ........................................................................ 48 3.5 Experimental Evaluation .................................................................................... 48 3.5.1 Evaluating Environment .................................................................... 48 3.5.2 Performance Comparisons ................................................................. 49 3.5.3 Memory Usage .................................................................................. 50 3.5.4 Scalability .......................................................................................... 57 3.5.5 Effectiveness of the Flow Based Pruning .......................................... 58 3.6 Summary ............................................................................................................. 59 vii 4 OutRank: Mining Anomalous Data 60 4.1 Anomaly Detection and its Issues ...................................................................... 60 4.2 Modeling Anomalies Using a Graph .................................................................. 63 4.2.1 Graph Representation ........................................................................ 63 4.2.2 Markov Chain Model ......................................................................... 64 4.3 Anomaly Detection Algorithms .......................................................................... 67 4.3.1 OutRank-a: Using Object Similarity ................................................. 67 4.3.2 OutRank-b: Using Shared Neighbors ................................................ 69 4.4 Experimental Evaluation .................................................................................... 71 4.4.1 Comparison with Other Approaches ................................................. 72 4.4.2 Effect of the Percentage of Outliers .................................................. 75 4.4.3 Effect of the Shared Neighbor Approach .......................................... 79 4.4.4 Choice of Similarity Measures .......................................................... 81 4.4.5 Effect of Threshold on the Quality of Solution ................................. 82 4.5 Discussion ........................................................................................................... 82 5 Mining Anomalous Graphs 85 5.1 Mining Graph Based Data .................................................................................. 86 5.2 Anomaly Detection in Graphs and its Issues ...................................................... 87 5.3 Extending OutRank for Graphs .......................................................................... 89 5.3.1 gRank. Graph Ranking ...................................................................... 89 5.3.2 Similarity Measures for Graphs ......................................................... 91 5.4 gEntropy: Maximum Entropy Based Unsupervised Anomaly Detection .......... 97 5.4.1 Maximum Entropy Model ................................................................. 97 5.4.2 Parameter Estimation ....................................................................... 101 5.4.3 Metropolis Sampling Approach ....................................................... 101 5.4.4 Feature Vector Generation ............................................................... 104 5.5 Maximum Entropy Based Supervised Anomaly Detection .............................. 107 5.5.1 Maximum Entropy Model in Supervised Setting ............................ 109 5.5.2 Parameter Estimation ....................................................................... 110 5.5.3 gEntropySuper: Classifying Anomalous Graphs ............................. 111 5.6 Experimental Evaluation .................................................................................. 113 5.6.1 Datasets ............................................................................................ 1 13 5.6.2 Evaluating Environment .................................................................. 114 5.6.3 Evaluation of the Unsupervised Frameworks .................................. 114 5.6.4 Evaluation of the Supervised Frameworks ...................................... 116 5.7 Discussion ......................................................................................................... l 19 6 Conclusions 122 6.1 Summary of the Thesis ..................................................................................... 122 6.1.1 Mining Frequent Itemsets ................................................................ 122 6.1.2 Mining Anomalous Patterns ............................................................ 124 6.2 Future Research ................................................................................................ 127 BIBLIOGRAPHY 130 viii LIST OF TABLES Table 2.1 Sample database ................................................................................................ 14 Table 3.1. Characteristics of various condensed representations ...................................... 27 Table 3.2. Sample database ............................................................................................... 30 Table 3.3. Characteristics of the databases ........................................................................ 49 Table 3.4. Evaluation of the global closedness techniques ............................................... 59 Table 4.1. Outlier rank for sample 2-D dataset ................................................................. 67 Table 4.2. Characteristics of the datasets .......................................................................... 72 Table 4.3. Experimental results ......................................................................................... 73 Table 4.4. Performance comparison with different similarity measures ........................... 82 Table 5.1. Characteristics of the datasets ........................................................................ 114 Table 5.2. Experimental results ....................................................................................... 1 15 Table 5.3. Characteristics of the datasets ........................................................................ 116 Table 5.4. Experimental results (Accuracy and F-Measure) ........................................... 118 Table 5.5. Experimental results with maximal subgraphs ............................................... 119 ix LIST OF FIGURES Figure 1.1. Execution time and memory usage for large databases (K =1 000) ................... 4 Figure 1.2. Results of applying LOF anomaly detection algorithm on 2D-Data ................ 6 Figure 2.1 Sample graph database and frequent graphs .................................................... 16 Figure 3.1. Different compressed data representations ..................................................... 27 Figure 3.2. PrefixGraph construction- a running example ................................................ 31 Figure 3.3. PrefoGraph representation of the sample database ........................................ 32 Figure 3.4. Size of bit vector databases at each node ........................................................ 36 Figure 3.5. Itemset enumeration tree (search space) of a node ......................................... 38 Figure 3.6. Transaction flow network for the sample database ......................................... 41 Figure 3.7. Execution time (in seconds) for Medical ........................................................ 51 Figure 3.8. Execution time (in seconds) for WebView2 ................................................... 51 Figure 3.9. Execution time (in seconds) for Kosarak ........................................................ 52 Figure 3.10. Execution time (in seconds) for Chess .......................................................... 52 Figure 3.11. Execution time (in seconds) for WebDocs .................................................... 53 Figure 3.12. Execution time (in seconds) for T2OI8D500K .............................................. 53 Figure 3.13. Execution time (in seconds) for Pumsb ........................................................ 54 Figure 3.14. Execution time (in seconds) for T40110D100K ............................................ 54 Figure 3.15. Execution time (in seconds) for T100120D100K .......................................... 55 Figure 3.16. Amount of memory (in MB) required for T40110D100K ............................ 55 Figure 3.17. Amount of memory (in MB) required for Pumsb ......................................... 56 Figure 3.18. Amount of memory (in MB) required for Kosarak ....................................... 56 Figure 3.19. Execution time versus number of transactions (K =1 000) ............................. 57 Figure 3.20. Memory usage of algorithms for large databases (K =1 000) ........................ 58 Figure 4.1. Outlier detection with random walk ................................................................ 62 Figure 4.2. Sample 2-D data set ........................................................................................ 66 Figure 4.3. Similarity based on the number of shared neighbors ...................................... 69 Figure 4.4. Precision while varying the % of outliers in Led7 .......................................... 76 Figure 4.5. False alarm rate while varying the % of outliers in Led7 ............................... 76 Figure 4.6. Precision while varying the % of outliers in KDD ......................................... 77 Figure 4.7. False alarm rate while varying the % of outliers in KDD ............................... 77 Figure 4.8. Precision while varying the % of outliers in Diabetic ..................................... 78 Figure 4.9. False alarm rate while varying the % of outliers in Diabetic .......................... 78 Figure 4.10. Precision of algorithms on optical dataset .................................................... 79 Figure 4.11. Connectivity of objects in Austra dataset ...................................................... 80 Figure 4.12. Connectivity of objects in Led7 dataset ........................................................ 81 Figure 4.13. Precision for different threshold values in Led7 dataset ............................... 83 Figure 4.14. Precision for different threshold values in KDD dataset .............................. 83 Figure 5.1. AIDS antiviral screening data and its interesting fragments ........................... 86 Figure 5.2. Compression based graph similarity ............................................................... 89 Figure 5.3. Maximal common frequent subgraph of the two graphs ................................ 92 Figure 5.4. Frequent subgraph lattice ................................................................................ 94 xi 1 Introduction WITH the growing scale and complexity of data generated in a variety of scien- tific, engineering, and commercial applications, a key challenge is to automati- cally process and analyze data to discover potentially useful information. Traditional data analysis tools are insufficient because they are mostly designed for small-scale problems. This has led to considerable interest in developing data mining technology for extracting knowledge buried in the wealth of data. Data mining, as defined by Fayyad et al. [FPS96], is the non-trivial process of identifying valid, novel, potentially useful, and ul- timately understandable patterns in large databases. It blends traditional data analysis tools with sophisticated algorithms for handling a massive amount of data and analyzing non-traditional complex data types such as sequences and graphs [TSK06]. There are two general classes of techniques in data mining—one aimed at model building and the other aimed at pattern mining. In model building, the objective is to con- struct a global representation that summarizes the data or describes the underlying proc- ess in which the data is generated. Examples of models include decision trees, support vector machines, ARIMA time series models, and Gaussian mixture models. Model build- ing, especially for classification and regression tasks, has been predominantly driven by advances in areas such as statistical learning, machine learning, and pattern recognition. In contrast, patterns are local structures hidden in the data involving only a subset of the objects or attributes [H98]. There are two notable types of pattems—frequent patterns and anomalous patterns. A frequent pattern is a subset of attributes or a substructure of complex objects that occur frequently in the data. Frequent patterns have been used to identify products that are frequently sold together for up-sell and cross-sell promotions [HCH+98], to find motifs in biological sequences [MS99], to discover chemical com- pound substructures that can be used in drug design [DKN+05], and to detect bugs in software programs [LYY+05]. Meanwhile, an anomalous pattern is an object whose characteristics are considerably different than the rest of the data. Anomalous patterns are useful for a variety of applications such as intrusion detection, fraud detection, and failure detection in complex systems. Other types of data mining patterns include trends, change points, subgroups, emerging patterns, etc. The scope of this thesis focuses on the mining of frequent patterns and anomalous patterns. The limitations of current approaches will be illustrated, which motivate the need to develop new ways for mining such patterns. F urtherrnore, the theoretical under- pinning of the methods developed in this thesis is based on concepts from graph theory. 1.1 Frequent Pattern Mining and its Challenges Frequent pattern mining was originally developed for market basket analysis to discover items (products) that customers frequently buy together. These patterns are also known as frequent itemsets in the data mining literature. Each itemset is associated with a quantita- tive measure called support, which is defined as the fiaction of total transactions that con- tain the given itemset. Support is a usefiil measure because it can be used to eliminate spurious patterns that occur simply by chance. Furthermore, it has certain desirable prop- erties that can be exploited to improve the efficiency of the frequent pattern discovery process. We can formally define the frequent pattern mining problem as follows: DEFINITION 1.1 (Frequent Pattern Mining) Frequent pattern mining is the task of find- ing all patterns in the database that satisfy a given minimum support threshold. In addition to the computational challenges of exploring a massive database to find interesting patterns, the large number of redundant or correlated patterns that can be gen- erated is also a concern. Redundant patterns exist because many of the frequent itemsets generated from transaction data have the same support as their supersets. Discarding such redundant patterns not only helps to reduce the number of generated patterns, it also im- proves the efficiency of frequent pattern mining algorithms. In the context of binary transaction data, the non-redundant patterns are known as frequent closed itemsets [PBT+99a]. Frequent closed itemsets provide a compact yet lossless representation of the frequent itemsets; i.e., it is possible to derive the complete set of frequent itemsets along with their support based on the frequent closed itemsets alone. In particular, the number of frequent closed itemsets can be orders of magnitude fewer than the number of frequent itemsets. The difference is even more pronounced for dense databases, where the frequent itemsets tend to be long and enumerating all of them is simply infeasible due to their ex- ponential number. Despite these advantages, generating frequent closed itemsets from large transaction databases is computationally expensive. For example, Figure 1.1 shows the relative per- formance of two state-of-the-art fiequent closed itemset mining algorithms] when ap- plied to a database, whose size is increased from 100,000 transactions to 5 million trans- actions. This experiment was conducted on a 2.8 GHz Pentium 4 machine with 4 GB memory. Since both of these algorithms store the entire database in memory using a condensed data representation, they broke down when the database size exceeded a mil- lion transactions. A more detailed analysis of the scalability issue is presented in Chapter 3. I Algorithm FPClose has shown to be the fastest algorithm based on the F [M] data mining competition. -—a— FPciose —1ar— DCI-Close 100K 500K 1000K 5000K No. of Transactions 4096 —a— FPclose 3584 ‘7 a DCl-Close “““““““ r "r ---------- 3072 ~ 2560 - 2048 A 1536 a Memory Size (MB) 1024 a 512 100K 500K 1000K 5000K No. of Transactions Figure 1.1. Execution time and memory usage for large databases (K =1 000) In summary, although there have been numerous algorithms developed for frequent closed itemset mining, these algorithms have characteristics that prohibit them from scal- ing-up to very large databases. Developing efficient and scalable frequent closed itemset mining algorithms is therefore an important research problem that needs to be addressed. Furthermore, any significant improvement to frequent closed itemset mining algorithms will have an impact on other frequent pattern mining problems, such as constraint based mining [HLN99], maximal pattern mining [G203], and top-K pattern mining [HWL+02], to name a few. In addition to frequent itemsets in transaction databases, frequent pattern mining has also been adapted to sequences and graphs. As part of this thesis, we will also illustrate the application of frequent pattern mining to graph anomaly detection. 1.2 Anomalous Pattern Mining and its Challenges Anomalous patterns or outliers are observations that deviate significantly from the major- ity of the observations in a dataset. Over the years, many outlier detection algorithms have been developed, including statistical-based [Esk00][Lew94], depth-based [JKN98][P888], distance-based [BS03][JTH01][KNT00][RRSOO], and density-based [BKN+00]. While these approaches have proven to be quite effective in some domains, they have several fundamental limitations. First, existing algorithms typically assume that the anomalous patterns are randomly distributed in the feature space. These algorithms are therefore ineffective when applied to data sets containing small clusters of anomalies, which are often found in many practi- cal applications. For example, attacks in intrusion detection are not isolated random events; they are clustered if they belong to similar classes of intrusions (e.g., variants of a worm). Figure 1.2 shows a synthetic two-dimensional data set, which consists of two clusters of normal observations (labeled as C1 and C2) and five clusters of anomalies (la- beled as 01 through 05). A density-based anomaly detection algorithm called LOF [BKN+OO] was applied to the data set. The data points identified as anomalies by the al— gorithm are labeled as “+”. Although LOF shows a much higher detection rate compared to traditional distance-based anomaly detection algorithms, Figure 1.2 clearly demon- strates the limitations of the algorithm when applied to data sets containing clusters of anomalies. This is because LOF defines an anomalous pattern based on the density of its predefined neighborhood. If the neighborhood contains fewer points than a minimum threshold, then the observation is declared an anomaly. For data sets with small clusters of anomalies, defining the appropriate neighborhood size and minimum number of points in a neighborhood is a non-trivial task, especially when the density of the outlying cluster is similar to the density of the normal cluster. Therefore, new data mining algorithms are needed to handle such types of anomalies. 5 05 ++1 C1 C2 + '3' 4 " '1'122132133122122321? : 23222221232322.3222: -*"'+ 2 _ + ................... +4” 03 04 1 1 I l I I l 0 1 2 3 4 5 6 7 Figure 1.2. Results of applying LOF anomaly detection algorithm on 2D-Data Second, anomalous pattern detection in complex data is another situation where many existing algorithms fail. For instance, in real-world graph-based datasets such as AIDS antiviral screening data, users are interested in anomalous substructures that correspond to HIV inhibitors, as they help to discover new drugs. Although some of the existing ap- proaches [DKW+05] attempted to address this type of data, their detection rate is still very low even in supervised mode. As anomaly detection is an essential data mining task, developing efficient algorithms that can achieve significant improvement in detection rate and a lower false alarm rate across these domains is an important research direction and can help numerous applica- tions including intrusion detection, credit card fraud detection, and pharmaceutical re- search. 1.3 Graph-based Approaches for Pattern Mining As discussed in the preceding sections, there are still many problems with current pattern mining algorithms. In order to provide efficient and robust pattern mining technology to potential users, it is necessary to make significant advances in these pattern mining meth- ods. In this work, our obj ective is to use a graph based approach for mining such patterns by addressing these existing challenges. Graph theory, which is a relatively mature field compared to data mining, began in the 18th century with the invention of Euler ’3 solution for the Konigsberg Bridge Prob- lem. Graphs provide a flexible way for modeling complex relationships in data and have been successfully applied in various fields such as networking [ZCM97], image process- ing [HIKO6], pattern recognition [PB06], and machine learning [SWH+05][SHR06] [HCL07]. For example, probabilistic graphical models such as Bayesian networks, Hid- den Markov Models, and Markov random fields, are widely used model building ap- proaches in machine learning and pattern recognition. These models are developed as a confluence of ideas from graph theory and probability theory. The attractive feature of graph-based methods is that, once the problem at hand has been abstracted to a relational structure, techniques from graph theory, statistics, and probability theory can be used for purposes of analysis. Graph based methods such as flow theory and random walk on graphs are important techniques that are already used in data mining applications. For instance, the PageRank algorithm used by the Google search engine employs the random walk model to calculate the authoritativeness of web pages. Surprisingly, even though graph-based methods have been widely used for model building, there has been very few works on applying these ideas to pattern mining. This thesis attempts to address the following question: “Can we improve the effec- tiveness of pattern mining tasks by incorporating graph theoretic concepts into the min- ing process?” Specifically, we attempt to answer this question by examining the applica- bility of the following graph based concepts: Flow network: In graph theory, a flow network is a directed graph with each edge receiving a flow, where the flow must satisfy a set of constraints such as the ca- pacity of the edges, skew symmetry, and flow conservation. A flow network can be used to model many real-world systems in which a medium propagates through a network of nodes. By transforming a real-world problem into a flow network, it allows us to view the problem from a different viewpoint and enables new theo- ries satisfying the constraints to be developed. This thesis utilizes ideas from flow network theory to develop effective strategies for pruning the search space of the closed itemset mining task, thereby improving both the efficiency and scalability of the mining algorithm. Random walk on a graph: Random walk is a special case of a Markov chain proc- ess that enjoys a property called time-symmetry or reversibility. The basic proper- ties of a random walk are determined by the transition probability matrix associ- ated with the graph that provides the probability of moving from one node to its neighbor. This thesis investigates the effectiveness of using the random walk ap- proach to detect anomalous patterns, particularly for data sets containing small clusters of anomalies. The random walk approach can also be extended to finding anomalies in graph-based data. Markov Random Fields: Markov Random Field (MRF) is a generalization of Markov chains to an undirected graphical model, where the vertices correspond to random variables, and edges represent dependencies between them. Random fields are very attractive from a theoretical standpoint because they allow us to define a probability model based on the dependencies of the graph. A probability distribution over the fields, which corresponds to the Gibbs distribution, is also consistent with the Maximum Entropy framework. A desirable property of this framework is that it allows us to build flexible probability models based on fea- tures derived from the data. 1.4 Contributions In this work, we develop a class of novel graph based pattern mining techniques. The main contributions of this work are: 1. We develop a graph-based approach for mining frequent closed itemsets. In par- ticular, we introduce a novel data representation, called PrefixGraph, containing variable length bit vectors to compress the database. Also, using efficient pruning strategies derived from network flow analysis, we develop a novel algorithm called PGMiner to discover the frequent closed itemsets. Our frequent closed itemset mining algorithm is highly scalable and space preserving for very large databases. In order to discover anomalous patterns, we present a random walk based algo- rithm called OutRank. Our analysis shows that the technique outperforms tradi- tional density-based and distance-based approaches in terms of their higher detec- tion rates and lower false alarm rates. We systematically develop a graph based framework for mining anomalous pat- terns in graph datasets. We investigate two different unsupervised approaches. The first approach, called gRank, is an extension of the previous OutRank algo- rithm for graph datasets. This method, which is based on substructure similarity, uses the random-walk model for anomaly detection. The second approach, called gEntropy, is a probabilistic anomaly detection approach based on the maximum entropy principle. Each graph is assigned a probability value representing the like- lihood it is generated from a probability model induced from the graph data. The smaller the probability, the more likely the graph is an anomaly. Finally, we apply the maximum entropy approach to the supervised anomaly detection task. The proposed algorithm, called gEntropySuper, uses a frequent subgraph mining algo- rithm to extract substructure based descriptors and then apply the maximum en- tropy principle to build a classification model from the frequent subgraphs. 1.5 Organization Of the Thesis The rest of the thesis is organized as follows: Chapter 2 illustrates basic concepts, prob- lem definitions, and related work in frequent itemset mining and anomalous pattern min- ing. Chapter 3 introduces a graph based approach for fiequent closed itemset mining. Chapter 4 presents our random walk based anomalous pattern mining algorithm. Chapter 5 discusses our proposed frameworks for mining anomalous patterns in graph datasets. Finally, Chapter 6 concludes with a summary of the thesis contributions and suggestions for firture research. 10 2 Background In this chapter, we illustrate the concepts and terminology used throughout this thesis. More specifically, we discuss preliminaries and definitions of graphs, frequent itemset mining, frequent subgraph mining, and anomalous pattern mining. Further, we discuss related research in both frequent pattern mining and anomalous pattern mining. 2.1 Graphs: Basic Concepts A graph is a mathematical abstraction useful in solving many kinds of problems across several domains. More formally, a graph G is a pair (V, E), where V is a finite set of ob- jects called vertices and E is a set of 2-element subsets of Vcalled edges. Also, V is called a vertex set whose elements are called as vertices or nodes and E is a collection of edges, where an edge is a pair (u, v) with u,v in V. If (u, v) is an edge of a graph G, then we say that u and v are adjacent in G. The edge (x,x) is called a self-loop. A walk in a graph is a sequence of vertices and edges such that the vertices and edges adjacent in the sequence are incident. A path is a walk in which no vertex is repeated. We denote path(u, v) as all the edges composed by the walk from ver- tex u to v or vice versa, where u,v e V. A cycle is a path that begins and ends on the same Vertex. A graph with no cycles is called an acyclic graph. Furthermore, a graph is con- nected if there is a path between every pair of vertices in the graph. 11 The size of a graph is defined as the number of edges that the graph contains. In a directed graph, edges are ordered pairs, connecting a source vertex tO a target vertex. Moreover, edge (u, v) is an out-edge of vertex u and an in-edge of vertex v in such a graph. The number of out-edges of a vertex is its out-degree and the number of in-edges is its in-degree. In an undirected graph, edges are unordered pairs and connect the two vertices in both directions. The number of edges incident to a vertex is called its degree. A tree is a special type of a connected graph that has no loops. Also, a complete graph is a graph where every pair of distinct vertices is adjacent. A graph is called a weighted graph if each edge has a weight assigned to it. Let G1 and G1 be two graphs and let f be a function from the vertex set of G1 to the vertex set of 62. Suppose that the following conditions hold: f is one-to-one and onto, and f(v) is adjacent to f(w) in GZ if and only if v is adjacent to win G1. Then we say that the function f is an isomorphism and that the two graphs G1 and G2 are isomorphic. We say that a graph G1 is a subgraph of a graph GZ if G1 is isomorphic to a graph, all of whose vertices and edges are in G2. Note that by this definition a graph is always a subgraph of itself. Two subgraphs G1, and G2 are said to be edge disjoint if they do not have any edge in common. A flow network is a directed graph G=(V,E) with a source vertex s and a sink vertex t. Each edge has a capacity function c. Also, there is a flow fimction f defined over every vertex pair, which satisfies the following constraints: 1. Capacity constraint: f(u,v) _< c(u,v) V(u,v) in Vx V 2. Skew symmetry: f(u,v) = -f(v,u) V(u,v) in VX V 3. Flow conservation: Z, ,-,, Vf(u,v) = 0 Va in V- {s, t} The flow Of the network is the net flow entering the sink vertex 1 (which is equal to the net flow leaving the source vertex 5): [f] = 2;, ,-,, y f(u, t) = Z, ,-,, yf(s,v). 12 2.2 Frequent Pattern Mining In this section, we describe the concept of frequent itemset mining, frequent subgraph mining, and related research. 2.2.1 Frequent Itemset Mining In this section, we present the basic terminology used in the frequent itemset mining. Let I = {i 1, i2, ...,i,,,} be a set of m distinct items. An itemset X is considered as a non- empty subset of items; i.e. X g; 1. Moreover, an itemset with k items is called a k—itemset. A transaction t = {tid,X} is a tuple, where tid is the transaction identifier and X g I is the itemset. A transaction t = {tid,X} is said to contain an itemset Y if 1’ g X. A transac- tion database D = {t1, t2, t3, ..., tN} is the set of all transactions. The support of an itemset X, denoted as 01X), is defined as the fraction of total trans- actions that contain X. Mathematically, o'OO can be stated as follows: 0(X)=|{t,~ |X_c_t,- Ati eD}| DEFINITION 2.1 (Frequent Itemset) An itemset X is called frequent if 0(1192 if, where if is a user specified minimum support threshold. Given a database D and a minimum support threshold 4‘, the problem of mining for fiequent itemsets is to find all itemsets that pass the support threshold. In particular, the number of frequent itemsets that can be generated from a given da- tabase can possibly be very large. In order to address this more compact representation for frequent itemsets called closed itemsets have been proposed. DEFINITION 2.2 (Frequent Closed Itemset) An itemset X is called a frequent closed itemset if o()02 f, where 4" is a user specified minimum support threshold and if there ex- ists no prOper superset Y :3 X with 0019 = 0(Y). l3 Similar to the problem of frequent itemset mining, the problem of mining for frequent closed itemsets is to find all closed itemsets that passes the support threshold. Table 2.1 Sample database Transaction ID Items 1 a, b, c, d, 2 b, d, a, e, f g 3 d, a, b, c, e 4 a, c, b, d 5 b, c, e 6 b, d, e, h EXAMPLE 2.1: Consider the sample database D shown in Table 2.1. Here {a, b, c, d, e, j: g, h} is the set of items. {ab}, {abc}, and {bd} are some of the itemsets. An itemset {abc} is called a 3-itemset. The support of itemsets {bd}, {abc}, and {abcd} is 5, 3, and 3 respectively. With minimum support threshold 4‘ =3, itemsets such as {abc}, {bd}, {abcd} become frequent. Note that itemset {abce} is not frequent since its support count (2) < 4‘. Itemset {abd} with support 4 is closed since none of the supersets of {abd} have support count 4. Note that itemset {abc} is not closed since superset itemset {abcd} has the same support count (3) as {abc}. During the fiequent itemset generation, anti-monotone property of frequent itemsets called the Apriori Principle can be used to reduce the itemset enumeration search space. THEOREM 2.1 (Apriori Principle) If an itemset X is infrequent, then all of the itemsets Y 3 X cannot be frequent. PROOF. Let itemset X be an infrequent itemset. i.e. it does not support the minimum sup- port threshold 4‘. Now suppose an item i is added to itemset X to form itemset Y. Then the resultant itemset (i.e. {i}u X 3 X) cannot occur more frequently than X. Therefore, item- set Y cannot be frequent. I 14 In other words, if an itemset is fiequent, then all of its subsets become frequent. Using frequent itemsets (or frequent closed itemsets), association rules can be derived. An asso- ciation rule can be formally defined as follows. DEFINITION 2.3 (Association Rule) An association rule is an implication of the form X => Y, where X and Y are itemsets and X nY = Q. The support of a rule denoted as s(X =>Y) is defined as o(XUY)/N, where N is the to- tal number of transactions. The confidence of the rule denoted as c(X =>Y) is defined as OTXUD/OfiO- Given a transaction database D, a minimum support threshold 6,, and minimum con- fidence threshold 41., the problem of association rule mining is to find all the association rules that pass both support thresholds. The complete set of association rules can be mined in two steps: (1) mining the set of frequent itemsets using 4‘, (2) derive the association rules on these itemsets using 41. Here we illustrate this with an example. EXAMPLE 2.2: Consider the sample database D shown in Table 2.1. For frequent itemset {abcd}, bd is a subset with support 5 and the confidence of rule R1: bd=>ac is dabcd)/o(A9 = 3/5=60%. Notice that association rules R1: ac=>bd and R2: abc=>d, generated from {abcd} have confidence 100%. 2.2.2 Frequent Subgraph Mining In this section, we present the basic terminology used in frequent subgraph mining. Let g = (V, E) be a graph, where V is a finite set of objects called vertices (or nodes) and E is a set of 2-element subsets of V called edges. A labeled graph is represented by a 4-tuple g = (V, E, L, l), where L is a set of labels, and a label function I: Vu E —> L maps a vertex or an edge to a label. 15 DEFINITION 2.4 (Frequent subgraph) a subgraph g is said to be frequent if the occur- rence of g (i.e. o(g)) in a given graph database G = {gil i= 0..n} is greater than or equal to a user specified minimum support threshold 5. Given a database G and a minimum support threshold 4’, the problem of frequent sub- graph mining is to find all frequent subgraphs in the graph database G. 3 9.90 a o e 610° 0 o oo o a) 6 06 0 0 0 o .0 o. (A) Graph Dataset (B) Frequent Graphs Figure 2.1 Sample graph database and frequent graphs EXAMPLE 2.2: Consider the sample database {G1, G2, G3}, shown in Figure 2.1 (A). With minimum support threshold 5 =2, frequent subgraphs f1 :2, f2z3, f3:3, f4:2, 15:2, and f6:2, where mzn denotes subgraph m with support count n, become frequent as shown by Figure 2.1 (B). Here we have six frequent subgraphs of size 1 to 3. During the frequent subgraph generation, anti-monotone property of frequent patterns called the Apriori Principle can be used to reduce the subgraph enumeration search space. 16 2.2.3 Related Work In this section, we first describe the related research in frequent itemset mining, and then describe research in frequent subgraph mining. The frequent itemset mining problem was first introduced by Agrawal et al. in [AS94b]. Since then, a number of efficient algorithms for mining fi'equent itemsets have been proposed [I-IPYOO, AIS93, OLP+03]. A popular algorithm is the Apriori algorithm [AS94b]. However, one of the major drawbacks of this algorithm is that it requires mak- ing several passes over the database. As an alternative, tree projection algorithms [HPY00, AAP+98], where transactions in the database are projected to a lexicographic tree, were proposed. FP- Tree [HPYOO] is one such algorithm, which creates a compact tree structure and applies partitioned-based, divide and conquer method for mining. Several concepts have been proposed in the past to address the issue of redundancy in frequent itemsets. The concept of frequent closed itemsets was introduced by Pasquier et al. [PBT+99a]. A level wise algorithm called A-Close [PBT+99b] was developed by Pas- quier et al., and it employs a breadth first strategy for mining fi'equent closed itemsets. Since A-Close needs to scan the database several times, its performance is quite poor compared to other algorithms on large databases. Algorithms such as CLOSET [PHMOO] and CLOSE T + [WHPO3] are based on the FP-Tree structure. These algorithms. use a min- ing technique based on recursive conditional projection of the FP—Trees. Non-closed itemsets are pruned using subset checking with the previously mined result set. However, this requires all the closed itemsets previously discovered to be present in memory. Hence the memory usage of these algorithms depends on the number of frequent closed itemsets. CLOSE T + introduces a new duplicate itemset detection technique—upward checking that is suitable for handling sparse data sets more efficiently. FPclose [GZO3] is another algorithm that is based on the FP-Tree structure and uses arrays to efficiently traverse the FP-Tree structure, thereby gaining significant time savings compared to 17 CLOSET and CLOSE T +. FPclose also uses multiple conditional FP-trees for checking the closure of itemsets and achieves better speedup compared to CLOSE T +, which uses a global tree structure for this task. The CHARM algorithm, which was proposed by Zaki in [ZH02], uses a vertical TID representation. For dense databases, it uses a dual itemset-tidlist search tree and adopts the Difilset technique [ZG03] to store the itemset-tidlist at each node of the tree. Similar to CHARM, CloseMiner [SSMOS] also uses frequent closed TID sets. Several alternative algorithms employ the vertical bit vector representation to store the database in a con- densed manner. MAFIA [BCGOI], and DCI-Close [LOP06a] are key algorithms in this category. Although they show good performance when mining smaller databases, the length of the bit vector is quite high for large databases. Also, for sparse databases the bit vectors contain mostly zeros and these algorithms spend a lot of time intersecting these _ regions. To reduce these costs, MAFIA uses compressed vertical bit map structures. Also, DCI-Close adopts many optimization techniques to reduce the number of bit wise inter- sections. DCI-Close uses closure operation for generating closed itemsets. This technique completely enumerates closed itemsets without duplication and needs no previously mined closed itemsets. Frequent itemset mining algorithms such as Apriori, FPgrowth, DCI_Close, CHARM, and FPclose work well when the main memory Of the computer can hold the entire data- base. When the support threshold is low or the database is very large, main memory may not be able to hold the entire representation of the database. A few studies have addressed this problem and proposed several approaches for mining frequent itemsets out of core, using the secondary memory. One such early approach is the partitioning [SON95], where the database is partitioned to many small databases and frequent itemsets are mined from each such small database. The main problem of this approach is that when the data structure of the candidate itemsets is larger, disk resident data structures are re- quired and therefore significant disk U0 is required to access them. To address this issue, 18 Grahne et a1 [GZO4] proposed a recursive projection based partitioning for FPgrowth al— gorithm. In this approach, they recursively project the database and create FP-trees until it fits in main memory. Then, these memory resident FP-trees can be mined using an in- core algorithm. This approach suffers from two major problems: first, they blindly pro- jects the database to a F P-tree and if it does not fit in memory, the entire FP-tree con- structed so far needs to be deleted and new set of FP-Trees needs to be built starting fiom the next level itemsets; second, when the FP-tree for level-1 itemset does not fit in mem- ory, all combinations of frequent-2 itemsets (level 2) are needed to be projected and this is very costly. Recently, fast disk based frequent itemset mining algorithm was proposed by Buehrer et al [BPGO6] for FPgrowth approach using the 64 bit computing capabilities available. The problem of mining frequent closed itemsets out of core is even more challenging, since the closedness of an itemset cannot be decided on the basis of knowledge available in a single data partition. Lucchess et al. [LOP06b] addressed this issue by proposing the first out of core closed itemset mining algorithm. In that they proposed a merging strat- egy to derive the global closed itemsets from the local closed itemsets mined in each par- tition. However, their partitioning method of bit vectors is not scalable and incurred many I/O cost for larger databases. There have been several parallel fi'equent itemset mining algorithms in the literature [ZELOI], [SK98a], [AS96], [PCY95], [CHN+96], [HKKOO], [CHX98], [ZPO+98], [CX98], [SK96]. Most such parallel algorithms are Apriori based [AS94b]. Agrawal et al. [AS96] presents three different parallel versions of the Apriori algorithm on distributed memory environment. The count distribution algorithm replicates the generation of the candidate set and is a straightforward parallelization of Apriori. The other two algorithms (data distribution and hybrid) partition the candidate set among processors. Park et al. [PCY95] uses a similar approach by replicating the candidate set on all processors. Sev- eral parallel mining algorithms were presented in [SK98a] for generalized association l9 rules, and they addressed the problem of skewed transactional data. Cheung et al. [CX98] also addressed the effect of data skewness in their FPM (Fast Parallel Mining) algorithm, which is based on the count distribution approach. A parallel tree projection based algo- rithm, called MLFPT, based on FP-Tree algorithm is presented in [ZELOl] for a shared memory environment. In frequent subgraph mining, the goal is to develop algorithms to discover frequently occurring subgraphs in the graph database. Although there are many efficient and scal- able frequent pattern mining algorithms exist for itemset mining and sequence mining, developing efficient and scalable algorithms for subgraph mining is particularly challeng- ing because subgraph isomorphism, which is a computationally expensive operation, plays a key role throughout the mining process. Despite that, several efficient subgraph mining algorithms such as FSG [KKOI], gSpan [YH02], FFSM [HWP03], Gaston [NK04], SPIN [HWP04], and CloseGraph [YH03] are available and can be used in many practical situations. In this thesis, we use FSG algorithm [KKOl] to generate frequent subgraph patterns, which are subsequently used to build anomaly detection model. F SG takes a graph data- base and a minimum support 5 and generates all connected subgraphs that occur in at least 5% of the graphs. FSG follows an Apriori style level-by-level approach (breadth first search) to generate sub graph patterns. It starts by enumerating frequent subgraphs consisting of one edge and proceeds to generate larger subgraphs by joining them. At each level, sub graphs are grown by adding one edge at a time. Once a subgraph is gener- ated, its occurrence in the graph database is computed to determine whether it is frequent. FSG uses a number of optimization techniques to join subgraphs efficiently, and to com- pute the frequency of the subgraphs. For more information, readers should refer to [KKOl]. 20 2.3 Mining Anomalous Patterns In this section, we present the basic concepts used in mining anomalous patterns. We dis- cuss anomaly detection in two types of input data: data that is described by feature vec- tors and data that has graph based representation. We also discuss the related research. 2.3.1 Basic Concepts in Anomaly Detection In anomalous pattern mining, the goal is to find objects that have surprising or unusual occurrence from the majority of objects. Such objects are referred to as anomalies or out- liers. Although it remains difficult to give a clear definition of what an outlier is, an often quoted definition of an outlier by Hawkins [Haw80] can be stated as follows. DEFINITION 2.5 (Outlier) An outlier is an observation that deviates so much fi'om other observations as to arouse suspicion that it was generated by a different mechanism. Discovering outlier objects can be performed in a supervised setting or an unsuper- vised setting. Supervised outlier detection schemes require a training set containing both anomalies and normal objects and use a training mechanism to learn the discriminating features of the database. In situations where training data is not available or rare, outlying objects must be detected by examining the entire dataset. Each object is then considered as an outlier or normal object based on the degree to which that particular object is anomalous. This type of scenario is known as unsupervised learning. Unsupervised learn— ing has both advantages and disadvantages over supervised learning. The major disadvan- tage is the lack of our known domain lmowledge about the problem for the learning algo- rithm. Nevertheless, this lack of known knowledge can turn out to be an advantage, as it lets the learning algorithm look for patterns that have not previously been considered. In outlier detection this helps to detect unknown anomalies, which is a key challenge for any outlier detection scheme, regardless of the learning method. 21 To measure the quality of the outlier detection scheme, several metrics can be used. Precision, recall and false alarm rate are the widely used metrics. TP Precision = —— TP + FP TP Recall = m FP FP+TN where T P (TN) is the number of true positives (negatives) and FF (FN) is the number of False Alarm = false positives (negatives). In the context of outlier detection, precision or the detection rate is the number of ob- jects detected as outliers from all the objects available. Recall measures the fraction of outlier objects detected. False alarm rate is the fraction of normal objects predicted as outlier objects. Obviously, a good outlier detection scheme must have higher preci- sion/recall and lower false alarm rate. Precision and recall can be combined into one metric called F -measure, which can be defined as: 2TP 2TP + FP + FN F -measure = In fact, F -measure is the harmonic mean of precision and recall. 2.3.2 Related Work In this section, we discuss related research in anomaly detection. First, we consider data objects described by feature vectors; i.e. each instance of data is represented by a list of features or a record (tabular data). Outlier detection for such data has been extensively studied in the past. These existing approaches can be classified into several categories, statistical based, depth based, clustering based, distance based, and density based. 22 In the statistical-based approach, they assume a standard distribution model (e. g. nor- mal) and flag as outliers those objects that deviate from the model [Esk00][Lew94]. One problem with this approach is that most distribution models apply directly to the feature space and are univariate, which makes this approach unsuitable even for moderate high- dirnensional data. Also, for any arbitrary dataset identifying the model that fits well with it can be difficult and expensive. Depth-based approach is another category based on computational geometry [PS88][JKN98]. Objects are organized in a convex hull so that the outliers are more likely to occur in the outer region [P888]. However, these approaches suffer from the curse of dimensionality. In the clustering approach, outliers are detected using the clustering algorithms. But these algorithms are not mainly designed for outlier detection and therefore show low detection rate. The distance-based approach was originally proposed by Knorr et al. [KNTOO]. In this approach, an object is considered as an outlier if at least a fi'action r of the objects are located farther than q of the objects. This notion is extended in [JTHOl], by considering outliers as the top-n data points whose distances to their k-th nearest neighbor are among the highest. This approach is not suitable when the dataset has both dense and sparse re- gions and it is known as the local density problem. Density-based approach was proposed by Breuning et al. [BKN+00]. It considers the local density of the object’s neighborhood and computes Local Outlier Factor (LOF) for each object. The neighborhood is defined by the distance to the k-th nearest neighbor, and objects with high LOF value are considered as outliers. Although, density based approach does not suffer from local density problem, selecting the right threshold-k is non-trivial, and also this approach is very sensitive to the choice of k. In the above approaches, outlier detection has been done for general data, where each object is represented as a set of features. Anomaly detection has been extended for ob- 23 jects that are represented by graphs. Very few unsupervised anomaly detection algorithms have been developed for graph based data. One such state of the art approach is the com- pression based anomaly detection scheme [NCO3]. Its general idea is as follows: Subdue, which is an algorithm to discover repetitive graph patterns, is run in multiple iterations on the graph dataset and after each run, the graph dataset is compressed using the best sub- structure discovered. Best substructure is determined using the Minimum Descriptor Length principle (MDL) [CHOO]. Here, every instance of the substructure is replaced by a single new vertex representing it. Then an anomaly score is computed for each graph in the dataset based on the percentage of subgraph that is compressed away. A major advan- tage of this method is because of the compression of the graphs, subsequent mining be- comes much faster and graphs are easy to manipulate. Also, outlier detection has been extended to time sequence of graphs in [IKO4]. Here the edge weights vary over time and this forms a time dependent adjacency matrix. 24 3 PGMiner: Mining Frequent Closed Itemsets In this chapter, we introduce a graph based approach for closed itemset mining. The main aspects of this work are surrunarized below: 0 We present a novel PrefixGraph representation of a transaction database. 0 We develop an algorithm called PGMiner (PrefixGraph Miner) that uses pruning strategies developed from network flow analysis. 0 We perform extensive experiments to compare the performance of PGMiner against other existing algorithms on a variety of real and synthetic data sets. The rest of the chapter is organized as follows: Section 3.1 discusses issues in the ex- isting state-of—the-art frequent closed itemset mining algorithms. Section 3.2 introduces the PrefixGraph representation. Closed itemset enumeration approach is discussed in Section 3.3. In Section 3.4, we present out PGMiner algorithm. Experimental results are reported in Section 3.5. 3.1 Issues in Frequent Closed Itemset Mining Numerous algorithms have been developed to improve the efficiency of the closed item- set mining task [PBT+99b][PHM00][ZH02][WHPO3][GZO3][SSM05][LOPO6a]. There are two typical strategies adopted by these algorithms: (1) an effective pruning strategy to 25 reduce the combinatorial search space of candidate itemsets and (2) a compressed data representation to facilitate in-core processing of the itemsets. Item merging and sub- itemset pruning are two of the most commonly used strategies employed by current algo- rithms [WHP03]. These strategies ensure that many of the non-closed itemsets will not be examined when searching for frequent closed itemsets, thereby reducing the runtime of the algorithms. Apart from the pruning strategies used, having a condensed representation of the da- tabase is also vital to achieve good efficiency. Existing algorithms such as FPclose [G203] and CLOSE T + [WHP03] construct a frequent pattern tree (F P-tree) structure [HPYOO] to encode the relevant itemsets and fiequency information. In this representa- tion, each transaction in the database is represented as a path from the root of the FP-tree. Compression is achieved by merging prefix paths of transactions that share the same items. A vertical database representation is another popular strategy, in which each item is associated with a column of values indicating the transactions that contain the item. For example, algorithms such as CHARM [ZH02] use vertical tid-lists as their data represen- tation while MAFIA [BCGOl] and DCI-Close [LOPO6a] use vertical bit vectors. Figure 3.1 shows the difference between the data compression techniques used to represent the database given in Table 3.2. Although an FP-tree often provides a compact representation of the data, our analysis shows that there are situations where the storage requirements of the tree may exceed even the database size, especially when the support threshold is low or when the database is very sparse. This is because each tree node encodes not only the item label, but also the support count and pointers to the parent, sibling, and child nodes. As shown in Table 3.1, the size of the initial FP-tree exceeds the database size for databases such as Chess and Kosarak. Algorithms that use the FP-tree structure must also recursively build smaller subtrees and traverse multiple branches of the trees to collect frequency information dur- 26 ing the mining process. The overhead of reconstructing and traversing the FP-trees may degrade the overall performance of the algorithm [GZO3]. @ [’0 0 I I I I I l’ @l m” I 0 a I #OON-im th-‘U' OIUIODNCD (b) Vertical tid-lists iii“ 9 , 0 11110 ® 0 “° 1‘ ‘rxt 1o 0 11 ’ /l 0 11 10 0 00 01101 iifiili (a) FP-tree (c) Vertical bit vectors Figure 3.1. Different compressed data representations Table 3.1 shows that the memory requirements for storing vertical bit vectors are gener- ally less than that for FP-tree and vertical tid-lists, with the exception of the Kosarak data set. Vertical bit vectors also allow for fast support counting using simple bitwise AND Operations. However, when the database is large and sparse, the handling of long bit- vectors is quite inefficient since there are lots of zeros in the vectors. Table 3.1. Characteristics of various condensed representations Database Database Min FP-Tree B11633; 333;: Size of Name Srze Support Num Nodes Size Size Size PrefixGraph Chess 474.4 Kb 25% 31,812 621 Kb 19.9 Kb 435.3 Kb 239.6 Kb Pumsb 14.0 MB 45% 183,349 3.5 MB 377.2 Kb 8.58 MB 4.28 MB WebDocs 1150.4 MB 10% 50,313,644 959.6 MB 52.8 MB 296.5 MB 111.6 MB Kosarak 36.8 MB 0.08% 3,425,391 65.3 MB 189.3 MB 26.1 MB 9.2 MB 27 Regardless of the representation, current closed itemset mining algorithms must compare the support of a candidate itemset against the support of its supersets. To perform this task more efficiently, many algorithms such as CLOSET + [WHP03], FPclose [GZO3], and CHARM [ZH02] store their intermediate result set, which contains all the closed itemsets that have been mined so far, in another data structure (PP-Tree, hash table etc.). Such a storage method is feasible as long as the number of closed itemsets is small. When the number of closed itemsets is large, it will consume considerable memory to store and time to search the itemsets. In fact, our analysis shows that in some cases the amount of memory occupied by the result set is several orders of magnitude larger than the size of initial FP-tree or vertical database representation. For example, in the Chess database with 25% support threshold, storing the result set takes up to 102MB, even though the size of the initial FP-Tree is only 621Kb! The cost for searching the result set (to deter- mine whether a candidate itemset is closed) can also be very expensive. Our analysis on the Chess database shows that FPclose spends about 70% of its overall computation time searching the result-set. In summary, both FP-Tree and vertical representations have their own strengths and limitations. A major limitation of the existing approaches is their poor scalability to large databases. In order to address these limitations, we introduce a novel representation called PrefixGraph, which leverages some of the positive aspects of existing representa- tions. From F P-tree, it borrows the idea of projecting a database onto different nodes of a graph—but without the extra cost of traversing multiple branches of the tree to collect fi'equency information. A PrefixGraph also uses bit vectors to encode the database pro- jection at each node. However, the length of its bit vector is considerably shorter than that used by existing vertical bit vector representations. We will discuss in more details how the graph is constructed in the next section. From Table 3.1, note that the size of the Pre- fixGraph structure is moderate compared to other representations. For the Kosarak data set, it yields the most compact representation. 28 3.2 PrefixGraph Representation 3 2.] Preliminaries A PrefixGraph consists of a set of nodes and a set of directed edges connecting pairs of nodes. Any item in the database that satisfies the support threshold is represented as a node in the PrefixGraph. Each node is also associated with a projected bit vector data- base (see Figure 3.3 for an overview). Before illustrating the PrefixGraph structure fur- ther, we give some useful definitions. Note that here, items in a given itemset are assumed to be sorted according to some total order, 4. We use the notation x < y to indicate x precedes y according to the total order. DEFINITION 3.1 (Prefix 2-Item) At a node k, an item i is called its prefix 2-item if i , where (mzn) de- notes an item m with support count it. Once the nodes are identified, we order them based on the descending order of sup- port count as shown in Figure 3.2(a). Next, we scan the database again to identify the pre- fix 2-items for each node. For example, the frequent 2-itemsets for nodes a, b, d, e, and c are (ab, ad, ac, ae), (ba, bd, be, be), (da, db, de), (ea, eb, ed), and (ca, cb), respectively. The prefix 2-items and their corresponding support counts for these nodes are {}, {a:3}, 30 {a:3, b:2}, {a:2, b:2, d:3}, and {a:2, b:3} respectively. For each node, we store its set of prefix 2-items in a header table as shown in Figure 3.2(b). The next stage of the graph construction is to store the transactions as bits in the pro- jected bit vector database of the nodes. We scan the database, and for each transaction, infrequent items are removed and the remaining items are sorted based on the frequency descending order. Let T be the resulting itemset. Now for each item k in T, we select the corresponding node k and compare its prefix 2-items against the prefix itemset of T. If there is a match, then these matching items are stored as bits in the projected bit vector database of node k. ....@ G) <9 <9 <9 (9 o . . Header Tables a ab abd ab @ o . a) . ab abd ab 0 (C) 11 11 mecca a ab abd ab (3.134119 11 111 11 11 Bit Vector Databases O Figure 3.2. PrefixGraph construction- a running example 31 For example, consider the first transaction of the sample database. After removing the infrequent items, the remaining items are: T = {a, b, d, c}. Since the transaction has 4 fre- quent items we need to consider the nodes a, b, d, and c. Node a has no prefix 2-items and therefore nothing is stored. For node b, item a of the transaction matches with its pre- fix 2-item (i.e. a), and therefore bit <1> is stored in its projected bit vector DB. For node d, items {a, b} of T match with its prefix 2-items and therefore bits <1 l> are stored in the projected bit vector DB. Similarly for node c, the bits for items {a, b} of the transaction are stored in the projected bit vector DB. Figures 3.2(c) and 3.2(d) show the PrefixGraph alter storing the first two transactions. When storing a transaction such as {d, e} at node e, we need to store bits <001> in node e, since only item d of the transaction matches with the prefix 2-items of node e. In the PrefixGraph structure, suffix links are created based on the transactions. For each item k in the transaction T, a suffix node m is selected such that m e T and Vn e suffix nodes of k, m < n. A suffix link is then created from node k to node m. For exam- ple, consider the transaction {a, b, d, c}. For node b we select the suffix node d (out of d and c) and add a suffix link from b to d. Figure 3.3 shows the complete PrefixGraph structure after storing all the transactions in the sample database. Suffix Links a ab abd ab Header / 1 11 1 1 1 11 (Prefix 2-items ) 1 11 101 11 1 10 010 01 Q 001 Bit Vector Databases Figure 3.3. PrefixGraph representation of the sample database 32 Instead of explicitly creating links, the links are incorporated directly into the projected bit vector database. More specifically, we can group the bit vectors of the transactions that have the same suffix link together and store them contiguously in the projected bit vector database of the node. For this purpose, the projected bit vector database of each node is partitioned into bins, and the set of bit vectors in each bin corresponds to a suffix link. For example, transactions {a, b, d, e} and {a, d, e} both have the same suffix link (de) at node d, and thus, can be stored together in a bin. If a transaction has no suffix link beyond a given node, these transactions are stored in an additional bin called the termi- nating bin. For example, the transaction {a, d, e} is stored in the terminating bin of node e. All bins must be arranged contiguously, so that the intersection of bit vectors (item wise) can be done fast as a one large chunk of words. Also, we need to keep track of the starting location of each bin in order to identify the suffix links. A summary of the PrefixGraph construction procedure is given in Algorithm 1. Algorithm 1 (PrefixGraph Construction) Input: A transaction database D and support threshold 2‘, Output: PrefixGraph structure Method: 1: Scan the database D and find the frequent l-itemsets (nodes) and their supports. 2 Sort the nodes in descending order of support. 3: Find the frequent 2-itemsets for each node and create the header tables. 4 For each transaction T: a. Sort the frequent items in T in descending order of their support. b. For each item k in T, select node k and match the prefix 2-items of node k with the items in T and if there is a match, store the matching items as a bit vector in the bit vector database of node k. 3.2.3 Analysis of PrefixGraph Structure The PrefixGraph construction algorithm requires three scans of the database. The first two scans are necessary to find frequent l-itemset and 2-itemset, while the third scan is needed to construct the projected bit vector databases. 33 PROPOSITION 3.1: The size of the projected bit vector database of a node is bounded by the support count of the node times the number of prefix 2-items of that node. PROOF. Let m be the number of prefix 2-items of a node k. Since the number of transac- tions that contain item k is equal to the support count o(k), the size of the projected bit vector database of node k must be equal to l— m x o(k)/ 8] bytes. But in a projected bit vector database, projected transactions starting with item k are not stored as bit vectors at that node. Therefore, the size of the projected bit vector database at node k is 51m x o(k) / 8] bytes. I Proposition 3.1 shows an important benefit of the PrefixGraph structure as we use bit vector intersections during mining. According to Proposition 3.1, the length of the bit vector is at most equal to the support count of the node. Since the support count of an item is usually much smaller than the database size, we will have shorter bit vectors and accordingly faster bit vector intersections. The size of the projected bit vector database at each node varies as shown in Figure 3.4 for the Chess and T40110D100K databases with minimum support thresholds 30% and 0.10% respectively. Except for the nodes in the middle, all other nodes have small projected bit vector databases, which facilitate faster mining of itemsets. Now let us theoretically analyze the space complexity of the PrefixGraph. Let I be the number of items in the database and let N be the total number of transactions. The al- gorithm for Prefszraph construction has three database scans. First database scan re— quires O(I) memory space to store the itemset vector for computing the frequent items (nodes). Suppose IF 1] is the number of frequent items (size 1) discovered. Then the sec- ond database scan requires O(|F,|2/2) to store the upper triangular matrix for frequent 2- itemset computation. Third scan constructs PrefixGraph in memory. It has ]F,| nodes and each node has a header table and a projected database attached to it. Let m be the maxi- mum number of prefix 2-items of a node and let s be the maximum support of a node. 34 Now the header tables require mlFll memory space. Also, all of the IF 1| projected data- bases require |F 1|(m.s/8) memory space (Proposition 3.1). Since we reclaim memory for frequent 2-itemset computation (i.e. matrix) before building the projected databases, only the maximtun size of these two steps should be considered as the space requirement. Therefore, the total space complexity of PrefixGraph is O(I+ m|F1|+ max((|F1|2/2), (Films/8))- We also analyze the time complexity of the PrefixGraph construction algorithm. Let It] be the maximum size of a transaction t in the database. Now the first scan of the data- base takes O(Nltl) to construct the frequent l-itemsets. The second scan requires us to consider all the combinations of size 2 items in t; i.e. it takes O(N( MC») time complexity for all N transactions. The third scan requires each transaction to be sorted and stored in the projected database of the nodes. Using of quicksort to sort a transaction takes |t|log |t| time. In the worst case a transaction may be stored in all the [Fll nodes and the cost is lF;||t|. Therefore this step takes O(N]t| (log |t|+|F1|)) for all N transactions. Then, the total time complexity is the sum of the complexities of all these three steps. 3.3 Frequent Closed Itemset Mining In this section, we will study how to efficiently mine frequent closed itemsets from the PrefixGraph structure. The algorithm proceeds in two phases: first, we find the fi'equent closed itemsets for each node (these are known as local closed itemsets). We then check whether the local closed itemsets are also globally closed using various inter-node prun- ing techniques. Here we give the formal definitions of local and global closed itemsets. 35 Chess .105 r4011oo1oox I I I Size (bytes) Size (bytes) 0 l I l l l | I U 10 30 40 50 I30 2m 4m 6]] HI] 1111 20 Node Node Figure 3.4. Size of bit vector databases at each node DEFINITION 3.6 (Local Closed itemset) An itemset X, derived under node n is defined as locally closed, if there is no itemset Y (3 X) derived under the same node n with o(l’) = c(X). DEFINITION 3.7 (Global Closed itemset) An itemset X, derived under node n is defined as globally closed, if there is no itemset Y (D X) derived under any node k, (k6 set of all nodes) with o(Y) = c(X). 3.3.1 Intra-Node Closed Itemset Mining In intra-node closed itemset mining we mine the locally closed itemsets fi'om the pro- jected bit vector database for each node. As shown in the next proposition, itemsets that are not locally closed are guaranteed to be non-globally closed. Such itemsets can there- fore be excluded fiom further consideration. 36 PROPOSITION 3.2: For any given itemset X, derived under node n, if X is not locally closed then it is not globally closed. PROOF. Since X is not locally closed, El Y derived under node n, s. t. X c Y and o(Y) = c(X). Therefore, by the Definition 3.7, X cannot be globally closed. I In general, to generate frequent itemsets of a node, bit vectors of all distinct pairs of the itemsets are intersected and the cardinality of the resulting bit vector is checked. This is carried out recursively in a depth first manner until all the itemsets are enumerated. For example, in the itemset enumeration tree given in Figure 3.5, for itemset {a}, we generate all its combinations ({ab},{ac},{ad}). Then, starting from {ab}, ({abc},{abd}) are gen- erated. If the support of {ab} is identical to the support for one of its immediate super- sets, then {ab} will be marked as not closed. Note that any itemset {i1, i2, ...,ik} gener- ated under a node n must have its node label appended as {i1, i2, ...,ik, n}. We have omit- ted item n from the set enumeration tree in Figure 3.5 for brevity. Similar to several past algorithms [BCG01, LOPO6a, PHMOO, ZHOZ], we also use two additional pruning tech- niques to rapidly identify the local closedness of the frequent itemset once it is generated. PROPOSITION 3.3: (sub-itemset pruning) For a frequent itemset X and an already found closed itemset Y, if X c Y and 000 =O(Y), then X and all X’s descendent itemsets in the set enumeration tree are not closed. PROOF. Let X and Y be frequent itemsets in the set enumeration tree s.t. X c Y and ofX) = o(Y). Since Y is already enumerated according to set enumeration order and found to be closed, Y can generate itemset Y U X. Since 0'00 = o(Y), tid-semi? = tid-set(Y), and (YUX) D X itemset X is not closed. Similarly, any descendent itemset X; of X can be gen- erated by Y according to set enumeration order and therefore is not closed. I PROPOSITION 3.4: (item merging) For a frequent itemset X and an already found fre- quent itemset Y, if the tid-set(AO_c tid—set(Y) and Y cz‘X, then X and all X ’s descendent itemsets in the set enumeration tree are not closed. 37 PROOF. Let Y be a frequent itemset already enumerated and let X be a frequent itemset in the set enumeration tree s.t. Y a: X. According to the itemset enumeration order Y’s sub tree must contain Y U X, which is already enumerated. So, if the tid-setpng tid-setm, then o(Y U X) = o(X). Since (Y U X) D X, X is not closed by definition of the closed itemset. Further, any descendent itemset X,- of X can be enumerated by Y as Y U X; with the same support count. Thus X’s descendents itemset are also not closed. I I Level -1 la\ 1 I l I l l l I Leve] -2 l ab\ aC ad bC ' | I I 1_:s__1 ?_| 131M I I Level -3 abc abd acd bcd | | I | I I l l Depth First Traversal 1— Level —4 m I Figure 3.5. Itemset enumeration tree (search space) of a node A direct implementation of sub—itemset pruning requires storing possibly a large set of closed itemsets and performing subset checking to determine whether an itemset is set- included in a superset. To reduce these overheads, we limit the applicability of this prun- ing strategy to itemsets between two successive levels of the depth first search space. For example, itemset {ab} at level-2 generates its level-3 itemsets ({abc}, {abd} ). Based on Proposition 3.3, if {abc} and {ac} have identical support counts, we can prune {ac} and its sub—tree. 38 The applicability of the item merging proposition to an itemset X requires that we per- form subset checking of X’s bitmap with the bitmaps of all the processed (i.e. already mined) local itemsets of level-1 down to the parent level of itemset X. For example, if the itemset is {bcd}, we need to check whether its bitmap is a subset of the bitmap of a. If it is a subset of bitmap(a), then {bcd} is not closed according to Proposition 3.4. In our ver- tical bit vector representation, such bitmap subset checking can be performed very fast. The main advantage of local closed itemset mining is that itemset generation and support counting are very fast, since the projected database contains short bit vectors. Unlike F Pclose and CLOSE T +, the information needed for support counting is locally available and there is no need to traverse any other nodes. Application of both propositions ensures that we generate only the complete set of local closed itemsets for that node. In the next section, we develop an efficient flow based pruning strategy to verify whether the local closed itemsets are also globally closed. 3 .3 .2 Inter-Node Pruning In order to develop inter-node closed itemset pruning, we consider PrefixGraph structure as a network with transactions flowing through the nodes. Therefore, the problem of dis- covering a global closed itemset can be mapped to a network flow problem. Let us first analyze the suffix links of the PrefixGraph structure. For a node n in the PrefixGraph G, we define the out-neighborhood and in-neighborhood of n by N*(n) = {m e V(G) I (n, m) e E(G)} and N(n) = {m e V(G) | (m, n) e E(G)}, respectively (here V(G) and E(G) are the set of nodes and edges respectively). For an edge (n, m) of the PrefixGraph G, f (n, m)is the flow along the edge and is considered as the set of transactions that flows through the edge (n, m). Furthermore, we have, OS|f(n,m)|£ o({nm}), where | f (n, m)| denotes the number of transactions. 39 Based on this, for any node n, its outgoing flow can be defined as 0utF(n) = U f(n,m) and incoming flow can be defined as InF(n) = U f (m,n)- More meN+(n) MEN—(n) specifically, we denote f X (n,m) as all the transactions containing itemset X that flow fiom node n to m. Then, for a given itemset X derived under node n, the outgoing flow of X can be defined as 0utFx(n) = U mem). Similarly, the incoming flow of itemset meN+(n) X derived under node n can be defined as InFX(n) = U fX (m,n)~ Here we give some MEN—(n) properties of this flow based representation. POSTULATE 3.1: Given an itemset X derived under node n, where IX] 2 2, the following properties hold: 11 000 = llanWl ii. Ian(n) QOutFx(n) iii. Vin: 0utFx(n) QInFXm(m), where n 0 then X is a globally closed itemset. PROOF. To construct the proof, by contradiction, assume that X is not globally closed but |InFX(n)| > IOutF,\(n)|. By Corollary 3.1, Elm: InFX(n) = InFXm(m). From the third property of Proposition 3.1, |0utFX(n)| 2 [InFXm(m)|. Putting them together, it follows that |1an(n)| _<_ |0utFx(n)|, which contradicts our initial assumption. Thus, X must be globally closed. I According to this theorem, if the bitmap of a local closed itemset X, derived under node n, has at least one transaction that terminates at node n (i.e. those transactions do not flow to other nodes), then X is globally closed. In our PrefixGraph structure, all we need to do is to examine the bits in the terminating bin of the corresponding itemset’s bitmap. If at least one bit is ‘1’ in the terminating bin of the itemset, then that itemset is globally 42 closed. This is a very fast operation that requires checking the itemset’s own bit vector to determine the global closedness. THEOREM 3.4: For any itemset X derived under node n if i. X is locally closed and ii. Ian(n) = 0utFX(n) and iii. X has exactly one suffix link to node m then X is not a globally closed itemset. PROOF. Let InFX(n) = 0utFX(n). Since X has exactly one suffix link to a node m, 0utFx(n) = InFXm(m). Putting them together, we obtain Ian(n) = InFXm(m), which ac- cording to Corollary 3.1 means that X is not globally closed. I Theorem 3.4 suggests that if all of the transactions that belong to itemset X flow to exactly one other node, then X is not closed. In the PrefixGraph representation, once an itemset is generated its links can be analyzed by checking the bins of the bit vector. Based on the number of links, we can decide whether the itemset is not closed. For the remaining local closed itemsets in which the previous theorems are inapplica- ble, we need to test whether they are globally closed. In order to determine the global closedness of a local closed itemset, we need to visit every suffix node and compare the support of its corresponding superset, which is a very expensive operation. The following theorem reduces the number of such nodes that need to be visited. ’ THEOREM 3.5: Let X be any itemset derived under node n and let t be the Farthest-Node of n w. r .t. itemset X. Then for any itemset X’ s. t. X’ 3X derived under node m, n < m< t, o(X') #001). PROOF. Let adjx(n) = (n1, n2, ...,nm, t) be the possible set of nodes that itemset X can have a outgoing flow, with n<— CHARM —a—— FPclose —A— DCI-Close _.__ PGMiner Figure 3.16. Amount of memory (in MB) required for T40110D100K 55 1000 Memory Size (MB) Pumsb -9 _L O O I _L O I 65 60 55 50 45 42 Support % + CHARM —e— FPclose —A— DCl-Close _._ PGMner Figure 3.17. Amount of memory (in MB) required for Pumsb Kosarak 1000 ..L O O I Memory Size (MB) 3 0.20 0.16 0.12 0.08 0.078 Support % —e— FPclose + DCI—Close + PGMiner Figure 3.18. Amount of memory (in MB) required for Kosarak 56 3 .5.4 Scalability We have also measured the execution time of all the algorithms by increasing the number of transactions gradually. We use the T 5 011 0DxK data set, where x is varied from 25,000 transactions (DB size 69th) to 50 million transactions (DB size 13.9GB), with mini- mum support threshold 0.1%. When experimenting with these databases we used a server (2 GHz) with 4GB of memory, since these databases are of gigabyte size. Execution time for all of the algorithms is shown in Figure 3.19. The experimental results revealed that CHARM, FPclose, and DCI-Close could not reach more than 1 million transactions (1000K) of this database set. FPclose and DCI-Close crashed for the T 5011 0D5000K dataset, and CHARM did not finish even after 2 hours. Analysis of memory usage for these algorithms revealed that they consrune high memory space. In the 5000K dataset, the FPclose algorithm fails because it has con- sumed all the available memory space and it was killed by the system when trying to al- locate more memory. See Figure 3.20. 10000 —96—(3HAGUM —a— FPclose —A— DCl-Close 1000 — 100 ~ Time (sec) 10— 25K 50K 100K 500K 1000K 5000K 50000K No. of Transactions Figure 3.19. Execution time versus number of transactions (K =1 000) 57 ’1 4096 —e—FPclose I 3584 ,_ +DCI-Close __________________ ,' __________________ —o—PGMner ,’ A 3072 — m g 2560 ~ 0) 3;! ‘g 2048 - ,5, 1536 - 2 1024 - 512 - 0 I J 4. J 4 1 r 1 r 100K 500K 1000K 5000K 50000K No. of Transactions Figure 3.20. Memory usage of algorithms for large databases (K =1 000) Memory consumption of DCI-Close is remarkably high even for the 1000K dataset, and it was also killed by the system when trying to allocate a larger block in the 5000K case. Note that PGMiner was able to reach 50 million transactions easily showing remarkably low memory usage. As shown in Figure 3.19, PGMiner shows impressive scalable per- formance when mining larger databases. 3.5.5 Effectiveness of the Flow Based Pruning In our algorithm, when a local closed itemset is discovered, we first apply Theorem 3.2 and then if it cannot discover the closedness of the itemset, we apply Theorem 3.3. In Ta- ble 3.4 we have shown the percentage of itemsets discovered by both these theorems. For example, In WebDocs dataset we were able to discover 94.1% of the total local closed itemsets as either globally closed or not by using Theorem 3.2. From the remaining percentage (i.e. 5.9%), 85.4% of itemsets were discovered by Theorem 3.3. Table 3.4 clearly shows that both Theo- rem 3.2 and Theorem 3.3 are capable of detecting global closedness of many local closed 58 itemsets of the database. Moreover, these two techniques can be easily implemented and it is one Of the key factors to achieve faster performance in our algorithm. Table 3.4. Evaluation of the global closedness techniques Data Set (min. sup.) Theorem 3.2 Theorem 3.3 Chess (30%) 91.6% 8.0% WebDocs (10%) 94.1% 85.4% Pumsb (45%) 91.9% 30.5% Kosarak (0.08%) 65.5% 68.1% T2018D500K (0.01%) 91.7% 26.4% T40110D100K (0.1%) 67.6% 40.3% 3 .6 Summary In this chapter, we introduced a novel data representation called PrefixGraph, which lev- erages some of the positive aspects of existing representations. From FP-tree, it borrows the idea of projecting a database onto different nodes of a graph—but without the extra cost of traversing multiple branches of the tree to collect frequency information. A Pre- fszraph also uses bit vectors to encode the database projection at each node. However, the length of its bit vector is considerably shorter than that used by existing vertical bit vector representations. The size of the PrefixGraph structure is quite moderate and its memory requirements do not grow as rapidly as other algorithms. Our proposed algorithm called PGMiner em- ploys several effective itemset pruning strategies derived from network flow analysis. These strategies can be adapted to other existing algorithms (such as CLOSET [PHM00]) that use projected databases to prune their non-closed itemsets. Furthermore, PGMiner outperforms FPclose [GZO3], DCI-Close [LOPO6a] and CHARM [ZH02], three state-Of-the-art closed itemset mining algorithms, by an order of magnitude both in time and memory requirements. 59 4 OutRank: Mining Anomalous Data This chapter explores the use of stochastic graph-based method for anomalous pattern mining. The main contributions of this work are as follows: 0 We investigate the effectiveness of random-walk approach for anomaly detection and show that our proposed framework, called OutRank (Outlier Ranking), is ca- pable of detecting small clusters of outliers, which is hard to detect by the existing approaches as discussed in Section 1.2. 0 We investigate two different approaches for constructing the graph based repre- sentation of objects upon which the random walk model is applied. 0 We also analyze different choices of similarity measures on the random walk model and compare the performance with existing anomaly detection methods. The remainder of this chapter is organized as follows. In section 4.1, wediscuss is- sues in the existing anomaly detection methods introduce our solution. Section 4.2 pre- sents our proposed anomaly detection model. Section 4.3 presents several anomaly detec- tion algorithms. We perform an extensive performance evaluation in Section 4.4 4.1 Anomaly Detection and its Issues Anomalies (or outliers) are aberrant observations whose characteristics deviate signifi- cantly from the majority of the data. Anomaly detection has huge potential benefits in a 60 variety of applications (e.g. computer intrusions, surveillance and auditing, failures in mechanical structures, to name a few). Many innovative anomaly detection algorithms have been developed, including statistical-based [Esk00][Lew94], depth-based [JKN98][PS88], distance-based [BSO3][JTH01][KNT00][RRSOO], and density-based [BKN+00]. These approaches, however, focus mostly on the efficiency of anomaly detection rather than the quality of solution. Therefore, when they are applied to real-world appli— cations across many domains, they show high false alarm rates. For instance, in intrusion detection [KV03], the small clusters of outliers often correspond to interesting events such as denial-of-service or worm attacks. Although existing density-based algorithms show high detection rate over distance-based algorithms for datasets with varying densi- ties, they can be less effective when identifying small clusters of outliers. This is because these algorithms consider the density of a predefined neighborhood for anomaly detec- tion, and in some cases small clusters of outliers with similar density to normal patterns cannot be distinguished. This chapter explores the use of random walk models as an alternative to previously used anomalous pattern mining algorithms. The heart of this approach is to represent the underlying dataset as a weighted undirected graph, where each node represents an object and each (weighted) edge represents sinrilarity between objects. By transforming the ad- jacency matrix of the graph into transition probabilities, we model the problem as a Markov chain process and find the dominant eigenvector of the transition probability ma- trix. The values in the eigenvector are then used to determine the outlierness of each ob- ject. The random-walk model is designed to find nodes that are most “central” to the graph. To illustrate, consider graph-A shown on the top left panel of Figure 4.1, which consists of 4 nodes connected to a central node labeled as node 1. Upon applying the ran- dom walk model to the transition matrix constructed from graph-A, the probability score 61 of each node is plotted on the right hand panel of Figure 4.1. Clearly, node 1 has the highest score compared to other remaining nodes. To illustrate the effect of outliers on the random walk model, consider the graph-B shown in Figure 4.1, which is obtained by removing the edges between nodes (3,5) and nodes (4,5) of graph-A. Node 5 can be con- sidered as an outlier (anomaly) of the graph. As can be seen from the probability score distribution for graph-B, the random walk model assigns the lowest score to the outlying node. 0.4 Graph A 2 r 3 1 0.35 - ‘1 \\ 0.3 - \ - 4 5 ‘ \ \ 0254 \ g \ o \ <8 \ 02 - \\ a f g . : \b- 2 3 ----.\ 0.15 ~ \ . \ \ \ \ \ 0.1 " \\- 4 5 -1— Graph A 1’ -O-- Graph 8 0.05 l l l 1 2 3 4 5 Node number Figure 4.1. Outlier detection with random walk A major advantage of using our random walk approach is that it can effectively capture not only the outlying objects scattered uniformly, but also small clusters of outliers. This is because random walk based model defines the outliemess of an object with respect to the entire graph of objects; i.e. it views the outliemess from a global perspective. In con- trast, existing methods consider a neighborhood to define the outliemess as discussed ear- 62 lier. Nevertheless, one potential challenge of using the random walk approach is to de- termine the neighborhood graph from which the outliers can be detected. In the next sec- tion, we will show how objects can be modeled as a graph to apply the random walk ap- proach. 4.2 Modeling Anomalies Using a Graph In this section, we develop our framework to discover anomalous objects in a database. Most anomaly detection schemes adopt Hawkin’s definition [Haw80] of outliers and thus assume that outliers are isolated points far away from other normal points. As such, these outliers can be easily detected by existing distance or density based algorithms. However, in this work we focus on outliers that might be concentrated in certain regions, thus forming small clusters of outliers. We take a graph based approach to solve this problem. Here we model objects in the database as a graph, where each node represents an object and each edge represents a similarity between them. Each edge is also assigned a weight, which is equal to the simi- larity between the nodes of the corresponding edge. There are two major issues that need to be addressed: first, how to determine the link structure of the graph based on the simi- larity of nodes; second, how to discover the outlying objects using this graph model. The following sections describe these issues in detail. 4.2.1 Graph Representation In order to determine the link structure of the graph we compute the similarity between every pair of objects. Let X = (x1, x2, xd) and Y = (y,, y;, yd) be the vector represen- tation of any two objects drawn from a d-dimensional space Rd. While there are many possible choices of similarity measures, we experiment with the following metrics: 63 Cosine Similarity. The similarity between X and Y is defined as follows: [0 if X = Y d x cosine__similarity(X, Y) = l k=1 ky k otherwise JET/i=1 x1? 'JZ:=1 y: L RBF Kernel. The similarity between X and Y is defined as follows (where 0' is a user (4.1) specified kernel width parameter): 0 if X = Y . . . IIX—YII2 rbf_srm11ar1ty(X, Y) =4 21 2 e - 202 otherwise (42) J 72'0’ Note that the similarity between an object to itself is set to zero to avoid self loops in the underlying graph representation. Such loops are ignored since they are common to every node, and therefore it is not very useful to distinguish normal objects from outliers. The relationship between all pairs of objects in the database is represented by the nxn similarity matrix Sim, where n is the number of objects. We use the similarity matrix to represent the adjacency matrix of the graph. In the graph representation, each node corre- sponds to an object in the database. Two nodes X and Y are connected by an edge if their similarity is greater than zero, and the weight of the edge is taken as Sim(X, Y). 4.2.2 Markov Chain Model Based on the graph representation, we model the problem of outlier detection as a Markov chain process. The Markov chain modeled here corresponds to a random walk on a graph defined by the link structure of the nodes. We hypothesize that under this repre- sentation, if an object has a low connectivity to other objects in the graph, then it is more likely to be an outlier. Connectivity is determined in terms of the weighted votes given by other nodes in the graph. Here higher connectivity nodes convey votes with more weight than that conveyed by the lesser connectivity nodes. The weight of the vote from any node is scaled by the number of nodes adjacent to the source node. The connectivity of a node is computed it- eratively using the following expression. DEFINITION 4.l (Connectivity) Connectivity c(u) of node u is defined as follows: ra if t = 0 c,(u)=< (43> Z(c,_1(v)/ | v |) otherwise Lveadj(u) where a is its initial value, t is the iteration step, adj(u) is the set of nodes linked to node u, and M denotes the degree of node v. Given n nodes, v1, v 2, ..., vn, we can initially assign each node a uniform connectivity value (e. g. C0(Vi) = I/n , ISiSn) and recursively apply Equation (4.3) to refine its value, taking into account the modified connectivity values computed for its neighboring nodes. This iterative procedure is known as the power method and is often used to find the dominant eigenvector of a stochastic matrix. Upon convergence, Equation (4.3) can be written in matrix notation as follows: c = S Tc (4.4) where S is the transition matrix and c is the stationary distribution representing connec- tivity value for each object in the dataset. For a general transition matrix, neither the exis- tence nor the uniqueness Of a stationary distribution is guaranteed, unless the transition matrix is irreducible and aperiodic. These properties follow from the well-known Per- ron-Frobenius theorem [IM76]. The transition matrix (S) of our Markov model is Obtained by normalizing the rows of our similarity matrix (Sim) defined earlier: 65 .- Fifi-“AER , , Sim i,‘ 512.1]: ,, [ J] (41,) ZSim[i,k] k=1 This normalization ensures that the elements of each row of the transition matrix sum to l, which is an essential property of a stochastic matrix. It is also assumed that the tran- sition probabilities in S do not change over time. In general, the transition matrix S com- puted from data might not be irreducible or aperiodic. To ensure convergence, instead of using Equation (4.4), we may compute the steady state distribution for the following modified matrix equation: c=d.l+(l—d)STc (4.6) where S is the row normalized transition matrix, d is known as the damping factor, and 1 is the unit column vector [1 1...1]T. For the proof of convergence of this equation, read- ers should refer to [S98]. Intuitively, the modification can be viewed as allowing the ran- dom walker to transit to any nodes in the graph with probability d even though they are not adjacent to the currently visited node. As an example, consider the 2-dirnensional data with 11 objects shown in Figure 4.2. Clearly object 1 and object 2 are outliers while the rest of the objects are normal. Object x y 5. O O O 1 4.0 2.0 o o o 2 4.5 1.5 41 O O O 3 2.0 4.0 4 2.0 4.5 3 5 2.0 5.0 l 6 2.5 4.0 21 o 7 2.5 4.5 o 8 2.5 5.0 1. 9 3.0 4.0 10 3.0 4.5 o . 11 3.0 5.0 0 1 2 3 4 5 Figure 4.2. Sample 2-D data set 66 Using uniform probabilities as the initial connectivity vector and after applying Equation (4.6), the connectivity vector converges to its stationary distribution after 112 iterations (where d is chosen to be 0.1). The final connectivity values and the rank for each object are shown in Table 4.1. Note that object-1 and object-2 are correctly identified as the most outlying objects. Table 4.1. Outlier rank for sample 2-D dataset Object Connectivity Rank 1 0.0835 2 2 0.0764 1 3 0.0930 5 4 0.0922 4 5 0.0914 3 6 0.0940 9 7 0.0936 7 8 0.0930 6 9 0.0942 10 10 0.0942 1 1 l 1 0.0939 8 4.3 Anomaly Detection Algorithms This section describes our proposed algorithm based on the above random walk model for outlier detection. Two variants of the OutRank algorithms are presented. 4.3.1 OutRank-a: Using Object Similarity In OutRank-a algorithm, we form the transition matrix for the Markov chain model using the similarity matrix discussed earlier. We then use the power method to compute the sta- tionary distribution of its connectivity vector. The connectivity vector is initialized to be a uniform probability distribution (l/n, where n is the total number of objects) and the 67 damping factor is set to 0.1. The pseudo code for the OutRank-a algorithm is shown be- low. Algorithm 3 (OutRank-a) Input: Similarity matrix Simm with n objects, error tolerances. Output: Outlier ranks c. Method: for i=1 to n do // forms transition matrix S let totSim=0.0; for j=1 to 11 do totSim=totSim+Sim[i][i]; end for j=l to 11 do S[i][j]=Sim[i][j]/totSim; end end : let d=0.1 // damping factor : let t=0; : let co =(1/n).l // arbitrary assignment, 1=colurrm vector : repeat c,..= (d/n) .1 + (1-d)ST c. 8 = How - ctllr t= 1+1; : until (5 V(g). Since f ’does not present in g and has same number of vertices as f, f ’cannot be present in g ’and therefore can be pruned. I PROPOSITION 5.4: Given a graph g and its generated graph gfi a frequent subgraph f along with its complete set of immediate super graphs S = {f’, ’2, f1. f ’w} in L, if there is a super graph f 9, known to be present in gf then the rest of the super graphs in S cannot be present in g ’and thus can be pruned. PROOF. Suppose two super graphs, f 2 and f 1,, e S are present in g’. Since both f i, and f 1,. are immediate super graphs off, they can be considered as generated by operators and since f1, #1,. , both these operators should be different. But graph g’is generated from g by applying only a single operator. Therefore, f1, and f 1,, cannot be present in g’at the same time. Therefore, if one super graph- ipresent in g ’then all the graphs in S\ {f1} can be pruned. I Proposition 5.1 is useful since a set of super graphs can be eliminated completely from the isomorphism check if its parent subgraph is not contained in g. Propositions 5.2 and 5.3 are useful to eliminate some of the frequent subgraphs when using ADD operation. Proposition 5.4 is also beneficial and avoids many isomorphism checks if a super graph of a frequent subgraph f is found to be present in g’, as the rest of the super graphs do not need to be checked in g’. Also, this proposition can be used to eliminate more super graphs. For example, suppose frequent subgraph f, has a super graph f’ found to be pre- sent in g’. Now if there is another frequent subgraphf; that happens to have the same f ’as its super graph, there is no need to check it again, and also all the super graphs of f2 can be safely pruned. This makes this proposition highly effective at reducing the amount of isomorphism check. Based on this discussion, we present the algorithm for generating a graph and a feature vector below: 106 # CHI-1". my." ' . I A Algorithm 10 (GenNextGraph) Input: Graph g, frequent feature vector FVg, operator op Output: Generated graph g’and corresponding feature vector FVg: Method: 1 let FVg = FVg 2 if op = ADD_EDGE or op = ADD_VERTEX 3 select edge (v1,v2) to be added // v2 is a new vertex in op ADD_ VERTEX 4 modify g by adding (v1,v2) to form g’ 5: for each graph f e FVg 6‘ select immediate super graph f ’in subgraph lattice L 7 prune f ’using Propositions 5.1-5 .4 8: if not pruned check if f ’is isomorphic to g’ //check presence off’ 9: update FVg’to reflect the presence off’ in g’ 10: else if 0 = DEL_EDGE 11: select edge e e g s.t. g will not become disconnected when e is removed 12: if edge e exists 13: modify g by removing e to form g’ 14: for each graph f 6 W8 15: check if f is isomorphic to g’ //check presence off 16: update FVg ’to reflect the presence off’ in g’ 17: else let g’= g // op DEL_EDGE is not successfid 18: return g’and the feature vector FV,: Algorithm GenNextGraph is used to generate a new graph g ’fiom the current graph g by applying the operators as described in the previous section. Also, the feature vector is up- dated for g ’ to reflect the presence of frequent subgraphs. In the case of operator DEL_EDGE, edges are deleted in such a way so that the resultant graph does not become disconnected. If such an edge is not present, then the original graph g is not modified. 5.5 Maximum Entropy Based Supervised Anomaly De- tection Over the years, several innovative supervised classification algorithms for graph-based data have been developed [DKN+05][WKO6][HGW04][KMMO4][LYY+05] [BB02]. Most of the algorithms are based on the underlying assumption that the intrinsic proper- ties of a graph are rendered by its underlying substructures (nodes, edges, paths, strongly connected components, trees, subgraphs, etc). These substructure-based algorithms iden- 107 V‘hnfih. .-. tify the important components present in each graph and subsequently use them to dis- criminate graphs from different classes. Most of the early works in graph-based classifi- cation have focused on the use of heuristic based search techniques to discover such dis- criminative components present in the graph database [BB02]. More recently, however, fi‘equent sub graph based approach was proposed in which frequent sub graph mining al- gorithms are employed to generate all the subgraphs that occur a sufficiently large num- ber of times in the graph database and to construct feature vectors based on the extracted subgraphs [DKN+05]. By transforming each graph into its corresponding feature vector, we can subsequently apply any of the existing classification algorithms to build a classi- fication model for the graph database. Among the state of the art classification algorithms includes support vector machines (SVM) [J 99], Adaboost [SS99] [FHTOO] [W05], and maximum entropy model. The SVM approach has been widely applied to the classification of graph objects. Cai et a1. use SVM to classify protein sequences [CWS+03]. Dobson et a1. applied SVM to distin- guish enzyme from non-enzyme proteins [DDO3]. Much of the recent work on applying SVM to graph-based application focuses on how to build efficient and valid kernel firnc- tions for graphs. Most of these approaches are based on constructing a feature space by decomposing a graph into its underlying subgraphs and counting the number of occur- rences of these subgraphs. Watkins [W99] shows that the scores produced by certain dy- namic alignment algorithms can be considered as valid kernel firnctions. Kashima et a1. [KTIO3] use the counts of paths produced by random-walks on graph to generate the ker- nel. Borgwardt et al. in [BOS+05] construct the kernel by combining the similarity meas- ures based on different data types for different source of information including sequen- tial, structural, and chemical. In this section, we present a principled approach for building a global classification model from the local patterns using the maximum entropy principle. The idea behind this approach is to learn a conditional probability distribution of the class for a graph given its 108 underlying features (i.e., frequent subgraphs) by using the support of the features as con- straints imposed on the probability model. Using an iterative technique known as the im- proved iterative-scaling algorithm [B97], the parameters of the probability model can be estimated from the data. While the idea of using the maximum entropy principle for con- structing a global model based on its underlying local patterns is not new [MS99], to the best of our knowledge, this approach has not been applied to graph-based data. 5.5.1 Maximum Entropy Model in Supervised Setting In Section 5.4.1, we discussed how maximum entropy can be used in an unsupervised set- ting. Here we discussed the supervised case, where each graph is assigned a class label, y, chosen from a set of discrete labels Y. In this work, we assume that the classes are bi- nary, even though the approach can be generalized to multi-class problems using one— versus-one, one-versus—all, or error correcting output coding (ECOC) methods. Our objective is to find a global probability model P(y| Xg) using the maximum en- tropy principle. To apply the maximum entropy principle, we first define a set of features constructed from the frequent subgraphs. We then create a database of binary transac- tions, where each transaction corresponds to a graph g e G. Each transaction also con- tains a set of binary features, also called items, each of which is defined as follows: fg(X"y)—{0 otherwise ( ) As previously noted, the maximum entropy principle seeks to find a probability model P* that maximizes the following objective function: P*=max — zP(y|Xg)logP(y|Xg) (5.13) geG subject to the following constraints: 109 F '.).'.'.'-‘.‘ A ‘ . —. I ..A ZP(leg)fg(XiaY)=Si (5.14) g where s,- corresponds to the support of the sub graph X, in the database G. Note that Equa- tion (5.14) states that the expected value for every feature is constrained to be identical to the support of the corresponding subgraph. It can be shown that the probability model that maximizes Equation (5.13) subject to the linear constraints in Equation (5.14) has the following exponential form: P*(YIXg)=T)1(-g—)exp[Z/iifg(Xi,y):| (5-15) I where Z(Xg) is a normalization factor, and 1’s are the parameters to be estimated. 5 .5 .2 Parameter Estimation Let A = {k1, 3.2, ..., M} be the set of parameters to be estimated. The A vector can be interpreted as the significance or weight of the corresponding features and can be esti- mated using the maximum likelihood approach. Specifically, the likelihood function for the training data G is: £00 = HP“ (y l X )E‘X'y’ (5.16) geG where p(x, y) is the estimated counts of (X, y) in the training data G. The conditional maximum likelihood model above is solved using the Improved It- erative Scaling algorithm (115') [BPP96]. US can be designed to prevent overfitting in the resulting model P*(y/X), by incorporating a Gaussian Prior. Here, instead of maximizing the likelihood function we maximize the function: L=HP*(y. IX.)’B"”X’>