n

.
_
gnaw...“

 

 

”.31.!

:1:
L!) i...
ii . L

.5
.1an

1.2.1.9.,
ﬁve ‘

E.
{ﬁfe}.

 

it... u. .
423...?! a...

rﬂhﬁﬂﬁ‘?
i .

hr. .
‘34”. 315.314.... s...:
Buﬁau

Ulu

. ..
Ema...“ .hwswmlﬁ. a! ..
. ‘ ﬂ ,- pa .'
ﬁzzﬁmu at; .
.51.»...
.3

. . .

5 .. .1

.ui. . .1. ti,
. €3.52

 

 

 

5 V

A . nunﬂx‘
. . , . , . 5. n .u‘ y, . . . , ,
3&2 ﬁémmngéﬁagﬁgw.¥ﬁ.., .wﬁ.‘
(Jilrll ..A., f 1: 7. .. I. n . 1

WESIS

100;

This is to certify that the
thesis entitled

OPTIMIZATIONS AND APPLICATIONS OF
TRIE-TREE BASED FREQUENT PATTERN MINING

presented by

Stuart King

has been accepted towards fulfillment
of the requirements for the

 

 

MS. degree in Computer Science

’—

 

Major Professor‘s Signature

6- “-06

 

Date

MSU is an Affirmative Action/Equal Opportunity Institution

 

 

LIBRARY
Michigan State
University

 

 

 

--‘----.-t-.-.'---

—-.- - -.~ - -.—- - 5-3—4

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/CIRCIDateDue.indd-p.1

OPTIMIZATIONS AND APPLICATIONS OF
TRIE-TREE BASED F REQUENT PATTERN MINING

By

Stuart King

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTERS OF COMPUTER SCIENCE
Department of Computer Science and Engineering

2006

ABSTRACT

OPTIMIZATIONS AND APPLICATIONS OF
TRIE-TREE BASED F REQUENT PATTERN MINING

By

Stuart King

Frequent pattern mining is an active area of research in a ﬁeld of artiﬁcial intelligence
called knowledge discovery. With it, we are teaching computers to discover hitherto
undiscovered information about ourselves and the world around us. Traditionally,
discovering patterns has been a very resource-intensive endeavor, costing both a great
deal of memory and processing time. In this research, we are going to introduce a
mechanism to geometrically reduce memory requirements. This allows frequent pattern
mining to be applied to new domains. To prove this, a new client application in PC
Internet security is developed that takes advantage of this optimization. Additionally,
research is presented on a CPU optimization for degenerate cases in the application. In
combination, this thesis presents a road map for advancing frequent pattern mining in a

new direction.

Copyright by
Stuart King

2006

Acknowledgements

Masters theses in Computer Science are relatively rare. However, doing so gives a
masters student an opportunity to not just learn science, but to contribute to the scientiﬁc
community in a very real way. Computer Science is such a young ﬁeld, that with the
right encouragements and guidance, even minimal training can be leveraged to produce

signiﬁcant contributions. This work does exactly that, but it was not done alone.

Any thesis takes a team to make it happen, and this one is no exception. I wish to
recognize and thank several key individuals in that team who helped make this thesis
possible. First and foremost is Dr. Satki Pramanik who helped drive my desire to
contribute to the scientiﬁc community and offered both perspective and direction as he
led my thesis committee. In addition, Dr. Pang-Ning Tan provided encouragement,
enlightenment, and the ever important critical challenges needed to produce strong
research. Any work is always stronger when you have a partner and sounding board to
bounce ideas off of. Li J ian and I had countless sessions, reviewing both good and bad
ideas, and his efforts helped immensely. Another seed of help came from a simple
challenge made by Dr. Matt Mutka. The challenge was to “make something practical”,

and Chapter 5 is the answer to that challenge.

There are many others, more than can be named here, but I wanted to make a special
thank you to Linda Moore who helped me navigate the thesis process. Her experience

and enlightenment helped the entire team to produce this work.

Thanks to one and all.

iv

TABLE OF CONTENTS

LIST OF TABLES ........................................................................................................... vii
LIST OF FIGURES ........................................................................................................ viii
CHAPTER 1: Introduction ............................................................................................... 1

PART ONE: FREQUENT PATTERN MINING

CHAPTER 2: Deﬁnitions and Practices ........................................................................ 3
2.1 What is “Frequent Pattern Mining” ........................................................................ 3
2.2 Metric Datasets and Dataset parameters ................................................................. 5
2.3 Trie-Tree data structure ........................................................................................... 9

CHAPTER 3: Previous Work ........................................................................................ 10
3.1 SimpleMining: Is Not Easy To Do ....................................................................... 10
3.2 Largeltemsets and Apriori: The First Practical Algorithms ................................. 10
3.3 Trie-Tree Approaches ........................................................................................... 14

PART TWO: OPTIMIZING MINING

CHAPTER 4: Using Compressed Tries in Frequent Pattern Mining ...................... 20
4. 1 Introduction ........................................................................................................... 20
4.2 Compressing Trie-Tree Tails ................................................................................ 21
4.3 Mathematical Models ............................................................................................ 23
4.4 Implementing Tails ............................................................................................... 32
4.5 Experimental Evaluation ....................................................................................... 35
4.6 Conclusions of Compressed Trie Mining ............................................................. 38

CHAPTER 5: Enhancing PC lntemet Security with Pattern Mining ....................... 40
5. 1 Introduction ........................................................................................................... 40
5.2 Previous Work ...................................................................................................... 42

5.3 Methodology ......................................................................................................... 44
5.4 Experiments .......................................................................................................... 48
5.5 ‘ Beneﬁts of Optimizations ..................................................................................... 54
5.6 Conclusions Drawn From the Application ........................................................... 56
CHAPTER 6: Reducing CPU Requirements For Dense Datasets ......................... 58
6.1 Introduction ........................................................................................................... 58
6.2 Trie and Frequent Patterns .................................................................................... 59
6.3 The Fri Algorithm ................................................................................................. 61
6.4 Optimizations ........................................................................................................ 64
6.5 Experiments .......................................................................................................... 66
6.6 Conclusions of Reducing CPU Requirements ...................................................... 70
CHAPTER 7: Conclusions and Future Work ............................................................. 73
APPENDIX A: Support vs Space ................................................................................. 75
BIBLIOGRAPHY ............................................................................................................. 79

vi

LIST OF TABLES

Table 1: Sorted Items ................................................................................................ 22
Table 2: Spyware Bundled with Kazaa .................................................................... 43
Table 3: Events and Records ..................................................................................... 47
Table 4: “Events” for a single pattern ....................................................................... 50

vii

LIST OF FIGURES

Figure 1: Trie-Tree .................................................................................................... 9

Figure 2: Tree Projection .......................................................................................... 15
Figure 3: FP-Tree ...................................................................................................... 17
Figure 4: Sorted Items in a Trie ................................................................................ 22
Figure 5: Compressed Trie ........................................................................................ 23
Figure 6: Mutual Independence ................................................................................ 24
Figure 7: Mutual Independence for Trie-Trees ......................................................... 26
Figure 8: Predicting d < ln(R)/1n(n) for Random Data ........................................... 29
Figure 9: Predicting d < ln(R)/1n(n) for Random Data ........................................... 29
Figure 10: Most Records Have Frequent Items ........................................................ 31
Figure 11: Most Nodes are in Tails ........................................................................... 31
Figure 12: FP-Tree .................................................................................................... 33
Figure 13: FP-Tail Before Mining FP-Tail ............................................................... 34
Figure 14: FP-Tail After Mining “A”, “B”, and “C” ............................................... 34
Figure 15: Tail Procedures ........................................................................................ 35
Figure 16: RAM for FP-Growth vs FP-Tail ............................................................. 37
Figure 17: Time for F P-Growth vs FP-Tail .............................................................. 38
Figure 18: GUI Display of TEST Computer ............................................................. 53
Figure 19: GUI Display of WORK Computer .......................................................... 53
Figure 20: GUI Display of MEDIA Computer ......................................................... 54
Figure 21: Space Savings Due to Optimizing ........................................................... 55
Figure 22: CPU Cost of Space Saving from Optimizing ......................................... 55

viii

Figure 23:
Figure 24:
Figure 25:
Figure 26:
Figure 27:
Figure 28:
Figure 29:
Figure 30:
Figure 31:
Figure 32:
Figure 33:
Figure 34:

Figure 35:

CPU usage with a Dense Data ................................................................ 56
A sample dataset and its Trie .................................................................. 61
The Trie after applying CopyMerge of subTree(B:4, A26) ..................... 62
The Trie after applying CopyMerge of subTree(C:2, A26) ..................... 63
The Trie after applying CopyMerge of two leaf nodes ........................... 63
The fully-expanded Trie .......................................................................... 64
Performance comparison on T2017D500K ............................................. 67
Performance comparison on T2016D*OOK ............................................. 68
Performance comparison on T201*D500K ............................................. 69
Performance comparison on T2017D500K with varying NPats ............. 70
CPU Optimization of Degenerate Case ................................................... 71
Support vs RAM (linear) ......................................................................... 76
Support vs RAM (log) ............................................................................. 76

NOTE: Images in this thesis are presented in color.

ix

CHAPTER 1: Introduction

Frequent pattern mining is a booming area of research on the process of discovery. In its
simplest terms, it is ﬁnding patterns in large quantities of data. In a broader sense, it is a
kind of artiﬁcial intelligence that reveals relationships in our complex world that we have
never before seen. It took less than two years to get from the ﬁrst practical algorithm for
frequent pattern mining to the creation of an entire industry dedicated to the task. This
thesis could not possibly cover in detail all of the various aspects of this research, but it
does aim to introduce some important aspects of the ﬁeld, and to present a new direction

for it.

To accomplish that aim, the paper is broken down into two distinct parts. The ﬁrst part
introduces some of the concepts, objectives, and methods of frequent pattern mining, as
well as some of the more common algorithms used. The second part introduces a new
algorithm to geometrically reduce the memory requirements of a large genre of frequent
pattern mining algorithms. As a proof-of-concept for the potential of the algorithm, a
new client application in PC lntemet security is developed that takes advantage of the
optimization. As will be seen, this proves very successful. However, extending the
application to do a more detailed analysis revealed a degenerate case in which CPU
utilization dramatically increased. Additional research is presented on this degenerate
case and how to more efficiently handle it. Combined, this thesis introduces a complete
approach for a practical application of ﬁ'equent pattern mining in a personal computing

environment.

PART ONE
FREQUENT PATTERN MINING

CHAPTER 2: Deﬁnitions and Practices

2.] What is “Frequent Pattern Mining”

To better understand what frequent pattern mining is (a.k.a. mining'), it helps to
understand some of the applications it is being used for, and to know a few key terms. To
start with, we will provide a deﬁnition of what a ﬁ'equent pattern is:

Frequent Pattern:
Given a set of discrete items and a set of records composed from these
items, a frequent pattern is a combination of items that occurs frequently
within the records.

NOTE: Ideally, this means patterns that occur more oﬁen than can be
explained by random chance; however other deﬁnitions can apply (see

minimum support below).
Example:

Set of Discrete Items: “A”, “B”, “C”, “D”, “E”

Set of Records: “ABD”, “BCD”, “ABCDE”, “BE”

Frequent Pattern: “BD” occurs in three of the four records
You might ask “So what, what is the big deal?”. The answer is association rules.
An association rule is a relationship between items in a dataset. Finding
association rules between items helps to make predictions about the future. To
illustrate, consider if the “Set of Records” from the example represents a database

of purchase records where “B” stands for “Bread” being purchased, and “D”

stands for “Hot Dogs” being purchased. Then someone buying “Bread” has a

 

I There are many meanings for the word mining, including the more general term data mining representing
the discovery of any potentially useful knowledge from data. However, this is a frequent pattern mining
thesis, so mining throughout the entire thesis will refer to “frequent pattern mining”.

75% chance of also buying “Hot Dogs”. This is an association rule and can be
written “B => D2 with 7 5% confidence”.

Another Association Rule would be “AC => E with 100% conﬁdence”.

However, this Association Rule needs to be tempered with the fact that the pattern
“ACE” only occurs once in this database of four records. That is, the rule

“AC => E” only has 25% support. Although both conﬁdence and support are
important, conﬁdence can be calculated from support. That is, given the support
of the rule’s assertion, “ACE”, and the antecedent, “AC”, the conﬁdence can be
calculated as follows:

Conﬁdence = Support of the rule’s assertions / Support of the antecedent
Conﬁdence(AC => B) = Support(“ACE”) / Support(“AC”) = 25% / 25% = 100%

As a result, a minimum support is frequently speciﬁed when mining for frequent patterns.
Minimum support, like it sounds, is specifying the lowest support a pattern can have and
still be considered a frequent pattern. Using the example above, a minimum support of
50% would remove the pattern “ACE” from consideration. A good value for minimum
support is very subjective and data dependent. For example, a 10% success rate is small,
but a 10% death rate is high.

What is not subjective is the deﬁnition of the terms “Support”, “Conﬁdence”, and
“Association Rule”. Formally, the deﬁnitions are as follows:

Let I= {i., i;, , in }be a set ofliterals called items. Let D be a set of
transactions where each transaction T is a set of items such that T g 1. Given an
antecedent of A (where the items of A c l ) and a consequence C (where the
items of C c: I ) such that A n C = Q, than an Associate Rule is an implication of

the form A => C. As such, a transaction T is said to contain the antecedent A if

 

2 “=>” stands for implication, so “AC => B” is saying “condition A and C implies condition D”

A g T. Similarly, a transaction T is said to contain the consequence C if C Q T.
The rule holds in the transaction set D with “Conﬁdence of c%” if c% of the
transactions in D that contain A, also contain C. The rule A => C has a “Support
of 5%” if s% of the transactions in D contain A U C.

Associations between items like “Hot Dogs & Bread” or “Beer & Diapers”3 is very
important for “market basket” analysis4, but frequent pattern mining has many other uses.
For example, it is currently being used to identify patterns of credit card and identity
theft, intrusion detection over IP networks, and cancer detection. In addition,
“Knowledge Discovery and Data mining” (KDD) is an active area of research for
artiﬁcial intelligence systems, and as the ﬁeld grows, I expect discoveries like the effects
of “El Nino”5 on global weather patterns and patterns in economic ﬂux will be identiﬁed

with mining techniques.

2.2 Metric Datasets and Dataset parameters

Since the introduction in 1993 of frequent pattern mining, many common practices have
been adopted for categorizing data for analyzing the performance of algorithms. Just as
different sorting algorithms work best with different kinds of data, so do diﬁ‘erent mining
algorithms. Deﬁning the different kinds of data an algorithm works best with involves
deﬁning key characteristics of the data that play a signiﬁcant role in the mining process.
Two types of data with known characteristics can assist in analyzing the performance of
algorithms. One is a library of datasets with well known properties, and the other is a

common algorithm for generating artiﬁcial datasets with desired characteristics.

 

3 In the FIMI'04 data mining competition, the most frequent pattern was “Beer, Diapers”.

’ Argawal et al’s research on ﬁnding simple patterns in market basket data has touched nearly every super
market shopper with the advent of “Super Market ‘Club’ Cards” for tracking purchase patterns.

5 The data mining technique Conical Correlation Analysis (CCA) has been used to help ﬁnd the eﬁ'ects of
El Nino, however, combining CCA with other data mining techniques may prove even more effective.

Four of the common attributes used for measuring the characteristics of datasets are:

I) The Number of Records in a Database

2) The Average Record Length

3) The Number of Items Within the Database
4) The Expected Pattern Length‘5

In addition, there is a qualitative attribute commonly used called pattern density. Pattern
density is a measure of how many patterns a large number of records will produce7. If a
relatively small number of records produce a large number of frequent patterns, the
database is said to have a high pattern density. However, if a large number of diverse
records only produce a small number of frequent patterns, then the database is said to
have low pattern density (or its pattern density is sparse). Altering the minimum support
frequently changes the density of a dataset. For example, a dataset may be dense with a

minimum support of 5%, but sparse with a minimum support of 25%.

One of the sparsest datasets is random data, but it is not effective in testing mining
algorithms. The very objective of frequent pattern mining is to ﬁnd data patterns in
seemingly random data. One of the biggest challenges of mining algorithms is to
efﬁciently wade through and ﬁlter out random data to reveal the hidden patterns. As
such, when random data is mined, it turns out that new patterns are found in only a
narrow range of minimum support. That is, when minimum support is above this
threshold, the random data is ﬁltered out; below this threshold, there are very few

combinations of items to be found that have not already been discovered.

 

6 Many factors effect pattern length, especially minimum support. However, for FIMI datasets there is
frequently a known set of useful patterns that can be used to deﬁne the expected pattern length.

7 An alternate deﬁnition of a dense dataset is one in which, on average, most records contain at least half of
the frequent items. Although most would agree this would produce a dense dataset, the term is frequently
used in a broader sense.

So what kind of data is good for frequent pattern mining? Intuitively, it is data with
frequent patterns. That is, data that has combinations of items that occur together far
more often than can be explained by random chance. Usually such patterns are indicative
of forces that we are either ignorant of, or do not fully understand. The real world is full

of such forces that interplay in a myriad of fashions.

The FIMI (Frequent Itemset Mining Implementations) repository contains several well
known datasets from the real world. These datasets have a variety of characteristics that
make them ideal for testing the performance of different implementations of mining
algorithms. Some examples of the kinds of datasets are:

Chess and Connect-4: A set of winning moves for the games chess and Connect-4
Mushroom: Mushroom classiﬁcation data

Webdocs: Terms used in 1.7 million web documents

(and many others)

These datasets are real data from real problems. The reason data mining programs exist
is because of datasets like these, and it is important that mining algorithms work well
with this kind of data. However, the F IMI repository does not contain every possible real
world dataset. As such, ﬁnding a spectrum of datasets to fully test and understand the
limits, scalability, and capabilities of a mining program is not possible. As a result, FIMI
has a method of generating synthetic datasets.

Synthetic datasets are artiﬁcial datasets that are guaranteed to contain frequent patterns.
Synthetic dataset generators are programs designed to generate synthetic data and were
developed to give the researcher more control of the characteristics of a dataset. As a
result, a spectrum of datasets can be generated that allows the different components of an

algorithm to be tested. These datasets not only give the developer the control to

incrementally vary the attributes of a dataset, but they are also guaranteed to contain
frequent patternss. One such generator was developed by IBM and creates data using the
following process:

Procedure Generate_Synthetic_Data

1) The data generator takes many input parameters including:
a) I: The number of items used by the artiﬁcial dataset
b) T: The average length of records
c) D: The number of records
(and several other parameters)

2) A list of random patterns are generated9

3) Records are generated as follows:
a) The size of the record is randomly chosen to be close to the desired length.
b) A pattern is selected.
c) One or more items are randomly chosen from the selected pattern.
d) steps b & c are repeated until the record is the chosen length.
e) Steps a-d are repeated until the desired number of records have been created.

Usually, when a synthetic dataset is generated, it is given a name similar to TnyDz, where
x is the average length of transactions, y is the number of items in the dataset (in 10005) and
z is the number of records in the database. So, a dataset called T10110D100K would contain
100,000 records averaging 10 items in length and using about 10,000 items. Since each
record was created from patterns, this forces the existence of frequent patterns within the
dataset. Because of the ability to tightly control dataset attributes, synthetic datasets will be

used later on in this work to evaluate the performance of algorithms.

 

8 That is, data that has combinations of items that occur together far more often than can be explained by
random chance.

9 Since the items of records only come from pre-generated patterns, the actual number of items used in the
synthetic dataset may be less than the desired number.

2.3 Trie-Tree data structure

Several effective algorithms in frequent pattern mining have adopted deviations on a

structure called a Trie-Tree. Wikipedia deﬁnes a Trie as:

“A trio, or preﬁx tree, is an ordered tree data structure that is used to store an

associative array where the keys are strings. Unlike a binary search tree, no

node in the tree stores the key associated with that node; instead, its position

in the tree shows what key it is associated with. All the descendants of any

one node have a common preﬁx of the string associated with that node, and

the root is associated with the empty string.”
For example, given the strings “ABC”, “ABD”, and “BCD”, the associated Trie-Tree would
look similar to Figure 3.1. In this example, “ABC” and “ABD” have a common preﬁx of
“AB”, so both strings share the use of the nodes “A:2” and “8:2”. The number following
the “:” is a common convention used by researchers to indicate the number of strings using
the node. That is, the “1” of node “B: 1” indicates only one string is using the node. As
such, the original records of the dataset are captured within the Trie-Tree. For example,
following up from the leaf node “C: 1” of Figure 1, it can be seen the original dataset had a

record “ABC”.

 

®
£9 03D
69 (ED
@®®

 

 

Figure 1: Trie-Tree

 

CHAPTER 3: Previous Work

3.1 SimpleMining: Is Not Easy To Do

Mining frequent patterns is a complex problem that is a very CPU and RAM intensive
application. Given n unique items within a database, the SimpleMining algorithm
below has a time complexity and space complexity of 0( 2n — 1 ). As such, even a small
database with only 85 unique items'0 would require 38,685,626,227,668,133,590,597,63l
patterns to be generated. Even though computers are constantly getting faster and bigger,
this brute force method of mining is just not practical. As a result, the ﬁrst practical
algorithm for frequent pattern mining needed to introduce a mechanism for pruning

pattern combinations that were not useful.

Procedure SimpleMining

1) Scan through the dataset and store each unique item.

2) Generate and store all possible patterns (combinations of items).

3) Scan the dataset again.
a) Rearrange the unique items of each record into all possible combinations.
b) For each combination, increment the counter of its associated pattern.

4) Scan through the patterns and print any that are above a given support.

3.2 Largeltemsets and Apriori: The First Practical Algorithms

In 1993, Agrawal (et. all.) wrote just such an algorithm. The Largeltemsetsll algorithm
(see below) was introduced in a paper called “Mining Association Rules Between Sets of
Items in Large Databases”. The primary method it used for pruning unneeded patterns

involved the Monotonicity Principle. Basically, the Monotonicity Principle says:

 

'0 This would be the approximate number of items from the menu of your local fast food restaurant.
” This algorithm is sometimes called “AIS” after the authors of the paper (ie: “Agrawal lmielinski Swami”)

10

If pattern B ;.D, then the support(B) 2 support(D)
Therefore, if the pattern B is infrequent, then all its supersets are also infrequent.

As an example, if the pattern B = “AB”, the pattern D = “ABCD”, and the

support(B)=10%, then the Monotonicity Principle says support(D) S 10%.

Largeltemsets combined the Monotonicity Principle with the Mutual Independence
method of estimating frequencies to greatly reduce the number of patterns that needed to
be generated (see Section 5 in Chapter 5 for details on Mutual Independence). However,
introducing “estimating” also introduced error. As a result, the Largeltemsets algorithm
may “estimate” that a pattern is not frequent when it actually is. As a result, several

passes through the database may be needed to eliminate such “estimating errors”.

Procedure Largeltemsets

1) Scan through the dataset and count the frequency of each unique item.
2) Based on what is known, “estimate” which patterns are frequent,
or might be almost frequent. Call this set the “candidate itemsets”
3) Scan the dataset again.
a) Rearrange the unique items of each record into all possible combinations
b) For each combination that is in the list of “candidate itemsets”,
increment the counter of its associated pattern.
4) If any pattern expected to be infrequent turns out to be frequent,
go back to step 2.

Although Largeltemsets was a practical algorithm for mining, it still produced a large
number of “candidate itemsets”, however, some simple observations helped to suggest an
improved algorithm. One observation was that most patterns were short (only three or

four items), and the Largeltemsets algorithm usually took three or four scans through the

database. As such, an algorithm that “grows” candidate itemsets might be more efﬁcient.

ll

 

That is, an algorithm would be more efﬁcient if it generates “length 2” candidate itemsets
before the second scan, “length 3” candidate itemsets before the third scan, etc. Another
simple observation that helped is that most of the items within a database were not
frequent. So, it was possible to signiﬁcantly reduce the size of the database records by
ﬁltering out infrequent items. As such, encoding of the data could usually reduce the size

of the database to a size that could ﬁt in RAM.

This improved algorithm was introduced by Agrawal only one year after the
Largeltemsets algorithm. It was called Apriori (see below), and it turned out to be
signiﬁcantly faster than Largeltemsets for most datasets. To this day, algorithms based
upon the Apriori algorithm are commonly used for frequent pattern mining. The
algorithm works as follows:

Procedure Apriori

1) Scan through the dataset and count the frequency of each unique item
2) Rescan the dataset and ﬁlter out infrequent items from each record
and encode the data to create a “compressed dataset”12
3) Initialize K=l
4) Generate all “candidate itemsets” of length K+l
a) Generate length K+1 patterns based upon the “length K candidate itemsets”
b) For each new pattern, ﬁlter out patterns that can not be frequent
5) Scan the compressw dataset
a) Rearrange the items of each record into length K+l combinations
b) For each combination that is in the list of “length K+1 candidate itemsets”,
increment the counter of its associated pattern

6) If any patterns in “length K+1 candidate itemsets” is frequent,

Then increment K and repeat steps 4-6

'2 Technically, the ﬁrst “encoding and compression” of Apriori was introduced in the same paper as
Apriori, and was called AprioriTid. However, since then, many other mechanisms have been introduced.

12

The compression of the dataset (step 2) has been implemented in many ways, but there
are two common features that are frequently used. One is that duplicate items are
removed from each record. A more signiﬁcant feature is that items are encoded into
numbers. That is, each frequent item is assigned a number, and the numbers are stored in

9

the “compressed dataset”. For example, the record { “beer”, “diaper’ , “Purple People
Eater”, “diaper” } could be encoded as { 1 2 } if “diaper”=1, “beer”=2, and “Purple
People Eater” is not a frequent item. These two mechanisms are so common that a lot

datasets targeted for mining will have these compression mechanisms applied as a pre-

processing step to mining.

Step 4 of the Apriori algorithml3 ﬁlters patterns using the Monotonicity Principle. Since
one consequence of the Monotonicity Principle is that each frequent “length K pattern”
could have supersets of “length K+l” that are frequent, step 4a generates its “length K+l
candidate itemsets” from its “length K candidate itemsets”. Another consequence of the
Monotonicity Principle is that a “length K+l pattern” can only be frequent if all subsets
of “length K patterns” are frequent. An example of Apriori ﬁltering is:

Given:
K=3
Thepattems L3 ={ {12 3}, {1 24}, {1 3 4}, {2 3 4}, {3 5 6}, {45 6} } are frequent

Then: (step 4a)14
{1 2 3} and {l 2 4} differ by one item, so {1 2 3 4} might be frequent
{l 2 4} and {1 3 4} differ by one item, also generates {l 2 3 4}

{3 5 6} and {4 5 6} differ by one item, so {3 4 5 6} might be frequent

 

'3 This application of the Monotinicity Principle is sometimes referred to as the Apriori principle.
" NOTE: There are many mechanisms to make step 4a more efﬁcient; however, only a simple method is
demonstrated here.

13

Filtering: (step 4b)
For the pattern {1 2 3 4} to be frequent, all “length 3” subsets must be frequent.
Since { {l 2 3}, {124}, {1 3 4}, {2 3 4} }isasubsetofL3,{12 3 4}e L4
For the pattern { 3 4 5 6 } to be frequent, all “length 3” subsets must be frequent.
Sincethesubset { 3 45} as L3,thepattem { 3 456} 92 L4
The ﬁnal resulting “candidate length 4 itemsets” is L4 = { {1 2 3 4} }
Scanning will need to be done to validate if any patterns in L4 are actually frequent.

3.3 Trie-Tree Approaches

Apriori works very well when there are not too many items. However, if there are a lot
of frequent items in the dataset, there is a drastic increase in the size of the “candidate
itemsets”. This has a signiﬁcant impact on both the size and processing time of the
algorithm. One common circumstance in which this occurs is when the support threshold
is relatively small. When there are a lot of frequent items, there are Trie-Tree approaches

that can help. One of the ﬁrst such approaches is Tree Projection.

Tree Projection is a Trie-Tree frequent pattern mining algorithm that stores compressed
and ﬁltered copies of the dataset in the leaf nodes of the tree. It is a top down mining
algorithm that grows the depth of the tree as it discovers longer frequent patterns. A key
aspect of the algorithm is that it reorders the items within a record. For example, Tree
Projection reorders the items within the records “ACB” and “ABCD” to make it easier to

see the pattern “ABC”. Tree Projection works as follows:

14

Procedure TreeProjection

1) Scan through the dataset and count the frequency of each item
2) Create a tree structure similar to Figure 2 for the frequent items in the dataset
3) Rescan the dataset. For each record scanned, do the following:
Add_Records_To Nodes:
a) Filter infrequent items from the record
b) Order the items in the record
c) Based on the ﬁrst item in the record, add the record to a node of the tree
d) Remove the ﬁrst item from the record

e) Repeat steps C & D until there are no more items in the record

4) For each set of records in a leaf node of the tree
a) Count the frequency of items
b) Add a branch to the leaf node for each item that is frequent in the node
c) For each record of the node, use the “Add_Records_To_Nodes” procedure

5) Repeat step 4 until there are no more frequent items in any of the leaf nodes

 

GE’

0 o 0 (£300
ooaaoooo

CDEF DF DEF E EF F
DEF EF F

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2: Tree Projection

 

 

 

Although Tree Projection was a big step forward in “growing” frequent patterns without
generating “candidate itemsets”, the algorithm had signiﬁcant performance degradation
with long records and dense datasets. A few observations in the mining process helps to
suggest areas of improvement. First, a single record can contribute to many different
patterns, so a record might be copied to many different leaf nodes. This has a signiﬁcant

impact for the initial Tree Projection Trie because the tree branches are short and the leaf

15

nodes contain longer records. An algorithm that uses pointers can signiﬁcantly reduce
this overhead. Another observation is that the number of frequent items within a branch
of a Trie-Tree is drastically smaller then the number of frequent items within the entire
Trie-Tree. As such, copying a branch to a new Trie-Tree with infrequent items removed

has a small overhead and a large beneﬁt.

An improved tree algorithm was developed by Han (et. all.) the year after Tree Projection
was introduced. It was called F P-Growth (see below), and it worked by combining a
Trie-Tree of records with an array of linked lists. Contrary to Tree Projection, FP-
Growth starts with a complete Trie-Tree structure from the strings in the dataset.
However, each node of the Trie-Tree has an additional pointer. The additional pointer is
used to link records together. For example, all nodes of the Trie-Tree containing “A”
would have an additional pointer linking them together, as such, the pointers create a
linked list of records containing the letter “A”. Similar lists are created for items “B”,
“C”, etc. The headers for all these lists are stored in an array. Combined, this data
structure is called an “FP-Tree”. As the algorithm progresses, it creates and destroys
additional F P-Trees from subsets of the data. The algorithm works as follows:

Procedure F P-Growth
1) IF this is the ﬁrst scan of the dataset,
THEN Scan through the dataset and count the ﬁequency of each item
2) Create a linked list for each frequent item
The array of linked lists will be called the “header table”
3) Rescan the dataset. For each record scanned, do the following:
Procedure Add_Records_To_Tree:
a) Filter infrequent items from the record
b) Order the items in the record

c) Add the record to the tree (See ﬁgure 3.2 for an example tree)

16

4) Add each node of the tree to one of the link list created in step 2
5) Create a subset of records by ﬁnding all records that contain the last item
(ie: “F” in Figure 3). The last “link list” of the header table (the one for “F”)
points to the leaf nodes of these records
6) Scan up the subset of records to count the frequency of items within the subset
Note: The list of frequent items in this subset represent patterns that begin with “F*”
7) Recurs the Procedure FP-Growth using the subset of records
Note: The F P-Tree constructed from the ﬁltered subset of records is called a
conditional ﬁ7-tree
8) Repeat step 5-7 for each item of the header table (ie: E, D, C, B, A)

 

@ ,,,,,,,, A
«at """ ,,:::®(”/
a» (""”:,’.’.:®( ----- (EDA/Ix,
on ’ —————— an ------ «st-5-x;
®(@<--- «ab/‘6’“
ammo—(Ll,
Figure 3: FP-Tree

 

 

 

In some cases, the most signiﬁcant beneﬁt of the F P-Growth approach is that longer
patterns tend to produce conditional fp-trees with only one branch. For example, the
removal of infrequent items might produce the three strings “BDF”, “BDF”, and “BDF”.
This would create a conditional Trie-Tree with one branch. As such, a very efficient
method of producing all combinations of “BDF” can be used to generate patterns.
Typically, the FP-Growth algorithm will re-order the strings of a subset ﬁ'om “most-

frequent” to “least-frequent” to help promote this effect.

17

The algorithms presented here represent only a fraction of the many algorithms available.
Frequent pattern mining is a very active area of research, and there are many, many more,
certainly more than can be presented here. A large percentage of the algorithms are
based upon the two approaches “Apriori” and “F P-Growth”. An even larger percentage
compare themselves against these two approaches. Many approaches are designed for
specialized datasets such as continuous data streams, very dense datasets, etc. There is
also a lot of research in special kinds of patterns such as closed itemsets and constraint
based rules which are designed to reduce the number of frequent patterns returned by a
ﬁequent pattern mining program. However, the remainder of the paper will work on new
approaches to optimize frequent pattern mining as well as exploring a new domain for the

task.

18

PART TWO
OPTIMIZING MINING

19

CHAPTER 4: Using Compressed Tries in Frequent Pattern Mining

4.1 Introduction

The uses of frequent pattern mining is expanding, and with it, the size of datasets that are
being analyzed are also increasing. Since a signiﬁcant portion of the most commonly
used frequent pattern mining algorithms use deviations of the Trie-Tree data structure,
most programs need signiﬁcantly more RAM than the original dataset. It is now to the
point where even relatively small datasets can exceed the capacity of existing equipment.
When this happens, the processing time it takes to mine data increases explosively.
Although there are methods of sub-dividing the dataset to prevent this from happening
(see “An efﬁcient algorithm for mining association rules in large databases”), these

methods require analyzing the data multiple times and thus require a great deal of time.

This section introduces a simple approach for signiﬁcantly reducing the memory
requirements during mining. This approach works well for a broad range of data and a
large genre of frequent pattern mining algorithms. It literally reduces the memory
requirements by an order of magnitude, and it does so with no discemable increase on

processing time.

The approach works by modifying the Trie-Tree data structure. The Trie-Tree data
structure can be used to create very time efﬁcient mining algorithms. Because of this,
there are even implementations of the Apriori algorithm that use Trie-Tree like structures
to store data. However, like most trees, Trie-Trees require a great deal of overhead. One
of the fastest Frequent pattern mining algorithms, F P-Growth, adds additional data

structure complexity (ie: doubly linked nodes, usage counts, etc.) that increase this

20

requirement even further. For small datasets, this does not have a signiﬁcant impact,
however, for larger datasets, space requirements could be many times that of the original

dataset.

Statistics can reveal quantitative results to seemingly random events. Analysis of the
seemingly random distribution of data within a Trie-Tree used for frequent pattern
mining shows a large number of tails. A tail is a branch of a Trie-Tree that is only used
by one record, that is, each node has no siblings and all but the leaf node has one child.
Knowing this pattern is prevalent, a compressed trie” is an obvious solution to
signiﬁcantly reducing space complexity. However, compressed tries do not lend them
selves to the substring searches required for frequent pattern mining. This chapter
introduces an algorithm to solve this challenge that not only has negligible overhead, but

is occasionally slightly faster.

The organization of the rest of the paper is as follows: Sections 4.2-4.3 discusses Trie-
Trees in frequent pattern mining. Section 4.4 and 4.5 gives a mathematical model for
why tails exist. Next, section 4.6 presents an algorithm for efﬁciently storing and
accessing data within tails, and section 4.7 gives experimental results. The last section

presents conclusions and introduces future work.

4.2 Compressing Trie-Tree Tails

There are many deviations and customizations for building a Trie-Tree for frequent
pattern mining. For efﬁciency, most mining algorithms using Trie-Trees will sort the

items within a record. This usually increases the number of records with common

 

‘5 Compressed tries is a method of storing a Trie-Tree that takes up less space. See section 4.4 for details.

21

preﬁxes, so it reduces the number of nodes within the tree, and lends itself to a
methodical method of getting strings into and out of the tree. For simplicity, this chapter

will sort the items within a record in a lexographical order (see Table 1).

Figure 4 is the Trie-Tree created from the records in Table 1. Even with just seven
records, you can start to see some usage patterns. One obvious pattern is that nodes near
the root of the tree have higher usage counts than nodes near the leaf nodes of the tree. A
less obvious pattern is that it is not just leaf nodes that have a usage count of 1. Indeed,
of the 20 nodes in the tree, 15 of them have a usage count of 1. Further, of the 7 records
in the Tree, all of them have tails. For this paper, a tail is:

A tail is a branch of a Trie-Tree that is only used by one record, that is, each node has

no siblings and all but the leaf node has one child.

 

 

Record?“ RECORD

ACDF
ACEF
ABDEF
ABEF
ABCDE
ABD
BCDF

\IQUIAUN"

 

Table 1: Sorted Items

 

 

 

 

Figure 4: Sorted Items in a Trie

 

 

By this deﬁnition, node “B: 1 ” is not part of a tail because it has sibling “Az6”. However,
the nodes below “B:1” do create the tail “C:1-D:l-F :1”.

22

Trie-trees with a large number of tails compress very well. Trie-tree compression is a
process of reducing the number of nodes in the tree. Although there are several methods
to do compression”, for this application, level compression alone will be very effective in
reducing the number of nodes in the Trie-tree. In this case, level compression is
performed by taking all of the child nodes of a tail and storing them as one string. For
example, the tail “B:l-C:l-D:1-F :1” of Figure 14 can be stored as the single node B:l
with the string CDF attached. Note that such nodes will always have the following three
characteristics: (1) they will always be leaf nodes; (2) they will not have siblings; and (3)
they will only be used by one record. As such, compressing the Trie-tree of Figure 14

will produce a Compressed Trie-Tree that looks like Figure 5.

 

 

Figure 5: Compressed Trie

 

 

 

4.3 Mathematical Models

Tails are not just an anomaly, but a statistical probability. That is, the patterns seen in
Figure 4 are common, and they are common for a reason. Section 4.3 and 4.4 present
that reason using statistical analysis. However, before that can be done, you ﬁrst need to

see how the principles of Mutual Independence can be applied to Trie-Trees.

 

'6 Patricia Trees are a method of doing “path compression”, but, that method will not be used here.

23

4.3.1 Mutual Independence

Back in 1993, Agrawal et—al presented frequent pattern mining and some key
mathematics used in the algorithm. The foundation behind the mathematics was the
prediction of frequent patterns. That is, given the frequency of short patterns, it predicted
the expected frequency of longer patterns. It did so by assuming a statistical property
called “Mutual Independence”. Mutual Independence basically says that if two (or more)
events are independent of each other, then the probability that events will occur together
can be calculated. That is, if event A and event B occur with a frequency of f(A) and
f(B), then the frequency of both A and B occurring at the same time is f(A+B)=f(A)f(B).
This implies that if event A happens frequently and event B happens frequently, then

A+B is likely to occur often.

Agrawal at-al presented mutual independence as follows: (see Figure 6)

IDB

Figure 6: Mutual Independence

 

 

 

 

 

 

 

 

Given:
X is a pattern x. xz X3 xn
Y is apattem such thatX n Y = {}
Then:
Sxy is the frequency of the pattern X+Y
The formula for Sxy is:
Fl) Sxy = f(y.) f(yz) . .. f(yk) (x—c) / dbsize
F2) f(yi) = support of item yi within dataset DB
Where y; 6 Y

24

F3) x = the # of occurrences of X within DB
For this paper, [X] is a set of records containing X
IfX = {}, then x = |DB| = dbsize = R
F4) c= | C | where C is a subset of [X] — [X+Y]
Note: (x-c)= Size of remaining portion of [X]

This is treating the occurrence of items within a string as events. The dataset DB
represents R independent trials where each string is a combination of simultaneous
events. The set [X] represents a set of strings that contain the simultaneous events X.
[X] is subdivided into mutually exclusive subsets. Given the set [X+Y] is such a subset
of [X], and knowing the support of X, Sxy predicts the support of the events X+Y. As

each subset is processed, it is added to the set C (ie: C’ = C + [X+Y]).

4.3.2 Mutual Independence for Trie-Trees

This same analogy can be applied to a Trie-Tree. Given a node Xn in a Trie-Tree (See
Figure 7), then the path from the root to Xn represents the pattern X = x; X; x“, and the
strings that use node Xn would represent the set [X]. The child nodes of Xn represent
mutually exclusive subsets of. [X]. Given the child nodes of Xn are Y1 Y2 Yk, then Y
can be represented as the single item yj, and C would be represented as the subset of
strings that use nodes Y. YH As such, Sxy = f(Yj) (x—c)/dbsize, where x = |[X]| =
the number of strings using x, and c = [CI = the number of strings using nodes Y1 through

it...

25

 

 

 

 

 

Figure 7: Mutual Independence for Trie-Trees

 

Figure 4 can be used to demonstrate how the analogy can be applied. Let node A:6
represent the pattern X, and item “B” be pattern Y to add to X. As a result, the pattern
X+Y is represented by node 3:4. The Sxy predicts how many records should use node

B:4. So:

X= “A”

Y= “B”

dbsize = R = 7 records

f(Y) = f(“B”) / R = 5 / 7 = 0.71

x=6

c=0

Sxy=f(Y) (x—c)/R
=0.71(6—0)/7

= 0.61 z f(X+Y)

So, the predicted support of 3:4 is Sxy=0.61, and the actual support is

0.57 ( = f(“AB”) = 4 / 7 ). As in 1993, a model of Mutual Independence produces a fairly

good estimate.

26

4.3.3 The Prediction of Trie-Tree Tails

So far, what has been presented is only a different perspective of Agrawal’s formulas.
However, the formula for Sxy can be generalized for all children of node X. For Trie-
Trees, it is given Y will always represent one item from the set it“, i.+2, ..., im. As such,

the support of the ik child of X will be given by:
F5) va = f(ik) (1 - Zj=t+l k-l f(ij) ) X / R
Further, the support for the ip child of Y (a grand child of X) can be calculated as follows:

Given:
Z = ip
y = R Sxy = the size of[X + Y]

Then:
F6) SXYZ=f(ip)(1'Zj=k+l...p-l f(ij))y/R
F7) SXYZ=f(ip)(1‘Zj=k+l...p-l f(ij))
f(ik)(1-Zj=t+l...k-l f(ij) ) X/R
F8) sxyzsfup) f(ik)x/R

For the root, x=R, so the support of any node path X = x. xn will be bounded by:

F9) Sx S H1=1...n f( Xj)
As a result, the number of records that are expected to use a node X can be predicted with
R Sx. If this is less than or equal to 1, then the node is predicted to be part of a tail.
Assuming all items have equal support, the depth of non-tail nodes can be estimated with:

Given:
f(x1)= f( X2): = f( X“) S l /n (where n = number ofitems in the dataset)

d is the depth of a node used by multiple records

27

R is the number of records in the dataset
Then:
F10) 1<R Sx= r1j=1...d f( ’9')
F11) 1 <R(1/n)d
F12) 0 < ln(R) — d ln(n)
F13) d < ln(R) / ln(n)
So, for example, if the number of strings in a dataset DB is R=1,000,000, and the number
of equally supported items in DB is n=20, then the expected depth where tails are

expected to show up is less than d=4.6. In other words, in this example, tails are

expected if the average number of frequent items within a string is more than 5.

To give an empirical test of the formula, the predicted depth was compared to the depth
of random datasets. In this case, the random data did not have any frequent patterns;

instead, each record contains 6 randomly selected items. As a result, the distribution of
items was evenly distributed and the length of each record was 6. From each dataset, a

Trie-Tree was constructed.

For Figure 8, nine datasets were generated with 100,000 records and the number of items
within the dataset varied17 from 20 to 100. For the datasets, the average depth of the
associated Trie-Tree shared nodes was plotted on the graph. This was compared to the
predicted average depth based on the formula (I < ln(R) / ln(n). As can be seen in Figure
8, the predicted depth does a very good job of bounding the average depth of shared
nodes. Figure 9 does a similar analysis, only it varies the size of the database and keeps
the number of items at 50. It also shows that the formula d < ln(R) / ln(n) does well at

bounding the expected average depth for random data.

28

 
  

6 T’T’Deipth Prediction with Changing ItenTCorTnf j ’ I
I :I 7 T: T j ‘—o—d=Avg Depth
I5 3 +ln(R)/In(n)
l O. . .
I3 2 I iiiiiii
‘ 1 I -- L
i o l- - - - - . ,
l 20 4o 60 80 100
I

n = Items Within Database

Figure: 81 Predicting d < ln(R)/ln(n) for Random Data
R =100,000 and n = 20 thru100

fins...smaiaaa.‘;m;’cn;.;ingssgig: *1

I

  
 

j +d=Avg Dept
i W - gm ..... mam ifﬂ'il’lvi'lu,

.000 10.000 100.000 1,000,000 10,000,000
R = Number of Records

l
--.,;
I

 

#7- m. .#._ 7 A.___,,.. m

iFigure 9: Pfglicting d < ln(R)/ln(n) for Random Data
n = 50 and R = 1,000 thru 10,000,000

4.3.4 The Observation of Trie-Tree Tails

The prediction of tails in random data is one thing, but do they occur in real data? After

all, the mathematical model makes several simplifying assumptions such as mutual

independence and evenly distributed items. To help answer that question, two distinct

and diverse datasets were examined.

The two datasets were collected from the FIMI repository. They were chosen because of

their diversity. The ﬁrst dataset is a DB of Belgium Traffic Accident data. At 35 MB,

 

'7 Since each record was ﬁxed at 6 items long, a minimum of 20 items was needed to create a sufﬁciently

random set of records.

29

and an average record length of 45 items, it is relatively small DB. However, many of its
572 items are mutually exclusive”, so it has a dense set of long patterns. Conversely,
WebDocs is a 1.5 GB dataset with a relatively sparse set of frequent patterns. Each
transaction of the WebDocs represents words of a document from the World Wide Web.
As such, it has 1.5 million records, and over 5 .3 million unique words. Because of its
large size, it has a signiﬁcant number of frequent patterns. Despite the diversity of the

datasets, both have a large percentage of tails.

For both datasets, a graph of the percentage of the records that have at least one frequent
item in it (Figure 10), as well as a graph of the percentage of nodes that are in tails
(Figure 11) was made. From this, it can be seen that even with relatively high support,
more than 90% of all records have at least one frequent item. It can also be seen that
even with support as high as 25%, more than half of the nodes of a Trie-Tree are in tails.
Observation has shown that the average depth of shared nodes for data with frequent
patterns is deeper than random data. Despite this, the graphs show that datasets with
infrequent items removed still have a lot of records, and it also shows their associated
Trie-Trees have a high percentage of tails. So, there is a large number of tails in real

data.

Since tails are so prevalent in the Trie-Trees of real data, what can be done to help? The

answer is a more efﬁcient method of handling tails.

 

'8 An example of mutually exclusive items would be “road condition = dry” and “road condition = wet”.

30

 

 

%Records Frequent

 

100% i i 32 _.a H

 

ll 0
l <>
II<>
no

9096 _ ._ ___. _t _ _______
8096 t ____u____ ._..-. . ,_______ , t_____~ -

60% .___.___.____ . . . ._ __.. _._. — _. ————— .

40% We” — —. V ...... . . ._—

% Rece with frequent item

 

0%.- . L, La . - , . . ,_. _ .,L_L,
30 28 26 24 22 20 18 16 14 12

1‘ support

Figure 10: Most Records Have Frequent Items

% of Nodes in Tails

100.0% I - - L , L ,, g 7
90.0% ~ . ., - _ , . . - -- L- , i
80.0% . 7 ,, 7 , W - 7 . -- -- h- L, L _k

60.0% , - , , --— - , , ~ a ,,
50.0% . - -, u - - - »- a- - , v-

40.0% «» — , 7 . . . . - — -_
30.0% ,. -» , -_---- ~ ~ .- . ~ ~ g... ;

96 of nodes In tells

 

Figure 11: Most Nodes are in Tails

3]

70% - - .__ _ _ _

50% ._ _ _ . .___ _ ..... __...

20% — , i+acuenag
10% ”i - ﬂ , —--—Webdocsji

70.0% . ”v ,L, 7. --- .--,--- - 7 A _ -L_--_ ,

20.0%.--._ .. - .- L - , , - , , f—O-Aocidengig
100%» _ ._. _. ._ .. _ _ _ _ _ ,+Webdocsir

0.0% TIT—“'_ -I‘” T‘ _ _Y" ' _ "T"' '_ Y_ _' —'I "— — I_' _ ‘1_" _’ I ” ‘ TT—_ "__T—"" ._. ' '

 

._.... _____.

 

4.3.5 Exploiting Tails for Frequent Pattern Mining

Now that we have explained and justiﬁed that tails exist, our challenge is to exploit this
characteristic in frequent pattern mining. Since several algorithms of mining use Tri-
Tree type structures (see Section 3.3 for examples), they can be made to be more space
efﬁcient by using the compressed Trie-Tree structure presented in Section 4.2. However,
some Tri-Tree based approaches add additional information and pointers to each node.
Most of this can be reduced or eliminated with the realization that tails have no siblings
and at most one child. Much of the rest can be eliminated by understanding the
methodical nature of mining. The next section will explore how this can be done for the

F P-Growth algorithm.

4.4 Implementing Tails

The most efﬁcient method of storing a tail within a Trie-Tree is to store only the items of
the tail. This can be easily done, but then the challenge for frequent pattern mining is
identifying all strings containing a particular item without scanning the entire tree. This
section will demonstrate how this can be done without such a scan, and without having a
separate pointer for each item of the string. To understand better, it helps to explain how

this has been done in the past.

4.4.1 FP-Growth’s Method of DB Scans

One of the core processes of F P-Growth is to scan all of the branches of its tree that
contain a given item. It can quickly ﬁnd all of these branches because it links the items
of its branches when it builds its tree (see ﬁgure 12). With all of the links in place, FP-
Growth can ﬁnd all branches containing a particular item. For example, when FP-

Growth wants to ﬁnd all strings that contain “D”, it starts with the D node for the string

32

“BCDF”. This node points to the D node for the string “ACDF”. The D node of
“ACDF” points to the D node of “ABDEF”. The last D node pointed to is for “ABCDE”.
By doing so, FP-Growth can methodically search for patterns that contain “D”. F P-

Growth will repeat this process for all items. A through F.

 

 

 

® ’’’’’’’’’ A
M“"’”,,:: V”
439 (ﬂ:< ----- (9‘
an I ------ an ------ @5ij
assume-- —-M"”:"”
WWW-(Ll
Figure 12: FP-Tree

 

4.4.2 Modify Scans to Scan Tails

The F P-Growth algorithm can be modiﬁed to scan tails by observing that the algorithm is
methodical in its scans. That is, it will look at all strings containing “A” before it looks at
strings containing “8””. When it starts looking at the strings for “B”, it will no longer

need to link “A” strings. As a result, each tail needs to only have one link.

For example, Figure 13 shows the initial setup of the link lists. Notice that this is a kind
of compressed Trie-Tree (ie: the compressed tail under 3:] is “CDF”). Notice also that

the “CDF” tail is linked to the list of “C” strings, but not to the list of “D” or “F” strings.

33

As part of scanning the list of “C” strings, the tail “CDF” will be scanned. After the
algorithm is done scanning the tail “CDF”, “CDF” is linked to the list of “D” strings (see
Figure 14). As such, the tail only needs one pointer for all of its items. This algorithm is

outlined in Figure 15 as well as the algorithm for adding strings to the tree.

 

 

 

 

 

can ...... A
an" ———— «m v”
(we: ,--—®< ------ g" ,.
.. ” F/x” IE
G! ------ on . -------- , F
I ”’ ’
K ,’ ’ I”
D 52.13“"- -- f -z’
E F ’,,r
F (._- F(—- F V
Figure 13: F P-Tail Before Mining
GED ........ A
__________ ,-—' B
m“ -439"
”ur’“ C
(ED ("" ..... <2) C __..—D
‘,-—" I v’D"

‘ T . '1’, F ’E
(on -------- on F
Lll ” ’,’ ””’

D E(—-@("“ s
E F
F (n. F(-- F V
Figure 14: F P-Tail After Mining “A”, “B”, and “C”
NOTE: The tail below “B21” is no longer linked to “C”

 

 

 

 

'9 The original F P-Growth algorithm was implemented with a bottom-up approach, however, there have
been top—down approaches and the procedure presented here can be adapted for either top-down or
bottom-up.

34

 

Procedure Add_String_To_Tree(String S, Tree T)
Node = root of T
For i = each item of S
lfNode has a tail
make lst item of tail a node
end if
lf Node has children
Merge i as a child of Node
Node = i child ofNode
Else
Tail = remainder of string S
attach Tail to Node
exit For loop
end
nexti

Procedure Process_Tree(Tree T)
For i = each item of I
For 5 = each object of LinkListi
Process_object(s)
If (s is a tail)
j= the item after i in string 5
add 5 to LinkListj
end if
next 5
nexti
Figure 15: Tail Procedures

 

 

 

4.5 Experimental Evaluation

Since tails compose an increasingly greater percentage of a tree as the strings get longer,
the advantage of compressing the tails becomes increasingly more signiﬁcant. In the
following experiments, the dense “accidents” and sparse “webdocs” datasets used from
section 4.4 were analyzed using two versions of FP-Growth. The F P-Growth algorithm
was used to demonstrate results because of its prevalence, however, other Trie-Tree
based approaches should work equally well. The comparison of the results will help to

demonstrate the algorithm is applicable for a wide range of datasets.

35

4.5.] Experimental Setup

The experiments on the performance comparison between FP-tail and F P-growth were
conducted on an AMD Atholon 2800+ PC with 1.2 GB RAM, running RedHat Linux 8.0.
The programs were written in C++ and compiled with gcc using the "-03" compiler

optimization.

4.5.2 Experimental Results

Figure 16 is a graph of memory usage for analyzing Accidents and Webdocs data using
the FP-Growth program and the FP-Tail program. The bottom two lines (AchP and
AccTail) represent the memory usage for analyzing the 35 MB Accidents data ﬁle, and
the top two lines represent the memory usage for analyzing the 1.5 GB of WebDocs data -
(DocFP and DocTail). As the “% Support” goes down, memory usage goes up. This is
expected since items with smaller support will need to be included in the analysis. What

might be less intuitive is the rate of growth for RAM as support decreases.

Since Figure 25 is LoglO, it can be seen decreasing support causes a geometric grth in
RAM. Although this growth is bounded, it only reaches this bound when “% Support”
reaches 0%. As a result, efficiently handling tails is not just a linear savings in space, but
a geometric one. Since the nature of this geometric growth is beyond the scope if this

chapter, Appendix A will expound upon its implications.

36

 

 

 

 

'r — Ram: lib-Growth vs FP-Tail
, 1m -- A- -—~ "w ~~~ «a- 10000 i
1 +AchP » y
i A
1 wooo—z
i 8
3
L m
J 100 2
f i
i . ~ - 1o
30 25 20 15 1o 5 o
l % Support
Figure 16: RAM for FP-Growth vs FP-Tail

 

 

 

Achieving this geometric savings in space can be achieved with no loss in performance.
Figure 17 shows the amount of time it took to process each data ﬁle. Notice that the
processing times for the accident data for both FP-Growth and F P-Tail (the bottom lines)
are overlapping. That is, the amount of time it takes to implement the tails algorithm for
the accidents data is negligible. Indeed, it produces a small time-savings because of the
more efﬁcient memory handling. The time savings for the analysis of the WebDocs data
also overlaps until it reaches 12% support. At that support, using the standard F P-Growth
algorithm takes over 5 hours instead of only 240 seconds. The reason this happens is
because the algorithm has exceeded physical memory by less than 5%. Since frequent
pattern mining requires analysis of the entire dataset on each pass, exceeding physical
RAM carries a very heavy price. This helps emphasize that saving memory can directly

lead to a signiﬁcant savings in time.

37

 

 

Time: FP-Growth vs FP-Tail

 

 

 

    

 

. , - ~ “ﬁr w -- . 800
+AchP p f 700
———AccTail 600
_+D°°FP; , 500%
—-x—DocTail' 400 g

'T 3003
200
“r 100
% v v e e ¢ ¢ 9 ﬂ t I I O
30 25 20 15 10 5 0

 

 

% Support
Figure 17: Time for FP-Growth vs FP-Tail

 

 

 

 

4.6 Conclusions of Compressed Trie Mining

Using compressed Trie-Trees for frequent pattern mining has proven extremely effective
in saving space. The space savings is not just linear, but geometric; as such, it has
changed the order of complexity of space requirements for mining. Further, this savings
is not only effective for diverse datasets, but it becomes more effective as space demands
increase. That is, the algorithm works best when it is needed the most, and it does not

appear to have any deviant case with a noticeable cost in space and/or time.

Using this approach allows frequent pattern mining algorithms to be used in at least two
new ways. First, with reduced space requirements, it becomes practical to analyze larger
datasets. Although time will increase the capacity of systems, it will also increase the
size of datasets, and this is a very effective approach to help bridge that gap. For
example, 1.5 GB is not very large, but it is by far the largest dataset offered by the FIMI

repository, and it is larger than a standard PC can mine using conventional methods. The

38

second way reduced memory can help is that it allows mining to enhance a system
instead of overwhelming it. For example, pattern mining can be used to enhance hard
drive usage, network security, or word processing preferences; and it can do so without
having to size a PC to meet the high resource demands of frequent pattern mining. This
point will be demonstrated in the next chapter with the development of an application for

ﬁ‘equent pattern mining involving the enhancement of PC lntemet security.

As a side note, this chapter also introduced several interesting observations from its
experimental results. These observations can be seen in Figures 10, 11, and 16.

Together, they help lead to a method of predicting the resources needed for mining.
However, predicting resources is not the same as optimizing the use of resources. As
such, this is out of the scope of this thesis, and the analysis of these observations has been

added as appendix A.

39

CHAPTER 5: Enhancing PC lntemet Security with Pattern Mining

5.] Introduction

The previous chapter introduced a mechanism for signiﬁcantly reducing the memory
resources required for frequent pattern mining. As mentioned, this can be used to either
mine bigger datasets, or to mine the same size datasets in a smaller environment. For
example, mining is not just practical on a personal computer, but it is now no longer
necessary to size your system to do some forms of mining. That is, mining can be
implemented to improve everyday tasks without having to buy extra RAM to do so. But
what task can be improved by mining? One of the contemporary problems plaguing

personal computing today is lntemet security.

In this case, we are talking about the long term leaching of information out of and/or into
your computer. That is, network applications and applications enhanced with network
connectivity are increasing the network traffic for personal computers to the point where
it is getting more difﬁcult to determine the network activity of your own system. Worms,
viruses, spyware, and even helpful applications will regularly send undesirable
information from your computer, or get unwanted data from the lntemet. However,
considering the volume of information involved, it would be impractical to monitor all of
this trafﬁc to determine what is, and is not, intended and necessary trafﬁc. Frequent
pattern mining can be used to analyze and classify this data to a manageable size. An
application that ﬁnds patterns in Internet traffic can be used to enhance the effectiveness
of PC lntemet security programs. This chapter introduces the development of a new

application called NetAppMine that does just that.

40

To understand why most systems need an application like NetAppMine, it is important to
realize that almost every home/office computer today is connected to the lntemet at least
intermittently. Frequently, applications on your system will automatically check when
you have connected to the lntemet and silently transfer data. However, identifying the
applications transferring this data can be challenging, and determining if the data

transferred is good or bad is next to impossible, but NetAppMine can help.

There are several types of applications on your PC that might transfer bad data. For
example, Adware will intermittently display advertisement on your screen”. Another
more subtle example is Spyware. Spyware will discreetly collect information about you
and your habits” for targeting purposes. Worse yet, Malware, Viruses, and Dialers can
take control of your computer and do any number of malicious acts without your
knowledge. Even purchased software will frequently have Spyware/Adware functions

built-in.

This does not mean all programs discretely connecting to the Internet are bad, indeed,
some are helpful. A program can remove bugs via auto-update. Well targeted marketing
can help you ﬁnd products your interested in. A company that knows more about you
can serve you better. The trick is to identify and control the data your sending and

getting.

Several applications have been developed to help control the ﬂow of lntemet data from/to
your PC. These include Pop Up Blockers, Spyware/Virus Detection, and lntemet

security programs. The problem with these solutions is that they only identify known

 

2° Frequently, Adware will display ads on how to spend money you don’t have.
2' A common use of Spyware is to collect E-mail addresses so “spammers” can send you ads on how to
enlarge or shrink something.

41

problems or have broad/stateless solutions. For instance, if an Internet security program
lets a program access the lntemet, the program can do anything it wants without further

checks, including IP broadcasts in the middle of the night.

One solution is to add fuzzy logic to lntemet security programs. That is, look for patterns
in the data ﬂow of information. Since frequent pattern mining is the process of looking
for patterns in data, it is ideally suited for this task. The scope of NetAppMine is to ﬁnd
such patterns between network applications and network data. This information can be

examined or post-processed to help identify problems.

The organization of the rest of this chapter is as follows: Section 5.2 presents previous
work in lntemet security and describes the unique niche NetAppMine will ﬁll. Section
5.3 explains the methods used to collect and pre-process the data to make mining as
effective and efficient as possible. Next, section 5.4 shows the results of analyzing the
data from three environments to demonstrate the effectiveness of NetAppMine. This is
followed by section 5.5, which demonstrates the efficiency NetAppMine. The last

section presents conclusions and introduces future work.

5.2 Previous Work

Programs with dubious lntemet communications have been around for many years. By
the year 2000, Spyware was so prevalent that Gibson Research released a program
dedicated to the removal of Spyware. Kazaa has been bundled with Spyware since at
least Dec 2001 (see Table 2). SpywareGuide currently has 2041 Spyware programs
listed, and this list is growing fast. A recent study found over 5% of the computers at the

University of Washington were infected with Spyware, some of them for over 4 years.

42

Network viruses and worms have been around even longer than that. But what can be

done?

Fundamental changes in the lntemet, and how it is used, would be needed to preserve
privacy. IPv6 would go a long way to helping out. This would allow for better packet
route tracing and greatly reduce spooﬁng. In addition, IPv6 would allow the elimination
of NAT and the addition of Kerberos authentication as needed. However, IPv6 would not
be enough. Some organization would need to oversee network applications and certify
Java Scripts, ActiveX applets, etc. to ensure their compliance with privacy standards.

However, these are extreme measures that are not likely to happen soon, if at all.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Version 1.3.3 1.4 1.5 1.6 1.7 2.0 2.1 2.1.1 2.6
released 12/01 01/02 02/02 04/02 05/02 09/02 02/03 05/03 1 1/03
Gator X
SaveNow X X X X X X X X
Cydoor x x x x x x x
BDE X X X X X X
VX2 X X
New.net X X X X X X
OnFlow X X X
OIL-Ware X X X
CmnName X X X X X X X
PromulGate X
DirecTVloon X X
MySearch X
Table 2: Spyware Bundled with Kazaa
Twelve different programs that were bundled with Kazaa at various points in time.

 

 

 

A more practical approach is for the individual to monitor their own lntemet data. One
way to do this monitoring is to have software tools that can ﬁnd patterns in network

traffic. IntelliGuard IT has just such a tool. It uses frequent pattern mining in their

43

architecture to create a self-learning ﬁrewall for an enterprise environment. It is this kind
of process that NetAppMine will be using for enhancing the lntemet security for the PC

environment.

Mining can be used to ﬁnd frequent patterns in network data. That is, ﬁnding two or
more network events that frequently occur together. For example, a DNS lookup of
MSU.EDU may frequently occur with a PING of MSU.EDU and an FTP to MSU.EDU.
This would result in a frequent pattern of (DNS, FTP, PING). Additional frequent
patterns might be (DNS, FTP) or (FTP, PING). In fact, there are 7 possible combinations
for the three items DNS, FTP, & PING. With 9 items, there would be 511 (=29-1)
combinations. Finding all of the frequent patterns and ﬁltering out the interesting ones is
one of the challenges this research will tackle. Aﬁer all, even with ﬁltering, there is still

lot of patterns to look at.

5.3 Methodology

The last section emphasized NetAppMine is a real application ﬁlling a real niche, this
section will demonstrate that every effort was done to make NetAppMine as effective and
efﬁcient as possible. It will do so by describing the methods used to collect and pre-
process data, as well as describing the kinds of results expected from the frequent pattern
mining analysis. As will be seen in the experimental results (see Section 5.5), the
optimization presented in Chapter 4 is an integral part of making NetAppMine practical

due to the high volume of data.

5.3.1 Collecting Network and Application Data

44

The program collects and time stamps information from each network packet sent. The
information that NetAppMine logs is Source IP, Source Port, Destination IP, Destination
Port, and the protocol (ie: tcp, udp, icmp, or other). In addition, as packets of data are
collected, the program also collects a snapshot of applications with active network
connections. This is done up to four times a second. Lastly, all transactions are given a
timestarnp in milliseconds. However, without processing, the resulting logs quickly grow

into many gigabytes of data.

5.3.2 Processing the Logs

Processing the logs prepares and reduces the amount of data that needs to be analyzed.
Processing will include the ﬁltering out of extraneous IPs, combining data to create
items, and grouping items to create records. The objective here is to balance between
having enough data to produce meaningful and useful patterns, while not having so much

data that the time and space complexity needed for analysis is too costly to be practical.

Filtering extraneous data is the safest method of reducing the amount of data that needs to
be analyzed. For example, the data was collected in promiscuous mode, so network
trafﬁc between other systems can be ﬁltered out. This is done by removing any
transactions that does not contain the IP address of the local machine, 127001, or
0.0.0.0. Some space compression is also done at this stage by encoding the data. That is,
IP addresses, application names, etc. are encoded as a single unique numbers. However,

this still leaves tremendous amounts of data to be analyzed.

Combining data to make items is a key component to reducing the time complexity of
analyzing the data. For example, one of the smaller experiments presented in Section 5.4

had 120 IPs, 700 Ports, 15 Applications, and 4 Protocols (ie: tcp, udp, icmp, and other).

45

Because of combinatorial explosion problem, this would mean 420 billion combinations
to look for with thousands of them being ﬁ'equent. To reduce this complexity, each
transaction is converted to three items. These items are a combination of the following
ﬁelds:

1) Application, Domain

2) Application, IP address
3) Application, IP address, Protocol, Destination Port, Send? 22

Where “Domain” is the ﬁrst two octets of the IP address.
Where “Send?” is FALSE if the source IP is a remote address.
As a result, the experiment with 420 billion combinations is reduced to less then 22

million; with less than 100 of them being frequent.

For example, event #1, event #2, and event#3 from Table 3 are all from a single
transaction. As such, this transaction only creates three items. Otherwise, this same
transaction would create the 6 items: “IE”, “35.9”, “35.9.2020”, “TCP”, “Port=80”, and
“Send=True”. However, with these deﬁnitions of items, we can still capture an
application that frequently connects to the “35.9” domain even if it continually changes

ports and/or IP addresses it is connecting to.

The last consolidation done by NetAppMine before analysis is the identiﬁcation of
records. In this case, a record is a collection of items. To reduce the number of records,
the deﬁnition of an item is further reﬁned to be an event. An event is the starting or
ending of a virtual connection. For example, if IE starts sending data to 35.9.2020 at

8:00 am, and continuously sends IP packets to port 80 until 9:00 am, that would generate

 

22 Items of type three are similar to a virtual circuit. That is, a virtual connection between two applications
that are communicating over the network.

46

6 events (see Table 3). Events that occur within the same time slice are grouped together.
For these experiments, a time slice was deﬁned as 10 seconds. As such, if the application
PING and FTP frequently run within 10 seconds of each other, a frequent pattern will be

identiﬁed that associates the running of PING with the running of FTP. It is this data that

NetAppMine analyzes for frequent patterns, and these kinds of patterns that NetAppMine

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

will ﬁnd.
Event #1 08:00:00 (IE. 35.9)
Event#2 08:00:00 (IE, 35.9.2020)
Event #3 08:00:00 (IE, 35.9.2020, tcp, 80. True)
Event“ 09:00:00 (IE, 35.9)
Event #5 09:00:00 (IE, 35.9.2020)
Eventi/ﬁ 09:00:00 (IE, 35.9.2020. tcp, 80. True)
Record/I1 08:00:00 (IE, 35.9)
(IE. 35.9.2020)
(IE, 35.9.2020, tcp, 80, True)
Reeord#2 09:00:00 (IE,35.9)
(IE, 35.9.2020)
(IE, 35.9.2020, top, 80, True)
Table 3: Events and Records
This table shows 6 events and the two records
created by the events.

 

5.3.3 Analyzing the Data
Frequent pattern mining can be done using several methods that produce several kinds of
results. In this case, the algorithm was modiﬁed to reduce the number of patterns found

without eliminating interesting patterns.

A pattern is a combination of items that occur together frequently. For example, (IE,
35.9) and (IE, 35.9.2020) could occur together in 150 records. As a result, the pattern

[ (IE, 35.9), (IE, 35.9.2020) ] has a frequency of 150. Such a pattern would indicate that

47

a lot of the communication to the Domain “35.9” is to the same IP Address. For these
experiments, a pattern needs to occur in at least 20% of the records to be considered a

frequent pattern.

To reduce the number of patterns printed, usually only “maximal patterns” are printed.
For example, if [ (IE, 35.9), (IE, 35.9.20.20)] has a frequency of 150, then [ (IE. 35.9)]
and [ (IE,35.9.20.20) ] must have a frequency of at least 150. However, neither of these
patterns will be printed unless one of them has a frequency signiﬁcantly greater than the

frequency of [(IE, 35.9), (IE, 3592020)].

In addition, to emphasize the role each application plays in network communications,
mining was done on each application. For example, network communication by
EXPLOREEXE may indicate ﬁle sharing over the network. As a result, all records that
reference EXPLORE.EXE are extracted from the logs and the dataset is mined

separately. This is done for each application that connects to the network.

If a dataset only has a few records, it is not mined. If there are only 5 records in a dataset,
then every record will be considered frequent. If two records contain several items that
are the same, then hundreds of patterns will be generated from just a few records. As a
result, any application dataset that contains 15 or fewer records is not mined. Instead, the
high port (1000 or greater) items are removed from the records and each record is printed

as a pattern.

5.4 Experiments

5.4.1 Experimental Environment

48

The network trafﬁc of three computers were analyzed. All three environments were
home PCs running different versions of Microsoﬁ Windows. Each of the computers had
a 2+ Mb/s lntemet connection. Each of the computers had network applications other
than browsers and were monitored during both active and idle times for at least 12 hours.

The names of the three computers were TEST, WORK, and MEDIA.

The TEST computer was the computer that NetAppMine was developed and tested. Its
network trafﬁc was collected several times over a two week period. Some collections
were under 10 minutes and others were over 12 hours. The Test PC was running
Windows 2000 and was connected to the lntemet via Cable through 3 D-Link Firewall
Router. Some of the collection was made with Norton lntemet Security installed, and
some was not. In addition, the Spyware programs “iGetNet” and “BonziBuddy” were

intentionally installed.

The Work computer’s primary function was to connect to a remote ofﬁce. Since it was a
relatively new computer, it did not have many programs installed on it. It was running
Windows XP and was connected to the lntemet via a wireless hub. Data collection was

done over a single 24 hour period.

The MEDIA computer’s primary function was for entertainment. It had many
applications installed on it, including two P2P applications (ie: Kazaa). It was running
Windows XP and was connected to the lntemet via Cable through a router. Data

collection was done over a single 12 hour period.

The collection of network and application data was done with a program that uses the
third party tools Ngsniff and Tcpvcon. These tools were chosen because they do not

require any setup or installation. That is, they can be run on any computer.

49

5.4.2 Displaying Results Using a GUI

Although there may be fewer than a hundred patterns, each pattern might contain 30 or
more items. To make it easier to review and recognize interesting patterns, NetAppMine
uses a GUI to display the results. Figure 28 (at the end of this chapter) contains a GUI

display of the pattern represented by the table below as well as summary columns:

 

 

Pattern #2 has a frequency of 6 with 13 simultaneously network events
for the application “C:\Program ﬁles\Intemet Explorere\IEXPLORE.EXE”
The13 network events include:

Pat Port App [P Protocol Port Direction Name
Type #

l 1 0 -1 None

2 2001 0 -1 None

3 1 0 192.168.0100 UDP 53 < None

2 2043 0 216.177 None

1 43 0 216.177.73.139 None

3 43 0 216.177.73.139 TCP 80 < None

2 203 5 10 209.244 SymProxySvc.exe
2 2044 10 216.1 77 SymProxySvc.exe
1 44 10 216.177.73.139 SymProxySvc.exe
l l 10 10 65.54.140.158 SymProxySvc.exe
1 57 12 127.00.] IEXPLORE.EXE
3 5 7 12 127.001 TCP 1030 < IEXPLORE.EXE
2 2008 12 205.161 IEXPLORE.EXE

 

Table 4: “Events” for a Single Pattern
NOTE: An “Event” does not mean a packet was sent,
but that a network connection was started or stopped
The format of the GUI display makes it easier to see information and patterns that can not
be easily seen in the text. The GUI display lists all applications on the right side and all
items of a pattern on the left. This makes it easier to see which applications are (and are
not) participating in a pattern. Further, the current application dataset is in black and has

“*9,

an . This makes it easy to identify which application dataset is being looked at. In

50

 

addition, extra information is readily available about the current application dataset as
well as other applications. This includes the number of records in other datasets as well
as the number of patterns. The top line of the display shows the number of patterns in the
dataset as well as the frequency count of the current pattern and the full path to the
application. All of this data in Figure 18 makes it easier to see that lntemet Explorer is
using SymProxy to access the lntemet for everything except the Domain 205.151.

Further analysis of this pattern will be given in section 5.

5.4.3 Results of Analysis

In all, there were 89 patterns generated among the three computers. Figures 18, 19, and
20 demonstrate the results of the three of those patterns. This help to demonstrate that

useful results can be gathered from this kind of analysis.

Figure 18 is a pattern from the analysis of the TEST computer. The TEST computer had
a total of 45 patterns generated. The Local IP address of the TEST computer was
192.168.0100. Figure 18 displays pattern #2 from the analysis of IE. It shows that IE
primarily used SymProxy23 to access the Internet. This can be seen from the “127.00.1”
communication while SymProxy was accessing 216.177.73.139 (IGnet Spyware web
site). However, IE did have an exception. It directly accessed several computers in the
Domain 205.161. Most likely, this is an indication that the IGnet soﬁware was using IE
to access the web directly. However, it could also be normal communications by IE for
tasks like downloading a program. Further collection of data would help to distinguish if

this was an anomaly or just normal behavior.

 

23 SymProxy is a Norton proxy program for lntemet security.

51

The analysis of the WORK computer had 23 patterns as well as 27 records from
programs that infrequently access the lntemet. PokerStarsUpdate.exe is an example of
just such a dataset. It helps to show what a pattern looks like for a program that
infrequently accesses the lntemet. Figure 19 is the ﬁrst pattern of the
PokerStarsUpdate.exe application. Since this program’s dataset has fewer than 15
records, each record that contains PokerStarsUpdate.exe generates a pattern with a
frequency of 1. This clearly demonstrates that PokerStarsUpdate.exe is accessing the
web. In addition, it shows what else is going on while it is doing so. Based on this
information, the PokerStarsUpdate.exe program could be blocked from accessing the

lntemet, or it could be blocked from the 66.212 Domain.

The MEDIA computer has an IP address of 192.168.2.35 and generated over 3,000
records in less then 12 hours. Most of these records were generated while the computer
was not being used. During that time, it was generating network trafﬁc at 6 times the rate
of the other two computers. Figure 20 shows that a lot of these records were broadcasts.
This is a likely indication that there is a misconﬁgured, poorly designed, or bad
application on this computer. Further analysis would be needed to determine which
application is causing the problem, but it is most likely the HP program “BackWeb”

because of the high volume of network trafﬁc it is generating.

52

Pat 2 of 14 with Freq 6 C:\Program Files“ ntemet ExploreﬂlEXPLOREEXE

 

app #Pets Maant #Reco

1:92 1680100UDF’53 < (ALL— 5 663 823
218:177' ' E - (NONE 11 68 149
gig.i;;:;§:1391‘cp 80 < (SYSt 663 769

 

 

 

 

  
 

 

5

202.277 (Wins 2 l 2
21317713138 (SVCh 0 0 0
i253:“,‘1°"58 (serv 0 0 0
127.031 TCP 1030 < (MSTa 0 0 0
“05'1”" (wmpl 3 1 3
(SShC 3 l 3
(real 5 1 5
(Quic 2 l 2

(SymP 5 663 780

(Nets 5 663 780

* ( IEXP l4 7 l7
(cvpn 0 0 0
(ie6w 2 1 2

Figure 18: GUI Display of TEST Computer
lntemet Explorer is using SymProxy (usually).

Pat 1 of 8 with Freq 1 C:\Program Files\PokerStars\PokerStarsUpdate.exe

 

app #Pats Maant #Recs

(ALL__ 5 113 244

192.168.2.1 UDP 53 <

      
 
  
 

 

 

 

  

 

208.32 E ( NONE 6 6 5 1 3 4
33:212.225.22——//—///Jf/ (SYS t 7 1 4 8 1 4 9
08.212.225.22 TCP 80 < ,, (svch 3 1 3
68.212.233.228 -:
662132233223 TCP 443 / (31 g . 0 0 0
1323082255 / * (Poke 8 1 8
182.188.2255 UDP 137 < (poke 5 1 6
182188238 (ieXp 5 1 1 3 242
(ccAp 0 0 0
l 0 1 1 0

 

182188.235 TCF’ 130 <
182188.235 UDF’ 137 >

88.21:

80.212.228.22 (LUCO
08.21:

06.212.233.228

200.253.208.108

Figure 19: GUI Display of WORK Computer
PokerUpdate has fewer than 15 records in its dataset.

53

Pat 1 of 4 with Freq 119 System

 

 

 

192 168 app #Pats Maant #Recs
1::;:::;§;::;UD. «\K E1353 .3: 33: :33?
3“"“-~—*(3yst 4 119 305

(svch 2 906 1933

(svch 0 0 0

(Expl 6 l 6

(ehRe 2 219 687

(awre 0 0 0

(back 0 0 0

(Back 3 766 1438

(nava 0 0 0

Figure 20: GUI Display of MEDIA Computer
Over a 12 hour duration, this computer was very “chatty” despite it was idle.

5.5 Beneﬁts of Optimizations

Section 5.3 presented the measures taken to ﬁlter and process the data so that the analysis
done by NetAppMine is as space and time efﬁcient as it could be. Section 5.4
demonstrated that the analysis of this data still produced meaningful and useful results.

This section analyzes the performance of NetAppMine.

Because of the high volume of network trafﬁc that a PC can generate, NetAppMine can
have large dataﬁles. Figure 21 is a graph of the RAM usage of NetAppMine for
analyzing data. In this case, each dataset represents a 24 hour period of trafﬁc and is
ordered from the smallest dataset to the largest. As you can see from the ﬁgure, the

amount of RAM required to mine patterns grows geometrically with the size of the

S4

dataset. However, implementing space optimizations has a geometric reduction of RAM

requirements, and it does so without increasing processing time (see Figure 22).

 

25° I I“: -2112: ’ Space Optimized Mining of Network Data' "

100‘

I

o 9'1”“ 3‘“ 100 200 300 400 500 800

 

oRAMS

 

 

 

Figure 21: Space Savings Due to Optimizing

I ‘5 I I — ' cpu CostofSpac;Won"‘ T-jﬁgrﬂ I
E I

 

 

 

 

 

 

i 10 I ,. , 2 . . ., , , — ______2,.
‘30 - 7 , 2 . .._._ 2—2,

Liane-sets!“ 100 200 300 400 500 _ 501i

 

 

 

Figure 22: CPU 66st lofSprace 85ng from Optimizing
You may notice that although there appears to be a good relationship between the size of
the dataset and the amount of RAM it takes to mine, a similar relationship between the

dataset size and runtime does not appear to exist. A lot of parameters affect the runtime

of mining, and the size of the dataset is not necessarily the most important. Indeed, while

55

doing performance analysis, the worst dataset was one of the smaller datasets. Figure 23
is a graph of the processing time needed to do detailed analysis of the network trafﬁc
running through the ﬁrewall from the experiment done on the TEST computer. From this
graph, it can be seen that decreasing the support by a small amount has a dramatic effect
on the processing time. The next chapter will look at why this is occurring, and what can

be done about it.

 

 

e e I
I 2- f I
'. . I
It I
I: 0 r , , z I Ii
! 0.5 Support 0.6 0.7 0.8 0.9

Figure 23: CPU usage with a Dense Data

5.6 Conclusions Drawn From the Application

NetAppMine works, and it does so efficiently. The results of analysis proved useful in
determining applications with network problems. Further, the combination of reducing
the quantity of data analyzed and implementing the RAM optimization technique has

made NetAppMine a practical application that should not overwhelm standard PC.

NetAppMine could be signiﬁcantly improved. Further studies in normal network activity
and a database of acceptable patterns could be added to help automate the process of

identifying unusual network trafﬁc. In addition, the application could keep track of user

56

responses to help automate the future distinction between acceptable and unusual
patterns. NetAppMine can be further improved by increasing the kinds of data it collects.
For example, it could collect packet sizes, or classifying the kinds of TCP data that is
being transferred. This would enhance the analysis and widen the kinds of problems that
the NetAppMine could detect. However, degenerate datasets do exist that make such

analysis difﬁcult. The next chapter will look at how such datasets might be handled.

57

CHAPTER 6: Reducing CPU Requirements For Dense Datasets

6.1 Introduction

The last chapter introduced a frequent pattern mining application; however, this
application discovered a degenerate datasets that was difﬁcult to mine. The reason for
this drastic degeneration of performance was due to the high number of patterns produced
by the small dataset. In other words, it was an extremely dense dataset. However, it is
common that the performance of mining algorithms degrade as pattern density increases.
Such degradation occurs when you are trying to get past common knowledge (ie: “it is
dry in the summer”) to more interesting knowledge (ie: conditions for “drought in the

summer”). This chapter presents an algorithm to increase the efﬁciency of such analysis.

Recent work on frequent pattern mining focused on using pattern growth algorithms such
as FP-Growth. This approach works well when the patterns are longer or there are a
large number of patterns. Although Trie-Tree like data structures have proven effective
for such algorithms, the construction and the traversal of the trees are a bottlenecks of
these approaches. In this chapter, we propose a method that uses a self modifying preﬁx-
tree structure that mines the patterns in a predeﬁned order. The method is called Fri, it
works through a process of extracting frequent patterns from a Trie-tree structure in a
depth ﬁrst fashion. Our experiments show that when minimum support is low, or when

the pattern density is high, Fri signiﬁcantly outperforms previous approaches.

F P-tree, the Trie-Tree data structure of FP-growth, has a complex structure of pointers.
As such, the cost of construction and traversal of the conditional FP-tree is not trivial, and

it adversely affects the performance of FP-growth under certain circumstances. Further,

58

the bottom-up approach produces results in a very erratic fashion. Indeed, the only
pattern it tends towards is the production of the least supported patterns ﬁrst. In this
work, we propose the following techniques to reduce the cost: First, the preﬁx-tree
structure, Trie-tree, is constructed to store the critical information about frequent patterns;
Second, a depth-ﬁrst pattern growth method is proposed to mine the frequent patterns
incrementally, as such, the Trie-tree grows during the mining process without having to
create conditional trees. The organization of the following sections is as follows. We
start by giving a brief description of frequent pattern generation and association rule
mining in section 2. Then the general idea of mining frequent patterns from the Trie
structure is introduced in section 3. Section 4 discusses the Fri algorithm as well as
several optimizations that could be applied. Section 5 is the presentation of the

performance study. Section 6 summarizes the study and discusses future work.

6.2 Trie and Frequent Patterns

The data structure Trie was ﬁrst introduced by Fredkin. The construction procedure of
Trie guarantees the sufficiency of this compressed data structure, i.e., it contains all the
information in the original dataset. For detailed construction, see section 2.3 or see the

article “An efﬁcient Implementation of TRI Structures” by Aoe (et. al).

Figure 24 illustrates a Trie-tree of an example dataset. We denote the subtree with a root
of A:6 under Root as subTree(A:6, Root); so the Trie in Figure 24 has 3 subtrees under
Root: subTree(A:6, Root), subTree(Bz2, Root), and subTree(Dzl, Root). It can be seen
that the subTree(A:6, Root) represents all the transactions that contain "A", so
subTree(A:6, Root) contains the exact information that is needed to generate frequent

patterns containing "A"; the union of subTree(B:4, A:6) and subTree(Bz2, Root) contains

59

the exact information to generate frequent patterns containing "B". Due to the structure
of the preﬁx tree, subTree(B:4, A:6) and subTree(Bz2, Root) could be merged to form a
new Trie, subTree(Bz6, Root). The Fri process is a pattern growth algorithm like FP-
growth. However, its data structure is a standard preﬁx-tree (Trie—tree). Further, the
frequent patterns are mined in a depth ﬁrst approach. As such, patterns are created in a
very predictable fashion. Further, the mining process is a recursive algorithm that can
easily be modiﬁed for parallel processing. That is, after the ﬁrst iteration of mining
patterns that begin with "A", the mining of the patterns that begin with "B" can be started.
The ordering of the pattern generation can be in any predeﬁned order such as alphabetical
or nmnerical. The patterns are generated in this lexographical order. As a result, if the
patterns are added to a tree, the tree would always be adding leaf nodes. Such a tree
would be very similar to the tree produced by the Tree Projection algorithm and can be
compactly stored and quickly reread for future reference. The Fri algorithm is a recursive
algorithm with each iteration producing one new pattern. When mining the "A" branch,
the ﬁrst step is to copy the branches under "A" to other parts of the tree. After that has
completed, the "B" branch can be mined independently of the "A" branch. Similarly,
after the ﬁrst iteration of the "B" branch, the "C" branch can be mined independently of
"A" and "B". Further, once the mining of a branch has started, the nodes under the
branch can be copied to another system and mined independently. This partitioning can
be applied to each level of the tree. That is, once the mining of the "AB" branch has

started, it too can be copied to another system and mined independently.

60

DATASET TRIE

/

AB ROOt
ABC /--/// I \\
ABC 2.3:" --..\ ,‘jf:::~.~I-
AC A 6} B = 2 ”\P = 2‘»
AC / 1
B \
BCD (B : 4) (C = 27> {5.01)
/,__ /i-_ \ , / .\
:i “(c : 2 )1 c; D: 1} (D: 1)
Figure 24: A sample dataset and its Trie
6.3 The Fri Algorithm

The simpliﬁed algorithm for producing patterns using a tree is:

Procedure SimpleFri(Branch)
(1) For(each Limb below Branch)
(2) CopyMerge the Limb up one level
(3) Output Pattern
(4) F or(each Limb below Branch)
(5) SimpleFri(Limb)

The technique behind pattern growth algorithms is to generate combinations of various
patterns in an orderly fashion. Since our algorithm searches for frequent patterns using a
Trie-tree, we call the algorithm Fri. The basic approach is a depth ﬁrst recursive
procedure of copying branches up one level. This means the tree will be modifying itself
as the patterns are generated. For example, the tree in Figure 24 has subTree(B:4, A:6)
which contains the nodes(C:2 & D: 1 ). This subtree would be copied up one level to

subTree(Bz2, Root). The resulting tree would look like Figure 25. In some cases, the

61

result will be to change nodes(i.e.: 82 becomes B:6). Other times, new nodes will need

to be created(i.e.: Dzl under B:6 is created).

 

 

 

/A :5)
”.- l \l . .
’8 - 4T ((0:2)

I
'l
I‘ . \ o

. / ..
/’
./
’l
,/

/.

J

(.9 2 "1.9 39.1.77 i 13.1/

Figure 25: The Trie after applying CopyMerge of subTree(B:4, A:6)
Since subTree(A:6, Root) also has C12 under it, C:2 would also be copied up one level.
This would complete the processing of the "A:6" node. The eﬁ‘ect of doing these copies
is to create new combinations of items, This is like removing all of the "A"s from the
records that contain "A"("AB", "ABC", "ABC", "ABD", "AC", and "AC") and adding
them to the Tri-tree. Now the tree would look like Figure 26. Since this is a depth ﬁrst
algorithm, the next subtree to be processed is subTree(B:4, A:6). Copying O2 and D:1
up one level would have the effect of adding the combinations "ABC", "ABC", and
"ABD" without the"B" and without modifying the count of A:6. Processing this new tree
would process nodes C:2, Dzl of subTree(B:4, A:6) and nodes C24 and D:1 of
subTree(A:6, Root). However, since these are leaf nodes, they have no subtrees beneath
them. As a result, Figure 27 is how the tree would look like after processing all of the

nodes in subTree(A:6, Root).

62

 

Figure 27: The Trie aﬁer applying CopyMerge of two leaf nodes of subTree(B:4, A:6)

Aﬁer all of the nodes in subTree(A:6, Root) have been processed, each node in
subTree(A:6, Root) will represent a pattern. For example, the "A" branch represents the
patterns "A:6", "AB:4", "ABC:2", "ABD21", "AC:4", and "AD:1 ". When all of the nodes
are processed, the resulting tree will have a similar structure to the tree produced by the
"Tree Projection" algorithm. Figure 28 is the fully-ﬂedged Trie-tree, each node whose

item has a frequency no less than min sup represents a pattern generated by Fri, and the

63

frequency of the pattern is equal to that of the item in the current node, due to the fact that

nodes at a higher level will have a higher frequency.

\.,Root,>
‘\‘\.1\':TN __ ‘1 -., ‘ ~2-
T \"~\ TH- '
./ 7 ‘ ,/"' “x /' ""“~\ ,,-~ ‘“ ‘~-~I
._ A:6); (p B. 6) I\ C: 5,. D- 3?
// \\\ \\\ \-
/ \‘ \\
/ \\\

' ”'\ /" x V \
LC; :-%/’J KRP‘: -_2/2

. A \
' I
I‘ ._ r
e \~ \~\ 2 ‘ ’//
/ \
I
I
I
/ ‘\
/ \

(62? BI‘ @312
Figure 28: The fully-expanded Trie

6.4 Optimizations
The SimpleFri algorithm is very efﬁcient at generating all patterns for all items in all the
transactions. However, when min sup is bigger than one transaction, there are a few
optimizations that will signiﬁcantly increase performance. The motivation for these
optimizations is driven by the general structure of a database stored in a Trie-tree. That
is, leaf nodes of a Trie-tree tend to be at the end of a long string of nodes with only one
child. An example of this is in the branch created by the transaction "BCD" in Figure 24.
These "tails" are frequent because long records tend to have unique "tails". For example,
two records "BCDE" and "BDE" both have the same preﬁx "B" but have different postﬁx
items(i.e., "CDE" and "DE"). Because of this, they will create different "tails" in a Trie-
tree. Copying these tails up one branch is costly and will usually not generate a frequent
pattern. As a result, it is a big optimization to allow multiple references to the same "tail"
whenever possible. Another optimization can be done when the algorithm is performing

its depth ﬁrst search and it comes across a node that is not frequent. Since the nodes of

the Tri-tree are ordered (i.e., alphabetically, a node will only have children that are
alphabetically bigger than it). This means that when the children of a node are copied up,
they can never add to the current node. It can greatly reduce the copying of infrequent
nodes if the branch is pruned of infrequent items. Since the siblings of a node were
copied up before traversing the branch, pruning will not eliminate any possible frequent
patterns of parent nodes. Similarly, after a node has been processed, it can be deleted to
reclaim any resources it is using. The last optimization is a result of the pruning process.
When infrequent items are removed from a branch, it frequently leaves a tail of nothing
but frequent items. This is especially true if there is a high conﬁdence between the items.
As a result, a more eﬁicient algorithm for printing all combinations of all nodes can be
utilized. The new optimized algorithm works very well for processing nodes in memory.
The optimized algorithm is:

Procedure Fri(Branch)
(1) For (each Limb below Branch)

(2) CopyMerge theLimb up one level
(3) If (Branch Support < min sup)
(4) Return

(5) Output Pattern

(6) CheckForRef(Branch)
(7) For (each Node below Branch)

(8) Limb = subTree(Node, Branch)
(9) If(Node Support < min sup)
(10) Prune(Limb)

(l I) Else

(l2) Fri(Limb)

(13) Delete Node

65

The ﬁrnction CheckForRef() will ensure that the branch is not multiply referenced. Limb

is a reference to a subtree, not a copy of the subtree.

6.5 Experiments

The experiments on the performance comparison between Fri and FP-growth were
conducted on an AMD Atholon 2800+ PC with 768MB RAM, running RedHat Linux
8.0. The programs were written in C++ and compiled with the "-O3" compiler

optimization.

We used synthetic datasets for our experiments. The procedure of generating the datasets
is described in. In the ﬁrst experiment, min sup is varied from 0.07% to 0.7% with the
dataset D1(T2017D500K) where average transaction size is 20, the average maximal
potentially frequent itemset size is 7, there are 500K transactions, and the number of
different items is 20K. In the second experiment, the dataset size is varied from 100K
transactions to 2 million transactions with a min sup of 1%, 2%, and 3% and the D2
datasets (T2016D*K) where the average transaction size is 20, the average maximal
potentially frequent itemset size is 6 and the number of different items is 20K. In the
third experiment, the potentially frequent pattern length (PatLen) varies from 1 (random
data) to 13 (very long maximal patterns) with the D3 datasets (T201*D500K) where the
average maximal potentially frequent itemset size is from 1 through 14 and the number of
items is 20K. The last experiment varied the number of predeﬁned frequent patterns used
to generate the artiﬁcial datasets. This would be the "NPats" parameter of the “gen”
program. NPats was varied from 40 patterns to 200 patterns for the D4 datasets

(T2017D500K) with 20K items.

66

The ﬁrst experiment (Figure 29) showed that Fri outperforms F P-growth signiﬁcantly
when min sup is less than 0.2%, This would be because of the high density of patterns
when min sup is small. The second experiment (Figure 30) showed that Fri is up to 5
times faster than FP-growth when the dataset has 200K transactions. Further, Fri is
consistently faster than FP-growth for datasets of at least 2 million transactions. FP-
growth does appear to scale better, but the dataset would have to be signiﬁcantly bigger
than 2 million transactions to outperform Fri. Further, because Trie-Trees have a lot of
"tails", and Fri does not prune tails until it needs to, Fri's runtime is much more

predictable for similar datasets.

 

5000 I I I T

 

 

—+— Fri
-0- FP-growlh

 

 

 

G
I
I
I
I
I
I
I
I
I
q
I
I
I
I
Q
I

Runtime(%)
9
I

I
2000 ~ 5‘, -

 

 

 

 

0 0| 0.2 0 3 0.4 0.5 0.6 0.7
Support(97o)

 

 

 

Figure 29: Performance comparison on T2017D500K

67

 

 

 
 
 
 
 
  
    

 

 

 

 

 

 

90 I I I I I I I I I
—+— FrI sup: 1%
80 _ G- FP-growth sup: l°6°
+ F0 sup:2%
G- FP-growth sup=2%
70 __ -0- FrI sup=3°8
v FP-growtlT=39’e
60 r
‘3’ 50 _
E
5 40 ~
0:
30 —
20 -
10 e
0 - l I I I I l l I l
0 200 400 I300 800 I 000 I200 1400 1600 1800 2000

Number of TransactIons(K)

Figure 30: Performance comparison on T20I6D*00K
The third experiment (Figure 31) showed that Fri works better with longer average
maximal pattern lengths. Fri outperforms F P-growth as pattern lengths are greater than 6.
Further, as the pattern lengths get longer, the difference becomes very signiﬁcant. When
the pattern length is 1, there is essentially a random or orthogonal relationship between
the items in the dataset. Under such circumstances Fri does not perform as well as F P-
growth, but neither algorithm will ﬁnd many patterns because there are not many patterns

to be found. This helps demonstrate that Fri works best when frequent patterns are dense.

The last experiment (Figure 32) showed Fri performs well when there is a large number
of shorter patterns. Since the procedure described in draws its items for the generated
dataset from a set of predeﬁned potential frequent patterns, a larger number of such

predeﬁned patterns will decrease the occurrence of any one potential pattern. This has

68

the effect of reducing the density of frequent patterns. With a reduced pattern density, Fri

and F P-growth perform about the same, but when the density gets higher, Fri

signiﬁcantly outperforms F P-growth.

 

 

 

 

 

 

 

 

 

 

 

250 I I I I I I
+ Fri
-e- FP—growth
*-
200 - I
.I
3 150 e '1
3’; 2
g .1. u‘ ’ z T ’0
g l" S;
I: I. r'
100 — 3 ; _
l I
'. I
l I
Q '. ’
\‘\ a. ’I
t I '
50 " \~ 1. I “
t‘ \ x0
\‘ ’?,’
,_
G ‘ ‘ ~ '0 :.:--:;-“"‘-:# t I I ’
‘“ r-:'3'-'"Q~—~-c :1‘ I
0 r I r ’T :9.— —T * 1"“ *
0 2 4 6 8 10 12 14
PatLen
Figure 31: Performance comparison on T201*D500K

69

 

 

 

 

 

 

 

7m I I I I I I I
OS -I-— FIi sup=05%
\ -13- FP-growth sup=0.5°/o
\ + Fri sup: 1%
500 _ \ '13- FP-growth sup=l°/o .1
\ -0— Fri sup=2%
\\ -v- FP-growth=2%
\
500 e \\ 2
\
\
’J 400~ \ a
9 \ P.
76 \ , \.
E \ / \
~- \
’5' \x ’ '\
95 300_ \ // \ q
\ // \\
‘8 \\
2004'- R "

 

 

 

 

 

40 60 80 100 120 140 160 180 200
NPats

 

 

 

Figure 32: Performance comparison on T20I7D500K with varying NPats

6.6 Conclusions of Reducing CPU Requirements

In this chapter, we have used a Trie structure to store the database in a compressed format
without information loss, and we then proposed that a pattern growth algorithm, Fri, be
used for frequent pattern mining in databases with dense frequent patterns. Fri
outperforms other algorithms as follows: (1) the temporary data structure for storing the
compressed dataset is very compact; (2) the pattern grth method not only avoids
candidate set generation and test, but also signiﬁcantly reduces the cost of node count re-

computation when the pattern density is high.

70

However, the real question is, does it help with the problem discovered in the previous
chapter. The ﬁgure below is a comparison between mining without the CPU
optimization and mining with Fri. As you can see, it cut the run time in half. This is not

to say Fri is best in all situations, but it clearly helped in this one.

 

 

; ~ -2- * " -_.- I
1 250 - CPU in Degenerate Case I
__.._' perigee; I

E+Un0ptlinzedf
I
. ,, 3 5 I
0.5 Support 0.6 0.7 0.8 09;

 

Figure 33: CPU Optimization of Degenerate Case

Further research issues related to Trie and Fri are caching and disk-based algorithms. For
example, when the dataset is very large and the resulting Trie-tree cannot ﬁt into
memory, we can put Trie onto disk. The locality of the Fri algorithm makes it possible to
efﬁciently generate patterns with a caching strategy, thus very large databases could be
mined. Initial studies have shown that data copied forward in the tree accounts for less
than 1% of the data processing needed to calculate frequent patterns. That is, the
algorithm has better than a 99% cache hit rate as it performs it's mining. An algorithm to

reduce the cost of cache hit/miss calculations is still being worked on.

The current algorithms repeat a lot of work. The same transactions are repeatedly
searched if they contain the items of two or more frequent patterns. As a result, they will

need to be repeatedly pruned and/or copied. One approach we are looking into is to store

71

the results of a transaction search. The obvious disadvantage of such an approach is the
high RAM requirements. Of course, the technique presented in Chapter 4 will help deal

with that.

72

CHAPTER 7: Conclusions and Future Work

This thesis has presented a road map for expanding frequent pattern mining into a new
environment. The new environment is pattern mining as a secondary application. Under
this new role, mining needed to be optimized to avoid having to size the resources of a
system for a secondary application. Since mining sometimes has high processing costs,
chapter 4 demonstrated how to implement frequent pattern mining to reduce processing
requirements. Since mining often has high memory requirements, chapter 5 introduced
an approach to signiﬁcantly reduce the memory usage of pattern rrrining. To demonstrate
that mining can have practical uses as a secondary application, chapter 6 implemented
enhancements to PC lntemet security using frequent pattern mining. Together, these
chapters provide a practical map for optimizing and applying frequent pattern mining to a

new area of research.

Further research of frequent pattern mining in secondary application can lead to an
enhanced computing environment. For instance, realize that the document you are
reading now was created with a program that continually tried to “autocorrect” the
formatting, but kept “guessing” wrong. Reconﬁguring the program could have been
done, but formatting preferences changed as the context of the document changed.
Frequent pattern mining could be used to adapt to the pattern usages of the user and
enhance the word processing experience. This is just a second example of how frequent
pattern mining could enhance the computing environment: it could also be used to
enhance disk sage; adjust the display of icons, windows, and colors; reorder search

results for ﬁles and web pages to match our usage patterns; and many, many others

73

secondary applications. But it all depends upon mining being optimized enough to

handle the increasing data load without overwhelming the system it is trying to enhance.

Frequent pattern mining is a very active area of research; some of this research can be
used to enhance this work. For instance, because of the nature of permutations, there are
often too many frequent patterns to be analyzed. To assist, there is a lot of research in
reducing this number to a more manageable size. For instance, there are frequently
groups of patterns that use the same items. Analyzing these groups instead of analyzing
the individual patterns reduces the complexity of analysis to something more practical.
In addition, having a preset criterion of interesting patterns helps to limit analysis to more
useful patterns. Also, analyzing patterns that are missing helps to ﬁnd missing data.
Missing patterns could be an indication of preferences people do not want to see, or
forces that keep patterns from occurring. All of this research, and much more, can be
incorporated with this work to enhance the computing environment and make

applications more useﬁil.

74

 

APPENDIX A: Support vs Space

The focus of the thesis has been on the optimizing of mining for PC applications.
However, as is often the case, while researching one topic, new and interesting topics
arise. Figure 16 of section 4.6 demonstrated a relationship between RAM and minimum
support. In this appendix, we have the opportunity to expound on the implications of this
relationship. Especially the rather surprisingly predictable relationship between
minimum support and RAM requirements.

Since the diverse datasets used in section 4.6 were from unrelated sources, it was
surprising to see such a strong and predictable relationship arise in Figure 16. Figures 34
and 35 repeat the memory requirements for WebDocs and Accidents, as well as adding a
160 MB dataset of census data from 2000, and a 500 MB synthetic dataset. Although
Figure 34 shows curved lines, Figure 35 demonstrates how straight the lines become
when comparing support with the log of RAM. Interestingly enough, this logarithmic
relationship holds strongest for the real world datasets, and such datasets do not normally
exhibit such nice linear results without a strong overriding principle governing it.
Although the principle governing the relationship between support and RAM is not strict,
it has a dominating effect for minimum support as high as 20%. Working from the

y = ax + b formula of a straight line, the principle (hereafter called the support-space
principle) would be as follows:

ln( D’ )= -a s + ln(D)
where: s = the minimum support
D = the space needed for storing the entire dataset
D’ = the space needed for a given support

a = a constant dependent upon the dataset and the
algorithm used to mine it.

75

 

    
  
    
  

I ' WSuppoirI vs RAM (iﬂar) If

I + accidents
—-— census
+ T25|100D5m
‘ —><— webdocs

 

 

 

30 25 20 15 10 5 0
% Support

Figure 34: Support vs RAM (linear)
Note: The Linear Graph Produces Curved Lines

 

 

 

    

 

 

 

I ' -2 ’ 10007

Supportvs RAM (log) ' ' .

"I I

1 . 21100 .

1 1+accidents E
I+census g

I I+T25l10005m “:1

.2- . . .2. -, - I+W3bd°°6 -110
30 25 2o 15 10 5 o
I %Support

13?ng 35: Support vs RAM (liogi)ﬁ 7 ﬁg 7
Note: The Log Graph Has Straight Lines

 

76

 

To understand the implications of this principle, it helps to look at some of the
observations of section 5.6. That section observed that for support as high as 20%, most
of the nodes within a Trie-Tree are within tails. So, the amount of RAM a Trie-Tree is
going to need is going to be proportional to R * L’, where R is the number of records in
the dataset, and L’ is the average length of records after infrequent items have been
ﬁltered out. However, section 5.6 also demonstrates that most records within a dataset
contain at least one ﬁequent item. As a result, the number of records within a dataset

remains relatively constant with respect to the minimum support.

The consequence of the observations of section 5.6 and the support-space principle is that
there is a strong relationship between L’ and minimum support. Hereafter, this
relationship will be called the support-length principle, and it can be deduced from the
support-space principle as follows:

Given:
ln( D’ )= -a s + ln(D) (where a is a constant and s is the minimum support)
D’ = R "' L’ (where L’ is the average length of ﬁltered records)

L = Average length of unﬁltered records in the dataset
Then:
ln(L’ * R)=-as+ln(L*R)
1n(L’*R)= ln(L*R * e45)
L’*R = L*R * e‘as
L’ / L = e ‘ a 5
Using the support-space and the support-length principles, it is possible to estimate the
space needs and the average lengths of ﬁltered records. This appears to work best for

larger real world datasets with longer records. Fortunately, it is just such datasets that

need good estimates. To do so would only require some basic knowledge of the

77

characteristics of the data, and a way to calculate “a”. Calculating “a” can be done by
either measuring the space requirements at a relatively high minimum support, or
measuring the average length of records after ﬁltering out infrequent items.

In any case, these principles are a beginning to getting a handle on analyzing resource
requirements for frequent pattern mining. Further research would need to be done to
deﬁne the environments best suited for these principles, but it is clear that they do occur
in the real world. Perhaps further analysis may reveal why these relationships exists; but,
until then, the results have an immediate and practical application of estimating resource

needs.

78

BIBLIOGRAPHY
Deﬁnition of “trie” http://en.wikipedia.org/wiki/Trie

R. Agarwal, T. lmielinski, and A. Swami. (1993) Mining association rules between sets
of items in large databases In SIGMOD '93, pp. 207-216.

R. Agarwal and R. Srikant (1994) Fast Algorithms for Mining Association Rules In
VLDB ’94 pp. 487-499.

R. Agarwal, C. Aggarwal, V.V.V. Prasad (1999) A Tree Projection Algorithm for
Generation of Frequent Itemsets. In Proc. of High Performance Data Mining Workshop,
1999.

J. Han, J. Pei, and Y. Yin (2000) Mining frequent patterns without candidategenerqtion.
In SIGMOD '00, pp. 1-12.

J. Park, M. Chen and P. Yu. (1995) An Effective Hash-based Algorithm for Mining
Association Rules. In SIGMOD '95, pp. 175-186.

J. Pei, J. Han, etc. (2001) H-Mine: Hyper-Structure Mining of Frequent Pattern_s in La_rge
Databases. In ICDM '01.

E. Fredkin. Trie memory, Commun. ACM, 3,(9), 490-500, 1960.

J. Aoe, K. Morimoto and T. Sato. (1992) An Efﬁcient Implementation of TRI Structures.
Software-Practice and Experience, 22, 695-721, 1992.

J. Han and M. Kamber (2000) Data Mining: Concepts and Techniques. Morgan
Kauﬁnann, 2000.

Spyware Wikipedia: (2004) http://en.wikipedia.org/wiki/Spyware (Nov 14, 2004)

Stefan Saroiu, Steven D. Gribble, and Henry M. Levy. (2004) Measurement and Analysis
of Spyware 1L3 University Environment. (Feb 2004)

SpywareGuide, (2004) http://www.spywareguide.com/product_list_full.php (Nov 2004)

Brad Rosen. (2003) Infrastructural Requrirements for 8 Privacy Preserving lntemet, (Fall
2003)

Stephen Withers, ZDNet Australia.
http://www.zdnet.com.au/news/security/0,200006l 744,20269428,00htm, (Oct 28, 2002)

Markus Hegland. Algorithms for Association Rules, (2002)

79

 

Robert F. Erbacher and Deborah F rincke. VisuaLBehafvior Characterization for Intrusion
and Misuse Detection (2001)

Ngsniff of Ngsec. http://www.ngsec.com/ngresearch/ngtools (Jun 9, 2004 version 1.2)

Mark Russinovich. tcpvcon of Sysinternals.
http://www.sysintemals.com/ntw2k/source/tcpview.shtrnl (Aug 9, 2004 version 2.34)

A. Savasere, E. Omiecinski, and S. Navathe. An efﬁcient algorithm for mining
association rules iLhrrge databases. In 2lst VLDB Conference, 1995.

Geurts, Wets, Brij s, and Vanhoof. “Proﬁling High Frequency Accident Locations
Using Association Rules” Limburg University: Data Analysis & Modelling Group

Lucchese, Orlando, Perego, and Silvestri. “WebDocs: a real-life huge transactional
dataset” Dipartimento di Inforrnatica, Universit‘a Ca’ Foscari di Venezia, Venezia, Italy

80

 

 

IIIII‘IJIIIIIIIIIJIIIIIIIIIIII