DIVIDE AND CONQUER: PACKET CLASSIFICATION BY SMART
DIVISION
By
James Edward Daly

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science â Doctor of Philosophy
2017

ABSTRACT
DIVIDE AND CONQUER: PACKET CLASSIFICATION BY SMART DIVISION
By
James Edward Daly
Packet classifiers form the backbone of many networking and security devices. Most
packet classifiers are defined by a list of rules that each match a subset of all packets. As
new threats and vulnerabilities emerge, these rule lists are getting larger and thus taking
longer to search. However, networks have very strict time constraints and delays classifying
a packet at one stage can cascade and cause delays and congestion throughout the network.
Additionally, with the advent of software-defined networks, these rule lists are also constantly
changing. Packet classifiers thus must support both fast classification time and fast updates.
We present four methods to improve packet classification. Each method uses smart division to divide the problem into simpler subproblems that are more easily solved. Diplomat
uses dimensional reduction to create a smaller rule list by dividing the problem into several
lower dimensional problems and then combines them with as few rules as possible. ByteCuts
uses rule list separation to partition the rule list into a small number of trees; this minimizes
the rule replication that other tree-based methods such as HyperCuts and HyperSplit require. PartitionSort uses rule list separation to partition the rule list into parts that are
sortable; each part can then be binary searched. Finally, TupleMerge uses rule list separation to partition the rule list into parts that are hashable; it requires fewer hash tables than
Tuple Space Search which reduces search times.

Copyright by
JAMES EDWARD DALY
2017

This thesis is dedicated to my wife Couri and the rest of my family, who have supported
me throughout the way.

iv

ACKNOWLEDGMENTS

I am eternally grateful to Dr. Eric Torng who served as my advisor, guide, and editor
throughout this process. None of this would have been possible without his help. I would
also like to thank Dr. Alex Liu who provided additional guidance. I would also like to
recognize the rest of my committee, Dr. Abdol-Hossein Esfahanian and Dr. Jian Ren for
their insights and contributions.
I would also like to recognize the contributions of Dr. Chad Meiners, whose code I
inherited, and Sorrachai Yingchareonthawornchai, who helped with some of the coding and
testing. This would have been even longer and more difficult without their contributions.
Finally, I would like to thank my family for encouraging me in this endeavor, especially
my wife Couri and my parents, who were beside me every step of the way.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

KEY TO SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1 Introduction . . . .
1.1 Offline Packet Classification
1.2 Online Packet Classification
1.3 Strategy Overview . . . . .
1.4 Organization . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

1
1
2
3
3

Chapter 2 Review of Current Topics
2.1 Exhaustive Search . . . . . . . . . .
2.2 Decision Tree . . . . . . . . . . . .
2.3 Decomposition . . . . . . . . . . .
2.4 Tuple Space . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

5
5
7
10
11

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Chapter 3

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Chapter 4 Diplomat . . . . . . . . . . . . .
4.1 ACL Compressor . . . . . . . . . . . . .
4.2 Diplomat Overview . . . . . . . . . . . .
4.3 Subproblem Solver . . . . . . . . . . . .
4.3.1 Firewall Compressor . . . . . . .
4.3.2 Suriâs Algorithm . . . . . . . . .
4.4 Resolvers . . . . . . . . . . . . . . . . . .
4.4.1 One-Dimensional Range Resolver
4.4.2 One-Dimensional Prefix Resolver
4.4.3 Higher-Dimensional Resolver . . .
4.5 Schedulers . . . . . . . . . . . . . . . . .
4.5.1 Range Scheduler . . . . . . . . .
4.5.2 Prefix Scheduler . . . . . . . . . .
4.6 Results . . . . . . . . . . . . . . . . . . .
4.6.1 Methodology . . . . . . . . . . .
4.6.2 Compression Results . . . . . . .
4.6.3 Efficiency . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

15
15
16
17
17
18
18
19
20
20
21
21
22
22
23
24
25

Chapter 5 ByteCuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 HiCuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28
29
29

vi

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

30
30
31
32
32
33
34
36
38
39
40
40
42
42
43
45
46
49
49
50

Chapter 6 PartitionSort . . . . . . . . . . . . . . . . .
6.1 Related Work . . . . . . . . . . . . . . . . . . . . .
6.1.1 Independent Set . . . . . . . . . . . . . . . .
6.2 PartitionSort Overview . . . . . . . . . . . . . . . .
6.3 Rule Sortability . . . . . . . . . . . . . . . . . . . .
6.3.1 Field Order Comparison Function . . . . . .
6.3.2 Usefulness for Packet Classification . . . . .
6.4 Multi-dimensional Interval Tree (MITree) . . . . . .
6.5 Offline Sortable Rule List Partitioning . . . . . . .
6.6 PartitionSort: Putting it all together . . . . . . . .
6.6.1 Rule Update (Online Sortable Partitioning)
6.6.2 Packet Classification . . . . . . . . . . . . .
6.7 Experimental Results . . . . . . . . . . . . . . . . .
6.7.1 Experimental Methods . . . . . . . . . . . .
6.7.2 PartitionSort versus TSS . . . . . . . . . . .
6.7.2.1 ClassificationTime . . . . . . . . .
6.7.2.2 Update Time . . . . . . . . . . . .
6.7.2.3 Construction Time . . . . . . . . .
6.7.2.4 Memory Usage . . . . . . . . . . .
6.7.3 Comparison with SmartSplit . . . . . . . . .
6.7.3.1 Classification Time . . . . . . . . .
6.7.3.2 Construction Time . . . . . . . . .
6.7.3.3 Memory Usage . . . . . . . . . . .
6.7.4 Comparison with SAX-PAC . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

51
51
51
52
53
54
56
56
59
62
63
63
65
65
67
67
70
70
70
71
71
71
72
73

5.2

5.3
5.4

5.1.2 HyperCuts . . . . . . .
5.1.3 EffiCuts . . . . . . . .
5.1.4 HyperSplit . . . . . . .
5.1.5 SmartSplit . . . . . . .
Methodology . . . . . . . . .
5.2.1 Bit Cutting . . . . . .
5.2.2 Tree Selection . . . . .
5.2.3 Cut Selection . . . . .
5.2.4 Splitting . . . . . . . .
5.2.5 Node Design . . . . . .
5.2.6 Packet Classification .
Theoretical Results . . . . . .
Experimental Results . . . . .
5.4.1 Experimental Setup . .
5.4.2 Classification Time . .
5.4.3 Number of Trees . . .
5.4.4 Tree Height . . . . . .
5.4.5 Construction Time . .
5.4.6 Memory Usage . . . .
5.4.7 Summary of Trade-offs

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6.7.5

Comparison of Partition Algorithms . . . . . . . . . . . . . . . . . . .

Chapter 7 TupleMerge . . . . . . . . . . . . . .
7.1 Related Work . . . . . . . . . . . . . . . . .
7.1.1 Tuple Space Search . . . . . . . . . .
7.2 Online Algorithms . . . . . . . . . . . . . .
7.2.1 Key Differences . . . . . . . . . . . .
7.2.2 Packet Classification . . . . . . . . .
7.2.3 Tuple Selection and Rule Insertion .
7.2.4 Rule Deletion . . . . . . . . . . . . .
7.3 Offline Algorithms . . . . . . . . . . . . . .
7.3.1 Table Selection . . . . . . . . . . . .
7.3.2 Table Consolidation . . . . . . . . . .
7.4 Experimental Results . . . . . . . . . . . . .
7.4.1 Experimental Setup . . . . . . . . . .
7.4.2 Comparison with PartitionSort . . .
7.4.2.1 Online Comparison . . . . .
7.4.2.2 Offline Comparison . . . . .
7.4.3 Comparison with Tuple Space Search
7.4.4 Comparison with SmartSplit . . . . .
7.4.5 Comparison with SAX-PAC . . . . .
7.4.6 TupleMerge Collision Limit . . . . .

73

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

75
76
76
78
78
80
81
82
83
83
85
85
85
88
88
91
92
92
94
95

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

Chapter 8

viii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

LIST OF TABLES

Table 4.1: Size of Largest Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Table 5.1: A sample rule list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Table 5.2: Both (Source, 3) and (Dest, 3) create trees with a maximum collision limit
of 2 and omit 2 rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Table 5.3: ByteCuts uses the two highlighted bits from the Source field as a key to
cut the first tree. The second tree does not need cutting since it contains
only two rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Table 6.1: GrInd (G), Online (O) and TSS (T) average number of partitions (tuples)
and percentage of rules in five partitions with highest priority rules . . . .

64

Table 7.1: Example 2D rulelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Table 7.2: TSS builds 5 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Table 7.3: TupleMerge builds 3 tables . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Table 7.4: An alternate TupleMerge scheme builds 2 tables . . . . . . . . . . . . . .

84

ix

LIST OF FIGURES

Figure 3.1: An effective grid denoted by the dashed lines

. . . . . . . . . . . . . . .

14

Figure 4.1: Merging planes P1 and P2 . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Figure 4.2: Mean improvement ratios . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Figure 4.3: Percentages of classifiers Diplomat outperforms previous methods . . . .

25

Figure 4.4: Compression ratios on range classifiers . . . . . . . . . . . . . . . . . . .

26

Figure 4.5: Compression ratios on prefix classifiers . . . . . . . . . . . . . . . . . . .

26

Figure 4.6: Compression ratios on mixed classifiers . . . . . . . . . . . . . . . . . . .

27

Figure 5.1: A graphical representation of the rules from Table 5.1. . . . . . . . . . .

29

Figure 5.2: The tree HyperCuts builds for Table 5.1. . . . . . . . . . . . . . . . . . .

30

Figure 5.3: The tree HyperSplit builds for Table 5.1. . . . . . . . . . . . . . . . . . .

31

Figure 5.4: The tree SmartSplit builds for Table 5.1. . . . . . . . . . . . . . . . . . .

31

Figure 5.5: 8 bits are extracted from this packet to form a key. . . . . . . . . . . . .

33

Figure 5.6: Rule ra is replicated twice and rb is replicated three times. . . . . . . . .

33

Figure 5.7: A graphical representation of the trees produced in Table 5.3. . . . . . .

37

Figure 5.8: A depiction of the memory layout for each of the three types of nodes.
Each includes a 4 byte header and a 4 byte pointer. . . . . . . . . . . . .

39

Figure 5.9: Classification Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 5.10: Classification Time on 64k rules . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 5.11: Classification Time on real-life rules . . . . . . . . . . . . . . . . . . . . .

44

Figure 5.12: Number of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 5.13: Number of trees across seeds . . . . . . . . . . . . . . . . . . . . . . . . .

46

x

Figure 5.14: Mean Tree Height . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 5.15: Mean Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 5.16: Mean Construction Time . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Figure 5.17: Mean Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Figure 6.1: PartitionSort is uniquely competitive with prior art on both classification
time and update time. Its classification time is almost as fast as SmartSplit
and its update time is similar to Tuple Space Search. . . . . . . . . . .

52

Figure 6.2: Example of ordered rule list with â¤xy

. . . . . . . . . . . . . . . . . . .

54

Figure 6.3: A sortable rule list with 7 rules and 3 fields, field order f~ = XY Z, and
the corresponding simple binary search tree where each node contains d
fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Figure 6.4: We depict the uncompressed (left) and compressed (right) MITree that
correspond to the sortable rule list from Figure 6.3. Regular left and
right child pointers are black and do not cross field boundaries. Next field
pointers are red and cross field boundaries. In the compressed MITree,
nodes with weight 1 are labeled by the one matching rule. . . . . . . . .

58

Figure 6.5: Example execution of GrInd. . . . . . . . . . . . . . . . . . . . . . . . .

62

Figure 6.6: Classification Time vs Tuple Space Search . . . . . . . . . . . . . . . . .

68

Figure 6.7: Partitions/Tuples Queried. A CDF distribution of number of partitions
required to classify packets with priority optimization (cf. Section 6.6.2).

68

Figure 6.8: Classification Time on 64K Rules . . . . . . . . . . . . . . . . . . . . . .

68

Figure 6.9: Update Time vs Tuple Space Search . . . . . . . . . . . . . . . . . . . .

69

Figure 6.10: Classification Time vs SmartSplit . . . . . . . . . . . . . . . . . . . . . .

71

Figure 6.11: Construction Time in logarithmic scale. . . . . . . . . . . . . . . . . . .

72

Figure 6.12: Memory Usage vs SmartSplit . . . . . . . . . . . . . . . . . . . . . . . .

72

Figure 6.13: Number of Partitions vs SAX-PAC . . . . . . . . . . . . . . . . . . . . .

73

Figure 6.14: Number of Partitions for Offline and Online PartitionSort . . . . . . . .

74

xi

Figure 6.15: Classification Time for Offline and Online PartitionSort

. . . . . . . . .

74

Figure 7.1: Average # of partitions required for each rulelist size . . . . . . . . . . .

86

Figure 7.2: Relative Update Time between PS and TupleMerge

. . . . . . . . . . .

87

Figure 7.3: Absolute Update Time on 256k rules . . . . . . . . . . . . . . . . . . . .

87

Figure 7.4: Relative Online Classification Time between PS and TupleMerge

. . . .

88

Figure 7.5: Absolute Online Classification Time on 256k rules . . . . . . . . . . . . .

88

Figure 7.6: Relative Offline Classification Time between PS and TM . . . . . . . . .

90

Figure 7.7: Absolute Offline Classification Time on 256k rules . . . . . . . . . . . . .

90

Figure 7.8: Relative classification time between SS and TupleMerge

. . . . . . . . .

92

Figure 7.9: Partitions produced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Figure 7.10: Smaller collision limit c

. . . . . . . . . . . . . . . . . . . . . . . . . . .

94

Figure 7.11: Larger collision limit c . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

xii

KEY TO SYMBOLS

c An action or color
x, y Packets
r A rule
s An interval
p A prefix interval
w Prefix length
d Number of dimensions
n Number of rules
ni Number of rules in set `i
mi Length of field i
C Packet domain
K Set of actions
P : C â K A pattern
P (x) The color of packet x
L A rule list
` A partial rule list

xiii

Chapter 1
Introduction
Routers, firewalls, and other devices perform a variety of network services, such as packet
filtering, network address translation (NAT), load balancing, and traffic monitoring. These
devices receive a stream of packets from the network and must decide how each packet should
be handled. These devices use a packet classifier to determine this behavior. While different
types of services have different requirements, in general, the classifier looks at a d-tuple from
fields in the packet header to make some sort of decision.
Network devices have real-time constraints. They continuously receive packets which
they need to respond to promptly; a delay in processing one packet can result in delays in
subsequent packets. Additionally, queue sizes are limited. If processing takes too long, the
queue will fill up, resulting in dropped packets. This contributes significantly to network
congestion and should be avoided. Fast packet classification thus reduces the number of
queued packets and reduces congestion.

1.1

Offline Packet Classification

In traditional offline packet classification, each classifier is defined by a list of n rules that
each map a set of packet headers to some decision. Usually each rule will use some subset
of the following 5 fields: source address, destination address, source port, destination port,
and protocol. Since there is no guarantee that the sets are disjoint, if multiple rules match
1

a packet, the first matching rule carries precedence.
Once a rule list is created, it is relatively static with infrequent changes. Thus, it is
sufficient to create a single static structure that can search the rule list as fast as possible. The
fastest methods can classify a packet in O(log n) time, but they require O(nd ) memory [28].
Unfortunately, memory is not an infinite resource and this is too large for all but the smallest
classifiers. Other methods require only O(n) memory but have O(logd n) search times. As
such, packet classification involves trade-offs between time and memory.
From the above, the Offline Packet Classification problem is as follows: given a rule list
L and a set of memory and other resource constraints, construct a data structure where the
expected time to find the first matching rule for each packet in a stream is minimized.

1.2

Online Packet Classification

OpenFlow and other software defined networks add multiple difficulties to traditional packet
classification [17]. First, the rule list is often updated by a controller, sometimes multiple
times a second. As such, it is no longer allowable for the classifier to be a static entity.
Instead, it must be able to be quickly updated. Furthermore, additional packet fields are
defined for network behavior and the controller can define other arbitrary fields. Thus, the
classifier used must be able to scale to a larger numbers of fields.
The Online Packet Classification Problem thus features revised requirements from the
offline problem: given a (many-dimensional) rule list L, a time allowance Ď , and a set of
memory and other resource constraints, construct a data structure which can be updated
in under Ď time and the expected time to find the first matching rule for each packet in a
stream is minimized.

2

1.3

Strategy Overview

Packet classification is a difficult problem. In this thesis, we present four methods which
use a strategy which we call smart division. In each method, the problem is divided into
multiple simpler problems. We then create a solution for each subproblem. These solutions
are then recombined to create a solution for the original problem.
We use two types of smart division. In dimensional reduction, a higher dimensional
problem is reduced to a series of lower-dimensional problems. This process is repeated until
we get 1-dimensional problems for which good solutions exist. These subsolutions are then
recombined to form a solution to the original problem. For rule list separation, the original
rule list is divided into a series of sublists, each of which has certain desirable properties.
We can then create a classifier for each of these sublists which are then combined to form a
classifier for the original rule list.

1.4

Organization

The rest of this thesis is organized as follows. In Chapter 2, we review a variety of prior work
in this field, grouped according to the taxonomy used by Taylor in [28]. In Chapter 3, we
present definitions and terminology used in common between all of the sections. In Chapter 4,
we describe Diplomat, a dimensional reduction method for compressing rule lists into smaller
lists. In Chapter 5, we describe ByteCuts, a rule list separation method that creates multiple
decision trees and focuses on fast search times. In Chapter 6, we describe PartitionSort, a
rule list separation method that supports fast updates by creating multiple search trees by
imposing a partial ordering over the rules. In Chapter 7, we describe TupleMerge, a rule
list separation method that supports fast updates by separating the rules into multiple hash
3

tables. Finally, in Chapter 8, we offer some concluding remarks and identify some open
problems.

4

Chapter 2
Review of Current Topics
This section covers currently existing methods. It is divided into four sections according to
the Taylor taxonomy [28].

2.1

Exhaustive Search

This category covers both sequential search and TCAMS. In a sequential search, each rule is
checked individually until a matching rule is found. This becomes very expensive time-wise
as the number of rules increases as it requires O(n) time.
TCAMs are special hardware chips that allow the entire list to be searched in parallel.
However, their memory capacity has not scaled at the rate predicted by Mooreâs Law [14].
Additionally, they tend to be expensive and consume a large amount of power. As such, the
number of rules that can be used in TCAMs is very limited.
Since both sequential search and TCAMs benefit significantly from smaller rule lists,
research in this area tends to focus around ways to reduce the size of the rule list.
The Iterated Strip Rule Solver [1] identifies a class of two-dimensional patterns, called
strip-rule patterns, which can be solved by coloring an entire row or column at a time. On
this type of pattern, they can find a solution that is guaranteed to be within a 2-factor of the
optimal solver. On other patterns, they divide it into a series of strip-rule patterns. They
then show that the number of such sub-patterns is not too large. This allows them to achieve
5

a O(min(n1/3 , Opt1/2 ))-approximation ratio on range-ACLs and a O(w2 min(n1/3 , Opt1/2 )approximation ratio on prefix-ACLs.
Suri et al. created an optimal one-dimensional prefix compressor [27]. Their method
starts with some base prefixes and determines which colors work best for their parent prefixes.
For a monochromatic prefix, the best color is the given color. Otherwise, if the two child
prefixes contain any of the same best colors, the parent contains those colors as its best colors;
the children can put off using that color by letting the parent take care of it. Otherwise,
its best set is their union; one child or the other will need a new rule and the other lets
the parent take care of it. Once the root prefix is reached, one of the best colors is chosen
arbitrarily. The rest of the tree is then climbed, placing a rule for any child whose best color
set does not contain the color used for its parent. A generalized version that keeps track of
the actual color costs allows for weighted colors.
Firewall Compressor [15] and TCAM Razor [14] observe that the one-dimensional problem can be solved in polynomial time for either range or prefix classifiers. They define a
decision tree for the classifier with each node representing a field, edges the different values
that the field can take, and leaves being rule lists for partial solutions. They work up the
tree, solving a one-dimensional problem at the lowest nodes and replacing it with the resulting rule lists. TCAM Razor uses Suriâs prefix solver from [27] while Firewall Compressor
puts forth their own range solver. The two are later combined into ACL Compressor which
can handle mixed prefix-range ACLs [16]. The primary limitation of this method is that
similar, but not identical, sub-solutions must both be paid in full even if only a single point
differentiates the two.
Topological Transform presents two transformation methods: Domain Compression and
Prefix Alignment [18]. Domain Compression re-encodes a rule by mapping several equivalent
6

values to the same number. This reduces the number of bits required to encode a rule since
there are fewer values. Prefix Alignment manipulates the encoded values further so that the
ranges actually used by the rules tend to match up with prefixes. This reduces the cost of
converting range rules to prefix rules.

2.2

Decision Tree

Decision tree methods partition the rules into multiple subsets by some method without
having to search all of the rules. This method is then repeated on each part until the
number of rules becomes small enough. Most methods then fall back onto a sequential
search on the few remaining rules.
Most of these methods operate by cutting the domain into multiple regions. Since region
boundaries may not match rule boundaries, this may require some rules to be replicated into
multiple parts which increases memory usage.
HiCuts selects a field and divides the domain of that field into multiple equal sized
regions [5]. Each rule is placed in whichever parts that overlap with their domain. In the
worst case, a rule may ignore the field being cut. Such a rule must be replicated into every
part. As such, HiCuts tends to have high memory usage. Once the number of rules in a
part falls below a predetermined threshold, further partitioning stops. The remaining rules
are then searched sequentially.
HyperCuts reduces the tree height by allowing multiple fields to be cut at once [22]. This
allows for faster searching since HyperCuts is essentially doing two cuts at once. However,
this does not reduce the amount of rule replication. As such, HyperCuts also tends to have
large memory requirements.

7

EffiCuts reduces the amount of rule replication by partititioning rules into sets that work
better together [31]. The rules are classified as being either specified or not specified in each
field. Rules that are specified in the same fields end up in the same part. Each part is then
made into its own tree. This dramatically reduces rule replication since no rule ever has to
be copied into every child region (as some rules do with HiCuts and HyperCuts). However,
the multiple trees tend to have a much larger total cost than HyperCuts; each tree is smaller
but their combined cost is greater. They selectively merge some of the parts back together
to help reduce this problem without reintroducing too much replication.
Rather than making many cuts, HyperSplit takes an alternate approach by making a
single cut at a time that partitions the rules as effectively as possible [20]. While HiCuts and
the rest divided the domain into equal-sized partitions, HyperSplit allows the two parts two
be very different in size as long as they contain similar numbers of rules. As such, HyperSplit
tends to have lower memory requirements than HyperCuts (but not as low as EffiCuts) at a
relatively minor cost in tree height.
SmartSplit [7] attempts to combine HyperCuts and HyperSplit. They propose several
heuristics for estimating the search time and memory usage of both HyperCuts and HyperSplit. This allows them to predict which method will produce a more efficient tree.
Additionally, if they determine that a single tree will require too much memory, they can
divide the rules into separate trees, like EffiCuts does. In SmartSplit, the rules are divided
into four sets (unlike the 32 sets that EffiCuts uses) based on the specificity of the address
fields. The set that is specified on both addresses is then recombined with the larger of the
two single-address sets. This allows them to remove the primary source of rule replication
(since all of the rules in the two larger sets have at least one field in common) without
creating the extraneous trees that EffiCuts does.
8

In [26], the authors take a very different strategy. For a given field, they find the largest
number of rules that are non-intersecting in that field. This allows them to order those
rules by sorting them on that field. They repeat this process until all of the rules have been
grouped. They then binary search each group until a matching interval is found. Afterwards,
only a small number of rules must be searched to find a match. The chief downsides are
that the search time is proportional to the number of independent sets required and that
certain large rules may not be able to be placed into the same set as any other rule. We
present work in Section 6 that improves on this method by significantly reducing the number
of independent sets required.
In Grid-of-Tries [25], the authors consider using a multi-tiered set of tries where one field
(such as source address) is encoded in a trie which stores pointers to another trie encoding
a second field (such as destination address). To prevent rule replication, each rule is stored
in only one of the tries and a switch pointer is used to jump from the trie associated with a
more specific address to that of a less specific one without having to start back at the top
of the second trie. This allows for O(w) search times instead of O(w2 ).
In SAX-PAC [9], the authors propose dividing the rule list into separate lists that are
order-independent. If a rule list is order-independent, then adding extra fields does not affect
the overall search times as they can be ignored without affecting the ability to discriminate
between rules. They invert this, showing how to remove fields while maintaining orderindependence. They observe that for some existing solutions, such as Grid-of-Tries, an
order-independent set can be searched in O(logdâ1 n) time while using only linear memory.
Thus, if the order-independent set can be reduced to only 2 dimensions, O(log n) search
times can be achieved. To help keep the number of tables down, they recommend using a
TCAM to solve the last 5-10% of rules. Because this is only a small fraction of the total
9

rules, it should be small enough to fit in the limited TCAM space.

2.3

Decomposition

There are many ways to efficiently search a single field, such as binary range search and
longest prefix match. Methods in this category leverage this by breaking the problem into a
series of one-dimensional problems and then aggregating the results. This first stage can be
done in parallel, which allows these methods to be generally faster. However, the challenge
lies in the second stage, where the results are aggregated together. This usually is memory
inefficient, limiting the capacity of these methods.
In Parallel Bit-Vector (PBV) [11], the authors divide each field into intervals in the
same manner as the effective grid discussed in Chapter 3. They then determine which
rules intersect with each interval. They create a bit-vector for each interval where each bit
corresponds to one of the rules. After finding a bit-vector for each field, they take the bitwise
AND of all of the vectors; any bit set in the result corresponds to a matching rule. The most
significant set bit will correspond to the first matching rule. The chief downside is that there
can be O(n) vectors that are each O(n) bits long giving a rather unfavorable O(n2 ) memory
requirement.
In Aggregate Bit-Vector (ABV) [2], the authors note that most rules do not intersect
with most of the intervals. They group the rules into blocks and represent each block as a
bit in a second bit-vector. When searching, they combine the aggregate vector first and then
only combine and check the blocks corresponding to bits in the combined aggregate vector.
Since they donât have to allocate memory for empty blocks, this reduces overall memory
usage (although not asymptotically).

10

The authors of Cross-Producting [25] note that for most fields, the number of unique
values is significantly less than the total number of rules. They construct a list of bestmatching values for each field. They then construct a table by taking the cross product of
these lists, mapping each tuple to the corresponding result. They store the result in a hash
table. This method can provide very high throughput, but the size of the hash table can be
up to O(nd ), which is generally too big. They introduce a variant called On-Demand CrossProducting that limits the table size and treats it more like a cache, but then performance
is largely dependent on the behavior of the packet stream.

2.4

Tuple Space

In this category, rules are grouped together by the number of specified bits in each field,
which is called a tuple. In the Tuple Space Search classifier [24], all rules with the same
characteristic tuple are placed into a hash table. During a search, they combine the tuple
with a packet to form a hash key, allowing the table to be probed. Since each rule can only be
placed in one table, and hash tables have amortized constant-time updates, this method has
recently had a resurgence with the Open vSwitch OpenFlow framework where rule updates
happen frequently [19].
Nonetheless, Tuple Space Search is hampered by the number of tables required; each
unique characteristic tuple requires its own separate table; potentially wd tables could be
required. Several solutions are proposed to the problem, each with their own drawbacks.
One solution is to only use certain fields (usually source and destination addresses), which
limits the exponential factor [24]. Some rules may be identical on the included fields but
different on the excluded fields. These rules will collide and necessitate a linear search to

11

determine the true match. In some cases, a poor field choice may exclude a particularly
discriminatory field causing a prohibitively large search to be required.
An alternate solution is to use one or two one-dimensional classifiers (similar to the
decomposition classifiers) as a pre-classifier to prune out tables that do not contain any
matching rules [24]. These pre-classifiers usually are more difficult to update and maintain,
increasing update times. This unfortunately removes the primary advantage that Tuple
Space Search has over other classifiers.

12

Chapter 3
Terminology
Here we present definitions and terminology used in common for all of the remaining sections.
K is the set of all decisions such as {Accept, Deny}, {Black, White}, or {1, 2, 3, 4}. A
color c â K is any one of those decisions.
An interval s = [l, u] is the range of integers between l and u, inclusive. A prefix
s = {0, 1}k {â}wâk = p{â}wâk is a special type of interval that covers the range [p Âˇ 2wâk , p Âˇ
2wâk + 2wâk â 1]. More idiomatically, âw = [0, 2w â 1] and replacing the first â of s with a
0 reduces the range to the lower half and with a 1 to the the upper half.
The domain C is Cartesian product of d intervals (normally [0, 2wi â 1]d ). A box B is a
subset of C that is also the Cartesian product of d intervals.
A pattern is a function P : C â K. P thus assigns a color to every point in C. For
any point x â C, P (x) is the color of that point. Two patterns P1 and P2 are equivalent if
P1 (x) = P2 (x) âx â C.
A rule r = (B, c) assigns B â C the color c â K. A rule list L = [r1 , r2 , r3 , Âˇ Âˇ Âˇ , rn ]
is an ordered list of n rules. A rule list defines a pattern with L(x) = ci if x â Bi and
xâ
/ Bj âj < i. More simply, the color L(x) is the color of the first rule that matches x. We
abuse notation slightly by using L to refer to both the rule list and the pattern it generates.
Two rule lists are equivalent if they produce equivalent patterns.
A box is monochromatic if âc such that P (x) = c âx â B. We can partition C into

13

Figure 3.1: An effective grid denoted by the dashed lines
monochromatic boxes by using the concept of an effective grid from [1]. The effective grid
is the set of boundary lines that separate two differently colored regions in P . It has sizes
m1 Ă m2 Ă Âˇ Âˇ Âˇ Ă md with each mi â¤ 2n â 1. Each cell in the effective grid will be a
monochromatic box. We can use this property to create a transformed problem where
each point in the transformed problem represents an entire effective grid cell in the original
problem. This reduces the domain space that must be considered.

14

Chapter 4
Diplomat
Generally speaking, if two rule lists are equivalent, the shorter rule list is preferable to the
longer one. When doing a basic sequential search, it generally takes less time to search a
smaller list than the longer one. Additionally, some devices use TCAMs to search the entire
lists in parallel. Since TCAMs are expensive in both upfront costs and power, a rule list
that can fit into a smaller TCAM can achieve significant savings.
Thus, given a rule list L, the objective is to find the smallest possible equivalent rule list
L0 ; that is to find the rule list with the fewest number of rules that has the same decision for
every possible input packet. This was shown to be NP-Hard in [1] for d âĽ 2 for the range
version. The prefix version is also believed to be hard. The one-dimensional versions of both
problems are in P.
In this chapter, we review the prior ACL Compressor [16]. We then review our Diplomat
work which corrects a key limitation of ACL Compressor. This chapter is based on our
previous published work [4].

4.1

ACL Compressor

ACL Compressor [16] is a combination of Firewall Compressor [15] and TCAM Razor [14].
Firewall Compressor compresses range rule lists while TCAM Razor operates on prefix rule
lists. However, the basic mechanism behind the two is the same, allowing ACL Compressor
15

to operate on mixed rule lists.
ACL Compressor begins by creating a Firewall Decision Diagram [13] for the rule list. In
an FDD, each of the leaf nodes represents a possible decision, such as Accept or Deny. Each
of the branch nodes represents a field, such as source IP address or protocol. Edges between
nodes represent ranges of values that a given field can take.
Each of the lowest branch nodes represents a one-dimensional problem. They solve this
with the Firewall Compressor 1-D solver if it is a range field or Suriâs 1-D solver [27] if it is
a prefix solver, and replace the branch node with the resulting rule list. Both methods are
covered in Section 4.3.
Higher branches will eventually find that their children have been replaced with rule lists.
They then become weighted one-dimensional problems with each rule list being a color with
weight equal to the number of rules in the list. Fortunately, the weighted problem is still in
P and can be solved with modified versions of the above algorithms.
The downside to this method is that it only knows if two rule lists are the same or not. If
two subproblems are similar, say because their patterns differ at only a single point, it must
still pay for both rule lists in their entirety. ACL Compressor runs redundancy removal [12]
to remove much of this, but not all of it can be removed this way.

4.2

Diplomat Overview

Each of the lower nodes in the FDD represents a smaller dimensional subproblem. The chief
handicap of ACL Compressor is that it considers each of these subproblems to be independent. However, neighboring subproblems usually are similar to one another. Our method,
Diplomat, overcomes this limitation by considering neighboring subproblems together.

16

Instead of solving each subproblem separately, Diplomat compares neighboring subproblems and determines where the patterns are different. Diplomat then defines those differences
using the minimum number of rules. This normally requires fewer rules than completely
defining the entire subproblem. Afterwards, the two subproblems can be considered to be
identical and merged together as any differences are defined by the previously placed rules.
Diplomat repeats this process until only one subproblem remains. It then solves this subproblem fully.
Diplomat requires three parts. One of these parts is how to solve the final subproblem.
The second is how to merge two subproblems together by resolving the differences between
them. The final part is how to determine which order to merge the subproblems.

4.3

Subproblem Solver

Normally, we solve a subproblem by making a recursive call to Diplomat. This subproblem
has one fewer field than the original problem and so eventually will be reduced down to a
single field. This one-dimensional problem is then handled differently. If it is a range field,
we use the one-dimensional solver from Firewall Compressor [15]. If it is a prefix field, we
use the one by Suri et al. [27]. We describe both methods here.

4.3.1

Firewall Compressor

Given some input classifier L, compute the effective grid of L and label the cells from 0 to
m â 1. The first key insight is that the last rule rn covers the leftmost cell 0 in some optimal
solution. Liu et al. use dynamic programming to determine which other cells are also covered
by rule rn . If no other cells are covered, then the optimal solution consists of rule rn which
17

only covers cell 0 plus an optimal solution that covers region [1, m â 1]. Otherwise, let x
be the nearest cell that is also covered by rn . The optimal solution is then one that covers
region [1, x â 1] and one that covers region [x, m â 1] where the last rule covers cell x (and
cell 0). We use this method for constructing our range resolvers.

4.3.2

Suriâs Algorithm

Their method works by dividing the range into regions designated by prefixes. Each region
has a designated background color. They then proceed to merge regions together, sharing
background colors between neighboring regions at reduced cost when possible.
More formally, they begin by dividing P into the largest prefix segments that are monochromatic. For each prefix-color pair, compute Cost(p, c), where c is the background color applied
to the given prefix. For the base case of one of the monochromatic prefixes, then Cost(p, c)
= 0 if c is the color of the segment and 1 otherwise. For other prefixes, Cost(p, c) =
minc0 ,c1 (Cost(p0, c0 ) + Cost(p1, c1) + [c 6= c0 ] + [c 6= c1 ]. The final cost, Cost(p) =
minc Cost(p, c) + 1 (to fill in the background color). We utilize Suriâs algorithm for constructing our prefix resolvers.

4.4

Resolvers

The second part to Diplomat is the resolver that is capable of finding and settling the
differences between two patterns. The resolver is given two subproblems, Pt and Pb over the
same fields and returns a rule list `. In the one-dimensional case, we refer to Pt and Pb as
rows. Otherwise, we call them planes.
The resolvers presented here assume that the resulting rules must come from Pt allowing
18

d2
P0

d2
P0

d3

d3

P1
P2
P3

P[1,2]

d1 P 3

d1

Figure 4.1: Merging planes P1 and P2 .
Pb to be used as the merged subproblem. With this restriction, for any point x â C, either
Pt (x) = `(x) or Pt (x) = Pb (x) and `(x) is undefined. In the original paper [4], we also
presented alternate versions that allowed rules to come from either subproblem. However,
this restriction allows the use of the dynamic programming scheduler in Section 4.5.

4.4.1

One-Dimensional Range Resolver

Here, we present an optimal way to merge Pt and Pb between columns 0 and k. Let `k denote
the partial rule list returned by this optimal solution and let Ck = |`k |. If Pt (k) = Pb (k),
then `k = `kâ1 , and Ck = Ckâ1 as there are no differences in column k.
Otherwise, there is a difference between Pt and Pb in position k. This means `k must
include at least one rule containing k of color Pt (k). Call this rule rm . Similar to Firewall
Compressor, we can assume that rm is the last rule in `k . Let Bm = [s, k] for some point s
in [0, k]. Since `k is defined over all of [s, k], `(x) must equal Rt (x) for all x in [s, k]. The
cheapest way to do that is by using Firewall Compressor. We also need to resolve the range
[0, s â 1]. This can be done with another application of this method.
This allows us to set up a dynamic program. For each k, Ck = min (Csâ1 +|F C(Pt [s, k])|)
0â¤sâ¤k

19

and `k = `sâ1 Âˇ F C(Pt [s, k]) for the minimum s. As a base case, Câ1 = 0 and `â1 = â.

4.4.2

One-Dimensional Prefix Resolver

Here, we present an optimal way to merge Pt and Pb over the range of some prefix pr using
only prefix rules. We begin by observing that Suri(Pt ) is a resolver. Let p denote the
smallest prefix (and thus specifies the largest number of points) used in an optimal resolver.
The restrictions on prefix rules means that either p = pr , p â pr 0, or p â pr 1. If p = pr , then
the optimal resolver is Suri(Pt ). Otherwise, it is the combination of the optimal resolvers
for pr 0 and pr 1. Let `p be the optimal resolver for Pt and Pb over the range denoted by
p. Then, `pr = min(`pr 0 Âˇ `pr 1 , Suri(Pt )). Our base case for this recurrence relation occurs
when Pt (pr ) is monochromatic meaning |OptResolve| is 0 if Pt (pr ) = Pb (pr ) and 1 otherwise.

4.4.3

Higher-Dimensional Resolver

If d âĽ 3, we need to merge two patterns Pt and Pb that are (d â 1)-dimensional hyperplanes.
This is more complicated than merging two rows. In particular, we have an optimal algorithm
for compressing 1-dimensional patterns, but compressing 2-dimensional patterns is NP-hard
as shown in [1]. This difference in complexity carries over to merging patterns.
We propose two different in-plane methods for merging (d â 1)-dimensional hyperplanes
when d âĽ 3. Without loss of generality, we assume that we wish to merge Pt and Pb leaving
pattern Pb . Thus, all rules in the resolver will be within Pt .
The first method is to enumerate all the differences. That is, we use the FDD comparison algorithm in [13] to find the differences between the two patterns. We then define `dif
by generating rules in Pt to cover all the places where Pt differs from Pb . We may cover

20

additional locations if this reduces the total number of rules required to cover all the differences. This resolver implicitly uses the correct type of rule (range, prefix or mixed) given
the characteristics of the fields.
In the second method, we find box, the minimum bounding box surrounding all differences. If any of the fields are prefix fields, we expand the bounding box in that dimension
so that the bounding box can be defined by prefixes. We then solve Pt (box) using the appropriate version of Diplomat (range, prefix or mixed) and call the resulting solution `mbb .
To achieve greater compression, we run both methods and use the smaller of the two
resolvers. That is, our final resolver ` = min(`dif , `mbb ).

4.5

Schedulers

The final part of Diplomat is a scheduler which determines the order that planes are to be
merged together. In general, different schedules will give different results.
In this section we present two schedulers: one for range rule and one for prefix rules. Both
use dynamic programming to consider all possible schedules. This requires the resulting rule
list of each merge to always be one of the original planes; otherwise we would need to store
all of the partial solutions. For this reason, our resolvers all draw rules from only one plane.
On a mixed rule list, Diplomat picks a scheduler appropriate to the field being split on.

4.5.1

Range Scheduler




We begin by precomputing the cost of each possible merge as M ergeCost[i, j] = Resolve(Pi , Pj )
âi 6= j â [0, m1 â 1] where the returned pattern is Pj . This requires O(m21 ) calls to Resolve.
However, because the merged pattern is always one of the original patterns, we do not need

21

to do any more merges.
We then define the following subproblem. Given interval [l, u] for 0 â¤ l â¤ u â¤ m1 â 1,
we find the minimum cost for merging patterns Pl , . . . Pu together while having Pr be the
remaining pattern where l â¤ r â¤ u. We use standard dynamic programming techniques
to compute the minimum cost solution to these O(m31 ) subproblems and the corresponding
optimal schedule of merges.

4.5.2

Prefix Scheduler

We also present a dynamic programming method for scheduling that works with prefixes.
We begin by precomputing the merge costs as for the range scheduler above. We then define
the recursive subproblem. Given a prefix p and row r â p, we find the minimum cost for
merging the two halves, p0 and p1 while having Pr be the remaining pattern. If r â p0
then Cost(p, r) = Cost(p0, r) + mins (Cost(p1, s) + M ergeCost[s, r]). The case for r â p1
is similar. As a base case, Cost(p, r) = 0 if p = r. We use standard dynamic programming
techniques to compute the minimum cost solution to each of the O(w2 m21 ) subproblems and
the corresponding schedule of merges.

4.6

Results

In this section we evaluate the effectiveness of Diplomat on real-life classifiers. Specifically,
we assess how much Diplomat improves over ACL compressor [16], the current state of the
art ACL compression algorithm. We test range, prefix, and mixed ACL compression. For
mixed-ACLs, we consider the port fields as ranges and the IP and protocol fields as prefixes.

22

4.6.1

Methodology

Diplomat and ACL Compressor (AC) are both sensitive to the ordering of the five packet
fields (source IP address, destination IP address, source port, destination port, and protocol).
We run both algorithms using each of the 5! = 120 different permutations across all of
the fields and use the best of the 120 results for each classifier. We also use redundancy
removal [12] as a post-processing step.
We now define the metrics for measuring the effectiveness of our Diplomat variants. Let
f and g denote compressors and L denote a classifier. Let f (L) denote the resulting rule list
obtained by applying compressor f to L, and |L| is the number of rules in L. The compression
|f (L)|

ratio of f on L is |L| , and the improvement ratio of f over g on L is

|g(L)|â|f (L)|
.
|g(L)|

In our

results, we focus on the mean compression ratio and mean improvement ratio given a set of
classifiers S. It is desirable to have a low compression ratio and a high improvement ratio.
We assess the performance of our algorithms using a set of 40 real-life classifiers which
we first preprocess by removing redundant rules. We divide this set into three smaller sets
based on the number of rules after running redundancy removal. The small set contains the
17 smallest classifiers, the middle set contains the next 12 larger classifiers, and the large set
contains the 11 largest classifiers. The classifiers in the large set all have at least 600 rules
after redundancy removal, with the largest having more than 7600 rules.
In general, larger classifiers also have larger effective grids. Not all of the fields strictly
increase as the list size goes up. In particular, the protocol field only uses a few distinct
values across all classifiers, usually 6 (TCP) and 17 (UDP). As such, the smallest dimension
size is at most 9 for all classifiers. The range of the largest dimension size for each set can
be seen in Table 4.1.

23

Min Median Max
Small
5
14
37
Medium 24
66
139
Large 55
506
1805
Table 4.1: Size of Largest Dimension

4.6.2

Compression Results

Our results show that (i) Diplomat outperforms ACL Compressor and that (ii) the performance gap between Diplomat and ACL Compressor increases as the classifiers grow in
size and complexity. Diplomatâs superior performance manifests itself in two ways. First,
as classifiers grow in size and complexity from small to medium to large, the percentage
of classifiers where Diplomat outperforms ACL Compressor increases from almost 0% for
the small classifiers to over 50% for the medium classifiers to roughly 100% for the large
classifiers (Figure 4.3). Second, again as classifiers grow in size and complexity, Diplomatâs
average improvement ratio grows from almost 0% for the small classifiers to roughly 10% for
the medium classifiers to as high as 30% for the large classifiers (Figure 4.2).
It is especially encouraging that Diplomat performs best on the large classifier set as the
large classifiers are the ones that most accurately represent the complex classifiers we are
likely to encounter in the future and which most need new compression techniques. More
specifically, we hypothesize that for the small classifiers, there is relatively little room for optimization and both methods are finding optimal or near optimal classifiers. However, as the
classifiers increase in size and complexity, Diplomat is able to find new compression opportunities missed by the earlier methods. More detailed compression ratio results for Diplomat
on range, prefix, and mixed-ACLs are shown in Figures 4.4, 4.5, and 4.6, respectively.

24

35

Range
Prefix
Mixed

Improvement Ratio (%)

30
25
20
15
10
5
0

Small

Large

Medium

Classifier Size

All

Figure 4.2: Mean improvement ratios
100

Range
Prefix
Mixed

Classifiers Improved (%)

80

60

40

20

0

Small

Large

Medium

Classifier Size

All

Figure 4.3: Percentages of classifiers Diplomat outperforms previous methods

4.6.3

Efficiency

We implemented our algorithms using a combination of C# and VB. We found that Diplomat
runs in under a second on the smaller sets, between a second to a minute on the medium
sets, and up to a few hours for some permutations on the largest sets. Our implementation
has not been designed with an emphasis on speed. In particular, multiple calls to Firewall
Compressor or Suri are made for each row when compared to any other row. Implementing
some optimizations such as remembering this result should result in significant speed ups.
Furthermore, since Diplomat is additive, we can terminate permutations once they have

25

70

ACL Compressor
Diplomat

Compression Ratio (%)

60
50
40
30
20
10
0

Small

Large

Medium

Classifier Size

All

Figure 4.4: Compression ratios on range classifiers
80

Compression Ratio (%)

70

ACL Compressor
Diplomat

60
50
40
30
20
10
0

Small

Large

Medium

Classifier Size

All

Figure 4.5: Compression ratios on prefix classifiers
accumulated more rules than a known solution such as one generated by another permutation.
This allows us to safely reduce the amount of work that must be done. Additionally, since
better permutatations tend to complete relatively quickly, we can terminate permutations
that run for a long time with little likelihood of missing the best solution.

26

70

Compression Ratio (%)

60
50
40
30
20
10
0

ACL Compressor
Diplomat
Small

Large

Medium

Classifier Size

All

Figure 4.6: Compression ratios on mixed classifiers

27

Chapter 5
ByteCuts
In Chapter 4 we devised a way to minimize the size of a rule list. This is especially helpful
for supporting hardware-based packet classifiers, such as TCAMs, which have limited space
available. Unfortunately, the size of rule lists has been growing faster than can be supported
by TCAMs, even with compression.
In this chapter, we will instead develop a software-based packet classifier which we call
ByteCuts. ByteCuts builds search trees similar to several existing methods such as Hypercuts [22] and HyperSplit [20]. ByteCuts adds two new improvements over existing search
tree classifiers. First, it uses an intelligent rule separation scheme that minimizes the rule
replication problem that plagues most of these tree-based classifiers. Second, it introduces
a new cutting method that reduces the height of the search trees.
Rule Source Dest
r1
000
101
r2
001
101
r3
010
001
r4
011
001
r5
011
000
r6
100
***
r7
110
***
r8
***
011
r9
***
010
Table 5.1: A sample rule list

28

Figure 5.1: A graphical representation of the rules from Table 5.1.

5.1
5.1.1

Related Work
HiCuts

HiCuts [5] is one of the earliest decision tree based methods. In HiCuts, one of the fields is
selected and the domain of that field is split into several equal-sized pieces. Each rule is then
copied into the sublist corresponding to whichever subdomain it is in. Unfortunately, the
domain boundaries and the rule boundaries do not always match up. In this case, the rule
must be copied into the sublist of each domain that partially contains it. In the worst case
scenerio the rule completely ignores the selected field and must be copied into every sublist.
This is the rule replication problem and it causes drastically increased memory sizes, even if
only a pointer to the rule is copied.
This process is repeated on each of the sublists. Eventually, the number of the rules
in the list becomes small enough that further splitting is of little benefit. Thus, when the
number of rules becomes less than a specified threshold, it switches over to a sequential
search instead.

29

Figure 5.2: The tree HyperCuts builds for Table 5.1.

5.1.2

HyperCuts

HyperCuts [22] improves on HyperCuts by allowing multiple fields to be cut at once. This
reduces the overall tree height by combining multiple levels into one decision. Unfortunately,
this does not reduce the amount of rule replication. They attempt to alleviate replication
by allowing branch nodes to also have rule lists, but since few rules meet this criterion (as
multiple fields are cut), this is not as helpful as it could be. A sample HyperCuts tree can
be seen in Figure 5.2.

5.1.3

EffiCuts

EffiCuts [31] is a variant on HyperCuts that addresses the rule replication problem directly.
For each rule they assign it a tuple corresponding to whether it uses a particular field or not.
Then then group all of the rules that have the same tuple together. Then then construct
a HyperCuts tree for each group. Since every rule in the group is specified in whichever
field is cut, only a few rules need to be replicated into some of the parts. While each tree is
also smaller than the single HyperCuts tree, their sum cost tends to be significantly higher.
They combat this by selectively merging groups that differ in only one field together before
creating the tree. This leaves enough options for field choices while reducing the number of

30

Figure 5.3: The tree HyperSplit builds for Table 5.1.

Figure 5.4: The tree SmartSplit builds for Table 5.1.
trees required.

5.1.4

HyperSplit

HyperSplit [20] takes a different option by making a single cut in an arbitrary location as
opposed to the HiCuts familyâs many equal-sized cuts. They first choose the field that has
the most even distribution of rules across the domain. They then make a cut that balances
rule separation and the amount of overlaps. This method tends to have significantly less
rule replication than HyperCuts (although more than EffiCuts) as it selects cuts that divide
only a few rules and those that are divided are replicated into only two subtrees. A sample
HyperSplit tree can be seen in Figure 5.3.

31

5.1.5

SmartSplit

SmartSplit [7] improves on HyperCuts and HyperSplit in two ways. First, they describe
a method for estimating the search times and memory requirements for HyperSplit and
HyperCuts. This allows them to predict which algorithm would be better given their needs
and constraints. Second, they modify the EffiCuts separation scheme by using only the two
address fields and only if they estimate that there will be too much rule replication without
it. This method produces four subsets (specified on source address, specified on destination
address, specified on both, and specified on neither). The specified on both set is then merged
with one of the partially specified sets. This leaves three sets that generally have sufficient
cohesion to produce a tree without significant rule replication. A sample SmartSplit tree can
be seen in Figure 5.4.

5.2

Methodology

In this section, we describe our ByteCuts (BC) classifier. ByteCuts improves on existing
tree-based packet classifiers in three ways. First, it incorporates a new cutting method
that is more versatile than the one included in HiCuts [5] and HyperCuts [22]. Second, it
includes a tree selection algorithm that is more discriminating than SmartSplit [7]. Finally,
it can include both cut nodes and split nodes (like HyperSplit [20]) in the same tree, unlike
SmartSplit where each tree is formed entirely of one kind of node (although different trees
can be of different types).

32

Figure 5.5: 8 bits are extracted from this packet to form a key.

Figure 5.6: Rule ra is replicated twice and rb is replicated three times.

5.2.1

Bit Cutting

ByteCuts works by extracting k consecutive bits from a rule or packet, interpreting those
bits as a number, and using that number as an index into an array of child nodes. This
strategy is designed to work on ternary rules where neighboring bits are likely to appear
together. An example of this process can be seen in Figure 5.5.
Address fields, the most common ternary field, are usually represented by prefixes, and
due to how addresses are usually allocated, certain prefix lengths are much more common
then others. We leverage this to reduce the number of cases to consider. We will only
consider nybbles, which are blocks of 4 bits (half a byte). There are 8 nybbles per address
field. For simplicity, our examples use only a few bits. This precludes demonstrating the
nybble alignment.
Port fields are normally not represented using prefixes or other ternary expressions but
are instead represented using ranges. We do not use our cutting method on these fields.
Instead, we borrow the splitting method from HyperSplit [20] on these fields.

33

Rule Source Dest
r1
000
101
r2
001
101
r3
010
001
r4
011
001
r5
011
000
r6
100
***
r7
110
***
r8
***
011
r9
***
010

Rule Source Dest
r5
011
000
r3
010
001
r4
011
001
r9
***
010
r8
***
011
r1
000
101
r2
001
101
r6
100
***
r7
110
***

Table 5.2: Both (Source, 3) and (Dest, 3) create trees with a maximum collision limit of 2
and omit 2 rules.

5.2.2

Tree Selection

Tree based methods, including ByteCuts, suffer from a problem called rule replication. Rule
replication is caused by rules falling into multiple partitions due to wildcards which forces
them to be copied multiple times. This can be seen in Figure 5.6 where rules ra and rb
are both replicated into multiple partitions. Existing methods, such as EffiCuts [31] and
SmartSplit [7] have attempted to alleviate this problem by partitioning rules such that all
rules in the same partition have similar characteristics. This significantly reduces the amount
of replication required as it is easier to find good cuts that do not trigger any of the wildcards.
We also divide the rule list L into a list of trees T based on the characteristics of the
rules. We consider each possible length for a prefix field (in nybble increments); this gives 8
possible lengths for each address field. For a particular field-length pair, we divide the rules
into two sets: those that are sufficiently long and those that are not. For each rule that is
sufficiently long, we find its prefix of the given length. We find the maximum occurrence of
any of the prefixes. This approximates the largest partition of a single cut; smaller maximums
are preferable. The other set represents rules that will be excluded from the tree; smaller
numbers here are also preferable.
34

These two goals are at odds with each other. Decreasing the length excludes fewer rules
but increases the maximum partition size. We deal with this trade-off by striking a balance
between the number of excluded rules and the maximum partition size.
We define two thresholds: one for each objective. We define c to be the maximum collision
ratio and nc = c Âˇ nL , where nL is the number of rules in L that have not yet been placed into
a tree. Likewise, we define Ď to be the maximum exclusion ratio and nĎ = Ď Âˇ nL . Both nc
and nĎ will be smaller for later trees since nL will be smaller; nc defines what an acceptable
collision limit is while nĎ defines the desired maximum number of excluded rules.
If any field-length pair (f, wf ) has a collision limit below nc , then we select the pair that
minimizes the number of excluded rules. Otherwise, from those that exclude at most nĎ
rules, we select the pair that minimizes the collision limit. If neither objective is reachable,
we select the pair that minimizes the sum collision size + rules excluded.
Once a field-length pair (f, wf ) is selected, we create a tree from all of the rules whose
prefix length is at least wf on field f as described in the next section. We then repeat the
tree selection process on the remaining rules.
Once the number of remaining rules falls below a certain threshold, Ď (5% in our experiments), we declare the remaining rules âbadâ (and by extension, the previous rules âgoodâ).
Bad rules typically have relatively few bits specified in both address fields. If TCAM is
available, we place the bad rules into TCAM. For a pure ByteCuts solution, we try to build
a single tree with all the bad rules and only remove rules if they require too much rule
replication in the cutting phase (see Section 5.2.3). Specifically, we only remove a rule from
a tree because of rule replication if it has 5 or more wildcard bits within any individual cut
in the tree.
Consider the rules in Table 5.1. Both (Source, 3) and (Dest, 3) have a maximum partition
35

size of 2 and 2 excluded rules, making them equally valid. Other options have a larger
maximum partition size; they are not considered. Choosing (Source, 3) will create a tree
from rules r1 to r7 . Rules r8 and r9 will be turned into a separate tree.
When we construct each tree Ti , we associate with it the highest priority of the rules that
it contains. We then sort T so that if i < j, then Ti has a higher priority than Tj . This will
be helpful for performing packet classification. Specifically, if a high-priority rule is matched
in an earlier tree, the trees containing only lower-priority rules do not need to be searched.

5.2.3

Cut Selection

Once the rules for a tree are selected, ByteCuts partitions the list into several sublists using
the method described below. These sublists are then partitioned in the same manner. This
process repeats until a partition has at most nleaf rules. This leaf node will then be searched
sequentially. In our experiments, we set nleaf = 8.
Our primary means of partitioning is with cuts on ternary fields (including prefix and
exact-match fields). For each cut, we select up to k consecutive bits. We extract those k
bits from each rule and interpret it as a number. All rules that have the same number will
end up in the same partition. If a rule has a wildcard in any of those bit positions, then the
rule is replicated into several partitions.
For each field, we consider each possible left-most bit and select the next possible Î´ bits
for each 0 < Î´ â¤ k. This gives us a selection of Î´ bits. From each rule ri , we then extract
these bits and interpret it as a number. Rules with wildcards count for multiple numbers (by
replacing the * with both 0 and 1). We count how many times each number appears. We
select the cut that has the smallest maximum frequency. This cut will minimize the largest
subtree. In case of ties, we prefer cuts with smaller Î´ since this saves memory.
36

Rule Source Dest
r1
000
101
r6
100
***
r2
001
101
r3
010
001
r7
110
***
r4
011
001
r5
011
000
r8
r9

***
***

Key
00
00
01
10
10
11
11

011
010

Table 5.3: ByteCuts uses the two highlighted bits from the Source field as a key to cut the
first tree. The second tree does not need cutting since it contains only two rules.

Figure 5.7: A graphical representation of the trees produced in Table 5.3.
Once a cut is selected, we partition the rules. We create 2Î´ partitions. Similar to above,
from each rule ri , we extract the selected Î´ bits, interpret them as a number, and then place
ri into the partition corresponding to that number. Rules with wildcards will be placed into
multiple partitions. We then recurse on each partition to create a new subtree. Some lists
contain identical sets of rules (because of wildcards). We detect this and create only one
child node that is shared for each of these sublists.
For example, consider the rules from Table 5.1. In Section 5.2.2, we selected rules r1 to
r7 . For this example, we let nleaf = 2 and k = 2. If we select the first two bits in the Source
field, then one partition will contains rules r3 , r4 , and r4 , but if we select the last two bits,
then each partition has at most 2 rules making it a better choice. None of the cuts on the

37

Dest field are as good since they require replicating r6 and r7 into all partitions. We thus
select (011, 000) as the mask for our cut. This yields four children, with part 00 containing
r1 and r6 , 01 containing r2 , 10 containing r3 and r7 , and 11 containing r4 and r5 . After this
cut, each partition only has 2 â¤ nleaf rules, so we create a leaf node for each part.
We have two optimizations to make cut selection faster. First, if a rule has more than 4
wildcards in the selected bits we assume that it ends up in every partition. We add the number of such rules as a penalty rather than counting them towards individual partitions. This
prevents having to increment thousands of frequencies for rules that have many wildcards.
The second optimization is that we align all of our cuts to nybble boundaries: our minimum cut size is 1 nybble, we increment the cut length by 1 nybble, and in our experiments
our maximum cut length is k = 4 nybbles (16 bits). This limits the size of the child array
to at most 65536 nodes; increasing this limit increases the array size, and thus potential
memory usage, exponentially. We select only nybbles because, in practice, these lengths are
significantly more common than other lengths. Thus finer cuts add few benefits at significant
time costs.

5.2.4

Splitting

For range fields, we borrow the splitting method from HyperSplit [20]. In a split, a field
f and split point p are selected. The rules are divided into two sublists; all rules where
low(r[f ]) â¤ p are placed into the first list and all rules where high(r[f ]) > p are placed into
the second list (some large rules will end up in both lists, causing replication).
We currently use a selection heuristic not detailed in the HyperSplit paper, which we call
rule-balanced. In the rule-balanced heuristic, the split point is selected so that the larger
set contains as few rules as possible. The dimension with the lowest max size is selected as
38

Figure 5.8: A depiction of the memory layout for each of the three types of nodes. Each
includes a 4 byte header and a 4 byte pointer.
the best split. This works well with our bit cutting selection; in both cases we are trying to
minimize the worst case size of any sublist.
Since we only allow rule splitting on port fields, we only need 16 bits to represent the
splitting point whereas HyperSplit needs 32 bits. This allows us to use the remaining 16 bits
to convey other information such as the split dimension and node type. This allows us to
use 32 bits for the child pointer compared to HyperSplitâs 24 bits, increasing the maximum
rule capacity.

5.2.5

Node Design

ByteCuts uses nodes that are 64 bits in size. We use 2 bits to represent the type of node;
leaf, cut, or split. We pad this to 8 bits to better align each field. In each case, 32 bits are
used for an array pointer. The remaining 24 bits are used slightly differently between each
of the node types. The layout for each type can be seen in Figure 5.8.
For the cut nodes, we use 8 bits each for the dimension, Î´, and right endpoint. Split
nodes use 8 bits for the dimension and 16 bits for the split boundary. Since we only split
on the 16 bit range fields, we do not need the full 32 bits that HyperSplit used for the split
boundary. Finally, for a leaf node we can use 8 bits to store the number of rules associated
with this node (leaving 16 bits unused).
For the cut node we created in Section 5.2.3, we need to encode the mask (011, 000).
This selects Î´ = 2 bits from dimension 0. This uses the least significant bits, so the right
39

end point is 0. If we use 1 as the ordinal for a cut node, then the final representation for
this node (less the child pointer), will be (type, d, Î´, rt) = (1, 0, 2, 0).

5.2.6

Packet Classification

To classify a packet p, ByteCuts searches each Ti â T in sequence. If the priority of the best
match found so far exceeds that of Ti , then we do not need to search Ti as none of its rules
can be of a higher priority than those already found. Since T is sorted based on priority,
this means that we can stop searching.
To search a tree T , we first look at its root node and determine which type of node it is. If
it is a leaf node or split node, we search these in exactly the same manner as HyperSplit [20].
Otherwise, it is a cut node. We right-shift p[d] over and then extract the rightmost Î´ bits.
We interpret this as an index into the child array, which we then search recursively.
Consider classifying packet (3, 0). For the first tree, we have a cut node with d = 0,
Î´ = 2 and rt = 0 as shown in Section 5.2.5. If we right shift 3 by 0 bits and then extract the
last 2 bits, we get 3 (or 112 ). This points to a leaf containing r4 and r5 . We see that r4 does
not match this packet but r5 does. We do not need to search the second tree because r5 has
a higher priority than both r8 and r9 . We thus conclude that r5 is the best matching rule.

5.3

Theoretical Results

In this section, we prove some theoretical results about ByteCuts. For these proofs, we make
four assumptions.
1. Each good tree (those containing only good rules and no bad rules) leaves at most
Ď Âˇ nL rules unselected.
40

2. The number of bad trees produced is constant.
3. At least some fraction Î˛ of the cuts on m rules contain no more than Îą Âˇ m rules in the
largest child node, for constants 0 < Îą, Î˛ < 1.
4. The replication factor is at most Ď âĽ 1.
Theorem 1. Given assumptions 1 and 2, the maximum number of trees produced is constant.
Proof. From assumption 1, if we start with n rules, then after k trees are produced, at most
nr = n Âˇ Ďk rules remain. Additionally, ByteCuts stops producing good trees when nr â¤ Ď Âˇ n.
Thus Ď Âˇ n âĽ n Âˇ Ďk . Solving for k we get that k log Ď â¤ log Ď and k â¤ log(Ď )/ log(Ď). Since
we can only produce an integer number of tables, k â¤ dlog(Ď )/ log(Ď)e. Since Ď and Ď are
constants, then so is k. Since both the number of good trees and the number of bad trees
are constant, so is the total number of trees.
If we set Ď = 0.2 and Ď = 0.05, then we expect 2 good trees. If we assume that there
will be 1-3 bad trees (such as for non-TCP/UDP traffic and wide source or destination
addresses), then we expect 3 to 5 trees total. In Section 5.4, we find experimentally that the
precondition to Theorem 1 usually holds and that we achieve these expected results.
We use Theorem 1 to compute the expected search and construction times.
Corollary 2. Given assumptions 1, 2, and 3, the search time for ByteCuts is O(log n).
Proof. Similar to a quicksort proof, we can use assumption 3 to show that the maximum
tree height will be at most Î˛ logÎą n. Since Theorem 1 shows the number of trees is constant,
the total search time is O(log n).
Theorem 3. Given all four assumptions, the construction time for ByteCuts is O(dw Âˇ
Ďn log n).
41

Proof. There are two costs to building the classifier: selecting the rules and building the
tree; tree construction is more expensive. Building a tree node requires considering each
pair of d fields and w lengths. For each such pair, we must build a key for each rule. From
assumption 4, there are at most Ďn rules shared between all of the nodes of a given level.
Each level of the tree thus costs O(dw Âˇ Ďn). From Corollary 2, we know that the height of
a tree is O(log n) and from Theorem 1 we know that the number of trees is constant. Thus,
the total construction time for ByteCuts is O(dw Âˇ Ďn log n).

5.4

Experimental Results

In this section, we compare ByteCuts against existing methods including HyperCuts [22],
HyperSplit [20], and SmartSplit [7].

5.4.1

Experimental Setup

We ran our experiments on rule lists created with ClassBench [29]. ClassBench includes 12
different seed files in 3 different categories: 5 access control lists (ACL), 5 firewalls (FW)
and 2 IP-chains (IPC). Seeds in the same category can have very different properties. In
addition, we have 18 real-life rule lists ranging in size from 100 rules to 7600 rules that we
used for timing comparisons.
From ClassBench, we generated rule lists of 7 different sizes. For each size, we generated
5 classifiers for each seed. This yields 60 classifiers per size and 420 classifiers total.
We compare ByteCuts to HyperCuts, HyperSplit, and SmartSplit using implementations
provided by the authors. For each method, we try leaf sizes (binth in the original papers,
equivalent to nleaf ) of 8 and 16 and report the better result. We leave the other variables
42

at their default values.
We measure six metrics: classification time, number of trees, tree height, memory accesses, construction time, and memory usage. Classification time, tree height, and memory
accesses all are measures of classification speed. Classification time measures average classification speed as it represents how long an algorithm takes to process an actual packet trace.
Tree height and memory access instead measure the worst case classification speed. Tree
height estimates the worst cost of traversing each tree, but not the linear search of rules in
the leaves. Memory accesses estimates the combined cost of both the trees and the rules
in the leaves. For multiple tree methods, such as ByteCuts and SmartSplit, these sums are
upper bounds; it may not be possible for an individual packet to be compared to the deepest
rules in every tree.
For each metric, we report the mean value and standard deviation of a classifier across
all of the rule lists under consideration (such as for a specific rule list size). We also compare
algorithm A to ByteCuts by reporting the ratio A/BC (usually as a percentage).
We ran these experiments on a computer with an Intel Xeon CPU @ 2.8 Ghz, 4 cores,
and 6 GB of RAM running Ubuntu 16.04 and compiled with gcc 5.4.0. We verified the
correctness of ByteCuts and SmartSplit by ensuring each classifier returns matching results
for each test packet.

5.4.2

Classification Time

We compare the mean packet classification time of ByteCuts and SmartSplit in Figure 5.9.
We do not consider HyperCuts and HyperSplit because the provided implementations do
not support classification (they report the static qualities of the created tree) and because
they were unable to produce a tree given the time and memory constraints for many rule
43

Figure 5.9: Classification Time

Figure 5.10: Classification Time on 64k rules
lists.
We measure classification time by generating a stream of 1 million packets, measuring
how long each classifier takes to classify the whole stream, and then computing the average
time per packet. Each method receives the same packet stream.
Overall, we find that ByteCuts performs 58% better than SmartSplit. The improvement
increases as the rule list becomes larger; on 1k rules, ByteCuts and SmartSplit perform

Figure 5.11: Classification Time on real-life rules
44

roughly equally well, but on 64k rules, ByteCuts performs 84% better. On the 64k rule lists,
SmartSplit requires an average of 0.127Âľs per packet, but ByteCuts requires only 0.069Âľs
per packet.
We also analyze the classification time across each of the individual seeds. For 11 of
the 12 seeds, ByteCuts performs better than SmartSplit. SmartSplit is only faster than
ByteCuts on the IPC2 seed, and it is only 6.9% faster. This is smaller than the difference for
every other seed; the next smallest difference is that SmartSplit is 19.9% slower for the FW2
seed. In extreme cases, such as the ACL4 seed, SmartSplit is 155% slower than ByteCuts.
These results are consistent with the experiments on the real-life rule lists. On these
rule lists, ByteCuts is faster than SmartSplit on 14 of the 18 rule lists. Overall, ByteCuts
averages 0.049 Âľs / packet while SmartSplit requires 0.056 Âľs / packet. These values both
fall in the range between the averages for our 1k experiments and our 8k experiments, which
is roughly the sizes of our real-life classifiers.
Finally, ByteCuts has much less variance than SmartSplit. For example, the standard
deviation of SmartSplit across the 64k rule lists is 2.5 times that of ByteCuts. This makes
ByteCuts more consistent and predictable.

5.4.3

Number of Trees

We measured the number of trees required by both ByteCuts and SmartSplit. Since HyperCuts and HyperSplit always produce a single tree, they are not shown. The results can be
seen in Figure 5.12.
We find that ByteCuts produces a relatively stable number of trees, averaging at 3.52
trees across all of the rule lists. Instead of varying by the number of rules, it is dependent
on the qualities of the individual seed as shown in Figure 5.13. By contrast, SmartSplit uses
45

Figure 5.12: Number of trees

Figure 5.13: Number of trees across seeds
more tables when the number of rules increases.
These results match the theoretical results from Theorem 1 which predicts that the
number of trees should be constant. This suggests that the assumptions made in Section 5.3
are valid for our classifiers.

5.4.4

Tree Height

We compare the mean tree height of all four methods in Figure 5.14. Since ByteCuts and
SmartSplit produce multiple trees, we report the performance of each method with stacked
bars. The lowest bar represents the average height of the first tree, the second bar the second
tree, and so on. For ByteCuts this is an upper bound on the actual worst-case as it may not
be possible to hit all of the deepest branches simultaneously.
The height of each individual ByteCuts tree is significantly smaller than for the other
46

Figure 5.14: Mean Tree Height

Figure 5.15: Mean Memory Accesses
methods, but the total height is similar to the other methods. On the 1k rule lists, the sum
height of the ByteCuts trees is 1% smaller than those of SmartSplit. This improves to 10%
for the 64k rule lists.
Comparing individual trees, we find that SmartSplitâs first tree tends to be, on average,
2.1 times the size of ByteCutsâs. This ratio decreases as the number of rules increases because
SmartSplit produces more (generally smaller trees). For the 64k rule lists, where SmartSplit
produces multiple trees, this reduces to only 1.59 times on average.
If a high-priority match is found, it is not necessary to search later trees (if they only
contain lower-priority rules) as no rule can have a higher priority. This is thus a significant
advantage for ByteCuts as it may only need to look at a single, smaller, tree. This explains
why ByteCuts performs so much better than SmartSplit.
We also compare the number of accesses (tree height + leaf size) in Figure 5.15. As before,
47

Figure 5.16: Mean Construction Time

Figure 5.17: Mean Memory Usage
we represent the accesses per tree with stacked bars. Again, each individual ByteCuts tree
is significantly smaller than the other individual trees and the sum is similar to the other
methods. We find that when SmartSplit produces multiple trees, the resulting sum is larger
than the corresponding sum for ByteCuts.
Unlike the classification time metric, which measures the expected average case time,
the tree height and number of accesses measure the theoretical worst case time. Under an
adversarial model where a classifier continuously receives the worst possible packets for it,
our results show that ByteCuts is competitive with all the other algorithms in the worst case
and outperforms SmartSplit as the number of rules increases.

48

5.4.5

Construction Time

ByteCuts has significantly faster construction times than any of the other methods, finishing
in under a minute even on the largest rule lists. SmartSplit is the only one of the other
algorithms that was able to finish on all of the rule lists. Both HyperCuts and HyperSplit
have problems with time and memory for larger rule lists (16k for HyperCuts and 8k for
HyperSplit). Of the two, HyperSplit has faster construction times; its times are faster than
SmartSplit for smaller rule lists, but it is still slower than ByteCuts.

5.4.6

Memory Usage

We compare the mean memory usage for each classifier in Figure 5.17. This includes all of
the memory for tree nodes, arrays, and the rule list.
ByteCuts requires multiple orders of magnitude less memory than the other methods,
with HyperCuts having the worst requirements by an order of magnitude. ByteCutsâs tree
separation scheme allows it to effectively minimize rule replication, keeping its memory
usage modest. HyperCuts and HyperSplit do not have any way to limit rule replication.
Additionally, the wide branching factor of HyperCuts causes rule replication to run rampant.
HyperSplit has problems with certain incompatibilities. Some rules cannot be split on
either address field, which is the primary separator for the other rules. These rules are thus
replicated into nearly every leaf. If there are too many of these bad rules (in particular if
it exceeds binth), they need to be separated individually. This can potentially take O(ndbad )
cuts, yielding a sudden memory explosion. For this reason our data for HyperSplit only goes
to 4k rules; some seeds exceeded the memory limits for each of the larger sizes.
SmartSplit has erratic memory requirements depending on whether it chose to build one

49

or multiple trees and if those were HyperCuts or HyperSplit trees. This is why its curve
oscillates.

5.4.7

Summary of Trade-offs

In packet classification there is an inherit trade-off between time and memory. Multi-tree
packet classifiers, such as ByteCuts and SmartSplit, balance this trade-off primarily by adjusting the number of trees produced. Different rules will have different properties. For
example, one rule may use the source address field while another will use the destination
address field instead. Placing both rules into the same tree will result in rule replication as
any cut that can differentiate one rule will divide the other rule into multiple parts. This in
turn increases both memory consumption and tree height.
Creating multiple trees allows rules with different properties to be separated. This reduces
the amount of rule replication required which directly reduces memory consumption. This
can be most easily seen by comparing Figure 5.12 and Figure 5.17; memory consumption
normally increases with the number of rules, but when SmartSplit produces multiple trees
instead of a single tree, its memory consumption decreases instead.
Producing more trees can both increase and decrease the search times required. With
multiple trees, each individual tree tends to be smaller than before, so each individual tree
is faster to search. However, collectively they may take more time to search than a single
tree. Thus a small number of trees may be faster than a single tree, but additional trees may
increase the time required. This both explains the dip in SmartSplitâs classification time in
Figure 5.9 and is one of the reasons that ByteCuts is faster than SmartSplit.

50

Chapter 6
PartitionSort
In Chapter 5, we looked at a software-based searching method for traditional offline packet
classification. With modern software-defined networks (SDN), such as OpenFlow [17], packet
classification has become an online problem. By this, we mean that rather than simply
classifying packets, the classifier must also be capable of modifying itself to match a changing
rule list. For many traditional packet classifiers, ByteCuts included, the only solution is to
completely rebuild the classifier. For SDN, this solution is infeasible.
In this chapter, we will instead develop an online software-based packet classifier which
we call PartitionSort. Like before, our approach will involve separating the rule list into
multiple trees. Unlike before, our trees will not directly partition rules spatially. Instead, we
will establish a partial order over the rule list. Our partition strategy will then ensure that
each sublist is totally ordered. This allows each sublist to be binary searched.

6.1
6.1.1

Related Work
Independent Set

In [26], the authors propose a method for dividing the rule list into sets where all of the
rules in a given set are disjoint in a particular field. Since the rules are non-overlapping in
this field, the rules can be sorted based upon it. This allows the list to be binary searched

51

Update Time

100 đ 

SmartSplit

2 đđ 

1 đđ 

PartitionSort
Tuple Space Search

2 đđ 

1 đđ 

3 đđ 

Classification Time

Figure 6.1: PartitionSort is uniquely competitive with prior art on both classification time
and update time. Its classification time is almost as fast as SmartSplit and its update time
is similar to Tuple Space Search.
and guarantees a search time for any given tree. Unfortunately, the number of sets is large
enough that their sum cost is prohibitive. PartitionSort generalizes this method with a more
effective comparison function and other enhancements that produces fewer sets.

6.2

PartitionSort Overview

PartitionSort combines the speed of traditional decision tree classifiers with the sortability
strategy of Independent Set [26] to produce a software packet classifier that is competitive
with both high-speed classification methods on classification time and fast update methods
on update time as shown in Figure 6.1. We generalize the sortability criterion from [26] into
multiple dimensions. PartitionSort partitions a rule list into a small number of sortable rule
lists. We store each sortable rule list in a multi-key binary search tree leading to O(d + log n)
classification and update time per tree. The memory requirement is linear in the number of
rules.
This method requires us to overcome four technical challenges. The first is defining
sortability for multi-dimensional rules. It is challenging to sort multi-dimensional rules in
52

a way that helps packet classification. We address this challeng by describing a necessary
requirement that any useful ordered rule list must satisfy and then introduce our field order
comparison that progressively compares multiple rules one field at a time.
The second technical challenge is designing an efficient data structure for sortable rule
lists that supports high-speed classification and fast update using only linear space. We show
that our notion of rule list sortability can be translated into a multi-key set data structure
and hence it is possible to have O(d + log n) search / insert / delete time using a multi-key
balanced binary search tree. We also propose a path compression optimization to speed up
classification time.
The third technical challenge is effectively partitioning a non-sortable rule list into as
few sortable rule lists as possible. We address this challenge by a reduction to a new graph
coloring problem with geometric constraint. Then, we develop a fast offline sortable rule list
partitioning algorithm that runs in milliseconds.
The forth challenge is designing an effective online partitioning algorithm. This is particularly challenging because we must not only achieve fast updates but also preserve fast
classification time in the face of multiple updates. We use an offline rule list partitioning
algorithm as a subroutine to design a fast online rule list partitioning algorithm that still
achieves fast classification time.

6.3

Rule Sortability

We must overcome two main challenges in defining sortable rule lists. The first is that we
must define an ordering function â¤ that allows us to order rules which are d-dimensional
hypercubes. Consider the two boxes r1 and r4 in Figure 6.2; one might argue that r1 â¤ r4

53

Y

7

r1

5

r2

3

1

r5

r4

r3

1

3

7

5

9

11

X

Figure 6.2: Example of ordered rule list with â¤xy
since r1 âs projection in the X dimension is smaller than r4 âs projection in the X dimension;
however, one could also argue that r4 â¤ r1 because r4 âs projection in the Y dimension is
smaller than r1 âs projection in the Y dimension. The second challenge is that â¤ must be
useful for packet classification purposes. Specifically, suppose â¤ defines a total ordering on
rules r1 through rn and we are trying to classify packet p. If we compare packet p to rule
ri and determine that p > ri , then packet p can only match some rule among rules ri+1
through rn . That is, packet p must not be able to match any of rules r1 through ri .

6.3.1

Field Order Comparison Function

We propose the following field order comparison function for sorting rules. The intuition
comes from the fact that sorting one-dimensional rules is straightforward. The basic idea is
that we choose a field order (permutation) f~ and then compare two rules by comparing their
projections in the current field until we can resolve the comparison. Continuing the example
from above, if we choose field order XY then r1 â¤XY r4 because [1, 3], r1 âs projection in
field X is smaller than [5, 7], r4 âs projection in field X. A more interesting example is that
r4 â¤XY r2 because their projections in field X are both [5, 7] which forces us to resolve the
ordering using their projections in field Y .
54

We now formally define our field order comparison function. Let f~i denote the ith field
of f~. For any rule r, let f~i (r) denote the projection of r in f~i , that is its component in field
f~i .
We now define how to compare two intervals using f~. For any two intervals s1 =
[min(s1 ), max(s1 )] and s2 = [min(s2 ), max(s2 )], we say that s1 < s2 if max(s1 ) < min(s2 ).
We say that two intervals are not comparable, s1 k s2 if they overlap but are not equal. Together this defines a partial ordering on intervals. To illustrate, [1, 3] < [4, 5], but [1, 3] k [3, 5]
as they overlap at point 3.
We define the field order comparison function â¤f~ based on field order f~ as follows.
Definition 1 (Field Order Comparison). Consider two rules ri and rj and a field order f~.
If f~1 (ri ) k f~1 (rj ), then ri kf~ rj . If f~1 (ri ) < f~1 (rj ), then ri <f~ rj . If f~1 (ri ) > f~1 (rj ), then
ri >f~ rj . Otherwise, f~1 (ri ) = f~1 (rj ). If f~ contains only one field, then ri =f~ rj . Otherwise,
the relationship between ri and rj is determined by f~0 = f~ \ f~1 . That is, we eliminate the
first field and recursively compare ri and rj starting with field f~2 which is f~0 1 .
To simplify notation, we often use the field order f~ to denote its associated field order
comparison function â¤f~ and we drop f~ from â¤ if it is clear from context.
For example, consider again the rectangles in Figure 6.2. We have r1 â¤ r2 since X(r1 ) <
X(r2 ) and X is the first field. We also have r3 â¤ r1 because X(r3 ) = X(r1 ) and Y (r3 ) â¤
Y (r1 ). The set of 5 rectangles are totally ordered by XY ; that is r3 < r1 < r4 < r2 < r5 .
On the other hand, the set of 5 rectangles are not totally ordered by Y X because r4 k r5 .
We now define sortable rule lists.
Definition 2 (Sortable Rule List). A rule list L is sortable if and only if there exists a field
order f~ such that L is totally ordered by f~.
55

6.3.2

Usefulness for Packet Classification

We now prove that our field order comparison function is useful for packet classification. We
first must define how to compare a packet p to a rule r. Using the same comparison function
does not work because we want p to match r if the point defined by p is contained within
the hypercube defined by r, but â¤f~ would say p k r in this scenario. We thus define the
following modified comparison function for comparing a packet p with a rule r using field
order f~.
Definition 3 (Packet Field Order Comparison). If f~1 (p) < min(f~1 (r)), then p <f~ r. If
f~1 (p) > max(f~1 (r)), then p >f~ r. If f~ contains only one field, then p matches r. Otherwise,
the relationship between p and r is determined by f~0 = f~ \ f~1 .
Lemma 4. Let n d-dimensional rules r1 through rn be totally ordered by field order f~ and
let p be a d-dimensional packet. For any 2 â¤ i â¤ n if p < ri then p > rj for 1 â¤ j < i.
Likewise, for any 1 â¤ i â¤ n â 1, if p < ri then p < rj for i < j â¤ n.
Proof Sketch. We discuss only the first case where p > ri as the other case is identical by
symmetry. Consider any packet p and rules ri and rj where p > ri and ri > rj . Applying
Definitions 1 and 3, we can prove that transitivity holds and p > rj .

6.4

Multi-dimensional Interval Tree (MITree)

We describe an efficient data structure called a Multi-dimensional Interval Tree (MITree)
for representing a sortable rule list given field order f~. In particular, we show that MITrees
achieve O(d + log n) time for search, insertion, and deletion; we refer to this as O(d + log n)
dynamic time.
56

X

Y

Z

r1

[0,2]

[0,6]

[1,2]

r2

[3,4]

[0,1]

[1,2]

r3

[3,4]

[2,3]

[3,5]

r4

[3,4]

[4,5]

[3,6]

r5

[5,6]

[0,5]

[1,2]

r6

[5,6]

[0,5]

[4,6]

r7

[5,6]

[0,5]

[7,9]

r4
r6

r2
r1

r3

r5

r7

Figure 6.3: A sortable rule list with 7 rules and 3 fields, field order f~ = XY Z, and the
corresponding simple binary search tree where each node contains d fields.
An obvious (and naive) approach is to store an entire rule for each node and construct
a balanced binary search tree because the rules are totally ordered by field order f~. The
running time is clearly O(d log n) because the height of a balanced tree is O(log n) and each
comparison requires O(d) time. The problem is this running time is not scalable to higher
dimensions. Figure 6.3 shows an example sortable rule list with its corresponding naive
binary search tree.
We can speed up the naive approach by a factor of d by using one field at a time with a
clever balancing heuristic. We show this in two steps. First we show O(d + log n) dynamic
time for a special case where all intervals are points. We extend the solution to the general
case of sortable rule lists.
The special case where all intervals are points is exactly a multi-key dictionary problem.
A multi-key dictionary is a dynamic set of objects that have comparisons based on a lexical
order of keys. One example of a multi-key dictionary is a database table of n rows and d
columns. There exists a multi-key binary search tree supporting O(d + log n) dynamic time
for the multi-key dictionary problem [3, 30].
In the general case of sortable rule lists, we show that the O(d+log n) result is attainable.
First, all keys are compared in lexical order because, by definition of sortable rule lists, field

57

order f~ is essentially a permutation of the natural order [1..d]. Furthermore, it is immediate
by the definition of sortable rule lists that if two rules are identical in [f (1), f (2), . . . , f (iâ1)]
then at f (i) they either overlap fully or have no intersection. This property allows us to
construct an MITree that behaves as if the interval is only a point. Hence, we can use the
same balancing heuristic to achieve O(d + log n) dynamic time. The search procedure is
identical to point keys except that we need additional checking for each interval tree node.
These results yield the following Theorem:
Theorem 5. Given a sortable rule list, a Multi-dimensional Interval Tree (MITree) supports
searches, insertion, and deletion in O(d + log n) time using only linear space.
r3 [3,5]

1
[2,3]

3
[3,4]

1

1

[0,1]

[4,5]

1

3

3

[0,2]

[5,6]

[0,5]

X

r3
[2,3]

r4 [3,6]
3

r2 [1,2]

[3,4]

r6
[4,6]

r5

r7

[1,2]

[7,9]

1

r1

[0,6]

[1,2]

Y

Z

r2

r4

[0,1]

[4,5]

r1

3

3

[0,2]

[5,6]

[0,5]

X

Y

r6
[4,6]

r5

r7

[1,2]

[7,9]

Z

Figure 6.4: We depict the uncompressed (left) and compressed (right) MITree that correspond to the sortable rule list from Figure 6.3. Regular left and right child pointers are black
and do not cross field boundaries. Next field pointers are red and cross field boundaries. In
the compressed MITree, nodes with weight 1 are labeled by the one matching rule.

We perform one additional path compression optimization. We remove duplicate intervals
for each field in the MITree so that each unique interval is shared by multiple rules. Each
node v thus represent a particular interval in a given field. Each v thus has a third child
containing the rules that match on v. Figure 6.4 shows an MITree for the rule list in Figure
6.3. Let c(v) be the number of rules that match node v. We observe that c decreases from
58

earlier fields to later fields. Thus if c(v) = 1 then there is only one rule r that can match.
In this case, we store the remaining field of r with v as shown in Figure 6.4. For the rest of
this chapter, we use MITree to refer to a compressed MITree.

6.5

Offline Sortable Rule List Partitioning

Rule lists are not necessarily sortable, so we must partition rule lists into a set of sortable
rule lists. Because the classification time depends on the number of sortable rule lists after
partitioning, our goal is to develop efficient sortable rule list partitioning algorithms that
minimize the number of sortable rule lists.
We first describe efficient and effective offline algorithms based on a reduction to a graph
coloring problem called Field Order Graph Coloring. By offline, we mean that all the rules to
be partitioned are know a priori. We start by observing that the offline rule list partitioning
problem is essentially a coloring of an intersection graph (to be defined shortly) by the d!
possible field orders for sortable rule lists.
Formally, the basic approach is to partition rule list L into Ď ordered rule lists (Li , f~(i))
for 1 â¤ i â¤ Ď where

SĎ
i=1 Li

= L, Li âŠ Lj = â and each sublist Li is sortable under field

order f~i . There is no relation between field orders f~(i) and f~(j); they may be the same or
different. Let L consist of n, possibly overlapping, d-dimensional rules v1 through vn . For
each of the d! field orders, f~(i) for 1 â¤ i â¤ d!, we define the corresponding interval graph
Gi = (V, Ei ) where V = L and (vj , vk ) â Ei if vj kf~(i) vk for 1 â¤ j, k â¤ n. We then
define the interval multi-graph G = (V, d!
i=1 Ei ) where each edge is labeled with the field
S

order that it corresponds to. Our sortable rule list partitioning problem is that of finding a
minimum coloring of V where the nodes with the same color form and independent set in

59

Gi for some 1 â¤ i â¤ d!; the minimum number of colors is denoted Ď(G).
Our offline partitioning framework is to find a maximal independent set of uncolored
nodes in some Gi . We âcolorâ these nodes which means we remove them from the current
rule list to construct an MITree. We then repeat until all nodes are colored. We now describe
how to find a large number of rules that are sortable in one of the d! possible field orders.
We use the weighted maximum interval tree from [8] to compute a maximum independent set for any Gi in O(dn log n) time. Since there are d! possible field orders, finding
the maximum set out of all of the field orders is infeasible. Instead, we propose a greedy
approximation, GrInd, that finds a maximal independent set and rules in time polynomial
in both n and d. GrInd achieves this by avoiding full recursion for each subproblem and
choosing f~ and the maximal independent set simultaneously.
We now describe GrInd which uses d rounds given d fields. In the first round, we have
one set of rules which is the complete set of rules. GrInd selects the first field in the field
order as follows. For each field f , it projects all rules onto the corresponding interval in field
f . It then weights each unique interval by the number of rules that project onto it plus 1.
We âadd oneâ to bias GrInd towards selecting more unique intervals. GrInd then computes
a maximum weighted independent set (using the method in [8]) of these unique weighted
intervals. The first field is the field f that results in the largest weighted independent set.
GrInd then eliminates all other intervals and their associated rules.
This process is repeated until all fields have been used or every set has only one rule. The
remaining rules are sortable under field order f~ and ready to be represented by an MITree.
Pseudocode for GrInd is shown in Algorithm 1. The running time is O(d2 n log n) as shown
in Theorem 6 below.
We give an example of GrInd in action in Figure 6.5. The input set of boxes with two fields
60

Algorithm 1: GrInd(R, f~)
Input: Rule list R
Output: A maximal sortable rule list R0 and field order f~0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

L = [R]
f~0 = []
for i = 1 to d do
for unselected field f do
foreach rule list ` in L do
let S be the set of projections from ` onto f
let ws = 1+ num rules with projection s â S
let S` = MaxWeightIndSet(S, w)
~
P
`âL |S` |
weight to f~0

let weight wf =

add f with max
remove from each ` â L any r where f (r) â
/ S`
subpartition each ` â L based on projection f (r)
let R0 equal one rule from every list in L
return (R0 , f~0 )

is shown in Figure 6.5 (a). For both possibilities of first fields, we create the corresponding
intervals and the corresponding rules. In this example, the maximum weight choice is to use
field X as we can get four rules with a weight of 6 (4 rules plus 2 partitions). If we use field
Y instead, the maximum weight choice is only 3 rules of total weight 5. We then proceed
with the remaining rules r1 , r2 , r6 and r7 as illustrated in Figure 6.5 (b). We process both
partitions independently and see that we can choose all 4 rules to get a final result of 4 rules.
Theorem 6. GrInd computes a field order f~(i) for 1 â¤ i â¤ d! and a maximal independent
set Li for Gi such that Li is sortable on f~(i). GrInd requires O(d2 n log n) time.
Proof. The correctness follows from choosing a maximum weighted independent set in each
round. For round 1 â¤ i â¤ d, for each of the d â i + 1 choices for fields in round i, the running
time of computing the maximum weighted independent set is O(n log n). In later rounds,
we likely have fewer than n boxes or rules left, but in the worst case for running time, all n
61

Discarded

(a) Selected field: {}

r3 r4 r5
r 1 r2

X:

r6 r7

Partition 1

Y:
r1

r5

MWIS: 6

Partition 2

r7

r3 r6
MWIS: 5

r4 r2

(b) Selected field: x Partition 2
r7
Y:
r2
r1

r6
MWIS: 6

Partition 1

(c) Selected fields: xy with {r1 r2 r7 r6} in total order xy

Figure 6.5: Example execution of GrInd.
boxes are considered in each round. This leads to the O(d2 n log n) running time.

6.6

PartitionSort: Putting it all together

In this section, we complete the picture of our proposed packet classifier, PartitionSort. We
present a fully online rule partitioning scheme where we assume the rule list is initially empty
and rules are inserted and deleted one at a time. We first describe how PartitionSort manages
multiple MITrees. We then describe our rule update mechanism which still achieves a small
number of partitions. We conclude by describing how PartitionSort classifies a packet given
multiple MITrees.
To facilitate fast rule update and packet classification, PartitionSort sorts the multiple
MITrees by their maximum rule priorities. Let T be the set of trees, t = |T |, pr(T ) be the
maximum priority in tree T â T , and f~(T ) be the field order of T â T . PartitionSort sorts
T so that for 1 â¤ i â¤ j â¤ t, pr(Ti ) > pr(Tj ). To maintain this sorted order, PartitionSort
uses the following data structures. First a lookup table M : L â T identifying which tree
each rule is in. Second, for each tree T â T , PartitionSort maintains a heap H(T ) of the
62

priorities in T .

6.6.1

Rule Update (Online Sortable Partitioning)

Given an existing set of rules, we now describe how PartitionSort performs rule updates
(insertions and deletions). We start with the simpler operation of deleting a rule r. Using
M , we identify which table T contains r. We then delete r from T , remove pr(r) from H(T )
and resort T if necessary. We use insertion sort since T is otherwise sorted. If no rules
remain in T we delete T from M and from T .
We now consider the more complex operation of inserting a rule r. We face two key
challenges: choosing a tree Ti â T to receive r and choosing a field order for Ti . We consider
trees Ti â T in priority order starting with the highest priority tree. When we consider Ti ,
we first see if Ti can receive r without modifying f~(Ti ). If so, we insert r into Ti . However,
before committing to this updated tree, if |Ti | â¤ Ď for some small threshold such as Ď â¤ 10,
we run GrInd (cf. Section 6.5) on rules in Ti (now including r) and use a new field order if
GrInd provides one partition with a different field order. This reconstruction of Ti will not
take long because GrInd is fast and Ď is small. If so, we update the field order and use the
new tree Ti0 . If we cannot insert r into Ti , we reject Ti as incompatible and move on to the
next Ti â T . If no existing tree Ti will work, we create a new tree Ti for r with an empty
heap H(Ti ). In all cases, we clean up by inserting pr(r) into H(Ti ) and resort T if necessary.

6.6.2

Packet Classification

We now describe packet classification in the presence of multiple MITrees. We keep track of
hp, the priority of the highest priority matching rule seen so far. Initially, hp = 0 to denote

63

no matching rule has been found. We sequentially search Ti for 1 â¤ i â¤ t for packet p. Before
searching Ti , we first compare pr(Ti ) to hp. If pr(Ti ) < hp, we stop the search process and
return the current rule as no unsearched rule has a higher priority than hp. Otherwise, we
search Ti for p and update hp if a higher priority matching rule is found. If we reach the end
of the list, we either return the highest priority rule or we report that no rule was found.
This priority technique was used in Open vSwitch [19] to speed up their implementation of
TSS.
We argue that priority optimization benefits PartitionSort more than other methods. The
reason for this is that our partitioning scheme is based on finding maximum independent
sets iteratively. This produces partitions where almost all of the rules are in the first few
partitions. In our experiments, we measured the percentage of rules contained in the first
five partitions for both PartitionSort and TSS. This data is highlighted in Table 6.1 below.
To summarize, we find that 99% or more of the rules in all classifiers reside in the first five
partitions when ordered by maximum priority. In contrast, TSS exhibits a more random
behavior where only some classifiers have many rules in the first five partitions.
32k
acl
fw
ipc

Average #Partitions
|G|
16.4
22.3
9.5

|O|
18.0
25.0
10.1

|T |
241.4
99.2
170.9

% rules in first five
partitions/tuples
%G
%O
%T
99.68 99.52 44.71
99.20 99.12 10.18
99.92 99.88 25.99

Table 6.1: GrInd (G), Online (O) and TSS (T) average number of partitions (tuples) and
percentage of rules in five partitions with highest priority rules

64

6.7
6.7.1

Experimental Results
Experimental Methods

We compare PartitionSort to two classification methods: TupleSpace Search (TSS) and
SmartSplit (SS). TSS is the de facto standard packet classifier in OpenFlow classification
because it has the fastest updates. It is a core packet classifier in Open vSwitch. SmartSplit is
the fastest decision tree method as it analyzes the rule list to then choose between HyperSplit
and HyperCuts. Like EffiCuts, it may construct multiple decision trees. We also compare
PartitionSort to SAX-PAC to assess the effectiveness of our partitioning strategies. Our
implementation of PartitionSort is available at GitHub1 .
We use the ClassBench utility [29] to generate rule lists. These rule lists are designed
to mimic real rule lists. It includes 12 parameter files that are divided into three different
categories: 5 access control lists (ACL), 5 firewalls (FW) and 2 IP chains (IPC). We generate
lists of 1K, 2K, 4K, 8K, 16K, 32K, and 64K rules. For each size, we generate 5 classifiers for
each parameter file, yielding 60 classifiers per size and 420 classifiers total.
We also consider the effect of the number of fields on the packet classifier. We modified
ClassBench to output rules with multiple sets of 5 fields. Although this does not necessarily
model real rule lists with more fields, it does allow us to project how our algorithms would
perform given rules with more fields. We generate rule lists with 5, 10, 15, and 20 fields and
64K rules. As before, we generate 5 classifiers for each parameter file yielding a total of 20
rule lists per parameter file or 240 rule lists in total.
Our TSS implementation is the one from GitHub described in Pfaff et al. [19]. Specifically,
we use their hash table implementation in the TSS scheme. We use their Priority TSS
1 https://github.com/sorrachai/PartitionSort

65

implementation that searches tuples in decreasing order of maximum priority.
For SmartSplit, we use an implementation that is written by the authors of the paper
which is hosted on GitHub2 .
We implemented SAX-PAC based on their paper [9] as there is no implementation available. We used the Coin-LP library [21] to select the order-independent sets; this is the
current best algorithm for maximizing the size of each selected order-independent set for
general d. We can use the fractional solution to lower-bound the number of sets required by
SAX-PAC.
For all comparisons, we run PartitionSort with a fully online partitioning scheme unless
explicitly stated otherwise; that is, the classifiers are generated by starting with an empty
classifier and then repeatedly inserting rules. We perform internal comparisons between
online and offline partitioning strategies.
We use four standard metrics: classification time, update time, space, and construction
time. Space is simply the memory required by the classifier. We measure classification
time as the time required to classify 1,000,000 packets generated by ClassBench when it
constructed the corresponding classifier. To focus on classification time, we omit caching
and consider only slow path packet classification on the first packet from each flow. We
measure update time as the time required to perform one rule insertion or deletion. We
perform 1,000,000 updates for each classifier as follows. We begin with each classifier having
half the rules. We then perform a sequence of 500,000 insertions intermixed with 500,000
deletions choosing the rule to insert or delete uniformly at random from the currently eligible
rules. Finally, we report all data by averaging over our rule lists. When we compare with
TSS, for each unique rule list size and each number of fields and every metric, we average
2 https://github.com/xnhp0320/SmartSplit

66

together all matching rule lists of the same type: ACL, FW, and IPC. When we compare
with SmartSplit, for each unique rule lsit size and each number of field and every metric, we
average together all matching rule lists. We then report3 the average value of the metric,
and, in some cases, a box plot of the metric.
All experiments are run on a machine with Intel i7-4790k CPU@4.00 GHz, 4 cores,
32 KB L1, 256KB L2, 8 MB L3 cache respectively, and 32 GB of DRAM. Most of the
experiments were run on Windows 7. The exception is the SmartSplit Test; both SmartSplit
and PartitionSort were run on Ubuntu 14.04.
To ensure correctness, we generate a stream of 1 million packets per rule list and test
that all of the classifiers agree on how to classify them. All of the classifiers agreed for every
input, so it is extremely probably that all of them work correctly.

6.7.2

PartitionSort versus TSS

We first compare PartitionSort with TSS. We compare these algorithms using the metrics of
classification time, update time, construction time, and memory usage. We note that both
algorithms have extremely fast construction times taking less than a second to construct all
rule lists.

6.7.2.1

ClassificationTime

Our experimental results show that, on average, PartitionSort classifies packets 7.2 times
faster than TSS for all types and sizes of rule lists. That is, for each of the 420 rule lists,
we computed the ratio of TSSâs classification time divided by PartitionSortâs classification
3 To minimize size-effects from hardware and the operating system, we run 10 trials for each of the
classification/update time and report the average

67

Classification Time(Âľs)

7
Tuple Space
6
PartitionSort
5
4
3
2
1
0 acl fw ipc acl fw ipc acl fw ipc
1k
4k
16k
Number of Rules

acl fw ipc
64k

Figure 6.6: Classification Time vs Tuple Space Search

Number of Tables

100
80

Tuple Space
PartitionSort

60
40
20
00

20

40
60
Packets(Percent)

80

95100

Figure 6.7: Partitions/Tuples Queried. A CDF distribution of number of partitions required
to classify packets with priority optimization (cf. Section 6.6.2).
time and then averaged these 420 ratios. When restricted to the 60 size 64k rule lists,
PartitionSort classifies packets 6.7 times faster than TSS with a maximum ratio of 20.1
and a minimum ratio of 2.6. Figure 6.6 shows the average classification times for both
PartitionSort and TSS across all rule list types and sizes. If we invert these classification

Classification Time(Âľs)

times to get throughputs, PartitionSort achieves an average throughput of 4.5 Mpps (million
7
Tuple Space
6
PartitionSort
5
4
3
2
1
0 acl fw ipc acl fw ipc acl fw ipc
5
10
15
Number of Fields

acl fw ipc
20

Figure 6.8: Classification Time on 64K Rules

68

Update time(Âľs)

2.0
1.5

Tuple Space
PartitionSort

1.0
0.5
0.0 acl fw ipc
1k

acl fw ipc acl fw ipc
4k
16k
Number of Rules

acl fw ipc
64k

Figure 6.9: Update Time vs Tuple Space Search
packets per second) whereas TSS achieves an average throughput of 0.79 Mpps. We note 4.5
divided by 0.79 is less than 7.2; this is because taking an average of ratios is different than
a ratio of averages.
Besides being much faster than TSS, PartitionSort is also much more consistent than
TSS. Specifically, the quartiles and extremes are much tighter for PartitionSort than for
TSS where the higher extremes do not even fit on the scale in Figure 6.6.
The reason for these factors is that PartitionSort requires querying significantly fewer
tables than TSS. As seen in Figure 6.7, PartitionSort needs to query fewer than 20 tables
95% of the time, while TSS needs to only 23% of the time. Since search times should be
roughly proportional to the number of tables queried, search times should be similar to the
area under their respective curves. Since TSSâs area is significantly larger, so too should
their search times.
Both methods are relatively invariant in the number of fields, as seen in Figure 6.8.
Because there are more field permutations available, PartitionSort is able to pick the best
one for slightly bigger partitions. This results in a slight decrease in lookup times.

69

6.7.2.2

Update Time

Our experimental results show that PartitionSort achieves very fast update times, with an
average update time of 0.65Âľs and a maximum update time of 2.1Âľs. This should be fast
enough for most applications. For example, PartitionSort can handle several hundred thousand updates per second. Figure 6.9 shows the average update times across all rule list types
and sizes for both PartitionSort and TSS.
As expected, TSS is faster at updating than PartitionSort, but the difference is not
large. Similar to classification time, we compute the ratio of PartitionSortâs update time
divided by TSSâs update time for each rule list and then compute the average of those
ratios. On average, PartitionSortâs update time is only 1.7 times larger than TSSâs update
time. Restricted to the 64k rule lists, this average ratio is also 1.7. The data from Figure
6.9 does show that PartitionSort has an O(log n) update time.

6.7.2.3

Construction Time

Our experimental results show that PartitionSort has very fast construction times, with an
average construction time of 83 ms for the largest classifiers. This is small enough that
even significant changes to the rule list cause only minor disruption to the packet flow. As
expected, TSS builds faster than PartitionSort, but the difference is not large. On average,
PartitionSortâs construction time is only 1.9 times larger than TSSâs construction time.
Restricted to the 64k rule lists, this average ratio decreases to only 1.4.

6.7.2.4

Memory Usage

Our experimental results show that PartitionSort requires less space than TSS. Both are
space-efficient, requiring O(n) memory, with PartitionSort requiring slightly less space than
70

Classification Time(Âľs)

1.0

SmartSplit
PartitionSort

0.8
0.6
0.4
0.2
0.0

1k

2k

4k
8k
16k
Number of Rules

32k

64k

Figure 6.10: Classification Time vs SmartSplit
TSS.

6.7.3

Comparison with SmartSplit

We now compare PartitionSort with SmartSplit. We compare these algorithms using the
metrics of classification time, construction time, and memory usage. We do not compare
update time because SmartSplit does not support updates.

6.7.3.1

Classification Time

Our experimental results show that, on average, PartitionSort takes 2.7 times longer than
SmartSplit to classify packets. This can be seen in Figure 6.10. There are two factors leading to this result. First, SmartSplit uses fewer trees than PartitionSort (the same advantage
that PartitionSort has over TSS). Second, the wider branching factor provided by the HyperCuts trees used by SmartSplit reduces the overall tree height. While both classifiers have
logarithmic search times, these two factors make SmartSplitâs constant factor smaller.

6.7.3.2

Construction Time

PartitionSort can be build faster than SmartSplit by several orders of magnitude. This becomes increasingly more pronounced as the number of rules increases, where PartitionSort
71

Construction Time(s)

103

SmartSplit
PartitionSort

102
101
100
10-1
10-2
10-3

1k

2k

4k
8k
16k
Number of Rules

32k

64k

Size(MB)

Figure 6.11: Construction Time in logarithmic scale.
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0

SmartSplit
PartitionSort

1k

2k

4k
8k
16k
Number of Rules

32k

64k

Figure 6.12: Memory Usage vs SmartSplit
takes less than a second but SmartSplit requires almost 10 minutes. This can be seen in
Figure 6.11. Additionally, SmartSplit does not support updating. Together, this means that
SmartSplit is not appropriate for cases where the rule list is updated frequently.

6.7.3.3

Memory Usage

Our experimental results show that PartitionSort requires less space than SmartSplit. The
average space requirements for each rule list size is shown in Figure 6.12. SmartSplitâs
memory usage is very erratic, depending on whether it decided to make one tree or several,
and whether those trees are HyperCuts or HyperSplit trees. SmartSplit tends to select single
HyperCuts trees when the rule lists are smaller and multiple HyperSplit trees when larger.
This causes the apparent drop in memory usage as the number of rules increases. We expect
that memory usage will continue to rise dramatically as rule replication again becomes a

72

Partitions (Count)

25
20
15

SaxPac2
SaxPac5
PartitionSort

10
5
0

1k

2k
Number of Rules

4k

Figure 6.13: Number of Partitions vs SAX-PAC
problem, but no further remedies remain for preventing it.

6.7.4

Comparison with SAX-PAC

The number of partitions required by PartitionSort is competative with SAX-PAC. When
using the same number of fields, PartitionSort requires 1.7 times more tables than SAXPAC. However, the classifier used by SAX-PAC requires O(logdâ1 n) time. For this reason,
they recommend using only 2 fields for order-independent partitioning. This significantly
increases the number of partitions required. In this case, SAX-PAC requires 1.4 times more
tables than PartitionSort. These results cna be seen in Figure 6.13.
SAX-PAC is too expensive to construct. Most of the algorithms used by SAX-PAC require
O(dn2 ) time [9]. Additionally, their set-cover approximation requires O(n2 ) space. Thus,
even 4000 rules can take hours to build, making it infeasible to use. For this reason, Figure
6.13 only goes up to 4000 rules.

6.7.5

Comparison of Partition Algorithms

We now compare our offline (cf. Section 6.5 and online (cf. Section 6.6) partitioning algorithms. Figure 6.14 shows the number of partitions required by each version.

73

Partitions (Count)

30

Offline
Online

25
20
15
10
5
0

1k

2k

4k
8k
16k
Number of Rules

32k

64k

Figure 6.14: Number of Partitions for Offline and Online PartitionSort

Classification Time (Âľs)

1.0

Offline
Online

0.8
0.6
0.4
0.2
0.0

1k

2k

4k
8k
16k
Number of Rules

32k

64k

Figure 6.15: Classification Time for Offline and Online PartitionSort
Online partitioning performs almost as well as offline partitioning. Online partititioning
requires only 17% more trees than offline partitioning. This translates into 31% larger
classification times as seen in Figure 6.15. We conclude that a single offline construction time
at the beginning with online updates should be more than sufficient for most purposes.

74

Chapter 7
TupleMerge
In Chapter 6, we developed an online tree-based packet classifier. This classifier is both
reasonably fast at classifying packets while also supporting frequent updates. However,
PartitionSort is hampered by two problems. First, the O(d + log n) search times of the trees
limits the maximum classification speed. Second, Tuple Space Search already has market
deployment.
In this chapter, we present a hashing-based online packet classifier, which we call TupleMerge. These hash tables have O(d) search times, allowing it to be faster than PartitionSort,
both for classification and updates. Additionally, the classifier produced is compatible with
that of Tuple Space Search which can ease deployment.
Rule
r1
r2
r3
r4
r5
r6
r7

Source
000
010
011
010
10*
110
***

Dest Tuple
111
(3, 3)
100
(3, 3)
100
(3, 3)
11*
(3, 2)
101
(2, 3)
***
(3, 0)
011
(0, 3)

Table 7.1: Example 2D rulelist

75

Rule
r1
r2
r3
r4
r5
r6
r7

Source
000
010
011
010
10*
110
***

Dest Tuple
111
(3, 3)
100
(3, 3)
100
(3, 3)
11*
(3, 2)
101
(2, 3)
***
(3, 0)
011
(0, 3)

Table 7.2: TSS builds 5 tables

7.1
7.1.1

Related Work
Tuple Space Search

Tuple Space Search (TSS) is one of the oldest offline packet classification methods that was
left behind as faster methods were developed [24]; TSS has reemerged for online packet
classification because it supports fast updates. In particular, Open vSwitch uses TSS as its
packet classifier because TSS offers constant-time updates and linear memory [17].
In TSS, each partition is defined by a corresponding mask tuple m
~ which represents the
actual bits used in each of the prefix or ternary fields of all rules in that partition. Since all
the rules in a partition have the same m,
~ we can store these rules using a hash table where
the masked bits m
~ for each rule form the hash key. During classification, those same bits
are extracted from the packet for the search key. For example, consider r5 = (10â, 101) from
Table 7.1. We produce the hash key h5 = hash(10, 101) and store h5 â r5 . Later, suppose
we wish to query for packet (5, 5) = (101, 101) Extracting the same bits, we get the search
key (10, 101). We then search for the key hash(10, 101) and get r5 as a match.
We now describe how to manage the multiple tables that result from partitioning. To
speed up the search process, we sort the tables by the highest priority rule each table contains

76

and search them in this order. This allows us to avoid searching a table if we have found a
matching rule that exceeds the maximum priority of any rule in that table. This speeds up
the searches for the case that a high priority matching rule is found in an early table, but
it does not help for the cases where the correct rule has low priority or no matching rule is
found.
For example, consider the rules in Table 7.1 would be placed in 5 tables, which can be
seen in Table 7.2. Rules r1 , r2 , and r3 each use all of the bits and are placed together. The
other rules all have unique tuples and so each gets its own table. If we search for packet
(110, 011) then T4 returns r6 as a match. Since this has a higher priority than r7 , it is not
necessary to search T5 even though r7 is also a match.
The main drawback of TSS is that the number of partitions or tables is large. This results
in relatively slow packet classification as many tables must be searched. We improve upon
this by putting rules from multiple TSS tables into a single hash table. This results in many
fewer hash tables.
Srinivasan et al. suggest using a (usually 1-dimensional) preclassifier to identity which
tables contain rules that might match the packet [24]. This allows fewer tables to be searched,
potentially resulting in significant time savings. The tradeoff is that the preclassifier requires
extra time and memory; for some rule lists, it may take more time to query the preclassifier
than is saved by reducing the number of tables queried. In contrast, TupleMerge reduces
the total number of tables and does not need a preclassifier.
Another option for reducing the number of tables is to reduce the number of tuples by
ignoring certain fields (such as the port fields). This results in a drastic reduction in the
number of tables which is exponential in the number of fields. However, some rules may
be identical on the reduced fields; extra effort must be spent to separate these rules. We
77

also exploit this optimization which, combined with other optimizations, results in far fewer
tables than TSS.
Yet another strategy to avoid querying hash tables is to use Bloom filters [23]. Bloom
filters are a very space-efficient probabilistic hash set that can answer whether an item is
probably in the set (with the probability being related to memory usage) or definitely not
in the set. Since Bloom filters are space efficient, they can be stored in cache and used to
check whether a matching rule probably exists before performing a query to slower memory.
This is a complementary optimization that TupleMerge can also leverage; in fact, it may be
more efficient for TM as TM has fewer hash tables than TSS.

7.2

Online Algorithms

We now describe our TupleMerge (TM) classifier that significantly improves upon TSS [24]
by significantly reducing the number of hash tables required to represent the classifier. Since
the search time is proportional to the number of tables searched, this results in a significant
improvement in search times. Additionally, since our method does not use any sort of
preclassifier, we are not penalized by losses elsewhere.
We first describe the two key differences that allow TM to outperform TSS. We then
review the classification process. Finally, we describe how to update the classifier during
rule changes.

7.2.1

Key Differences

There are two key differences that make TupleMerge faster at packet classification than TSS.
First, TM exploits the observation that not all bits are always required to identify which rules

78

are potential matches. In TSS, rules are separated so that rules that use exactly the same
tuple will end up in the same hash table and every other rule will end up in a different table.
Many rules have similar but not identical tuples which TSS places in separate tables. TM
allows these rules to be placed into the same table, reducing the number of tables required;
this leads to faster classification.
Second, TM has methods for reducing the number of hash collisions. It is possible for
two keys to be hashed to the same value. Additionally, both TSS and TM may produce the
same hash key from different rules because of information that is omitted. For TSS, this is
because most implementations will omit port fields (which are ranges) and sometimes other
fields as well; the exact field selection is usually chosen by the designer ahead of time and is
the same for all tables. For TM this occurs because it has decided to omit some bits from
the tuple. TSS has only one option when this happens; it must sequentially search all of the
colliding rules until it finds a match. TM uses the number of collisions to control the number
and specificity of its tables; if there are too many collisions (determined by parameter c) it
adds more specific tables to reduce the number of collisions so that no table will have more
than c rules collide.
We now explain how TM allows rules with different tuples to be placed in the same hash
table. Like TSS, each TM hash table T has an associate tuple m
~ T . In TM, unlike TSS, a
rule r may be placed in T if m
~T âm
~ r ; in TSS, r may be placed in T only if m
~T =m
~ r . We
create a hash key kT,r from m
~ T and r by extracting the bits matching m
~ T from r, similarly
to TSS. The key issue is identifying what tuples m
~ T should be constructed for each ruleset,
particularly as we construct the classifier in an online fashion. We describe this in more
detail in our Rule Insertion section below.
For example, in Table 7.1 r2 has tuple (3, 3) and r4 has tuple (3, 2). We could place both
79

r1 and r4 into the same table if the table used (3, 2) for its tuple. If we wanted to use (3, 3)
for the tuple, we would find that r4 is incompatible since 3 > 2; this would require splitting
r4 into two rules which we do not do because this complicates updating. Another possibility
is to use (3, 1) for the tuple; both rules would have a hash key of (010, 1 â â) which creates a
hash value collision in the table. This tuple would be allowed if the number of collisions in
this hash value does not exceed c.
TM maintains a list T of tables created along with their tuples where the tables are
sorted in order of the maximum rule contained and a mapping F : L â T of rules to tables.
During packet classification, only T is used. The mapping F supports fast deletion of rules.
Both require space that is linear in n.
Standard online construction (where rules are received on at a time) is represented by a
series of rule insertions as described below. If the rules are mostly known ahead of time, we
use the offline construction scheme described in Section 7.3 to create a better classifier.

7.2.2

Packet Classification

TupleMerge classifies packets just like TSS classifies packets. As with TSS, once a high priority matching rule is found, later tables with lower priority rules do not need to be searched.
While the table construction algorithm changes, the packet classifier can remain the same.
This is a potential implementation advantage that allows TM to be easily integrated into
existing implementations such as Open vSwitch that use TSS.

80

7.2.3

Tuple Selection and Rule Insertion

We now describe how TupleMerge chooses which tuples to use for tables. For TSS, no choice
was necessary; we simply used the tuples defined by the used bits for each rule. For TM,
we have the opportunity to use fewer bits to define tuples which can reduce the number
of tables but may come at the cost of increasing the number of collisions. We strive to
balance between using too few bits and having too many collisions and using too many bits
and having too many tables. One key observation is that if two rules overlap, having them
collide in the same table is a better option than creating separate tables for the two rules.
For this reason, we maintain a relatively high collision limit parameter c.
Consider the situation when we have to create a new table T for a given rule r. This
occurs for the first rule inserted and for later rules if it is incompatible with all existing
tables. In this event, we need to determine m
~ T for a new table. Setting m
~T =m
~ r is not a
good strategy; if another similar, but slightly less specific, rule appears we will be unable to
add it to T and will thus have to create another new table. We thus consider two factors:
is the rule more strongly aligned with source or destination addresses (usually the two most
important fields) and how much slack needs to be given to allow for other rules. If the
source and destination addresses are close together (within 4 bits for our experiments), we
use both of them. Otherwise, we drop the smaller (less specific) address and its associated
port field from consideration; r is predominantly aligned with on of the two fields and should
be grouped with other similar rules. This is similar to TSS dropping port fields, but since
it is based on observable rule characteristics it is more likely to keep important fields and
discard less useful ones.
We then look at the absolute lengths of the addresses. If the address is long, we are more

81

likely to try to add shorter lengths and likewise the reverse. We thus remove a few bits from
both address fields with more bits removed from longer addresses. For 32 bit addresses, we
remove 4 bits, 3 for more than 24, 2 for more than 16, and so on (so 8 and fewer bits donât
have any removed). We only do this for prefix fields like addresses; both range fields (like
ports) and exact match fields (like protocol) should remain as they are.
Now consider the case where we need to insert a new rule r into TMâs classifier. We first
try to find an existing table that is compatible with r. We search the table T â T in order
of max priority rule. For each table T , r is compatible with T if m
~T âm
~ r . If no compatible
table is found, we create a new table for r as described above.
If a compatible table is found, we perform a hash probe to check whether c rule with the
same key already exist within T . If not, we add r to T and add the mapping r â T to F .
Otherwise, this is a signal that we ignored too many bits in m
~ T and we need to split the
table into two tables. We select all of the colliding rules ` and find their maximum common
tuple m
~ ` . Normally m
~ ` is specific enough to hash ` with few or no collisions. We then
create a new table T2 with tuple m
~ ` and transer all compatible rules from T to T2 . If m
~ ` is
not specific enough, we find the field with the biggest difference between the minimum and
maximum tuple lengths for all of the rules in ` and set that field to be the average of those
two values. We then transfer all compatible rules as before. This guarantees that some rules
from ` will move and that T2 will have a smaller number of collisions than T did.

7.2.4

Rule Deletion

Deleting a rule r from TupleMerge is straightforward. We can find which table T contains
r with a single lookup from F . We then remove r from T and the mapping r â T from F .
This requires O(1) hash updates.
82

7.3

Offline Algorithms

Up until now, we have discussed online algorithms; here we discuss offline construction.
When most of the rules are known a priori, we can use this extra information to make
better decisions about how to partition the rules. This is relevant for several online scenerios
including the case where we periodically rebuild the classifier assuming the rebuild process
can be done in roughly a second. As noted earlier, OpenFlow hardware switches can require
roughly half a second for the data plane to catch up to the control plane.

7.3.1

Table Selection

The main principle behind our table selection strategy is that if a high-priority rule is matched
in an early table, then any future tables that contain only lower priority rules do not need
to be searched. However, if a lower-priority match is found in an early table, we must still
search later tables as they could contain a better match. To this effect, we try to place as
many high-priority rule early to maximize the chances that we will not need to search later
tables.
Let Li be the first i rules in L. Let m
~ Li be the maximum common tuple of these rules;
that is the tuple that contains all of the bits that they have in common. We now create a
counter H of hashes, `i , a proposed table, and `ÂŻi , the rules not in `i . Fore each rule r â L,
we compute h(r). If H(h(r)) < c, were c is an argument representing a maximum number of
collisions, we add r to `i and increment H(h(r)). Otherwise we add r to `ÂŻi . Each different
m
~ Li will yield a `i and `ÂŻi . To maximize the probability that we do not need to search future
tables, we select the `i that maximizes the maximum priority of `ÂŻi (the first missing rule is as
late as possible), breaking ties by maximum |`i |, thus reducing the number of tables required.

83

Rule
r1
r2
r3
r4
r5
r6
r7

Source
000
010
011
010
10*
110
***

Dest
111
100
100
11*
101
***
011

Nominal Tuple Used Tuple
(3, 3)
(3, 2)
(3, 3)
(3, 2)
(3, 3)
(3, 2)
(3, 2)
(3, 2)
(2, 3)
(2, 0)
(3, 0)
(2, 0)
(0, 3)
(0, 3)

Table 7.3: TupleMerge builds 3 tables
Rule
r1
r3
r4
r6
r2
r5
r7

Source
000
011
010
110
010
10*
***

Dest
111
100
11*
***
100
101
011

Nominal Tuple Used Tuple
(3, 3)
(3, 0)
(3, 3)
(3, 0)
(3, 2)
(3, 0)
(3, 0)
(3, 0)
(3, 3)
(0, 3)
(2, 3)
(0, 3)
(0, 3)
(0, 3)

Table 7.4: An alternate TupleMerge scheme builds 2 tables
We make a new table T = (`i , m
~ Li ) and append it to T , and repeat on the remaining rules
L0 = `ÂŻi until L0 is empty.
For example, consider the rules in Table 7.1. Tuple (3, 2) can contain 4 rules without any
collisions: r1 , r2 , r3 and r4 . Tuple (3, 3) can only contain the first three rules, while (2, 2)
has a collision between r2 and r3 , making them worse choices. We create a table containing
rules r1 , r2 , r3 , and r4 and repeat on the remaining three rules. The results from this process
can be seen in Table 7.3.
Other strategies exist that may yield fewer tables. For example, Table 7.4 contains only
two tables. However, unless the incoming packet matches rule r1 , both tables will need to
be searched. Table 7.3 will need to search only 1 table if any of r1 , r2 , r3 , or r4 match which
may give it better results in practice even if the worst case is not as good.

84

7.3.2

Table Consolidation

This strategy sometimes produces multiple tables that have the same tuple. When this
happens, we merge the two tables together into a single table. This removes a table query
while examining at most the same number of rules as before.

7.4

Experimental Results

In this section, we perform experiments to compare the effectiveness of TupleMerge against
PartitionSort (PS), TupleSpace Search (TSS), SmartSplit (SS), and SAX-PAC (SP).

7.4.1

Experimental Setup

We ran our experiments on rule lists created with ClassBench [29]. ClassBench comes with
12 seed files that can be used to generate rule lists with different properties. While the
seeds are divided into three categories (5 access control lists (ACL), 5 firewalls (FW) and 2
IP-chain (IPC)), different seeds in the same category can have very different properties.
We generate rule lists of nine different sizes, from 1k up to 256k rules. For each of the
12 ClassBench sees, we generate 5 rule lists of each size for 540 rule lists total.
We use these rule lists in three types of experiments. For an offline experiment, we
construct a static data structure from all the rules and then use the data structure to classify
packets. For an online experiment, we have two protocols. In the first protocol, we create
a dynamic data structure where the rules are fed to the online algorithm one at a time in
a random order with each algorithm receiving the rules in the same order. The resulting
classifier is then used to classify packets. It classifies 1,000,000 packets which are generated
using the ClassBench method for generating packets given a rule list. In the second protocol,
85

Figure 7.1: Average # of partitions required for each rulelist size
which we use to measure update time, half of the rules are selected at random to be in the
initial active rule list. We create an initial data structure by inserting these rules one at a
time into the data structure in a random order where each algorithm receives the rules in the
same order. A list of 500,000 insertions and 500,000 deletions is then created and shuffled.
Each insertion inserts one of the excluded rules and each deletion removes one of the included
rules, each chosen at random from the available rules. This sequence is the same for each
algorithm. In this protocol, we do not use the resulting data structure to classify packets.
For a given rule list L and a given algorithm A, we measure four things. First is classification time, denoted CT (A, L), which is the total time required to classify all 1,000,000
packets divided by 1,000,000. Second is the number of partitions P (A, L) in the resulting
data structure. This is useful as a static estimator of each data structureâs classification
speed. Third is update time, denoted U T (A, L), which is the total time required to perform
all 1,000,000 rule updates divided by 1,000,000. Note we do not include the time to create
the initial data structure with half the rules. Last is memory, denoted M (A, L), which is the
total memory required by the data structure.
For each metric, we average the results across all rule lists of a given size or seed.
To better compare an algorithm A, typically PS, against TM, we compute the relative
classification time of A and TM on a rule list L, denoted RCT (A, T M, L), as the ratio
86

Figure 7.2: Relative Update Time between PS and TupleMerge

Figure 7.3: Absolute Update Time on 256k rules
CT (A, L)/CT (T M, L); higher values are better for TM. Likewise, we compute the relative update time of A and TM for rule list L, denoted RU T (A, T M, L) to be the ratio
U T (A, L)/U T (T M, L). We average these relative metrics across all rule lists of a given size
or seed We note that for the two relative time metrics, the range of ratios is relatively small
whereas for the raw metrics, some values are much larger than others and thus have a much
larger impact on the final average. This leads to some apparent discrepancies between the
two averages.
These experiments were run on a machine with an Intel Xeon CPU @ 2.8 GHz, 4 cores,
and 6 GB of RAM running Ubuntu 16.04. We verified the correctness of classification results
by ensuring each classifier returns the same answer for each test packet.

87

Figure 7.4: Relative Online Classification Time between PS and TupleMerge

Figure 7.5: Absolute Online Classification Time on 256k rules

7.4.2

Comparison with PartitionSort

We compare TupleMerge against PartitionSort from Chapter 6 since PS is the state-of-theart online packet classifier. We first compare their online versions and then compare their
offline versions.

7.4.2.1

Online Comparison

In our experiments, online TM outperforms online PS in both classification speed and update
time. Focusing on the relative metrics averaged over all seeds and sizes, TM performs rule
updates 30% faster than PS and TM classifies packets 34.2% faster than PS. PS
also requires 2.71 times the amount of memory that TM does, though both required only
linear memory in the number of rules.
We now exam these rules in more detail. The number of tables required for online TM,
88

online PS and TSS are shown in Figure 7.1. Relative update time results average over each
rule list size are shown in Figure 7.2 and raw update time results for online TM, online PS,
and TSS for each seed for the 256k rule lists are shown in Figure 7.3. Relative classification
time results average over each rule list size are shown in Figure 7.4 and raw classification
time results for online TM and online PS for each seed for the 256k rule lists are shown in
Figure 7.5.
Looking more closely at update time, we see that TM consistently outperforms PS in
every way. Specifically, TM has a faster update time for all rule list sizes and for all 12 seeds.
For the largest rule lists, TM outperforms PS for all 12 seeds and has an average update
time of only 1.78Âľs whereas PS has an average update time of 2.27Âľs.
Looking more closely at classification time, TM again outperforms PS for all rule list
sizes with the gap between TM and PS generally increasing with rule list size, but we do
observe that PS does outperform TM for some seeds. On the largest rule lists, TM takes
0.64Âľs on average to classify a packet, whereas PS takes 0.65Âľs. TM classifies faster on 10
of the 12 seeds, but the two remaining seeds (ACL2 and FW1) are the slowest for both PS
and TM which skews the average classification times to be closer than the average relative
classification time. Focusing on the other 10 seeds, TM takes 0.15Âľs to 0.76Âľs (average
0.34Âľs) to classify a packet and PS takes 0.26Âľs to 1.00Âľs (average 0.47Âľs) to classify a
packet. These results can be seen in Figure 7.5.
These results can be explained by the number of partitions required and the time required
to search each partition. We plot the average number of partitions required by TM and PS
for each rule list size in Figure 7.1. As expected, TM typically requires a few more partitions
that PS. Focusing on the 256k rule lists, TM requires an average of 30 partitions whereas
PS requires an average of 11 partitions. TM typically achieves a smaller classification time
89

Figure 7.6: Relative Offline Classification Time between PS and TM

Figure 7.7: Absolute Offline Classification Time on 256k rules
because TM uses hash tables (O(d) time) whereas PS uses trees (O(d+log n)) time. However,
for the worst case seeds, ACL2 and FW1, TM requires 131 and 53 partitions, respectively,
whereas PS requires only 24 and 17 partitions, respectively. For the remaining ten seeds,
TM requires at most 38 partitions.
The number of partitions has more effect on classification time than rule update time. For
rule insertion, once we find an available partition, we do not need to search later partitions.
For rule deletion, both schemes store a pointer to the partition containing each rule.
For memory, for the 256k rule lists, TM requires only 4.47 MiB on average whereas PS
requires 12.19 MiB on average.

90

7.4.2.2

Offline Comparison

We now compare the offline versions of TM and PS focusing on classification time, memory
usage, and construction time. Offline TM again outperforms offline PS for all rule list sizes
except the 1k rule lists, and the gap between TM and PS generally increases with rule list
size. However, we again observe that PS does outperform TM for some seeds. Focusing
on the 256k rule lists, on average, offline TM classifies packets 39% faster than PS. These
results can be seen in Figure 7.6. The outliers again are the two seeds ACL2 and FW1 where
offline TM requires 78 and 47 partitions, respectively, whereas offline PS requires only 20
and 13 partitions, respectively. When comparing memory, the results are similar as those
for the online versions. This is to be expected since the memory usage is tied to the number
of rules rather than the number of tables. Finally, on average, offline TM builds its dynamic
data structure 3.8 times faster than offline PS does. For the 256k rule lists, TM requires an
average of 2.9s whereas PS requires an average of 9.0s.
One of the main reason to consider offline TM and PS is to periodically rebuild the
dynamic data structures to support faster packet classification and rule update. Averaging
performance over all rule list sizes and all seeds, offline TM is 38% faster than online TM
at classifying packets. Given that it takes an average of 2.9s to rebuild the dynamic data
structures for a 256k rule list, this may be acceptable, particularly if the rebuild process can
happen in parallel offline. We further not that existing hardware switches may be behind
the controller by 400ms [10], so 2.9s may be acceptable.

91

Mean Classification Time Ratio
(SS / TM)

1.0
0.8
0.6
0.4
0.2
0.01k

2k

4k

8k

16k

Rulelist Size (log scale)

32k

64k

Figure 7.8: Relative classification time between SS and TupleMerge

7.4.3

Comparison with Tuple Space Search

We only compare online TM against TSS since TSS is inherently an online algorithm. TM is,
on average, 7.43 times faster at classifying packets than TSS, and TM is, on average, 39%
slower at performing rule updates than TSS. For our 256k rule lists, TM takes an average
of 0.64Âľs to classify a packet whereas TSS takes an average of 2.93Âľs to classify a packet.
For our 256 rule lists, TM takes on average 1.78Âľs to update a rule wheras TSS takes on
average 1.28Âľ to update a rule. We thus see that for only a small penalty in rule update,
TM achieves a significant gain in classification time. This improvement in classification
time is largely explained by the fact that TM requires many fewer tables than TSS, as can
be seen in Figure 7.1. Besides improving classification time, TM also requires 1.93 times
less memory on average than TSS. Given the significant improvement in classification time
and even memory, TM clearly is a better choice than TSS as an online packet classification
algorithm.

7.4.4

Comparison with SmartSplit

We now compare offline TupleMerge with SmartSplit (SS), the current state-of-the-art offline
packet classification algorithm, to assess how much we give up in classification time to achieve

92

Figure 7.9: Partitions produced
fast updates. We compare these algorithms on classification time, construction time, and
memory usage. The largest rule lists we use are the 64k rule lists because of SSâs slow
construction time.
SS is significantly faster than TM (or the other methods studied. For our 64k rule lists,
on average, SS classifies packets in 0.12Âľs whereas offline TM classifies packets in 0.24Âľs.
The relative classification time required by SS and TM can be seen in Figure 7.8.
TM requires orders of magnitude less time to build a classifier than SS. Whereas TM can
usually generate a classifier in seconds even for very large rule lists, SS can take hours, even
for reasonably sized rule lists.
Finally, SS has very unpredictable memory requirements. For smaller rule lists, it produces
a single HyperCuts tree, maximizing classification speed. As the number of rules increases,
it switches over to multiple trees and/or HyperSplit trees, both of which reduce the rule
replication and thus memory required. This doesnât fix the underlying problem though.
Eventually, there are enough problematic rules that this is not enough.

93

Figure 7.10: Smaller collision limit c

7.4.5

Comparison with SAX-PAC

SAX-PAC (SP) [9] splits the rule list into order-independent sets and then builds a classifier
for each set. Since the order-independent sets are the most general type of collisionless rule
list, this serves to lower-bound the number of partitions required for a variety of methods.
We compare against them to estimate how well our table selection scheme performs.
TM requires a comparable number of partitions to SP. When SP uses all of the fields,
TM requires on average 2.1 times more partitions than SAX-PAC. This shows that the
number of partitions required by TM is not too much larger than the thoretical minimum.
However, the classifier recommended by SP requires O(logdâ1 n) time. To counteract this,
they recommend using only 2 fields, which reduces the search cost to only O(log n). This
increases the number of partitions that they require; TM requires only 68% more tables.
These results can be seen in Figure 7.9. In addition, since they recommend using classifiers
with a O(logdâ1 n) time whereas TMâs hash tables require O(d) time, the advantage of fewer
partitions disappears.

94

Figure 7.11: Larger collision limit c

7.4.6

TupleMerge Collision Limit

We now measure how the collision limit c influences TMâs effectiveness. We show how the
classification speed for both online and offline TM vary as a function of c in Figures 7.10
and 7.11. A higher c allows us to produce fewer tables, but a particular table may take
longer to search since we must do a linear search of the colliding rules. for online TM,
classification speed is maximized with c = 40. Smaller rule lists generally need smaller c and
the incremental gain tapers off as c exceeds 20. Smaller c, such as c = 10, give acceptable
results. The best classification speed can be seen in Figure 7.11. The collision limit for
offline TM is maximized earlier at c = 8. This best speed can be seen in Figure 7.10.

95

Chapter 8
Conclusion
We presented four methods for packet classification. Each of these methods works by using
smart division to divide the problem into simpler subproblems.
In Chapter 4, we presented Diplomat, a method for using dimensional reduction to compress rule lists into smaller equivalent lists. We presented both a general framework for
Diplomat as well as specific resolvers and schedulers for both range and prefix fields. In a
side-by-side comparison with ACL Compressor, Diplomat performed 30.6% better on large
range-ACLs, 12.1% better on large prefix-ACLs, and 9.4% better on large mixed-ACLs.
In Chapter 5, we presented ByteCuts, a packet classifier that utilizes rule list separation
to create multiple decision trees. It builds the trees using a new type of cut node that selects
several consecutive bits from a packet and interprets it as an index into the child array.
This allows it to classify packets 58% better than SmartSplit, the previous best decision
tree-based packet classifier.
In Chapter 6, we presented PartitionSort, a packet classifier that partially orders rules
by sorting them on their individual fields. It then uses rule list separation to partition the
rules into sets that are each totally ordered, allowing for binary searching. This allows it to
classify packets 7.2 times faster than Tuple Space Search, the previous choice for OpenFlow
packet classification.
Finally, in Chapter 7, we presented TupleMerge, a packet classifier that uses rule list

96

separation to divide the rules into several hash tables. This improves upon Tuple Space
Search by allowing rules with similar, but not identical, tuples to be placed into the same
table. This reduces the number of tables compared to Tuple Space Search by 5.4 times. This
in turn reduces the search time by 7.43 times.
Further research can be done in this area. ByteCuts currently does not support incremental updates. Adding this ability would allow it to support online packet classification.
Additionally, ByteCuts, PartitionSort, and TupleMerge could each be used for different partitions in a rule list separation classifier. Identifying new ways to partition the rules to take
advantage of the strengths of each algorithm could allow for further improvements.

97

BIBLIOGRAPHY

98

BIBLIOGRAPHY

[1] David A. Applegate, Gruia Calinescu, David S. Johnson, Howard Karloff, Katrina Ligett,
and Jia Wang. Compressing rectilinear pictures and minimizing access control lists. In
SODA â07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete
algorithms, pages 1066â1075, Philadelphia, PA, USA, 2007. Society for Industrial and
Applied Mathematics.
[2] Florin Baboescu and George Varghese. Scalable packet classification. In In ACM SIGCOMM, pages 199â210, 2001.
[3] Samuel W. Bent, Daniel D. Sleator, and Robert E. Tarjan. Biased search trees. SIAM
Journal on Computing, 14(3):545â568, 1985.
[4] James Daly, Alex X. Liu, and Eric Torng. A difference resolution approach to compressing
access control lists, 2013.
[5] Pankaj Gupta, , Pankaj Gupta, and Nick Mckeown. Packet classification using hierarchical intelligent cuttings. In in Hot Interconnects VII, pages 34â41, 1999.
[6] Pankaj Gupta and Nick McKeown. Packet classification on multiple fields, 1999.
[7] Peng He, Gaogang Xie, KaveĚ Salamatian, and Laurent Mathy. Meta-algorithms for
software-based packet classification. In Proceedings of the 22nd IEEE International Conference on Network Protocols (ICNP), pages 308â319, Washington, DC, USA, 2014. IEEE
Computer Society.
[8] Ju Yuan Hsiao, Chuan Yi Tang, and Ruay Shiung Chang. An efficient algorithm for
finding a maximum weight 2-independent set on interval graphs. Infomation Processing
Letters, 43(5):229â235, October 1992.
[9] Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane, and Patrick Eugster. Sax-pac (scalable and expressive packet classification). In Proceedings of the ACM
SIGCOMM Conference, pages 15â26, New York, NY, USA, 2014. ACM.
[10] Maciej Kuzniar, Peter PeresÄąĚni, and Dejan Kostic. What you need to know about SDN
flow tables. In Passive and Active Measurement - 16th International Conference, PAM
2015, New York, NY, USA, March 19-20, 2015, Proceedings, pages 347â359, 2015.
[11] T. V. Lakshman and Dimitrios Stiliadis. High-speed policy-based packet forwarding using
efficient multi-dimensional range matching. In In ACM SIGCOMM, pages 203â214, 1998.

99

[12] Alex X. Liu and Mohamed G. Gouda. Complete redundancy detection in firewalls. In In
Proceedings of 19th Annual IFIP Conference on Data and Applications Security, LNCS
3654, S. Jajodia and D. Wijesekera Ed, pages 196â209, August 2005.
[13] Alex X. Liu and Mohamed G. Gouda. Diverse firewall design. IEEE Transactions on
Parallel and Distributed Systems, 19:1237â1251, 2008.
[14] Alex X. Liu, Chad R. Meiners, and Eric Torng. TCAM Razor: A systematic approach towards minimizing packet classifiers in TCAMs. IEEE/ACM Transactions on Networking,
18(2):490â500, 2010.
[15] Alex X. Liu, Eric Torng, and Chad Meiners. Firewall compressor: An algorithm for
minimizing firewall policies. In Proceedings of the 27th Annual IEEE Conference on
Computer Communications (INFOCOM), pages 176â180, Phoenix, Arizona, April 2008.
[16] Alex X. Liu, Eric Torng, and Chad R. Meiners. Compressing network access control lists.
IEEE Transactions on Parallel and Distributed Systems, 22:1969â1977, 2011.
[17] Nick Mckeown, Scott Shenker, Tom Anderson, Larry Peterson, Jonathan Turner, Hari
Balakrishnan, and Jennifer Rexford. Openflow: Enabling innovation in campus networks.
[18] Chad R. Meiners, Alex X. Liu, and Eric Torng. Topological transformation approaches
to optimizing tcam-based packet classification systems.
[19] Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme,
Jesse Gross, Alex Wang, Joe Stringer, Pravin Shelar, Keith Amidon, and Martin Casado.
The design and implementation of open vswitch. In Proceedings of the 12th USENIX
Symposium on Networked Systems Design and Implementation (NSDI), pages 117â130,
Oakland, CA, May 2015. USENIX Association.
[20] Yaxuan Qi, Lianghong Xu, Baohua Yang, Yibo Xue, and Jun Li. Packet classification
algorithms: From theory to practice. In INFOCOM 2009, IEEE, pages 648â656, 2009.
[21] Matthew J. Saltzman. Coin-Or: An Open-Source Library for Optimization, pages 3â32.
Springer US, Boston, MA, 2002.
[22] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. Packet classification
using multidimensional cutting. In Proceedings of the 2003 conference on Applications,
technologies, architectures, and protocols for computer communications, SIGCOMM â03,
pages 213â224, New York, NY, USA, 2003. ACM.
[23] Haoyu Song, Sarang Dharmapurikar, Jonathan Turner, and John Lockwood. Fast hash
table lookup using extended bloom filter: An aid to network processing. In Proceedings
of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for

100

Computer Communications, SIGCOMM â05, pages 181â192, New York, NY, USA, 2005.
ACM.
[24] V. Srinivasan, S. Suri, and G. Varghese. Packet classification using tuple space search.
SIGCOMM Comput. Commun. Rev., 29(4):135â146, August 1999.
[25] V. Srinivasan, George Varghese, Subhash Suri, and Marcel Waldvogel. Fast and scalable
layer four switching, 1998.
[26] Xuehong Sun, Sartaj K. Sahni, and Yiqiang Q. Zhao. Packet classification consuming
small amount of memory. IEEE/ACM Trans. Netw., 13(5):1135â1145, October 2005.
[27] Subhash Suri, Tuomas Sandholm, and Priyank Ramesh Warkhede. Compressing twodimensional routing tables. Algorithmica, 35(4):287â300, 2003.
[28] David E. Taylor. Survey & taxonomy of packet classification techniques. Technical report,
ACM COMPUTING SURVEYS, 2004.
[29] David E. Taylor and Jonathan S. Turner. Classbench: a packet classification benchmark.
IEEE/ACM Trans. Netw., 15(3):499â511, June 2007.
[30] V. K. Vaishnavi. Multidimensional balanced binary trees. IEEE Transactions on Computers, 38(7):968â985, Jul 1989.
[31] Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar. Efficuts: optimizing
packet classification for memory and throughput. SIGCOMM Comput. Commun. Rev.,
40(4):207â218, August 2010.

101