HARDWARE ALGORITHMS FOR HIGHSPEED PACKET
PROCESSING
By
Eric Norige
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science — Doctor of Philosophy
2017
ABSTRACT
HARDWARE ALGORITHMS FOR HIGHSPEED PACKET PROCESSING
By
Eric Norige
The networking industry is facing enormous challenges of scaling devices to support the
exponential growth of internet traffic as well as increasing number of features being implemented inside the network. Algorithmic hardware improvements to networking components
have largely been neglected due to the ease of leveraging increased clock frequency and compute power and the risks of implementing complex hardware designs. As clock frequency
slows its growth, algorithmic solutions become important to fill the gap between current
generation capability and next generation requirements. This paper presents algorithmic
solutions to networking problems in three domains: Deep Packet Inspection (DPI), firewall
ruleset compression and noncryptographic hashing. The improvements in DPI are twopronged: first in the area of applicationlevel protocol field extraction, which allows security
devices to precisely identify packet fields for targeted validity checks. By using counting
automata, we achieve precise parsing of nonregular protocols with small, constant perflow
memory requirements, extracting at rates of up to 30 Gbps on real traffic in software while
using only 112 bytes of state per flow. The second DPI improvement is on the long standing
regular expression matching problem, where we complete the HFA solution to the DFA state
explosion problem with efficient construction algorithms and optimized memory layout for
hardware or software implementation. These methods construct automata too complex to
be constructed by previous methods in seconds, while being capable of 29 Gbps throughput
with an ASIC implementation. Firewall ruleset compression enables more firewall entries to
be stored in a fixed capacity pattern matching engine, and can also be used to reorganize a
firewall specification for higher performance software matching. A novel recursive structure
called TUF is given to unify the best known solutions to this problem and suggest future
avenues of attack. These algorithms, with little tuning, achieve a 13.7% improvement in
compression on large, reallife classifiers, and can achieve the same results as existing algorithms while running 20 times faster. Finally, noncryptographic hash functions can be used
for anything from hash tables to track network flows to packet sampling for traffic characterization. We give a novel approach to generating hardware hash functions in between the
extremes of expensive cryptographic hash functions and low quality linear hash functions.
To evaluate these midrange hash functions properly, we develop new evaluation methods to
better distinguish noncryptographic hash function quality. The hash functions described in
this paper achieve lowlatency, wide hashing with good avalanche and universality properties
at a much lower cost than existing solutions.
To my advisor, who didn’t give up on me despite all the reasons to.
iv
ACKNOWLEDGMENTS
I would like to thank Chad Meiners and Sailesh Kumar for providing strong shoulders to
stand upon.
This work is partially supported by the National Science Foundation under Grant Numbers CNS0916044, CNS0845513, CNS1017588, CCF1347953, CNS1318563 and CNS1017598, and the National Natural Science Foundation of China under Grant Numbers
61272546 and 61370226, and by a research gift from Cisco Systems, Inc..
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
Chapter 1
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 2 Protocol Parsing . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.1.1 Motivation . . . . . . . . . . . . . . .
2.1.2 Problem Statement . . . . . . . . . .
2.1.3 Limitations of Prior Art . . . . . . .
2.1.4 Proposed Approach . . . . . . . . . .
2.1.5 Key Contributions . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . .
2.3 Protocol and Extraction Specifications . . .
2.3.1 Counting Context Free Grammar . .
2.3.2 Protocol Specification in CCFG . . .
2.3.3 Extraction Specification in CCFG . .
2.4 Grammar Optimization . . . . . . . . . . . .
2.4.1 Counting Regular Grammar . . . . .
2.4.2 Normal Nonterminal Identification .
2.4.3 Normal Nonterminal Regularization .
2.4.4 Counting Approximation . . . . . . .
2.4.5 Idle Rule Elimination . . . . . . . . .
2.5 Automated Counting Automaton Generation
2.5.1 Counting Automata . . . . . . . . .
2.5.2 LPDFA . . . . . . . . . . . . . . . .
2.5.3 CA Specific Optimizations . . . . . .
2.6 Counting Automaton Implementation . . . .
2.6.1 Incremental Packet Processing . . . .
2.6.2 Simulated CA . . . . . . . . . . . . .
2.6.3 Compiled CA . . . . . . . . . . . . .
2.7 Extraction Generator . . . . . . . . . . . . .
2.8 Experimental Results . . . . . . . . . . . . .
2.8.1 Methods . . . . . . . . . . . . . . . .
2.8.1.1 Traces . . . . . . . . . . . .
2.8.1.2 Field Extractors . . . . . .
2.8.1.3 Metrics . . . . . . . . . . .
2.8.2 Experimental Results . . . . . . . . .
2.8.2.1 Parsing Speed . . . . . . . .
2.8.2.2 Memory Use . . . . . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
6
7
8
10
10
12
13
14
16
18
18
19
20
22
24
25
26
29
31
32
32
35
37
39
42
43
43
45
46
47
47
50
2.9
2.8.2.3 Parser Definition Complexity . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
51
Chapter 3 Regex Matching . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . .
3.1.1 Motivation . . . . . . . . . . . . . . . .
3.1.2 Limitations of Prior Art . . . . . . . .
3.1.3 Proposed Approach . . . . . . . . . . .
3.1.4 Challenges and Proposed Solutions . .
3.1.5 Key Novelty and Contributions . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . .
3.3 Automatic HFA Construction . . . . . . . . .
3.3.1 Basic Construction Method . . . . . .
3.3.2 Bit State Selection . . . . . . . . . . .
3.3.3 HFA Construction without DFA . . . .
3.3.4 Transition Table Optimization . . . . .
3.4 Fast HFA Construction . . . . . . . . . . . . .
3.4.1 Observation and Basic Ideas . . . . . .
3.4.2 Bit State Pruning . . . . . . . . . . . .
3.4.3 Mixin Table Generation . . . . . . . .
3.4.4 HFA Transition Table Generation . . .
3.4.5 Correctness of Fast HFA Construction
3.5 Fast Packet Processing . . . . . . . . . . . . .
3.5.1 Design Considerations . . . . . . . . .
3.5.2 Transition Format . . . . . . . . . . .
3.5.3 Action Compression Algorithm . . . .
3.5.4 Transition Table Image Construction .
3.6 Hardware Design . . . . . . . . . . . . . . . .
3.7 Experimental Results . . . . . . . . . . . . . .
3.7.1 Data Set . . . . . . . . . . . . . . . . .
3.7.2 Metrics & Experimental Setup . . . . .
3.7.3 Automaton Construction: Time & Size
3.7.4 Packet Processing Throughput . . . . .
3.8 Conclusions . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
54
55
57
59
60
61
63
63
64
66
67
69
69
71
71
74
76
79
79
79
81
84
86
90
91
93
93
96
98
Chapter 4 Firewall Compression . . .
4.1 Introduction . . . . . . . . . . . . .
4.1.1 Background and Motivation
4.1.2 Problem Statement . . . . .
4.1.3 Limitations of Prior Art . .
4.1.4 Proposed Approach . . . . .
4.1.5 Key Contributions . . . . .
4.2 Related Work . . . . . . . . . . . .
4.3 TUF Framework . . . . . . . . . .
4.3.1 TUF Outline . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
99
99
101
102
103
104
104
106
106
.
.
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.3.2 Efficient Solution Merging . . . . . . . . . . . . .
4.4 Prefix Minimization using Tries . . . . . . . . . . . . . .
4.4.1 1Dimensional Prefix Minimization . . . . . . . .
4.4.2 Multidimensional prefix minimization . . . . . .
4.5 Ternary Minimization using Terns . . . . . . . . . . . . .
4.6 Ternary Minimization using ACLs . . . . . . . . . . . . .
4.7 Revisiting prior schemes in TUF . . . . . . . . . . . . . .
4.8 Ternary Redundancy Removal . . . . . . . . . . . . . . .
4.9 Experimental Results . . . . . . . . . . . . . . . . . . . .
4.9.1 Evaluation Metrics . . . . . . . . . . . . . . . . .
4.9.2 Results on realworld classifiers . . . . . . . . . .
4.9.2.1 Sensitivity to number of unique decisions
4.9.2.2 Comparison with stateoftheart results
4.9.3 Efficiency . . . . . . . . . . . . . . . . . . . . . .
4.9.4 Ternary Redundancy Removal . . . . . . . . . . .
4.10 Problem Variants . . . . . . . . . . . . . . . . . . . . . .
4.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5 Hashing . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . .
5.1.1 Networking Hash Trends . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . .
5.2.1 Cryptographic hashes . . . . . . . . .
5.2.2 Noncryptographic Hardware Hashes
5.2.3 Software Hashes . . . . . . . . . . . .
5.2.4 Existing Evaluation Methods . . . .
5.3 Design Methodology . . . . . . . . . . . . .
5.3.1 Design considerations . . . . . . . . .
5.3.2 The framework . . . . . . . . . . . .
5.3.3 XOR stage . . . . . . . . . . . . . . .
5.3.4 Sbox stage . . . . . . . . . . . . . .
5.3.5 Permutation stage . . . . . . . . . .
5.4 Evaluation . . . . . . . . . . . . . . . . . . .
5.4.1 Hash functions for Comparison . . .
5.4.2 Generalized Uniformity Test . . . . .
5.4.3 Avalanche Test . . . . . . . . . . . .
5.4.4 Universality Test . . . . . . . . . . .
5.5 Future Work . . . . . . . . . . . . . . . . . .
5.6 Conclusion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
108
112
113
116
117
121
123
126
128
128
130
132
132
135
136
138
141
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
142
142
143
146
146
146
147
148
150
150
153
154
156
158
159
160
161
168
169
173
173
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
viii
LIST OF TABLES
Table 3.1: Example transitions before optimization . . . . . . . . . . . . . . . . . . .
67
Table 3.2: HFA transition mergeability table . . . . . . . . . . . . . . . . . . . . . .
68
Table 3.3: Table 3.1 transitions after optimization . . . . . . . . . . . . . . . . . . .
69
Table 3.4: Mixin Incoming Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
Table 3.5: Bit State 1 Outgoing Table . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Table 3.6: Bit State 3 Outgoing Table . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Table 3.7: RegEx set Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Table 4.1: An example packet classifier . . . . . . . . . . . . . . . . . . . . . . . . . 100
Table 4.2: A classifier equivalent to over 2B range rules . . . . . . . . . . . . . . . . 127
Table 4.3: An equivalent classifier equivalent to 131 range rules . . . . . . . . . . . . 127
Table 4.4: An equivalent classifier equivalent to 3 range rules . . . . . . . . . . . . . 128
Table 4.5: Classifier categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Table 4.6: ACR on real world classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 132
Table 4.7: Small Classifiers Compressed # of Rules . . . . . . . . . . . . . . . . . . . 134
Table 4.8: Medium Classifiers Compressed # of Rules . . . . . . . . . . . . . . . . . 134
Table 4.9: Large Classifiers Compressed # of Rules . . . . . . . . . . . . . . . . . . . 134
Table 4.10: Runtimes in seconds of Red. Rem. algorithms . . . . . . . . . . . . . . . 137
Table 5.1: Avalanche with real trace and +1 sequence . . . . . . . . . . . . . . . . . 169
ix
LIST OF FIGURES
Figure 2.1: FlowSifter architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Figure 2.2: Two protocol specifications in CCFG . . . . . . . . . . . . . . . . . . . .
15
Figure 2.3: Derivation of “10 ba” from the Varstring grammar in Figure 2.2(a) . . .
16
Figure 2.4: Two extraction CCFGs Γxv and Γxd . . . . . . . . . . . . . . . . . . . .
17
Figure 2.5: Varstring after decomposition of rule S→BV. . . . . . . . . . . . . . . .
21
Figure 2.6: General Approximation Structure . . . . . . . . . . . . . . . . . . . . . .
23
Figure 2.7: Approximation of Dyck S . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Figure 2.8: CRG for Dyck example from Figures 2.2b and 2.4b . . . . . . . . . . . .
27
Figure 2.9: Exploded CA for Dyck in Figures 2.2b and 2.4b; each cluster has start
CA state on left, and destination CA state on right . . . . . . . . . . . .
28
Figure 2.10: Extraction Generator Application Screenshot . . . . . . . . . . . . . . .
40
Figure 2.11: Intuitive MetaTree and Extraction Grammar for Dyck example . . . . .
41
Figure 2.12: Comparison of parsers on different traces . . . . . . . . . . . . . . . . . .
48
Figure 2.13: Various experimental results . . . . . . . . . . . . . . . . . . . . . . . . .
53
Figure 3.1: NFA, HFA, and DFA generated from RegEx set: {EFG, X.*Y} . . . . . .
56
Figure 3.2: Input NFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Figure 3.3: Pruned NFA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Figure 3.4: Mixin Outgoing Table for 1&3 . . . . . . . . . . . . . . . . . . . . . . . .
74
Figure 3.5: Output HFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Figure 3.6: Transition Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Figure 3.7: Action Mask Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
x
Figure 3.8: Effects of Table Width . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Figure 3.9: Hardware design for HFA module . . . . . . . . . . . . . . . . . . . . . .
89
Figure 3.10: Construction Time BCS Sets . . . . . . . . . . . . . . . . . . . . . . . .
94
Figure 3.11: Construction Time Scale Sequence . . . . . . . . . . . . . . . . . . . . .
95
Figure 3.12: Memory Image Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
Figure 3.13: Throughput Synthetic Traces . . . . . . . . . . . . . . . . . . . . . . . .
96
Figure 3.14: Throughput Real Traces . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
Figure 3.15: Transition Order Optimization . . . . . . . . . . . . . . . . . . . . . . .
97
Figure 3.16: HASIC hardware implementation throughput . . . . . . . . . . . . . . .
98
Figure 4.1: Structural Recursion Example, converting a BDD to a TCAM Classifier
108
Figure 4.2: TUF operations w/ backgrounds . . . . . . . . . . . . . . . . . . . . . . 109
Figure 4.3: TUF Trie compression of simple classifier . . . . . . . . . . . . . . . . . . 115
Figure 4.4: Encapsulation at a field boundary . . . . . . . . . . . . . . . . . . . . . . 117
Figure 4.5: Example Tern compression . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Figure 4.6: Recursive merging left and right ACLs . . . . . . . . . . . . . . . . . . . 122
Figure 4.7: ACL pairing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 4.8: Razor hoisting the null solution as a decision . . . . . . . . . . . . . . . . 125
Figure 4.9: TUF and Razor comparison on same input . . . . . . . . . . . . . . . . . 126
Figure 4.10: Razor vs. Redundancy Removal, for classifier grouping purposes . . . . . 130
Figure 4.11: Improvement of TUF over the state of the art for real life classifiers . . . 131
Figure 4.12: Incomplete compression time comparison . . . . . . . . . . . . . . . . . . 136
Figure 5.1: Framework of hash function . . . . . . . . . . . . . . . . . . . . . . . . . 153
Figure 5.2: One stage of XOR Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 154
xi
Figure 5.3: Equivalent matrix equation . . . . . . . . . . . . . . . . . . . . . . . . . 154
Figure 5.4: One stage of Sboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Figure 5.5: AOI222 gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Figure 5.6: Example permutation with 6 layer separated and then combined . . . . . 158
Figure 5.7: Boxandwhisker plot of the results of Uniformity tests on real trace . . . 164
Figure 5.8: Uniformity results for “+10” series . . . . . . . . . . . . . . . . . . . . . 166
Figure 5.9: QQ plot for “+10” series Many functions which perform well are removed
from the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 5.10: Procedure for testing universality . . . . . . . . . . . . . . . . . . . . . . 171
Figure 5.11: Universality testing results . . . . . . . . . . . . . . . . . . . . . . . . . . 172
xii
Chapter 1
Introduction
The packet processing domain offers algorithm authors a unique combination of challenges
and opportunities. The phrase “packet processing domain” means the wide range of tasks
done in networking components’ data planes. This includes topics as varied as buffer management strategies, to manage buffering packets between reception and transmission, and
IP lookup, to determine how to forward the packet. It also includes security topics such as
deep packet inspection and statistics gathering topics such as packet sampling and counter
implementations. Until coming into the networking field, I had no idea how much work has
been put into the implementation of simply counting how many packets/bytes were transferred by a router. It turns out that there’s a lot of improvement that can be done on even
this simple task.
One reason these tasks need improvement is the incredible demand for Internet bandwidth. Internet use is still growing at astronomical rates, and the infrastructure needed to
support this growth keeps being pushed to its limits. Electrical and optical engineers keep
improving the “pipes” of the internet, giving the ability to send more and more data through
wires and fiber optic strands. Historically, the data path in the boxes that connected these
“pipes” had little difficulty keeping up with the data transfer rates. Further, as semiconductor technologies scaled up in frequency, the data path logic performance was automatically
upgraded, so less attention was paid to its implementation. Semiconductor technology is
1
not scaling in frequency anywhere near as fast as it has in the past, so this easy source of
performance has been lost, making algorithmic improvements valuable to keep pace with
demand.
Further, new security and monitoring requirements are being added to networks, increasing the burden on the datapath to do complex processing on packets in flight. Firewall
rulesets and IP routing tables are growing in size even as the time available to process each
packet decreases with higher line rates. Deep packet inspection is moving up the complexity ladder to being able to deeply understand application protocols with a combination of
string matching, regular expression matching and protocol parsing. Monitoring of networks
is becoming more involved, with network analytics becoming important to managing even
midsize networks, with predictive systems alerting administrators to problems before failures occur. For high security networks, network situational analysis and network behavior
anomaly detection tools can be deployed to monitor for and identify a wide range of security
problems from information exfiltration to botnets for further investigation.
The opportunity is that the networking industry is able to make use of custom hardware
solutions. Normally, custom hardware is well out of the reach of most algorithm designers
for three reasons: lifetime, wide scope and experience. An algorithm that will have to
be frequently changed, such as Google PageRank, is not a good candidate for hardware
implementation, as the hardware will take much time to develop and will be out of date by
the time it can be used. In the networking field, the task of a router hasn’t fundamentally
changed in decades. Even the new requirements being added to network datapaths can
be built on top of hardware primitives for orders of magnitude performance improvements.
The cost of custom hardware is high, but the large number of installed units allows this
cost to be amortized across hundreds of thousands to millions of devices. Further, the cost
2
of hardware is a small part of the total cost of ownership of a network, allowing expensive,
highperformance hardware to be in high demand. Finally, the networking industry has a
long history of developing custom hardware as part of the “pipes” portion of networking.
This has led to the industry having experience implementing custom hardware solutions,
making it possible for a customhardware solution to their problems to be integrated in new
products.
This paper develops algorithmic solutions to four separate problems. It first tackles
deep packet inspection at two levels: protocol parsing and regular expression matching. An
efficient protocol parsing engine is developed in Chapter 2 to extract applicationlayer fields
from raw flows. Chapter 3 follows with novel methods of extended automaton construction
for regular expression matching. Chapter 4 develops a framework for building firewallstyle
ruleset compression. It ends with Chapter 5 on ASICoptimized noncryptographic hashing,
useful in hash tables and measurement sampling.
The results described in Chapter 2 have been published in JSAC Vol 32 No. 10. The
results described in Chapter 3 chapter have been published in ICNP 2013. The results
described in Chapter 4 have been published in ANCS 2013 and are in preprint for TON
2015. The results described in Chapter 5 have been published in ANCS 2011.
3
Chapter 2
Protocol Parsing
2.1
2.1.1
Introduction
Motivation
Content inspection is the core of a variety of network security and management devices
and services such as Network Intrusion Detection/Prevention Systems (NIDS/NIPS), load
balancing, and traffic shaping. Currently, content inspection is typically achieved by Deep
Packet Inspection (DPI), where packet payload is simply treated as a string of bytes and
is matched against content policies specified as a set of regular expressions. However, such
regular expression based content policies, which do not take content semantics into consideration, can no longer meet the increasing complexity of network security and management.
Semanticaware content policies, which are specified over application protocols fields, i.e.,
Layer 7 (L7) fields, have been increasingly used in network security and management devices
and services. For example, a load balancing device would like to extract method names and
parameter fields from flows (carrying SOAP [1] and XMLRPC [2] traffic for example) to
determine the best way to route traffic. A network web traffic analysis tool would extract
message length and content type fields from HTTP header fields to gain information about
current web traffic patterns. Another example application that demonstrates the need for
and the power of semanticaware content policies is vulnerabilitybased signature checking for
4
detecting polymorphic worms in NIDS/NIPS. Traditionally, NIDS/NIPS use exploitbased
signatures, which are typically specified as regular expressions. The major weakness of an
exploitbased signature is that it only recognizes a specific implementation of an exploit.
For example, the following exploitbased signature for Code Red,
urlcontent:‘‘ida?NNNNNNNNNNNN...’’,
where the long string of N s is used to trigger a buffer overflow vulnerability, fails to detect
Code Red variant II where each N is replaced by an X. Given the limitations of exploitbased
signatures and the rapidly increasing number of polymorphic worms, vulnerabilitybased
signatures have emerged as effective tools for detecting zeroday polymorphic worms [3–6].
As a vulnerabilitybased signature is independent of any specific implementation, it is hard
to evade even for polymorphic worms. Here is an example vulnerabilitybased signature for
Code Red worms as specified in Shield [3]:
c = MATCH STR LEN(>> P Get Request.URI,‘‘id[aq]
?(.*)$’’,limit);
IF (c > limit) # Exploit!
The key feature of this signature is its extraction of the string beginning with “ida?” or
“idq?”. By extracting this string and then measuring its length, this signature is able to
detect any variant of the Code Red polymorphic worm.
We call the process of inspecting flow content based on semanticaware content policies
Deep Flow Inspection (DFI). DFI is the foundation and enabler of a wide variety of current
and future network security and management services such as vulnerabilitybased malware
filtering, applicationaware load balancing, network measurement, and contentaware caching
and routing. The core technology of DFI is L7 field extraction, the process of extracting the
values of desired application (Layer 7) protocol fields.
5
2.1.2
Problem Statement
In this paper, we aim to develop a highspeed online L7 field extraction framework, which
will serve as the core of next generation semantic aware network security and management
devices. Such a framework needs to satisfy the following six requirements. First, it needs to
parse at highspeed. Second, it needs to use a small amount of memory per flow so that the
framework can run in SRAM; otherwise, it will be running in DRAM, which is hundreds of
times slower than SRAM.
Third, such a framework needs to be automated ; that is, it should have a tool that takes
as input an extraction specification and automatically generates an extractor. Handcoded
extractors are not acceptable because each time when application protocols or fields change,
they need to be manually written or modified, which is slow and error prone.
Fourth, because of time and space constraints, any such framework must perform selective
parsing, i.e., parsing only relevant protocol fields that are needed for extracting the specified
fields, instead of full parsing, i.e., parsing the values of every field. Full protocol parsing is too
slow and is unnecessary for many applications such as vulnerabilitybased signature checking
because many protocol fields may not be referenced in given vulnerability signatures [5].
Selective parsing that skips irrelevant data leads to faster parsing with less memory. To
avoid full protocol parsing and improve parsing efficiency, we want to dramatically simplify
parsing grammars based on extraction specification. We are not aware of existing compiler
theory that addresses this issue.
Fifth, again because of time and space constraints, any such framework must support
approximate protocol parsing where the actual parser does not parse the input exactly as
specified by the grammar. While precise parsing, where a parser parses input exactly as
6
specified by the grammar, is desirable and possibly feasible for end hosts, it is not practical
for highspeed network based security and management devices. First, precise parsing for
recursive grammars is time consuming and memory intensive; therefore, it is not suitable for
NIDS/NIPS due to performance demand and resource constraints. Second, precise parsers
are vulnerable to Denial of Service (DoS) attacks as attackers can easily craft an arbitrarily
deep recursive structure in their messages and exhaust the memory used by the parser.
However, it is technically challenging to perform approximate protocol parsing. Again, since
most existing compiler applications run on end hosts, we are not aware of existing compiler
theory that addresses this issue.
At last, any such framework must be able to parse application protocols with field length
descriptors, which are fields that specify the length of another field. An example field length
descriptor is the HTTP ContentLength header field, which gives the length of the HTTP
body. Field length descriptors cannot be described with CFG [7,8], which means that CFGs
are not expressive enough for such framework.
2.1.3
Limitations of Prior Art
To the best of our knowledge, there is no existing online L7 field extraction framework
that is both automated and selective. The only selective protocol parsing solution (i.e.,
[5]) is handcoded and all automated protocol parsing solutions (i.e., [7–16]) perform full
parsing. Furthermore, prior work on approximate protocol parsing is too inaccurate. To
reduce memory usage, Moscola et al.proposed ignoring the recursive structures in a grammar
[13–15]. This crude approximate parsing is not sufficient because recursive structures often
must be partially parsed in order to extract the desired information; for example, the second
field in a function call.
7
2.1.4
Proposed Approach
To address the above limitations of prior work on application protocol parsing and extraction,
in this paper, we propose FlowSifter, an L7 field extraction framework that is automated,
selective, and approximate. The architecture of FlowSifter is illustrated in Figure 2.1. The
input to FlowSifter is an extraction specification that specifies the relevant protocol fields
that we want to extract values. FlowSifter adopts a new grammar model called Counting
Regular Grammars (CRGs) for describing both the grammatical structures of messages carried by application protocols and the message fields that we want FlowSifter to extract. To
our best knowledge, CRG is the first protocol grammar model that facilitates the automatic
optimization of extraction grammars based on their corresponding protocol grammars. The
extraction specification can be a partial specification that uses a corresponding complete
protocol grammar from FlowSifter’s builtin library of protocol grammars to complete its
specification. Changing the field extractor only requires changing the extraction specification, which makes FlowSifter highly flexible and configurable. FlowSifter has three modules:
a grammar optimizer, an automata generator, and a field extractor. The grammar optimizer
module takes the extraction specification and the corresponding protocol grammar as its
input and outputs an optimized extraction grammar, by which FlowSifter selectively parses
only relevant fields bypassing irrelevant fields. The automata generator module takes the optimized extraction grammar as its input and outputs the corresponding counting automaton.
The field extractor module applies the counting automaton to extract relevant fields from
flows.
FlowSifter achieves low memory consumption by stackless approximate parsing. Processing millions of concurrent flows makes stacks an unaffordable luxury and a vulnerability for
8
Data Flow in
Protocol
Library
User Friendly
Extraction
Specification
Grammar
Optimizer
Optimized
Extraction
Grammar
Automata
Generator
Field
Extractor
Counting
Automata
Extracted
Fields
Data Flow out
Figure 2.1: FlowSifter architecture
DoS attacks. For some application protocols such as XMLRPC [2], an attacker may craft
its flow so that the stack used by the parser go infinitely deep until memory is exhausted. To
achieve a controlled tradeoff between memory usage and parsing accuracy, we use a formal
automata theory model, counting automata (CA), to support approximate protocol parsing.
CAs facilitate stackless parsing by using counters to maintain parsing state information.
Controlling memory size allocated to counters gives us a simple way of balancing between
memory usage and parsing accuracy. With this solid theoretical underpinning, FlowSifter
achieves approximate protocol parsing with welldefined error bounds.
FlowSifter can be implemented in both software and hardware. For hardware implementation, FlowSifter can be implemented in both ASIC and Ternary Content Addressable
Memory (TCAM). For ASIC implementation, FlowSifter uses a small, fast parsing engine
to traverse the memory image of the CA along with the flow bytes, allowing realtime processing of a vast number of flows with easy modification of the protocol to be parsed. For
TCAM implementation, FlowSifter encodes the CA transition tables into a TCAM table,
thus allowing FlowSifter to extract relevant information from a flow in linear time over the
number of bytes in a flow. FlowSifter uses optimization techniques to minimize the number
9
of TCAM bits needed to encode the transition tables. Note that TCAM has already been
installed on many networking devices such as routers and firewalls.
2.1.5
Key Contributions
In this paper, we make the following key contributions:
1. We propose the first L7 field extraction framework that is automated, selective, and
approximate.
2. We propose for the first time to use Counting ContextFree Grammar and Counting
Automata to support approximate protocol parsing. By controlling memory size allocated to counters, we can easily tradeoff between memory usage and parsing accuracy.
3. We propose efficient algorithms for optimizing extraction grammars.
4. We propose an algorithm for the automatic generation of stackless parsers from nonregular grammars.
2.2
Related Work
Prior work on application protocol parsing falls into three categories: (1) handcoded, full,
and precise parsing, (2) handcoded, selective, and precise parsing, and (3) automated, full,
and precise parsing,.
Handcoded, full, and precise parsing: Although application protocol parsers are still
predominantly hand coded [7], handcoded protocol parsing has two major weaknesses in
comparison with automated protocol parsing. First, handcoded protocol parsers are hard to
10
reuse as they are tightly coupled with specific systems and deeply embedded into their working environment [7]. For example, Wireshark has a large collection of protocol parsers, but
none can be easily reused outside of Wireshark. Second, such parsers tend to be errorprone
and lack robustness [7]. For example, severe vulnerabilities have been discovered in several
handcoded protocol parsers [17–22]. Writing an efficient and robust parser is a surprisingly
difficult and errorprone process because of the many protocol specific issues (such as handling concurrent flows) and the increasing complexity of modern protocols [7]. For example,
the NetWare Core Protocol used for remote file access has about 400 request types, each
with its own syntax [7].
Handcoded, selective, and precise parsing: Full protocol parsing is not necessary for many
applications. For example, Schear et al.observed that full protocol parsing is not necessary
for detecting vulnerabilitybased signatures because many protocol fields are not referenced
in vulnerability signatures [5]. Based on such observations, Schear et al.proposed selective
protocol parsing [5], which is three times faster than binpac. However, the protocol parsers
in [5] are handcoded and henceforth suffer from the weaknesses of handcoded protocol
parsing.
Automated, full, and precise parsing: Recognizing the increasing demand for application
protocol parsers, the difficulty in developing efficient and robust protocol parsers, and the
many defects of homebrew parsers, three application protocol parser generators have been
proposed: binpac [7], GAPA [8], and UltraPAC [16]. Most network protocols are designed
to be easily parsed by hand, but this often means their formal definitions turn out complex
in terms of standard parsing representations. Pang et al., motivated by the fact that the
programming language community has benefited from higher levels of abstraction for many
years using parser generation tools such as yacc [23] and ANTLR [24], developed the pro11
tocol parser generator BinPAC. GAPA, developed by Borisov et al., focuses on providing a
protocol specification that guarantees the generated parser to be typesafe and free of infinite
loops. The similarity between BinPAC and GAPA is that they both use recursive grammars
and embedded code to generate context sensitive protocol parsers. The difference between
BinPAC and GAPA is that BinPAC favors parsing efficiency and GAPA favors parsing safety.
BinPAC uses C++ for users to specify code blocks and compile the entire parser into C++
whereas GAPA uses a restricted and memorysafe interpreted language that can be proven
free of infinite loops. UltraPAC improves on BinPAC by replacing BinPAC’s tree parser with
a stream parser implemented using a state machine to avoid constructing the tree representation of a flow. UltraPAC inherits from BinPAC a lowlevel protocol field extraction
language that allows additional grammar expressiveness using embedded C++ code. This
makes optimizing an extraction specification extremely difficult, if not impossible. In contrast, FlowSifter uses highlevel CA grammars without any inline code, which facilitates the
automated optimization of protocol field extraction specifications. When parsing HTTP, for
example, BinPAC and UltraPAC need inline C++ code to detect and extract the ContentLength field’s value whereas FlowSifter’s grammar can represent this operation directly. In
addition, FlowSifter can automatically regularize nonregular grammars to produce a stackless approximate parser whereas an UltraPAC parser for the same extraction specification
must be converted manually to a stackless form using C++ to embed the approximation.
2.3
Protocol and Extraction Specifications
FlowSifter produces an L7 field extractor from two inputs: a protocol specification, and an
extraction specification. Both specifications are specified using a new grammar model called
12
Counting Context Free Grammar (CCFG), which augments rules for contextfree grammars
with counters, guards, and actions. These augmentations increase grammar expressiveness,
but still allows the grammars to be automatically simplified and optimized. In this section,
we first formally define CCFG, and then explain how to write protocol specification and
extraction specification using CCFG.
2.3.1
Counting Context Free Grammar
Formally, a counting contextfree grammar is a fivetuple Γ = (N, Σ, C, R, S) where N, Σ, C,
and R are finite sets of nonterminals, terminals, counters, and production rules, respectively,
and S is the start nonterminal. The terminal symbols are those that can be seen in strings
to be parsed. For L7 field extraction, this is usually a single octet. A counter is a variable
with an integer value, which is initialized to zero. The counters can be used to store parsing
information. For example, in parsing an HTTP flow, a counter can be used to store the value
of the “ContentLength” field. Counters also provide a mechanism for eliminating parsing
stacks.
A production rule is written as guard : nonterminal → body . The guard is a conjunction of unary predicates over the counters in C, i.e., expressions of a single counter that
return true or false. An example guard is (c1 > 2; c2 > 2), which checks counters c1 and c2 ,
and evaluates to true if both are greater than 2. If a counter is not included in a guard, then
its value does not affect the evaluation of the guard. Guards are used to guide the parsing of
future bytes based on the parsing history of past bytes. For example, in parsing an HTTP
flow, the value of the “ContentLength” field determines the number of bytes that will be
included in the message body. This value can be stored in a counter. As bytes are processed,
13
the associated counter is decremented. A guard that checks if the counter is 0 would detect
the end of the message body and allow a different action to be taken.
The nonterminal following the guard is called the head of the rule. Following it, the
body is an ordered sequence of terminals and nonterminals, any of which can have associated
actions. An empty body is written . An action is a set of unary update expressions, each
updating the value of one counter, and is associated with a specific terminal or nonterminal
in a rule. The action is executed after parsing the associated terminal or nonterminal. An
example action in CCFG is (c1 := c1 ∗ 2; c2 := c2 + 1). If a counter is not included in an
action, then the value of that counter is unchanged. An action may be empty, i.e., updates
no counter. Actions in CCFG are used to write “history” information into counters, such as
the message body size.
To make the parsing process deterministic, CCFGs require leftmost derivations; that
is, for any body that has multiple nonterminals, only the leftmost nonterminal that can
be expanded is expanded. Thus, at any time, production rules can be applied in only one
position. We use leftmost derivations rather than rightmost derivations so that updates are
applied to counters in the order that the corresponding data appears in the data flow.
2.3.2
Protocol Specification in CCFG
An application protocol specification precisely specifies the protocol being parsed. We use
CCFG to specify application protocols. For example, consider the Varstring language consisting of strings with two fields separated by a space: a length field, B, and a data field,
V, where the binary encoded value of B specifies the length of V. This language cannot be
specified as a Context Free Grammar (CFG), it can be easily specified in CCFG as shown in
Figure 2.2(a). The CCFG specification of another language called Dyck consisting of strings
14
with balanced parentheses ‘[’ and ‘]’, is shown in Figure 2.2(b). We adopt the convention
that the head of the first rule, such as S in the Varstring and Dyck grammars, is the start
nonterminal.
S →BV
B → ‘0’ (c := c ∗ 2) B
B → ‘1’ (c := c ∗ 2 + 1) B
B→‘ ’
(c > 0) V → Σ (c := c − 1) V
(c = 0) V →
examples: “1 a”, “10 ba”, “101 xyzab”
1
2
3
4
5
6
(a) Varstring Γv
S→
S→IS
I → ‘[’ S ‘]’
“[[]]”, “[][[][]]”
1
2
3
(b) Dyck Γd
Figure 2.2: Two protocol specifications in CCFG
We now explain the six production rules in Varstring. The first rule S→BV means that a
message has two fields, the length field represented by nonterminal B and a variablelength
data field represented by nonterminal V . The second rule B→‘0’ (c := c ∗ 2) B means that
if we encounter character ‘0’ when parsing the length field, we double the value of counter
c. Similarly, the third rule B→‘1’ (c := c ∗ 2 + 1) B means that if we encounter ‘1’ when
parsing the length field, we double the value of counter c first and then increase its value by
1. The fourth rule B→‘ ’ means that the parsing of the length field terminates when we see
a space. These three rules fully specify how to parse the length field and store the length in
c. For example, after parsing a length field with “10”, the value of the counter c will be 2
(= ((0 ∗ 2) + 1) ∗ 2). Note that here “10” is the binary representation of value 2. The fifth
rule (c > 0):V→ Σ (c := c − 1) V means that when parsing the data field, we decrement the
counter c by one each time a character is parsed. The guard allows use of this rule as long as
c > 0. The sixth rule (c = 0):V→ means that when c = 0, the parsing of the variablelength
field is terminated.
15
We now demonstrate how the Varstring grammar can produce the string “10 ba”. Each
row of the table in Figure 2.3 is a step in the derivation. The c column shows the value of
the counter c at each step. The number in parentheses is the rule from Figure 2.2(a) that
is applied to get to the next derivation. Starting with the Varstring’s start symbol, S, we
derive the target string by replacing the leftmost nonterminal with the body of one of its
production rules. When applying rule 5, the symbol Σ is shorthand for any character, so it
can produce ‘a’ or ‘b’ or any other character.
Derivation
S
BV
1BV
10 B V
10 V
10 b V
10 ba V
10 ba
c Rule # Note
0
(1)
Decompose into len and body
0
(3)
Produce ‘1’, c = 0 ∗ 2 + 1
1
(2)
Produce ‘0’, c = 1 ∗ 2
2
(4)
Eliminate B, c is the length of body
2
(5)
Produce ’b’, decrement c
1
(5)
Produce ’a’, decrement c
0
(6)
c = 0, so eliminate V
0
No nonterminals left, done
Figure 2.3: Derivation of “10 ba” from the Varstring grammar in Figure 2.2(a)
2.3.3
Extraction Specification in CCFG
An extraction specification is a CCFG Γx = (Nx , Σ, Cx , Rx , Sx ), which may not be complete.
It can refer to nonterminals specified in the corresponding protocol specification denoted
Γp = (Np , Σ, Cp , Rp , Sp ). However, Γx cannot modify Γp and cannot add new derivations for
nonterminals defined in Γp . This ensures that we can approximate Γp without changing the
semantics of Γx .
The purpose of FlowSifter is to call application processing functions on extracted field
values. Based on the extracted field values, these application processing functions will take
application specific actions such as stopping the flow for security purposes or routing the
16
flow to a particular server for load balancing purposes. Calls to these functions appear in
the actions of a rule. Application processing functions can also return a value back into
the extractor to affect the parsing of the rest of the flow. Since the application processing
functions are part of the layer above FlowSifter, their specification is beyond the scope of
this paper. Furthermore, we define a shorthand for calling an application processing function
f on a piece of the grammar f { body }, where body is a rule body that makes up the field
to be extracted.
We next show two extraction specifications that are partial CCFGs, using the functioncalling shorthand. The first, Γxv in Figure 2.4(a), specifies the extraction of the variablelength field V for the Varstring CCFG in Figure 2.2(a). This field is passed to an application
processing function vstr. For example, given input stream “101 Hello”, the field “Hello”
will be extracted. This example illustrates several features. First, it shows how FlowSifter
can handle variablelength field extractions. Second, it shows how the user can leverage
the protocol library to simplify writing the extraction specification. The second extraction
specification, Γxd in Figure 2.4(b), is associated with the Dyck CCFG in Figure 2.2(b) and
specifies the extraction of the contents of the first pair of square parentheses; this field is
passed to an application processing function named param. For example, given the input
stream [[[]] ][[][]], the [[]] will be extracted. This example illustrates how FlowSifter
can extract specific fields within a recursive protocol by referring to the protocol grammar.
1
X → B vstr{V} 1 X → ‘[’ param{S} ‘]’ S
(a) Varstring Γxv
(b) Dyck Γxd
Figure 2.4: Two extraction CCFGs Γxv and Γxd
17
2.4
Grammar Optimization
In this section, we introduce techniques to optimize the input protocol specification and
extraction specification.
2.4.1
Counting Regular Grammar
Just as parsing with nonregular CFGs is expensive, parsing with nonregular CCFGs is also
expensive as the whole derivation must be tracked with a stack. To resolve this, FlowSifter
converts input CCFGs to Counting Regular Grammars (CRGs), which can be parsed without
a stack, just like parsing regular grammars. A CRG is a CCFG where each production rule
is regular. A production rule is regular if and only if it is in one of the following two forms:
guard : X → α[ action ]Y
(2.1)
guard : X → α[ action ]
(2.2)
where X and Y are nonterminals and α is a terminal. CRG rules that fit equation 2.1 are
the nonterminating rules whereas those that fit equation 2.2 are the terminating rules as
derivations end when they are applied.
For a CCFG Γ = (N, Σ, C, R, S) and a nonterminal X ∈ N, we use Γ(X) to denote the
CCFG subgrammar Γ = (N, Σ, C, R, X) with the nonterminals that are unreachable from X
being removed. For a CCFG Γ = (N, Σ, C, R, S) and a nonterminal X ∈ N, X is regular if
and only if Γ(X) is equivalent to some CRG.
Given the extraction CCFG Γx and L7 grammar Γp as inputs, FlowSifter first generates
a complete extraction CRG Γf = (Nx ∪ Np , Σ, Cx ∪ Cp , Rx ∪ Rp , Sx ). Second, it prunes any
18
unreachable nonterminals from Sx and their corresponding production rules. Third, it partitions the nonterminals in Np into those we can guarantee to be normal and those we cannot.
Fourth, for each nonterminal X that are guaranteed to be normal, FlowSifter regularizes X;
for the remaining nonterminals X ∈ Np , FlowSifter uses counting approximation to produce
a CRG that approximates Γf (X). If FlowSifter is unable to regularize any nonterminal, it
reports that the extraction specification Γx needs to be modified and provides appropriate
debugging information. Last, FlowSifter eliminates idle rules to optimize the CRG. Next,
we explain in detail how to identify nonterminals that are guaranteed to be normal, how to
regularize a normal terminal, how to perform counting approximation, and how to eliminate
idle rules.
2.4.2
Normal Nonterminal Identification
Determining if a CFG describes a regular language is undecidable. Thus, we cannot precisely
identify normal nonterminals. FlowSifter identifies nonterminals in Np that are guaranteed to
be normal using the following sufficient but not necessary condition. A nonterminal X ∈ Np
is guaranteed to be normal if it satisfies one of the following two conditions:
1. Γf (X) has only regular rules.
2. For the body of any rule with head X, X only appears last in the body and for every
nonterminal Y that is reachable from X, Y is normal and X is not reachable from Y .
Although we may misidentify a normal nonterminal as not normal, fortunately, as we will
see in Section 2.4.4, the cost of such a mistake is relatively low; it is only one counter in
memory and some unnecessary predicate checks.
19
2.4.3
Normal Nonterminal Regularization
In this step, FlowSifter replaces each identified normal nonterminal’s production rule with a
collection of equivalent regular rules. Consider an arbitrary nonregular rule
guard : X → body .
We first express the body as Y1 · · · Yn where Yi , 1 ≤ i ≤ n, is either a terminal or a
nonterminal (possibly with an action). Because this is a nonregular rule, either Y1 is a
nonterminal or n > 2 (or both). We handle the cases as follows.
• If Y1 is a nonnormal nonterminal, Γx was incorrectly written and needs to be reformulated.
• If Y1 is a normal nonterminal, we define CRG Γ = (N , Σ, C , R , Y ) to be Γf (Y1 )
where the nonterminals have been given unique names. We use Γ to update the rule set
as follows. First, we replace the rule guard : X → body with guard : X → Y .
Next, for each terminating rule r ∈ R , we create a new rule r where we append
Y2 · · · Yn to the body of r and add r to the rule set; for each nonterminating rule
r ∈ R , we add r to the rule set.
• If Y1 is a terminal and n > 2, the rule is decomposed into two rules: guard : X →
Y1 X and X → Y2 · · · Yn where X is a new nonterminal.
The above regularization process is repeatedly applied until there are no nonregular rules.
For example, consider the Varstring CCFG Γ with nonregular rule S→BV. As both
Γ(B) and Γ(V ) are CRGs, so S is a normal nonterminal. Decomposition regularizes Γ(S) by
replacing S→BV by S→ B and B→ by B → V. We also add copies of all other rules where
20
we use B in place of B. Figure 2.5 illustrates the resulting rule set excluding unreachable
rules. For example, the nonterminal B is no longer referenced by any rule in the new grammar.
For efficiency, we remove unreferenced nonterminals and their rules after each application of
regularization.
1
2
3
4
5
6
S→B
B → 0 (c := c ∗ 2) B
B → 1 (c := 1 + c ∗ 2) B
B → V
(c = 0) V →
(c > 0) V → Σ (c := c − 1) V
Figure 2.5: Varstring after decomposition of rule S→BV.
Theorem 2.4.1
Given a normal nonterminal X in grammar Γ, applying regularization to any rule guard :
¯
X → Y1 · · · Yn in Γ produces an equivalent grammar Γ.
Proof 2.4.1
We define rule r = guard : X → Y1 · · · Yn as the rule in Γ that is replaced by other rules
¯ We consider two cases: Y1 is a terminal, and Y1 is a normal nonterminal.
in Γ.
¯ is that rule
For the case that Y1 is a terminal, the only difference between Γ and Γ
r = guard : X → Y1 · · · Yn is replaced by rules r1 = guard : X → Y1 X and r2 =
X → Y2 · · · Yn to get Y1 · · · Yn . Consider any leftmost derivation with Γ that applies
¯ that replaces the application
the rule r. We get an equivalent leftmost derivation with Γ
of r with the application of rule r1 immediately followed by the application of rule r2 to
¯ that applies rule r1
produce the exact same result. Likewise, any leftmost derivation in Γ
must then immediately apply rule r2 since X is the leftmost nonterminal in the resulting
string. We get an equivalent leftmost derivation with Γ by replacing the application of r1
and r2 with r. Finally, r2 can never be applied except immediately following the application
21
of r1 because rule r1 is the only derivation that can produce nonterminal X . Therefore, Γ
¯ are equivalent.
and Γ
For the case that Y1 is nonterminal, rule r = guard : X → Y1 · · · Yn in Γ is replaced by
¯ Furthermore, we add copies of all rules with head Y1 that
rule r1 = guard : X → Y1 in Γ.
now have head Y1 where all nonterminals are replaced with new equivalent nonterminals.
This also applied to other nonterminals that are in the body of rules in Γ with head Y1 .
Finally, for any terminating rule rt in Γ(Y1 ), we add a rule rt where Y2 · · · Yn is appended
¯
to the body of rt and add rt to Γ.
Consider any leftmost derivation with Γ that applies the rule r. We get an equivalent
¯ as follows. First, we replace the application of r with the applicaleftmost derivation with Γ
tion of r1 . Next, until we reach a terminating rule, we replace each application of a rule with
head Y1 or other nonterminal in Γ(Y1 ) with the equivalent new rule using the new nonterminal names. Finally, we replace the application of terminating rule rt with terminating rule
¯ that applies rule r1 . This leftmost derivation
rt . Now consider any leftmost derivation in Γ
must eventually apply some terminating rule rt . We get an equivalent leftmost derivation in
¯ by replacing r1 with r, r with rt , and intermediate rule applications with their original
Γ
t
¯ can use any of the new rules without first
rule copies. We also note that no derivations in Γ
invoking r1 since that is the only path to reaching the new nonterminals. Therefore, Γ and
¯ are equivalent.
Γ
2.4.4
Counting Approximation
For nonterminals in Np that are not normal, we use counting approximation to produce a
collection of regular rules, which are used to replace these nonnormal nonterminals. For
a nonnormal nonterminal X, our basic idea is to parse only the start and end terminals
22
for Γ(X) ignoring any other parsing information contained within subgrammar Γ(X). This
approximation is sufficient because our extraction specification does not need to precisely
parse any subgrammar starting at a nonterminal in Np . That is, we only need to identify the
start and end of any subgrammar rooted at a nonterminal X ∈ Np . By using the counters
to track nesting depth, we can approximate the parsing stack for such nonterminals . For
nonterminals in the input extraction grammar, we require them to be normal. In practice,
this restriction turns out to be minor, and when a violation is detected, our tool will give
feedback to aid the user in revising the grammar.
Given a CCFG Γf with a nonterminal X ∈ Np that does not identify as normal,
FlowSifter computes a counting approximation of Γf (X) as follows. First, FlowSifter computes the sets of start and end terminals for Γf (X), which are denoted as start and stop.
These are the terminals that mark the start and end of a string that can be produced by
Γ(X). The remaining terminals are denoted as other. For example, in the Dyck extraction
grammar Γxd in Figure 2.4(b), the set of start and end terminals of Γxd (S) are {‘[’} and {‘]’},
respectively, and other has no elements. FlowSifter replaces all rules with head X with the
four rules in Figure 2.6 that use a new counter cnt. The first rule allows exiting X when the
1
2
3
4
(cnt = 0)
(cnt ≥ 0)
(cnt > 0)
(cnt > 0)
X→
X → start (cnt := cnt + 1) X
X → stop (cnt := cnt − 1) X
X → other X
Figure 2.6: General Approximation Structure
recursion level is zero. The second and third increase and decrease the recursion level when
matching start and stop terminals. The final production rule consumes the other terminals,
approximating the grammar while cnt > 0.
23
For example, if we apply counting approximation to the nonterminal S from the Dyck
extraction grammar Γxd in Figure 2.4(b), we get the new production rules in Figure 2.7.
1 (cnt = 0) S →
2 (cnt ≥ 0) S → ’[’ (cnt := cnt + 1) S
3 (cnt > 0) S → ’]’ (cnt := cnt − 1) S
Figure 2.7: Approximation of Dyck S
We can apply counting approximation to any subgrammar Γf (X) with unambiguous
starting and stopping terminals. Ignoring all parsing information other than nesting depth
of start and end terminals in the flow leads to potentially faster flow processing and fixed
memory cost. In particular, the errors introduced by counting approximation do not interfere
with extracting fields from correct locations within protocol compliant inputs. However,
counting approximations do not guarantee that all extracted fields are the result of only
protocol compliant inputs. Therefore, application processing functions should validate any
input that its behavior depends upon for proper operation.
2.4.5
Idle Rule Elimination
The CRG generated as above may have production rules without terminals. When implemented as a parser, no input is consumed when executing such rules. We call such rules idle
rules, and they have the form: X → Y without any terminal α. FlowSifter eliminates idle
rules by hoisting the contents of Y into X, composing the actions and predicates as well.
For a CRG with n variables, to compose a rule
(q1 ∧ · · · ∧ qn ) : Y → α(act)Z(g1 , · · · , gn )
24
into the idle rule
(p1 ∧ · · · ∧ pn ) : X → Y (f1 , · · · , fn ),
we create a new rule
(p1 ∧ · · · ∧ pn ) : X → α(act)Z(f1 , · · · , fn )
where pi = pi ∧ qi and fi = fi ◦ gi for 1 ≤ i ≤ n. That is, we compose the actions associated
with Y in X into Z’s actions and merge the predicates.
2.5
Automated Counting Automaton Generation
The automata generator module in FlowSifter takes an optimized extraction grammar as its
input and generates an equivalent counting automaton (CA) at output. The field extractor
module will use this CA as its data structure for performing field extraction.
One of the challenges of field extraction with a CA is resolving conflicting instructions
from different CRG rules. For example, consider a CA state A with two rules: A → /ab/B
and A → /a/[x := x + 1]C. After processing input character a, the state of the automaton
is indeterminate. Should it increment counter x and transition to state C, or should it wait
for input character b so it can transition to state B?
We solve this problem by using a DFA as subroutines in counting automtata. The DFA
will inspect flow bytes and return a decision indicating which pattern matched. The DFA
will use a priority system to resolve ambiguities and return the highest priorty decision. For
the above example, the DFA will have to lookahead in the stream to determine whether /ab/
25
or /a/ is the correct match; in practice, the lookahead required is small and inexpensive. We
describe this novel integrated CA and DFA model in this section.
2.5.1
Counting Automata
A Counting Automata (CA) is a 6tuple (Q, C, q0 , c0 , D, δ) where Q is a set of CA states,
C is a set of possible counter configurations, q0 ∈ Q is the initial state and c0 ∈ C is the
initial counter configuration. Normally, the transition function δ of the CA is a function that
maps the current configuration (state q ∈ Q and counter configuration c ∈ C along with
input character σ ∈ Σ to a new CA state q along with some action to update the current
counters. We choose a different approach where we use Labeled Priority DFA (LPDFA) as a
subroutine to perform this mapping for a given CA. We leave the formal details of LPDFA to
Section 2.5.2 but describe now how a CA uses LPDFA to define the CA’s transition function
δ. The set D defines the set of LPDFA that the CA may use. The transition function δ then
maps each configuration (q ∈ Q, c ∈ C) to an appropriate LPDFA. That is, δ : Q × C → D.
The LPDFA will deterministically process the flow and return a decision belonging to set
D = (QC ∪ {DON E, F AIL}) × (C → C) where DON E and F AIL are distinct from all
CA states; DON E implies the CA has completed processing of the input stream and F AIL
implies that the input flow does not meet the protocol expectations.
FlowSifter generates a CA (Q, C, q0 , c0 , D, δ) from a CRG Γ = (N, Σg , Cg , R, S) as follows.
We set Q = N and q0 = S. We build C based on Cg but with a bounded maximum
size b where typically b = 2sizeof (int) − 1; counters are initialized to 0. Formally, C =
{(c1 , c2 , . . . , cCg  ) : 0 ≤ ci < b, 1 ≤ i ≤ Cg } and c0 = (0, 0, . . . , 0). If necessary, we can tune
the parsing state size by using different numbers of bits for each counter. We set δ as follows.
For each configuration (q, c), we identify the set of CRG rules R(q, c) that correspond to
26
(q, c) and build the corresponding LPDFA from those rules; the set D is simply the set of
LPDFA we build. We describe LPDFA construction in detail in Section 2.5.2.
For example, in Figure 2.8, for the configuration with CA state 3 and counter c == 0,
rules 5 and 6 are active, so δ(3, c == 0) is the LPDFA constructed from the right hand
sides of those rules: [ (c := c + 1) 3 and (p := token(p)) 2. This LPDFA will return decision
(3, (c := c + 1)) when the flow bytes match [ and (2, p := token(p)) when the flow bytes
match . On the other hand, δ(3, c > 0) is constructed from the right hand side of rules 4,
5, and 7.
1
2
3
4
5
6
7
8
9
10
11
12
13
(c > 0)
(c = 0)
(c > 0)
(c > 0)
(c = 0)
(c > 0)
1 → [ (p := pos()) 4
2 → ][ (c := 1) 5
2→]
3 → ] (c := c − 1) 3
3 → [ (c := c + 1) 3
3 → (p := token(p)) 2
3 → /[^[\]]/ 3
4 → ] (c := 1) 3
4 → (p := token(p)) 2
5 → ] (c := c − 1) 5
5 → [ (c := c + 1) 5
5→
5 → /[^[\]]/ 5
Figure 2.8: CRG for Dyck example from Figures 2.2b and 2.4b
To apply a CA to a flow, we initialize the CA state as (q0 , c0 ). Based on the current
state (qi , ci ), we determine δ(qi , ci ) = df ai , and apply df ai to the flow bytes. This LPDFA
will always return a decision (qi+1 , acti ), and if qi+1 is either DON E or F AIL, parsing
ends. After computing ci+1 = acti (ci ), the CA state becomes (qi+1 , ci+1 ), and we repeat the
process until we are out of flow bytes.
For example, consider the optimized CRG and CA for our running Dyck grammar example depicted in Figures 2.8 and 2.9, respectively; this CA uses a single counter cnt. The CA
27
state labels have been replaced with integers, and we show the terminal symbols as strings
when possible, or as /regex/ when a regular expression is needed, such as for rule 7. The
initial CA configuration is (1, cnt == 0). The only CRG rule for state 1 is CRG rule 1 which
has no conditions associated with it. Thus, the CA invokes the LPDFA that represents the
body of rule 1; this LPDFA matches only the string “[”, and the CA transitions to state 4
after updating p to the result of pos(). If the flow bytes do not start with [, then the LPDFA
returns F AIL and the CA stops processing input because the input does not conform to
the CRG. Suppose later the configuration is (3, cnt > 0). In this case, the CA will invoke an
LPDFA that matches the input against [ to increment cnt, ] to decrement cnt, or any other
input character making no change to cnt; this is based on the counting approximation to
parse nested brackets. In all cases, the LPDFA sets the next state to be 3. If the configuration
is (3, cnt == 0), then the CA will invoke an LPDFA that matches the input against [ to
increment cnt and return to state 3; otherwise no input is consumed and p := param(p) will
be evaluated and the CA will leave state 3 for state 2.
Figure 2.9: Exploded CA for Dyck in Figures 2.2b and 2.4b; each cluster has start CA state
on left, and destination CA state on right
28
The functions pos and param allow the CA to interact with its runtime state and the
outside world. The function pos() is built into the FlowSifter environment and returns the
current parsing offset in the flow. The CA stores this information in a counter so that it can
report the start and end positions of a token when it calls param(p) to report this token
to the layer above. The CA waits for a return value from the called application processing
function so that it can update the counters before it continues processing the input flow.
In many cases, the application processing function never needs to return an actual value to
the CA; in such cases, it immediately returns a null value, and the CA immediately resumes
processing the input flow.
2.5.2
LPDFA
In this section, we formally define LPDFA focusing on nonstandard LPDFA construction
and operation details. A Labeled Priority DFA (LPDFA) is a 7tuple (Q, Σ, δ, q0 , D, DF, π)
where Q is a set of states, Σ is an alphabet, q0 is the initial state, δ : Q × Σ → Q is
the transition function, as normal. The new properties are D, the set of possible decisions,
DF : Q → D, a partial function assigning a subset of the states (the accepting states) a
decision, and π : Q → N a total function assigning every state a priority. An LPDFA works
as a subroutine for a CA, examining the input for various patterns, consuming the highest
priority observed pattern, and returning this pattern’s associated decision. We construct
and run an LPDFA in a manner similar to a DFA with some modifications to return values
instead of just accept/reject and to only consume the correct amount of the input. We focus
on the construction of DF : Q → D, a partial function assigning a subset of the states (the
accepting states) a decision, and π : Q → N a total function assigning every state a priority.
29
As a reminder, D = (QC ∪ {DON E, F AIL}) × (C → C), where DON E and F AIL are
distinct from all CA states.
For each configuration (q ∈ Q and c ∈ C) in the δ function of the CA, we construct
an LPDFA from the bodies of the rules matching the given configuration. Because we start
from a CRG, we can assume the rule bodies are written as (rxi , acti , qi ), a terminal regular
expression, action and nonterminal. If no rule has a regular expression that matches the
empty string , then we add an rule with body ( , (), F AIL) to guarantee the LPDFA will
always return a value. We use the standard process for constructing an automaton from a
collection of regular expressions.
We now describe how we set DF . For any accepting state q, we identify all CRG rules’
whose regular expressions were matched. If more than one match, we choose the rule r with
body (rx, act, q) that has the highest priority (the rule has lowest priority). If r is the
rule, the CA state value is set to F AIL. If q is empty, the CA state value is set to DON E.
Otherwise, the CA state value is set to q. The appropriate action is set to act.
We now describe how the LPDFA operates. We give all states in the LPDFA a priority
which is the highest priority decision that is reachable from that state. As the LPDFA
processes the input flow, it remembers the highest priority decision state encountered. To
prevent the LPDFA from continuing to consume input due to a potential low priority match
such as a low priority /.*/ rule, the LPDFA stops processing the input once it reaches a
state with an equal or lower priority. The LPDFA then returns the appropriate decision and
consumes only the appropriate input even if more has been seen.
We illustrate the importance of prioritizing CRG rules using the following two rules from
the protocol specification for HTTP headers.
HEADER 50 > /(?i:ContentLength):\s*/
30
[bodylength := getnum()];
HEADER 20 > TOKEN /:/ VALUE;
The second rule is given a lower priority (20) than the first rule’s priority (50) to ensure
that the first rule is used when the flow prefix is “ContentLength:”. In such a case, the
rule stores the size of the HTTP body in a counter bodylength for later use. If the priorities
were inverted, the first rule would never be used. We must ensure the relative priorities of
rules are maintained through all optimizations that we apply. Maintaining these priorities is
straightforward but tedious, so we omit these details.
2.5.3
CA Specific Optimizations
We have implemented two key optimizations to speed up our CA implementation. We first
avoid processing many bytes of the flow by having actions modify the flow offset in the
parsing state. Specifically, if our CA is processing an HTTP flow and does not need to parse
within the body, an action can call skip(n) to skip over n bytes of payload. This allows
FlowSifter to avoid stepping through the payload bytebybyte to get to the headers of the
next request in the same flow.
We also eliminate some LPDFA to CA transitions. Suppose the optimized CRG has a
nonterminal X with a single rule with no actions such as X → /rx/Y. We can eliminate
the switch from LPDFA to CA at the end of /rx/ and the switch back to LPDFA at the
beginning of Y by inlining Y into X. This is similar to idle rule elimination, but because our
terminals are regular expressions, we can concatenate two regular expressions into a single
terminal, keeping the grammar in normal form. We also perform this optimization when Y
has a single rule and all of X’s rules that end in Y have no actions. This increases the number
of states in the LPDFA for each nonterminal but improves parsing speed by decreasing the
31
number of context switches between LPDFA and CA. This optimization has already been
performed on the CA in Figure 2.9, specifically in state 2. The pattern “][” is not part of any
of the input regular expressions, but is composed of the closing ‘]’ of the Dyck extraction
grammar and the opening ‘[’ of the S nonterminal following it.
2.6
Counting Automaton Implementation
We now describe how we implement CA. We first describe incremental packet processing,
which is needed because flow data arrives in packets and should be processed as it is received.
The alternative solution of buffering large portions of flow is problematic for two reasons.
First, it may require large amounts of dynamically allocated memory. Second, it will increase
latency in scenarios where undesirable traffic must be blocked.
We then describe two implementations of CA: simulated CA and compiled CA. In a
simulated CA, an automatonindependent process parses a flow by referencing a memory
image of the simulated CA. In a compiled CA, the structure of the CA is encoded in the
process that parses the flow. Typically, compiled CA are more efficient but simulated CA
are easier to deploy and modify.
2.6.1
Incremental Packet Processing
Packet processing is difficult because the flow data arrives packet by packet rather than all at
once. There are two main ways to process packets: incrementally which means processing each
packet as it arrives or buffering which means buffering a number of packets until a certain
amount of flow data is gathered. A DFA supports incremental processing by processing
each byte of the current packet and then saving the active state of the DFA in a parsing
32
state to be used when the next packet of the flow arrives. BinPAC and UltraPAC do not
support incremental packet processing. Instead, they buffer packets until they can guarantee
sufficient flow data is available to match an entire token. This has the two drawbacks of
(i) requiring large amounts of dynamically allocated memory and (ii) increasing latency in
scenarios where undesirable traffic must be blocked.
As FlowSifter is built on automatatheoretic constructs, we present an incremental packet
processing approach similar to DFA, though we do require some buffering of flow data since
an LPDFA may look at more flow data than it consumes. Unlike other buffering solutions,
FlowSifter only buffers input data that the LPDFA needs to determine the highest priority
match; this is typically a small amount of data, much smaller than the amount of data needed
to guarantee a token can be matched. There are three parts of the perflow state that we
must record: the state of the input flow, the state of the CA and the state of the LPDFA.
We record two byte offsets for each flow: an offset within the current packet of the next
byte to be processed, and a flow offset of the current packet, indicating the position in the
flow of the first byte of that packet. We keep both of these in our parsing state to be able
to check for errors in flow reassembly and to aid in supporting the skip() builtin. After
processing a packet, we subtract its size from the packet offset, which normally resets the
packet offset. The skip() action can be implemented by increasing the packet offset by the
number of bytes to be skipped. If the resulting offset is within the current packet, processing
will resume there. If the offset is within the next packet, the subtraction rule will correctly
compute the offset within that next packet to resume processing. If the offset is beyond the
next packet’s data, that packet can be skipped entirely and the subtraction rule will account
for that packet’s bytes being skipped.
33
As both the CA and LPDFA are deterministic, they each have only one active state.
Furthermore, when an LPDFA has an active state, the CA state is not needed, and when a
CA state is active, no LPDFA state is active; thus we only need to store one or the other. In
addition to the active state, the CA also must keep the values of each counter. In addition
to the active state, the LPDFA must track the state id and flow offset of the current highest
priority match, if any. A small buffer may be needed when the LPDFA lookahead occurs
at a packet boundary, to buffer the flow data since the last match was found. This allows
the LPDFA to consume only the bytes needed to reach its decision while allowing the next
consumer of input bytes to resume from the correct location, even if that is in the previous
packet.
Finally, we must address incremental parsing for actions that alter the parsing state.
For example, skip() can change the packet offset. We might create other actions such as
getByte() that can read a current input character into a counter as an unsigned integer;
getByte() is useful in the DNS protocol where the length of a name field is indicated by the
byte preceding it. Instead of using getByte(), we could use an LPDFA that transitions to
a different state for each possible value of that byte and have an action for each value that
would set the counter correctly. Using a function like getByte() makes this much easier to
implement and faster to execute. However, this does introduce a corner case where the byte
we need may be in the next packet. In general, the CA must be able to resume processing
from the middle of an action. Our simulated CA and compiled CA take different approaches
to solving this problem, as described in their sections below.
34
2.6.2
Simulated CA
In an ASIC implementation of CA, we create a fixed simulator that cannot be changed after
construction. The simulator uses a memory image of the CA to be simulated when processing
an input flow. To assess the challenges of a simulated CA implementation and develop a proof
of concept, we implemented a software CA simulator. The most challenging issues we faced
were (i) efficiently implementing the function δ to find the appropriate LPFDA based on the
current CA counter values and (ii) incremental packet processing.
The biggest challenge to implementing δ efficiently is the potentially huge size of the
counter state space. With two 16bit counters and just 10 CA states, a direct lookup table
would have over 40 billion entries, ruling out this solution for the entirety of δ. Instead, we
break the lookup into two steps. In the first step, we use the CA state number as a direct
lookup.
For a CA with C counters, a given CA state may correspond to a CRG nonterminal with n
rules. We must find the correct LPFDA that implements the combination of these rules that
can be applied,based upon the current counter values. We present three different solutions to
solving this LPDFA selection problem. We evaluate each solution based upon two criteria: (i)
space and (ii) time. The space complexity of a solution corresponds to the number of LPDFA
that need to be precomputed and stored. The time complexity of a solution corresponds to
the number of predicate evaluations that must be performed. Each solution differs in how to
index the LPDFA and how to perform the predicate evaluations.
Our first solution indexes LPDFA by rule. That is, we construct 2n LPDFA, one for each
combination of rules and store pointers to these LPDFA in an array of size 2n . Note that
not all combinations of rules are possible. For example, in the Dyck CA, state 3 has 4 rules,
35
so the LPDFA table has 16 entries. However, only two LPDFAs are possible for state 3: one
encoding rules 5 and 6 when cnt == 0, and one encoding rules 4, 5 and 7 when cnt > 0.
We perform an analysis of rule predicates and save space by leaving as many of the LPDFA
array entries empty as possible. To find the current predicate, we evaluate the predicates for
each rule and pack these true/false results as 1/0 bits in an unsigned integer used to index
the array of LPDFA. In the worst case, this requires nC counter predicate evaluations. Of
course, some counters may not need to be evaluated for some rules, and the results from
some counter evaluations may eliminate the need to perform other counter evaluations.
Our second solution indexes LPDFA by predicate, taking advantage of redundancy in
predicates among different rules. For example, using state 3 from the Dyck CA, we previously
observed there are only two relevant LPDFA: one for when cnt == 0 and one for when
cnt > 0. We can store this list of relevant predicates and index the LPDFA by their truth
values. For this example, we would have just these two predicates and two LPFDA. For this
example, we would evaluate both predicates and use these two bits to choose the LPDFA to
execute.
A third solution is to build an optimal decision tree. Each node of the decision tree would
correspond to a counter and the children of that node would either be the LPDFA to execute
or further nodes corresponding to other counters. Each edge would be labeled with the values
of the counter it can be traversed on. In this example, the decision tree would have one node,
with an edge to the rule {5,6} LPDFA labeled 0 and an edge to the rule {4,5,7} LPDFA
on all other values. This solution can reduce the number of counter evaluations required
by optimally exploiting redundancies and optimally using the results of previous predicate
evaluations. We plan on investigating this potential solution further in future work. In our
36
software simulator, we observed that indexing by rule appeared to be more efficient than
indexing by predicate in our experiments.
Our simulator uses closures to handle saving and resuming state. Whenever it runs out of
input, it returns a new function that takes the next packet’s payload as input and continues
processing where it left off. This is possible because OCaml allows us to encapsulate a portion
of our environment of variables inside a newly created function, allowing this new function to
make use of any local state available at the point of its creation. The details of the function
differ depending on where we run out of input. In general, this resume function does some
bookkeeping on the input string to put it into the main shared state record, and then it calls
a function to resume processing at either a CA state (to processes any delayed actions) or
at an LPDFA state, to continue consuming input.
2.6.3
Compiled CA
In this section, we give a compilation from CA to C++. This nonASIC compiled CA implementation can achieve very high performance because modern CPUs are highly optimized
simulators for their machine language. The major difficulty is minimizing the overhead in
the compilation process; the Turing completeness of CPUs guarantees that this is always
possible for any reasonable automata construction, but the efficiency of the result is strongly
dependent on the semantics of the automaton and the specifics of the translation process.
The basic structure of the compiled CA is similar to that of the simulated CA. We briefly
describe all of the details but focus on the key differences. We encapsulate the parsing state
of a flow using a C++ struct that contains unsigned integers for each counter, a few fields
to hold a pointer to the flow data and the offsets relative to the pointer and to the start
of the flow. We also include fields for the LPDFA to store the offset, state, and priority of
37
the highest priority matching state seen so far. Finally, we store the current LPDFA state,
if any, to support incremental packet processing.
We represent each CA state, each LPDFA, and each action update function with a procedure. Each procedure exhibits tail call behavior where it ends by calling some other procedure. For example, after the CA state procedure determines which LPDFA to run, it ends
its operation by calling the appropriate LPDFA procedure. Likewise, an LPDFA procedure
will typically end by calling a series of update actions and then the next CA state procedure. To ensure the stack does not grow to unbounded size, our CA must implement Tail
Call Optimization (TCO). Ideally, the C/C++ compiler will implement TCO. If not, we
manually implement TCO by having a main dispatch loop. Within this loop, we maintain a
function pointer that represents the next procedure to be called. Each procedure then ends
by returning a pointer to the next function to be called rather than calling the next function.
The procedure for each CA state has access to all the flow state, and its job is simply
to determine which LPDFA to run. As with simulated CA, the compiled CA can index the
LPFDA in many different ways such as by rule, by predicate, or by decision tree. Our compiler
produces a CA that indexes by rule where the resulting Boolean values are concatenated into
a single integer. The CA state procedure uses a switch statement on this integer to determine
which LPDFA is initialized and started. If all the rules for a CA state have no predicates,
then there is only one LPDFA that is always called. We can eliminate this CA state procedure
and simply call the corresponding LPDFA procedure instead.
We run each LPDFA using a its own LPDFA simulator procedure. The transitions and
priorities are stored in integer arrays. To make a transition, the current state and input character are used to index the transition array which stores the next state that this automaton
will be in. The simulator only proceeds to the next state q if q’s priority exceeds the priority
38
of the highest priority match found so far. When we have determined the highest priority
match, the simulator runs a switch statement to implement the actions and the transition to
the next CA state dictated by this match. We choose to implement a separate simulator for
each LPDFA to support finely tuned branch prediction for the CPU and to facilitate pausing
and resumption caused by incremental packet processing. The cost of replicating the very
small common simulation code is negligible.
It is possible that multiple CA states may call identical LPDFA. With each new LPDFA
that we create, we compare it to previously created ones and only keep the new one if it is
indeed different than all previous ones.
Finally, we discuss our procedures which implement the actions of a CA. We create a
procedure for each action so that we can pause and resume them at packet boundaries. To
support incremental packet processing, we ensure that each inputconsuming function in an
action is in its own action function. We improve performance by leaving inline actions that
do not consume any input as the CA will never need to resume from such actions.
2.7
Extraction Generator
FlowSifter is unique in its use of protocol and extraction grammars to specify the desired
fields to be extracted. In this section, we describe our simple Extraction Generator GUI tool
that allows a user to easily create an extraction grammar from a given protocol grammar.
It simplifies the process of writing the extraction grammar to exploring expansions of the
protocol grammar and ticking off boxes next to the fields to extract.
A screenshot of this interface is shown in Figure 2.10. At the top is a dropdown dialog
box that lists all the available protocol grammars; in this case, we have chosen the dyck.pro
39
Figure 2.10: Extraction Generator Application Screenshot
grammar for the Dyck protocol. Below that is a TreeView that shows a tree view of the
grammar, which we call the MetaTree. Figure 2.11a shows the MetaTree corresponding to
the tree view in the screenshot. The root of the MetaTree is the start symbol of the grammar.
“Choice” nodes in the tree allow us to show the rules for each nonterminal. The body of
that nonterminal makes up the children of its “Choice” children. Terminal symbols are the
leaves in the MetaTree.
The MetaTree for a grammar contains all possible parse trees as subtrees. For grammars
whose language is infinite, this tree is infinite in size, so we construct it on demand. To
preserve the efficiency of generating the MetaTree, it does not include the predicates or
actions of the grammar and thus does not maintain counter values. This simplification means
that a given MetaTree may include fields that are not producible by the original grammar.
That is, the counter values may prevent an application of a derivation rule, but the MetaTree
will allow it since it does not track the counter values.
The Extraction Generator works by allowing the user to explore the MetaTree and choose
subtrees to extract as fields by selecting nonterminals and rules as follows. Initially, the
40
extraction generator shows the start nonterminal for the protocol grammar. In the example
screenshot, this start nonterminal is called S. A nonterminal is selected by clicking the
adjacent to the nonterminal. Selecting a nonterminal displays the rules that have that
nonterminal as head. As rules typically have no name, we identify these rules using the
notation “Choice #0” to “Choice #k” if there are k + 1 rules. When there is only one rule
for a nonterminal, the “Choice” nodes are omitted. For example, the nonterminal I node in
Figure 2.10 has no Choice node children. A nonterminals and choice nodes are expanded to
show their children by clicking the
adjacent to that . Selecting a rule displays the sequence
of terminals and nonterminals in the body of that rule. Finally, any nonterminal or rule node
can be selected for extraction by clicking the box next to it.
(a) MetaTree for Dyck example; field to extract in grey
1
2
3
XS →
XS → XI S
XI → /\[/ token(S) /]/
(b) Extraction Grammar
Figure 2.11: Intuitive MetaTree and Extraction Grammar for Dyck example
After a MetaTree has been constructed, it can be easily converted into an extraction
grammar. First, the nonterminals with descendant nodes to be extracted are renamed with
unique names. The other nonterminals keep their names and reference the corresponding
rules in the protocol grammar. For any renamed nonterminals, we generate new rules for
41
them replacing any nonterminals with renamed nonterminals as dictated by the MetaTree.
A nonterminal T marked for extraction is encapsulated in a token(T ) in the corresponding
rewritten rule. If a choice node is marked for extraction, the corresponding rule is encapsulated by token().
In the example depicted in Figures 2.10 and 2.11, a single S field is chosen to be extracted
from within the first brackets of the Dyck grammar. The internal nonterminals, the root S
and the I in the root S’s “Choice #1” child, are marked by a thick box in Figure 2.11 and
are renamed to XS and XI in the extraction grammar. We produce two rewritten rules for
XS which are rule numbers 1 and 2 in Figure 2.11b. Rule 1 corresponds to “Choice #0” and
has an empty body. Rule 2 corresponds to “Choice #1” and has a renamed XI in its rule
body. Rule 3 is produced by XI directly as it has only one choice, and its rule body has no
renamed nonterminals as the child S is a leaf. That S nonterminal is placed inside token()
because it will be extracted as a field.
This extraction generator is able to handle a wide range of extraction scenarios, but
it cannot create arbitrary extraction grammars. One limitation is that it cannot generate
extraction grammars with cycles in their nonterminal symbols because the MetaTree representation prevents cycles from being created. If such an extraction grammar is required, it
can be written by hand in some cases. The user must ensure that the resulting grammar is
regular.
2.8
Experimental Results
We evaluate field extractor performance in both speed and memory. Speed is important
to keep up with incoming packets. Because memory bandwidth is limited and saving and
42
loading extractor state to DRAM is necessary when parsing a large number of simultaneous
flows, memory use is also a critical aspect of field extraction.
2.8.1
Methods
2.8.1.1
Traces
Tests are performed using two types of traces, HTTP and SOAP. We use HTTP traffic in
our comparative tests because the majority of nonP2P traffic is HTTP and because HTTP
field extraction is critical for L7 load balancing. We use a SOAPlike protocol to demonstrate
FlowSifter’s ability to perform field extraction on flows with recursive structure. SOAP is
a very common protocol for RPC in business applications, and SOAP is the successor of
XMLRPC. Parsing SOAP at the firewall is important for detecting parameter overflows.
Our trace data format has interleaved packets from multiple flows. In contrast, previous
work has used traces that consist of preassembled complete flows. We use the interleaved
packet format because it is impractical for a network device to preassemble each flow before
passing it to the parser. Specifically, the memory costs of this preassembly would be very
large and the resulting delays in flow transmission would be unacceptably long.
Our HTTP packet data comes from two main sources: the MIT Lincoln Lab’s (LL)
DARPA intrusion detection data sets [25] and captures done in our research lab. This LL
data set has 12 total weeks of data from 1998 and 1999. We obtained the HTTP packet
data by prefiltering for traffic on port 80 with elimination of TCP retransmissions and
delaying outoforder packets. Each day’s traffic became one test case. We eliminated the
unusually small traces (< 25MB) from our test data sets to improve timing accuracy. The
43
final collection is 45 LL test traces, with between 0.16 and 2.5 Gbits of data and between
27K and 566K packets per trace, totaling 17GB of trace data.
The real life packet traces come from two sources: 300 MB of HTTP spidering traces
(HM) and 50GB (roughly 2/3 of which is HTTP) of Internet use (IU) over a period of one
month. We capture the HTTP spidering traces using the HarvestMan [26] web spider on web
directories like dmoz.org and dir.yahoo.com. This method produces mostly smaller requests
for .html files instead of downloading large graphic, sound or video files. HarvestMan also
uses HTTP 1.1’s keepalive pervasively, resulting in many HTTP requests/responses being
included in a single TCP flow. Thus, it is imperative to support the ContentLength header
to identify the end of one request’s data. The IU trace was recorded in 100MB sections. We
use each section as a separate datapoint for testing.
We construct SOAPlike traces by constructing 10,000 SOAPlike flows and then merging
them together into a trace. We create one SOAPlike flow by encapsulating a constructed
SOAP body in a fixed HTTP and SOAP header and footer. We use a parameter n ranging
from 0 to 16 to vary the depth of our flows as follows. The SOAP body is composed of nested
tag; on level l, a child node is inserted with probability (0.8max(0,l−n) ). After inserting a child
node, the generator inserts a sibling node with probability (.6 ∗ .8max(0,l−n) ). The average
depth of a flow with parameter n is about n + 5.
For each value of n, we generate 10 traces for a total of 170 traces. Each trace consists
of 10,000 flows generated using the same parameter n. These 10,000 flows are multiplexed
into a stream of packets as follows. Initially we set no flow as active. During each unit of
virtual time, one new flow is added to the set of active flows. Each active flow generates
data that will be sent in the current unit of virtual time as follows: with equal probability, it
sends 0, rand(50), rand(200), rand(1000) and 1000+rand(500) bytes. If the transmission
44
amount for a flow exceeds its remaining content, that flow sends all remaining data and is
then removed from the set of active flows. All the data sent in one unit of virtual time is
merged and accumulated into a virtual packet flow. The typical trace length is roughly 28K
packets for n = 0 and 106K packets for n = 16. The total length of all 170 traces is 668MB.
2.8.1.2
Field Extractors
We have two implementations of FlowSifter: a compiled FlowSifter (csift) and a simulated
FlowSifter (sift). We have written one FlowSifter package that generates both implementations. This package is written in 1900 lines of Objective Caml (excluding LPDFA generation,
which uses normal DFA construction code) and runs on a desktop PC running Linux 2.6.35
on an AMD Phenom X4 945 with 4GB RAM. The package includes additional optimizations
not documented here for space reasons. We emphasize that in our experiments, the compiled
FlowSifter outperforms the simulated FlowSifter because we are only doing software simulation. If we had an ASIC simulator, the simulator would run much more quickly and likely
would outperform our compiled implementation.
We constructed HTTP field extractors using FlowSifter, BinPAC from version 1.5.1 of
Bro, and UltraPAC from NetShield’s SVN r1928. The basic method for field extractor construction with all three systems is identical. First, a base parser is constructed from an
HTTP protocol grammar. Next, a field extractor is constructed by compiling an extraction
specification with the base parser. Each system provides its own method for melding a base
parser with an extraction specification to construct a field extractor. We used UltraPAC’s
default HTTP field extractor which extracts the following HTTP fields: method, URI, header
name, and header value. We modified BinPAC’s default HTTP field extractor to extract
these same fields by adding extraction actions. FlowSifter’s base HTTP parser was written
45
from the HTTP protocol spec. We then wrote an extraction specification to extract these
same HTTP fields.
For SOAP traffic, we can only test the two implementations of FlowSifter. We again wrote
a base SOAP parser using a simplified SOAP protocol spec. We then made an extraction
specification to extract some specific SOAP fields and formed the SOAP field extractor
by compiling the extraction specification with the base SOAP parser. We attempted to
develop field extractors for BinPAC and UltraPAC, but they seem incapable of easily parsing
XMLstyle recursive structures. BinPAC assumes it can buffer enough flow data to be able
to generate a parse node at once. UltraPAC’s Parsing State Machine can’t represent the
recursive structure of the stack, so it would require generating the counting approximation
by hand.
2.8.1.3
Metrics
For any trace, there are two key metrics for measuring a field extractor’s performance: parsing
speed and memory used. We use the term speedup to indicate the ratio of FlowSifter’s parsing
speed on a trace divided by another field extractor’s parsing speed on the same trace. We
use the term memory compression to indicate the ratio of another parser’s memory used
on a trace divided by FlowSifter’s memory used on the same trace. The average speedup or
average memory compression of FlowSifter for a set of traces is the average of the speedups
or memory compressions for each trace. Parser Complexity is measured by comparing the
definitions of the base HTTP protocol parsers. We only compare the HTTP protocol parsers
since we failed to construct SOAP field extractors for either BinPAC or UltraPAC.
We measure parsing speed as the number of bits parsed divided by the time spent parsing.
We use Linux process counters to measure the user plus system time needed to parse a trace.
46
We record the memory used by a field extractor on a trace by taking the difference
between the memory used by the extractor at the end of processing the trace and the
memory used by the extractor just before processing the trace. This is not the peak memory
across the trace. Instead, it is a somewhat random sample of memory usages for each trace.
In particular, traces end at arbitrary points typically with many active flows. BinPAC,
UltraPAC and compiled FlowSifter use manual memory management, so we get our memory
used values via tcmalloc’s [27] generic.current allocated bytes parameter. This allows
us to precisely identify the exact amount of memory allocated to the extractor and not
yet freed. Although this does not measure stack usage, none of the implementations makes
extensive use of the stack. Simulated FlowSifter runs in a garbage collected environment
that provides an equivalent measure of live heap data.
2.8.2
Experimental Results
We show empirical CDFs for all three field extractors’ memory usage and extraction speed on
each dataset. These show FlowSifter’s memory use dominates both BinPAC and UltraPAC.
There is slight overlap in parsing speed, but FlowSifter clearly has better best case, worst case
and average speed than both BinPAC and UltraPAC. The efficiency curve nearly matches the
speed curve, with FlowSifter having infrequent worst and best efficiency, and still showing
much improvement over BinPAC and UltraPAC.
2.8.2.1
Parsing Speed
As shown in Figure 2.12, compiled FlowSifter (siftc) is significantly faster than simulated
FlowSifter (sift) which is faster than both BinPAC (bpac) and UltraPAC (upac). Each cluster
of points has its convex hull shaded to make it easier to separate the clusters while still
47
(a) LL traces
(b) HM traces
(c) IU traces
Figure 2.12: Comparison of parsers on different traces
48
showing the individual data points. Across the range of traces, compiled FlowSifter’s average
speedup over BinPAC ranged from 11 (LL) to 21 (IU). UltraPAC was able to parse faster
than BinPAC, but compiled FlowSifter still kept an average speedup of between 4 (LL) to
12 (IU). This means that on a reallife dataset, FlowSifter was able to achieve an average
performance of over 12 times that of the previous best field extraction tool. Further speed
improvement is possible with ASIC implementation, which could make one LPDFA transition
per cycle at 1GHz, resulting in up to 80Gbps performance on the IU traces with a single
engine.
Simulated FlowSifter’s LPDFA speed is 1.8Gbps; compiled FlowSifter’s LPDFA speed is
2.6Gbps. These speeds were measured by running a simple LPDFA on a simple input flow.
As shown in Figure 2.12, the FlowSifter implementations can run both faster and slower than
their LPDFA speed. FlowSifter can traverse flows faster by using the CA to perform selective
parsing. For example, for an HTTP flow, the CA can process the ContentLength field into a
number and skip the entire body by ignoring that number of bytes from the input. BinPAC
and UltraPAC improve their performance similarly through their &restofflow flag.
However, the CA introduces two factors that can lead to slower parsing: evaluating expressions and context switching from an LPDFA to a CA and then back to an LPDFA.
Evaluating predicates and performing actions is more costly in the simulated FlowSifter implementation than in the compiled implementation. Context switches are also more costly
for the simulated implementation than the compiled implementation. Further, the compiled
implementation can eliminate some context switches by compiler optimizations. That is, it
can inline the code for a CA state into the decision for the LPDFA so that one LPDFA can
directly start the next one.
49
To test FlowSifter’s approximation performance, we made a SOAP field extractor that
extracts a single SOAP data field two levels deep and then ran it on our 10 traces for each
value of n ranging from 0 to 16. The resulting average parsing speeds with 95% confidence
intervals for each value of n are shown in Figure 2.13a. For both compiled and simulated
implementations, the parsing speed on this grammar is much less than on plain HTTP. The
major reason for this is that with HTTP, there are no fields to extract from the body, so the
body can be skipped. With SOAP traces, we cannot exploit any skip.
Also, as the recursion level increases, the number of CA transitions per DFA transition
increases. This causes FlowSifter to check and modify counters more often, slowing execution.
2.8.2.2
Memory Use
While the graphs in Figure 2.12 show FlowSifter using less memory than BinPAC and UltraPAC, the different number of active flows in each capture make direct comparison harder.
Each point in Figure 2.13b shows the total memory used divided by the number of flows
in progress when the capture was made. This shows FlowSifter uses much less memory per
flow (and thus per trace) than either BinPAC or UltraPAC. On average over our 45 LL
traces, FlowSifter uses 16 times less memory per flow (or trace) than BinPAC and 8 times
less memory per flow (or trace) than UltraPAC.
FlowSifter’s memory usage is consistently 344 bytes per flow for simulated automaton
and 112 bytes for compiled automaton. This is due to FlowSifter’s use of a fixedsize array
of counters to store almost all of the parsing state. BinPAC and UltraPAC use much more
memory averaging 5.5KB and 2.7KB per flow, respectively. This is mainly due to their buffering requirements; they must parse an entire record at once. For HTTP traffic, this means an
entire line must be buffered before they parse it. When matching a regular expression against
50
flow content, if there is not enough flow to finish, they buffer additional content before trying
to match again.
2.8.2.3
Parser Definition Complexity
The final point of comparison is possibly less scientific than the others, but is relevant for
practical use of parser generators. The complexity of writing a base protocol parser for each
of these systems can be approximated by the size of the parser file. We exclude comments
and blank lines for this comparison, but even doing this, the results should be taken as a
very rough estimate of complexity. Figure 2.13c shows a DNS and HTTP parser size for
BinPAC and FlowSifter and HTTP parser size for UltraPAC. UltraPAC has not released
a DNS parser. The FlowSifter parsers are the smallest of all three, with FlowSifter’s DNS
parser being especially small. This indicates that CCFG grammars are a good match for
application protocol parsing.
2.9
Conclusions
In this work, we performed a rigorous study of the online L7 field extraction problem. We
propose FlowSifter, the first systematic framework that generates optimized L7 field extractors. Besides the importance of the subject itself and its potential transformative impact on
networking and security services, the significance of this work lies in the theoretical foundation that we lay for future work on this subject, which is based on wellestablished automata
theory.
With this solid theoretical underpinning, FlowSifter generates highspeed and stackless
L7 field extractors. These field extractors run faster than comparable state of the art parsers,
51
use much less memory, and allow more complex protocols to be represented. The parsing
specifications are even by some measures simpler than previous works. There are further
improvements to be made to make our field extractor even more selective and efficient by
further relaxing the original grammar.
52
(a) Average parsing speed for SOAPlike flows versus
recursion depth n
(b) Memory per flow on LL traces
(c) Complexity of base protocol parsers
Figure 2.13: Various experimental results
53
Chapter 3
Regex Matching
3.1
3.1.1
Introduction
Motivation
Deep Packet Inspection (DPI) is the core operation in a wide range of networking and security services (such as malware detection, data leak prevention, application protocol identification, load balancing, quality of service, differential billing, copyright enforcement, and
usage monitoring) on most networking middleboxes and security devices (such as routers,
firewalls, and Network Intrusion Detection/Prevention Systems (NIDS/NIPS)). In the past,
string matching was used in DPI; nowadays, Regular Expression (RegEx) matching has become the industry standard because RegExes are more expressive than strings. Given a flow
as a stream of bytes, a RegEx matching algorithm needs to determine which RegExes in a
predefined set are matched in that flow. As each packet of each flow needs to go through
RegEx matching, the memory and time efficiency of RegEx matching is critical to achieve
low latency and high throughput. However, it is difficult to achieve both memory and time
efficiency for RegEx matching. The Nondeterministic Finite Automata (NFA) model gives
us the best memory efficiency as memory grows linear with the size of RegExes, but has
the worst time efficiency because we need to maintain multiple active states and perform
one memory lookup per active state for each input character. The Deterministic Finite Au54
tomata (DFA) model, on the other hand, gives us the best time efficiency of one lookup per
character, but has the worst memory efficiency because of the well known state explosion
problem  the number of DFA states can be exponential in the size of RegExes. Memory
efficiency is critical for RegEx matching because it needs to run in onchip SRAM to achieve
high speed as it needs low latency randomaccess memory reads.
To achieve high speed, RegEx matching algorithms need to be Application Specific Integrated Circuit (ASIC) friendly, i.e., they can be implemented in a small and fast ASIC block.
The size of an ASIC block depends on the number of logic gates needed to implement the
algorithm, and its speed (i.e., clock frequency) depends on circuit complexity. Not only chip
fabrication cost is proportional to its die area, but also for networking and security devices
such as IPSes, area efficiency of the circuit board is a critical issue.
3.1.2
Limitations of Prior Art
The software RegEx matching solution that represents the state of the art is XFA [28, 29].
An XFA is a DFA where each state is augmented with a program. The transition from a
source state to a destination state triggers the execution of a program associated with the
destination state. Despite its theoretical elegance, XFA has two fundamental limitations. (1)
XFA does not have a fully automated construction algorithm (with a given RegEx set as
input and a memory image for runtime packet processing as output) fundamentally because
the XFA construction process requires human experts to annotate the given RegEx set [30].
Furthermore, both the nonautomated manual processing and the automated computation
required in XFA construction may take a long time. For manual processing, XFA authors
reported that the manual processing may take one hour even for one RegEx, and it is “prohibitively timeconsuming” for some RegExes from Snort [28]. For automated computation,
55
(a) NFA
(b) HFA
(c) DFA
Figure 3.1: NFA, HFA, and DFA generated from RegEx set: {EFG, X.*Y}
56
the required time to make the transitions deterministic varies for different RegEx sets, with
13% of the test set in [28] taking between 16 minutes and 2 hours. Slow construction renders
XFA unsuitable for applications where RegEx updating is frequent. (2) XFA is not ASIC
friendly because the ASIC implementation of XFA would require much of the complexity of
a general purpose CPU to implement the programs associated with states. Such an ASIC
chip will not only be overly expensive but also too complex to develop and verify due to the
complex interactions among different components. Furthermore, because each XFA state’s
program may take a very different amount of time to process, in the ASIC implementation
of XFA, the processing elements at different stages would be very difficult to pipeline. These
two reasons partially explain why XFA has not been implemented in ASIC since its invention. Nevertheless, XFA is no doubt an important RegEx matching solution as it shows the
power of DFAs with annotated instructions.
3.1.3
Proposed Approach
In this paper, we propose HASIC, a Historybased Finite Automaton (HFA) based RegEx
matching scheme that is both fully automated and ASIC friendly. An HFA is a DFA with
an auxiliary vector of bits (called history bits) where each transition is augmented with
a boolean condition specified in history bits and some actions of setting/clearing history
bits.The current HFA state, input character, and history bits jointly determine a single
outgoing transition, and following this transition the actions to the history bits are applied.
We now explain why history bits can exponentially reduce the state explosion in DFAs.
As a large number of RegExes can be partially matched in parallel, the DFA must track the
progress of every combination of partial matches with a separate state. An exponentially
large number of partial matches may occur leading to an exponential number of states. In
57
reallife RegEx sets, many partially matched patterns are persistent; that is, once they have
been found, they stay active for a long period of time, waiting to continue matching the
rest of their pattern. Such partial matches are often the primary culprit behind the state
explosion in DFAs. The bitvector in HFAs can compactly represent these partial matches
and therefore alleviate state explosion. For example, in the HFA in Figure 3.1b built from
RegEx set {EFG, X.*Y}, we use one bit b to remember the partial match of character X for
RegEx X.*Y, which eliminates the duplication of the NFA states corresponding to RegEx
EFG.
Although both HFA and XFA can exponentially reduce state explosion, they are fundamentally different. First, HFA places conditions and actions on transitions, using the history
value to direct the operation of the automaton. XFA instead places actions on states and
conditions on accepting states. This means that the sequence of states that XFA enters is
not affected by its memory values, only whether or not it reports a match when it reaches
an accept state. Second, they differ on operations. In HFA, the operations only include three
simple bit operations: test, set, and clear, which are very easy to implement in ASIC.
In XFA, the operations can be an arbitrary program, which requires CPU complexity to
implement in ASIC. When limiting primitive XFA operations to a set of so called “efficiently
implementable operations” [28], it is unclear what operations this set should include and
whether some set of efficiently implementable operations can guarantee that an XFA can be
constructed for any given RegEx.
Compared with XFA, HASIC advances the state of the art from two perspectives. First,
HASIC construction can be fully automated (without human annotation). In this paper, we
present efficient algorithms to construct an HFA from a RegEx set and then generate an
optimized memory image for runtime packet processing. Second, HASIC is ASIC friendly.
58
HASIC only involves three simple bit operations: test, set, and clear, and only adds a
small fixedsize bit array to the state of each flow, which makes storing and restoring the
active state of each flow efficient. Furthermore, these operations are combined without the
need for a software layer to tie them together, allowing us to represent the processing logic
with a simple fixed combination of gates.
3.1.4
Challenges and Proposed Solutions
There are two key challenges in HASIC: (1) automated and fast HFA construction (converting
a RegEx set to an HFA), (2) memory image construction (converting an HFA to a memory
image) for fast packet processing.
Automated and Fast HFA Construction: Automatic HFA construction is challenging because of two reasons. First, the number of choices for history bits is exponential in the
number of NFA states; yet, choosing the right set of history bits is most critical for eliminating state explosion. Second, the number of history bits in a high speed implementation
is limited; we must construct an automaton that uses no more than this many history bits.
We solve this by identifying NFA states that stay active over many transitions, and ranking
them so that we can choose the best states to use. Fast HFA construction is challenging because the intermediate DFA is often exponential (in the number of NFA states) to construct
in both time and space; and sometimes the DFA is too big to be practically built. In this
paper, we propose a fast and automated HFA construction algorithm whose time and space
complexity is linear in the size of the final HFA, which is the best possible complexity for
HFA construction. Our key idea is to eliminate the bit states from the NFA before doing
subset construction. We can then modify the generated states to compensate, allowing us to
avoid exploring all the potential DFA states.
59
Memory Image Construction for Fast Packet Processing: The number of memory reads per input character for finding the next state(s) is critical for RegEx algorithms
to achieve high throughput. It is challenging to minimize the number of memory reads per
character in an HFA because there can be multiple HFA transitions for a state and character pair. In this paper, we minimize the number of transitions to be searched by merging
compatible transitions into a single transition with the same net effect. To further reduce
the size of each transition in memory, we compress the actions by using an auxiliary action
mask table to store atomic actions so that the actions for any transition can be represented
as the union of these atomic actions; thus, for each transition, instead of storing its actions,
we only need to store the indexes of the atomic actions in the action mask table.
3.1.5
Key Novelty and Contributions
The key novelty is in proposing automatic HFA construction algorithms and memory image
construction algorithms, where we identify many optimization opportunities. The main contribution is in proposing an automated and ASIC friendly RegEx matching scheme. Specifically, we make the following key contributions. First, we propose an automated and optimized
HFA construction algorithm, and then propose a fast HFA construction algorithm with linear
complexity in the final HFA size. Second, we propose an optimized memory image construction algorithm for fast runtime packet processing. Finally, we conducted experiments using
realworld RegEx sets and various traffic traces. As we cannot construct XFA from our RegEx
sets, we estimate the packet processing speed on our hardware based on the results reported
in [28]. Our results show that HFA runs an average of 3.34 times faster than XFA. In comparison with DFA, for automata construction speed, HFA is orders of magnitude faster than
60
DFA; for memory image size, HFA is an average of 20.4 and 16.5 times smaller than DFA
for memory bus width of 16 and 32 bytes, respectively.
3.2
Related Work
Existing RegEx matching schemes fall into three categories based on their underlying implementation technology: software based, FPGA based, and TCAM based.
Softwarebased schemes are generally based on deterministic automata to achieve high
throughput. The difference between them is in their approach to solving the DFA state
explosion problem. We divide softwarebased schemes based on whether they introduce auxiliary memory to the automaton or not. Schemes that do not introduce auxiliary memory
include D2 FA [31], mDFA [32], and HybridFA [33]. D2 FA achieves significant compression
of the transition table, but does not solve the exponential explosion in the number of states.
mDFA and HybridFA avoid building too large a DFA by either building multiple DFA or
by producing a partiallydeterministic automaton. In both cases, there will be multiple simultaneous active states, causing a large reduction in throughput. Schemes that augment
auxiliary memory include XFA [28,29], extendedFA [34], and HFA [35]. XFA and extendedFA both propose hardware designs that are, in broad strokes, a plain DFA that processes all
traffic plus a much more complex logic that handles the parts of the RegExes that are too
complex to add to the DFA. The complexity of this second layer makes them unsuitable for
ASIC implementation. Additionally, the task of coupling these two layers together to achieve
guaranteed high performance is even more difficult, as the second layer’s processing cost per
input byte is highly variable. In [35], Kumar et al.briefly proposed the theoretical model of
HFA and a manual HFA construction method that requires human experts to design history
61
bits. However, they proposed neither the methods for the automatic construction of HFAs
nor the methods for generating memory images for runtime packet processing. Furthermore,
the manual HFA construction method in [35] requires first constructing a DFA from the
given RegEx set and then constructing the HFA from the DFA. Although the final HFA is
memory efficient, the intermediate DFA is often exponential (in the number of NFA states)
to construct in both time and space; and sometimes the DFA is too big to be practically
built. Note that the solution for handling large counters in [35] can be applied in our scheme
as well.
The field of regular expression matching using FPGA includes a huge breadth of work
[36–43]. These techniques all develop a circuit on FPGA that takes in packet data and reports
match information. Important to their methods is using the reprogrammability of the FPGA
to have the flexibility to handle many pattern sets. These techniques are effective for a fixed
pattern set or for environments where the pattern matching tool can be taken offline without
penalty. Because the resynthesis procedure to update the patterns is complex and requires
taking the FPGA offline, FPGA solutions have issues with practical deployment in many
scenarios. As well, the matching state of many FPGA solutions is large, making it expensive
to save and restore this state when matching a large number of interleaved network flows.
As an example, in Bando et al. [36], the internal state includes 1.5 Kbits of status flags for
string matching modules. This makes the handling of interleaved flows much more complex,
as saving and loading that state is very expensive.
A newer line of research is the use of TCAM technology to encode pattern matching
automata [44, 45]. TCAM are content addressable memory with wildcards, meaning that
a binary string is input, and the set of ternary patterns stored in the memory are checked
to find the first pattern that matches the query. An SRAM associated with the TCAM
62
allows a value to be associated with each pattern in the TCAM. The transition table of
an automaton can be implemented by creating query strings that indicate the current state
and input character and storing the destination state of the transition in the SRAM. The
downside of TCAM is its high power usage, as every query is matched against every pattern.
3.3
3.3.1
Automatic HFA Construction
Basic Construction Method
The original algorithm to construct an HFA from a RegEx set first constructs an NFA and
uses subset construction to create the corresponding DFA. The label of each DFA state is the
subset of NFA states that the DFA state is constructed from. Second, the DFA is “folded”
into an HFA by repeatedly turning a single NFA state into a history bit, removing this label
from all DFA states, and merging those DFA states that now have identical NFA state labels.
We call the removed NFA states “bit states”. To convert an NFA state s into a bit state,
we partition the DFA into two groups: P , which consists of the states that have the NFA
state s in their label, and N , which consists of those that do not. For example, to convert
NFA state 4 in Figure 3.1a into a bit state, we partition the DFA in Figure 3.1c into two
groups: P = {4, 5, 6, 7, 8} and N = {0, 1, 2, 3}, those states that have the NFA state 4 and
those that do not. We use an example state label 0,3,4
to explain the way that we label DFA
7/1
(or HFA) state in this paper: 0, 3, 4 is the set of NFA states that corresponds to this DFA
state, 7 is the DFA state ID, and /1 denotes that this is an accepting state and the RegEx
ID 1 (namely EFG) is matched upon reaching this state. Removing label s from each state in
P will allow us to merge each state and their transitions in group P with the corresponding
state and transitions in N . For example, removing NFA state label 4 from each DFA state
63
in P = {4, 5, 6, 7, 8} will allow us to pairwise merge DFA states 4 and 0, 5 and 1, 6 and 2,
and 7 and 3.
Transitions leaving a state in P can now be taken when the corresponding history bit for
N is set and transitions leaving a state in N can only be taken when that bit is clear. For
example, in Figure 3.1b, the transitions from HFA states 1, 2, and 3 to state 4 on character
Y can only be taken when b is set. The transitions that were going from P to N must clear
the history bit, and from N to P must set it. For example, in Figure 3.1b, the transition
from HFA state 0 to 0 on character X corresponds to the transition from DFA state 0 to
4. Because this transition goes from N to P , it sets bit b. Transitions that stay within a
group do not need an action to modify the history bits. For example, in Figure 3.1b, in the
transitions among HFA states 0, 1, 2, and 3, there is no action to modify bit b. Repeating
the above process constructs an HFA with multiple history bits.
3.3.2
Bit State Selection
Choosing bit states is critical for HFA. The best case for compressing the DFA by converting
an NFA state s into a history bit is when the DFA states that include s in their labels exactly
mirror the DFA states that do not, which allows us to halve the number of automaton states.
For example, in Figure 3.1a, by choosing NFA state 4 to be a bit state, we almost halve the
number of HFA states.
Before presenting our method for choosing the right bit states, we first introduce some
new concepts. The selflooping degree of an NFA state is defined as the percentage of the
number of input characters that the state transitions to itself on. An NFA state is completeselflooping if its selflooping degree is 100%. For example, in Figure 3.1a, both states 0 and
4 are completeselflooping. RegExes with .* cause most completeselflooping states. Once
64
a completeselflooping becomes active, it remains active. An NFA state s1 shadows another
state s2 if and only if every time when state s2 is active, s1 is also active. For example, in
Figure 3.1a, state 0 shadows every other state, and state 4 shadows state 5.
The shadowing relationship reduces state explosion by eliminating the combinations of
states for which we need to generate new DFA states. For example, in Figure 3.1a, we do
not need to generate a new DFA state for the combination of states {0, 5} because whenever
state 5 is active, state 4 is active. Two NFA states s1 and s2 are exclusive if and only if they
cannot be simultaneously active. For example, in Figure 3.1a, states 1, 2, and 3 are mutually
exclusive. Exclusive relationship also reduces state explosion by eliminating the combinations
of states that we need to generate new DFA states for. For example, in Figure 3.1a, we do
not need to generate a new DFA state for the combination of states {1, 2} because they
cannot be simultaneously active. NFA states s1 and s2 are independent if and only if two
conditions are satisfied: (1) there is no shadowing relationship between them, and (2) they
are not exclusive.
Independent states cause state explosion in DFAs. Given an NFA with n independent
persistent states and m other states, using d(m) to denote the size of the DFA constructed
only from the m states, the size of the final DFA is in the order of 2n ∗ d(m). This comes
from the d(m) states being copied for all 2n ways for the independent states to be active.
For an NFA state that is independent from most of other states, if we choose it to be a
bit state, then we almost halve the DFA size. As NFA states with a high selflooping degree
tend to remain active for a long time, we use the NFA states with a high selflooping degree
that are shadowed by completeselflooping states as bit states, as these are likely to be
independent with a large number of other states.
65
3.3.3
HFA Construction without DFA
In constructing an HFA from a RegEx set, the intermediate DFA may be too big to be
practically generated due to state explosion, even if the final HFA is small. Next, we present
our HFA construction algorithm that can directly build the HFA from an NFA without
generating and storing the intermediate DFA. Before we present our algorithm, we first
introduce a new concept of equivalent state classes: given an NFA with a set of bit states
B, for two DFA states that correspond to two sets of NFA states S1 and S2 , if the two NFA
state sets only differ on bit states (i.e., S1 − B = S2 − B), then the two DFA states are
equivalent with respect to bit state set B. This relationship partitions the DFA states into
equivalence classes. Considering the example in Figure 4.3, when choosing NFA state 4 as
a bit state, the DFA states 1 and 5 are in the same equivalence class as they correspond to
NFA state sets {0, 1} and {0, 1, 4}, respectively.
The this HFA construction algorithm is similar to the standard subset construction algorithm as in DFA construction, but it only generates one HFA state per equivalence class.
Let B be the set of bit states. Each time we generate a DFA state that corresponds to a set
of NFA states S, we append its transitions to the HFA state S − B. For each transition from
DFA state S to DFA state D, we add an HFA transition, with a condition and an action,
from S − B to D − B on the same character. The condition corresponds to S ∩ B, meaning
that this transition can only be taken when the history bits corresponding to S ∩ B are set
and the remaining history bits are clear. The action consists of two parts: setting the history
bits corresponding to (D − S) ∩ B and clearing the history bits corresponding to (S − D) ∩ B.
An HFA with a vector H of k history bits, has 5tuple transitions (S, c, C, A, D) which
are the source state, input character, condition, action, and destination state. The source
66
state and destination state of an HFA are HFA states, which are written as a set of NFA
states. The condition of an HFA transition is represented as a vector C of k ternary bits,
denoting the condition ∧k−1
i=0 (C[i] = H[i]). A ternary bit has three possible values: 0, 1, or *.
Note that for each 0 ≤ i ≤ k − 1, when C[i] is *, C[i] = H[i] is true regardless of the value of
H[i]. The action of an HFA transition is represented as a vector A of k bitwise operations.
Each bitwise operation is either set (denoted as s), clear (denoted as c), or donothing
(denoted as n). For each 0 ≤ i ≤ k − 1, A[i] = s means the action assigns 1 (sets) to H[i],
A[i] = c means the action assigns 0 (clears) to H[i], and A[i] = n means the action does
nothing to H[i]. Table 3.1 shows example transitions with 3 history bits for a HFA state
and character pair. The pseudocode of this HFA construction algorithm, called HASICD is
shown in the appendix.
Condition
Action
Destination State
*00
*10
*01
*11
nnn
ncn
nnn
ncn
{1}
{1}
{1, 5}
{1, 5}
Table 3.1: Example transitions before optimization
3.3.4
Transition Table Optimization
The above HFA construction algorithm avoids DFA state explosion by merging many variants
of the same NFA states into one HFA state; however, all it adds each DFA transition to the
HFA. Next, we introduce our HFA transition table optimization algorithm that can efficiently
store transitions while allowing fast transition lookup. We first introduce a new concept called
mergeable bit actions. Given an action a on a ternary bit t, we use a(t) to denote the resulting
value after applying action a on t. Two action and ternary bit pairs (a1 , t1 ) and (a2 , t2 ) are
67
mergeable if and only if there exists an action a3 so that a1 (t1 ) = a3 (t1 ) and a2 (t2 ) = a3 (t2 ).
We call a3 a merged action of (a1 , t1 ) and (a2 , t2 ). For example, (n, 0) and (s, 1) are mergeable
and the merged action is n. One merging may have two merged actions, either n and c or n
and s. For example, action n on bit 0 and action c on bit 0 have two possible merged actions:
n and c. In such cases, we choose n to be the merged action because it has less bit operations
than c or s. Note that the choice of merged actions can never be between c and s because
their results are always different. Table 3.2 shows the merged actions of all possible pairs of
bitaction pair, where  denotes that the two bitaction pairs are not mergeable.
n0
n1
n*
c0
c1
c*
s0
s1
s*
n0 n1 n* c0 c1 c* s0 s1 s*
n n n n c c
n
n n n n
s n s
n n n n
n
n n n n c c
n
c
c c c
c
c c c
s
s s s
n n n n
s n s
s
s s s
Table 3.2: HFA transition mergeability table
Now we introduce another concept called mergeable transitions and discuss how to minimize HFA transitions by identifying and merging such transitions. In an HFA with k history
bits, a transition (S, c, C1 , A1 , D) and a transition (S, c, C2 , A2 , D) are mergeable if and only
if both of the following conditions are satisfied: (1) C1 and C2 differ in only one bit and
(2) for 0 ≤ i ≤ k − 1, (A1 [i], C1 [i]) and (A2 [i], C2 [i]) are mergeable. For these two mergeable transitions, assuming C1 and C2 differ in bit i, we merge them into one rule by both
replacing C1 [i] by ∗ and replacing A1 [i] by the merged action from Table 3.2. For example,
in Table 3.1, the first two transitions and the last two transitions are mergeable. Table 3.3
68
shows the two merged actions. This optimization allows many HFA transitions to be stored
in a small memory while allowing fast packet processing.
Condition
Action
Destination State
**0
**1
ncn
ncn
{1}
{ 1, 5 }
Table 3.3: Table 3.1 transitions after optimization
3.4
Fast HFA Construction
In this section, we propose the first algorithm to generate deterministic automata without
exploring all the possible NFA state interactions. Compared with the HFA construction
algorithm in Section 3.3, this algorithm runs significantly faster at the price of producing a
possibly larger HFA.
3.4.1
Observation and Basic Ideas
In observing the HFAs that our previous algorithm constructed from reallife RegEx sets, we
observe that many HFA states have exactly 2n outgoing transitions for a particular input
character where n is the number of bit states. This occurs when n bit states have departing
transitions to distinct NFA states on the same input character. When we merge all the
transitions from all the DFA states in an equivalence class, our previous algorithm has to
create a conditional transition for each reachable combination of history bits. This constitutes
a significant portion of the HFA construction time. To speed up HFA construction, our key
idea is to simply assume each bit state is independent from all other states; thus, we can
precompute an incoming transition table and an outgoing transition table, which we call the
69
mixin incoming table and mixin outgoing table, respectively. These two tables consist of all
the transitions introduced by all combinations of bit states. Then, we can mix this table
with the transition table of each HFA state. This may introduce some transitions that can
never be taken, but they do not affect the correct execution of the automaton. We choose
the term “mixin” because of its resemblance of the use of mixin classes in objectoriented
programming, where a mixin is a class that provides a certain functionality to be inherited
or reused by derived classes.
To construct an HFA from an NFA using this method, we first identify bit states as
described earlier. Second, we generate a mixin incoming table that is constructed from all
incoming transitions to bit states and a mixin outgoing table that is constructed from all
outgoing transitions from bit states. Third, we remove all bit states and their incoming and
outgoing transitions from the NFA to produce a pruned NFA that has only nonbit states.
Finally, we generate HFA states using subset construction on the pruned NFA, “mixing in”
the transition information from the two mixin tables.
To illustrate this process, we will show step by step conversion of the example NFA in
Figure 3.2 into the HFA in Figure 3.5.
Figure 3.2: Input NFA
70
3.4.2
Bit State Pruning
Given an NFA and its bit states, we produce a pruned NFA by removing the bit states from
the NFA as well as all their incoming and outgoing transitions. The information about these
bit states, which is missing in the pruned NFA, will be kept in two mixin tables. When a
history bit is set, this impacts the action of a transition and/or its destination. The mixin
tables capture both of these effects and allow us to apply them to the full HFA. In the NFA in
Figure 3.2, which is constructed from the regular expressions a.*b and b[^a]*b, we choose
states 1 and 3 as the bit states, which means that we have two history bits that correspond
to NFA states 1 and 3. The pruned HFA is shown in Figure 3.3.
Figure 3.3: Pruned NFA
3.4.3
Mixin Table Generation
We use two tables to store the information about bit states: the mixin incoming table and the
mixin outgoing table. For the mixin incoming table, we generate it directly from the NFA.
For the mixin outgoing table, we first generate an outgoing table for each bit state and then
merge them into one table. Note that these two tables are only used for HFA construction
and they are not part of the final HFA memory image.
71
To generate the mixin incoming table, we simply record all the transitions from nonbit
states to bit states. Figure 3.4 shows the mixin incoming table for the NFA in Figure 3.2,
whose entries are three tuples, (q, c, A). Note that the source field of this table is a single
NFA state. The first entry shown comes from the incoming transition to NFA state 1 on
input character a. The source field of the transition is 0, indicating that this is only available
from HFA states constructed from NFA state 0. The action “sn” means that we should set
the first history bit and do nothing to the second history bit. Similarly, on b, the action “ns”
sets the second history bit.
Src.
input
Act.
0
a
b
sn
ns
Table 3.4: Mixin Incoming Table
The outgoing table for a bit state has entries that are 4tuples, (c, b, A, D), representing
an input character, single history bit, action, and destination state. The input character and
history bit value uniquely determine an action and a destination state. For input characters
that are not shown, the default entry is an action that does nothing and an empty destination
state, marked “else”. To generate the mixin outgoing table entries for a bit state, we first
examine its outgoing transitions that are not looping back to itself. Figure 3.5 shows the mixin
outgoing table of bit state 1, which has a transition on b to state 2. The corresponding entry
in this table means that in the final HFA, whenever bit state 1 is active, on character b, we
take no additional action and we transition to a combination of NFA states containing state
2. For transitions that leave a bit state to go to another bit state, the outgoing table entry sets
the history bit for the destination state instead of putting that state in the destination set.
When processing an input character for which some bit state has a selflooping transition, the
72
corresponding history bit will not change. But we must clear the history bit upon processing
a character lacking a selfloop transition. Figure 3.6 shows an example of this: bit state 3
does not have a transition to itself on character a, thus we need the first entry in its outgoing
table, which clears the history bit for state 3 when it is set.
input
b
else:
*
Cond.
1
0
*
Act.
nn
nn
nn
Dest.
{2}
∅
∅
Act.
nc
nn
nn
nn
nn
Dest.
∅
∅
{4}
∅
∅
Table 3.5: Bit State 1 Outgoing Table
input
a
b
else:
*
Cond.
1
0
1
0
**
Table 3.6: Bit State 3 Outgoing Table
After we generate a mixin outgoing table for each bit state, we merge these individual
outgoing tables into a single mixin outgoing table, which will be used during the subset construction process. The entries in this table are 4tuples, (c, C, A, D), similar to the outgoing
table for a single state, but with kbit conditions. Given multiple individual mixin outgoing
tables, we construct all combinations of rules in the input tables that have the same input
character, σ. The resulting destination of two rules is the union of their destination state
sets. The rule for merging actions is based on NFA semantics, where a state becoming active
takes precedence over that state becoming inactive from the lack of a transition. Thus, the
result of merging two actions is s if either input is s, c if some action is c, otherwise n. For
73
example, the first entry in Figure 3.4 is the result of merging the entry for a in Figure 3.6
with the default entry from Figure 3.5. For actions, we have nn+nc = nc as the merging
result. For destinations, we have ∅ ∪ ∅ = ∅ as the merging result. Since we have two entries
on b in each table, the mixin outgoing table will have 4 entries, as shown in Figure 3.4.
input
a
b
else:
*
Cond.
*1
*0
00
10
01
11
**
Act. Dest.
nc
∅
nn
∅
nn
∅
nn
{2}
nn
{4}
nn {2,4}
nn
∅
Figure 3.4: Mixin Outgoing Table for 1&3
3.4.4
HFA Transition Table Generation
We now construct HFA from the pruned NFA, not the original NFA. We still use subset
construction considering all possible combination of states in the pruned NFA. But note
that the number of all combinations of pruned NFA states is orders of magnitude smaller
than that of the original NFA. This explains why this HFA construction method is much
faster than the one in Section 3.3. The start state of the generated HFA corresponds to the
start state of the NFA without any bit states active. The starting value of the history bits
has all bits cleared except those corresponding to history states in the start state of the NFA.
In generating transitions, we mix in the transitions in the mixin transition table as follows.
The transitions generated from the pruned NFA, which has no bit states, can be represented
as 3tuples: source NFA state set, input character, and destination NFA state set. Let (S, c, D)
denote a transition generated from the pruned NFA. The transitions in the mixin incoming
table can be represented as 3tuples: source NFA state, input character, and action. Let
74
(qi , ci , Ai ) (1 ≤ i ≤ m) denote the m entries of the mixin incoming table with ci = c and
qi ∈ S. We will merge all these actions into the result, so we can write M as the result of
merging all the Ai . Recall that the result of merging two actions is s if either input is s, c
if some action is c, otherwise n. Let (cj , Cj , Aj , Dj ) (1 ≤ i ≤ n) denote the n entries of the
mixin outgoing table with cj = c. In generating the transitions for each NFA state, we merge
the destination set constructed from the pruned NFA with the two mixin tables in a manner
similar to how we merge individual mixin outgoing tables. For each entry (cj , Cj , Aj , Dj ) in
the mixin outgoing table, we create an HFA transition (S, c, Cj , M + Aj , D ∪ Dj ).
In our example, the subset construction process generates an HFA state for NFA state {0}
and constructs the transition ({0}, a, {0}). This is augmented by the mixin tables to become
({0}, a, ∗1, sc, {0}) and ({0}, a, ∗0, sn, {0}), which can be compressed into the transition
from HFA state 0 to itself on input character a, setting the history bit for bit state 1 and
clearing the history bit for bit state 3. The subset construction process also constructs a
similar transition for b: ({0}, b, {0}). This transition is expanded to four transitions, each
with different destinations: ({0}, b, 00, nn, {0}), ({0}, b, 10, nn, {0, 2}), ({0}, b, 01, nn, {0, 4}),
({0}, b, 11, nn, {0, 2, 4}). The newly reachable states from {0} have their transition tables
generated in the same manner until all states are constructed. Applying this algorithm to
the pruned NFA in Figure 3.3, the Mixin Incoming Table in Figure 3.4 and the Mixin
Outgoing Table in Figure 3.4 produces the HFA in Figure 3.5.
The work necessary to produce HFA by this process is not proportional to the size of
the DFA. When we prune the bit states that cause exponential increase in DFA size and
construct HFA using mixin tables, we avoid enumerating exponentially many combinations
of NFA states. Our experimental results show that this fast HFA construction algorithm
75
Figure 3.5: Output HFA
produces HFAs of similar size as our previous HFA construction algorithm at orders of
magnitude higher speed.
3.4.5
Correctness of Fast HFA Construction
Next, we prove the correctness of our fast HFA construction algorithm. We first introduce
some new concepts and notations. The configuration of an automaton means the entirety
of its internal state. For a DFA, this is the single active state, for an NFA, it is the set
of active states, and for HFA it is both the active state and the value of the history bits.
The configuration of an HFA after processing some input can be written (S, H), where S
is the active HFA state, identified by its equivalent NFA state set, and H is the current
history bit vector, also identified with a set of bit states. We can partition the configuration
of an NFA into {S, H}, where S is the set of active nonbit states and H is the set of
c
active bit states. For an automaton (HFA or NFA) A, we can use A : S → D to indicate
that automaton A transitions from configuration S to configuration D on input character
s
c. Further, A ❀ D indicates that D is the final configuration after processing string s
with automaton A, starting from the automaton’s initial state. Theorem 3.4.1 states the
76
correctness of our fast HFA construction algorithm. For an action A and history bit vector
H, we use A(H) to denote the resulting history bit vector after applying action A on H.
Because a history bit vector uniquely defines a set of bit states, we also use H and A(H) to
denote the bit state sets defined by them.
Theorem 3.4.1
For any NFA N with state set Q and alphabet Σ, string s over Σ, character c ∈ Σ, state
sets S1 , H1 , S2 , H2 ⊆ Q, and HFA H constructed by our fast HFA construction algorithm
s
c
s
from N , if N ❀ {S1 , H1 }, N : {S1 , H1 } → {S2 , H2 }, and H ❀ (S1 , H1 ), then we have
c
H : (S1 , H1 ) → (S2 , H2 ).
Proof 3.4.1
We must show that the HFA transition for (S1 , H1 ) on character c has destination state S2
and action A such that A applied to H1 results in H2 . We will do this by decomposing the
NFA transition based on the bit states and showing how each part is mirrored in the HFA
construction. Using S to indicate nonbit states and H to indicate bit states, the HFA must
account for the effects S → S, S → H, H → S and H → H. We divide the NFA transition
c
c
N : {S1 , H1 } → {S2 , H2 } into two partial transitions: N : S1 → {Sa , Ha } consisting of
c
transitions from active nonbit states in S1 and N : H1 → {Sb , Hb } consisting of transitions
from active bit states in H1 . Note that Sa ∪ Sb = S2 and Ha ∪ Hb = H2 .
S → S: Because S1 is reachable, the subset construction process will generate exactly Sa
as the destination on c from S1 .
H → S: The mixin outgoing table will generate exactly Sb as it includes each nonbit
state destination of a transition on c from all the active bit states.
77
S → H: The mixin incoming table contributes M , which is the combined actions of all
NFA transitions on c from S1 to bit states, so the result of these actions will ensure that
M (∅) = Ha .
H → H: The mixin outgoing table contributes an action A that simulates transitions on
c from bit states to bit states (themselves and other bit states), corresponding to H1 → Hb
in the NFA.
The destination state of HFA transitions consists of the NFA states calculated by the
subset construction procedure and the NFA states obtained from the mixin destination
table. Thus, the HFA destination state is Sa ∪ Sb = S2 . The action for HFA transitions is
constructed from the mixin incoming table and the mixin outgoing table. By the semantics of
action merging, their combination M + A has the property (M + A)(H1 ) = M (∅) ∪ A(H1 ) =
Ha ∪ Hb = H2 .
Theorem 3.4.2
The HFA generated by mixin construction simulates the NFA it was generated from correctly.
Proof 3.4.2
We will proceed by induction on the input string using Lemma 3.4.1. Given an input string
s, NFA N and HFA H constructed from N by fast HFA construction algorithm, we will
s
s
show that if N ❀ {S1 , H1 } then H ❀ (S1 , H1 ). Base case: s is the empty string. In this
case, the initial state of the HFA corresponds to the initial state of the NFA by construction.
s
Recursive case: s can be decomposed into s + c, where c is the final character of s. Let N ❀
{S2 , H2 } define the state that N reaches after processing s . From the recursive hypothesis,
s
H ❀ (S2 , H2 ). Using Theorem 3.4.1 to make the final transition on character c, we conclude
s
that H ❀ (S1 , H1 ).
78
3.5
Fast Packet Processing
Having constructed an HFA, we need to create a memory image to represent this HFA so that
it can be used for runtime fast packet processing. In this section, we first examine design
considerations for an HFA memory layout to achieve high throughput with low required
memory bandwidth. Then, we introduce a compact format for an individual HFA transition,
and present a data structure for storing HFA transitions. Finally, we explore optimizations
on transition ordering. The pseudocode of our packet processing algorithm is shown in the
appendix. Note that if RegExes are specified in Unicode, we can convert them into RegExes
over bytes.
3.5.1
Design Considerations
The key optimization metric in designing the memory layout for HFA is to minimize the
average number of memory accesses per input character because the key limiting factor
for packet processing performance in routers/IPSes is the memory bandwidth of random
accesses. Furthermore, we want to bound the computational complexity required to calculate
the next state per input character to be small because simpler chip circuitry leads to higher
clock rate, smaller die area, and reduced power consumption.
3.5.2
Transition Format
To best suit ASIC implementation, we propose two formats to encode state transitions so
that the required processing includes only ASIC friendly operations such as bitwise logical
operations and shifting.For a 128bit memory bus and 32 history bits, we propose the transition format with a total of 128 bits as shown in Figure 3.6. We choose 128 as the number
79
of bits for encoding each transition because memory buses in routers/IPSes are often 128bit
and we want to minimize the number of memory accesses for each input character. Here, the
first 32 bits constitute the condition mask, the second 32 bits constitute the condition value,
and together they represent a ternary condition. Thus, testing whether the current history
bitvector satisfies such a ternary condition requires only simple bitwise logical operations.
The third 32 bits encode the ID of the next state and some flags (where an example flag is
for indicating whether the next state is an accepting state). The last 32 bits encode some
actions (for setting/clearing history bits) that can be executed in parallel upon taking this
transition. The format of each action is shown as “Action Format” below the transition format in Figure 3.6. The first bit in an action indicates whether that action is a valid one that
needs to be taken. The second bit in an action indicates whether the action is for setting
bits or clearing bits. The remaining six bits form the index of the entry that specifies the
target bits that need to be set or cleared in a table that we call the action mask table. The
action mask table consists of 64(= 26 ) entries where each entry has 32 bits and is called an
action mask. The ith bit in an action mask being one indicates that the corresponding ith
bit in the history is a target for this action.
For other memory bus width and number of history bits, we can design transition formats
similarly. For example, the bottom of Figure 3.6 has a transition format for a 64bit memory
bus and 16 history bits. Compared with the 128bit format, this format has only two actions,
no extra flag space and a smaller next state id field, but it is able to represent very complex
automata.
80
Transition Format for 128bit bus and 32bit history
31
23
15
7
0
ConditionvMask
ConditionvValue
NextvStatevID
Actionv1
Flags
Actionv2
Actionv3
7 Action Format
Valid/Invalid
Set/Clear
Actionv4
0
5
Index
31 Action Mask Table
0
⁞
0
01100000000000001000000000000000
⁞
63
Transition Format for 64bit bus and 16bit history
31
16 15
7
0
ConditionvMask
ConditionvValue
NextvStatevID
Flag
Action 1
Actionv2
Figure 3.6: Transition Formats
3.5.3
Action Compression Algorithm
Each HFA has one action mask table whose width is the number of history bits. If there is
no limit on the number of entries that this table can have, then we can store each distinct
action mask used by the HFA. However, we have to limit this number because the index
field in each action has a fixed number of bits. Generating such a fixed size table for an HFA
of arbitrary size becomes technically challenging. Next, we present our action compression
81
Set
Clear
Act 1
S 1,3
1,3
4
Actions for one transition
Set
Clear
Act 1
Act 2
Act 3
Act 4
C4


4
1,3
Act 2
1,2,3
C 1,3
C2
Actions for another transition
Act 3
Act 4


2
Action Mask Table
Figure 3.7: Action Mask Example
algorithm for generating the action mask table for an HFA using the transition format in
Figure 3.6.
Let us first have a deeper understanding of the action mask table generation problem.
The actions taken upon an HFA transition are setting and/or clearing some history bits. We
want to generate the action mask table for an HFA so that the following two conditions hold.
First, performing the actions for each HFA transition can be accomplished by performing no
more than four subactions where each subaction is either setting or clearing the history bits
indicated by an action mask table entry. Second, the number of subactions for each HFA
transition is maximum four because there are a total of four actions in the transition format.
This problem is a special case of the known NPcomplete problem called set basis [46]: given
two sets of finite sets, S and B, we say B is a set basis for S if and only if each set in S can
be represented as the union of some elements of B; the set basis problem asks for a set S
and positive integer k, whether there exists a set basis of size no greater than k. The major
differences lie in that we have a limit of four and there are two types of actions (of set and
clear).
Our action compression algorithm for generating the action mask table of an HFA runs
in a greedy fashion. The input is a set of set action set and clear action set pairs where
each pair corresponds to an HFA transition, each set action set contains all history bits to
82
be set, and each clear action set contains all history bits to be cleared. First, we search for
the smallest action set among all action sets and create an action mask corresponding to
that action set. Second, we subtract this smallest action set from all remaining action sets
that are the supersets of this smallest action set and remove all duplicate action sets. We
repeat the above two steps until all action sets become either empty or the action sets of
each transition cannot be further decomposed. For example, suppose one HFA transition has
the actions of setting bits {1, 3, 5, 7} and clearing bits {2, 4, 6}. If we have created one action
mask entry for {1} and another entry for {3}, then we must create two action mask entries,
one for {5, 7} and one for {2, 4, 6} because any further decomposition of the two sets {5, 7}
and {2, 4, 6} will result in more than four action mask entries for this transition.
Due to the limitation of 64 entries imposed by our transition table format, our algorithm
does not guarantee the successful generation of the action mask table for any HFA, although
we have not encountered such failures in our experiments with even complex RegEx sets.
There are two simple solutions to this problem. One is to increase the number of bits allocated
for encoding each transition. For example, if we allocate 256 bits to each transition instead
of 128 bits, with 64bit history and 4 actions, we can have 12 bits for the index for each
action, which means that the action mask table can have 212 entries instead of 26 entries. Of
course, this will increase the number of memory accesses per input character. This solution
can be adopted in the ASIC design phase in trading off efficiency and capability. Another
solution is to split the input RegEx set into two RegEx sets and build two HFAs and the
corresponding two memory images. For each input character, we will give it to both HFAs.
This solution can be adopted after ASIC has been fabricated.
83
3.5.4
Transition Table Image Construction
Next, we present our data structure for storing all the transitions of an HFA. The complexity
is due to the fact that at any HFA state, for any possible input character, there may be
multiple transitions, one of which will be taken based on the current value of history bits.
Different HFA state and input character pairs may have different number of transitions. For
an HFA with Q states over the alphabet Σ, the natural solution for storing all the transitions
is to use an array T of size Q∗Σ, where for state Qi and character σ, treating σ as an integer
T [i ∗ Σ + σ] stores a pointer that points to a linked list of all the corresponding transitions.
However, linked lists are inefficient for high speed networking and security devices for two
main reasons. First, pointers consume memory. As our transition format does not leave
room for storing such pointers, we have to use extra memory to store pointers, which may
cause more memory access per transition. Second, traversing a linked list leads to random
accesses of noncontiguous memory blocks, which is less efficient than sequential memory
reads. Based on the fact that for any HFA state and any possible input character, there is
always a transition that the current value of history bits matches, we can store all transitions
of an HFA in one array S where the transitions for each state and input character pair are
stored as one block in the array. Thus, for any state Qi and character σ, T [i ∗ Σ + σ] simply
stores the index of their first transition in S. When processing character σ at state Qi , we
first fetch transition S[T [i ∗ Σ + σ]] and check whether the current history matches the
condition of the transition. If it matches, then we consume the character and move to the
next state; otherwise, we fetch transition the next transition S[T [i ∗ Σ + σ] + 1] and repeat
the above process.
84
By the above design of the ragged array T , for any state Qi and character σ, there is
always one extra memory access for fetching T [i ∗ Σ + σ] in addition to sequentially fetching
their transitions for S. To reduce memory accesses, we move the first k ≥ 1 transitions for
each state Qi and character σ to T [i ∗ Σ + σ], where T is a new two dimensional array T
with a height of Q ∗ Σ (the same as T ) and a width of k. If state Qi and character σ have
less than k transitions, then we leave unused transitions empty in T . Thus, when processing
character σ at state Qi , we first sequentially fetch the k transitions from T [i ∗ Σ + σ] until
we hit a match; if the k transitions result in no matches, we follow the above process of
fetching transitions starting from S[T [i ∗ Σ + σ]] until we hit a match. There is a design
tradeoff for the value of k. The larger the k is, the more empty transitions T has. The
smaller the k is, the higher chance that the extra memory access for reading the index from
T occurs. Figure 3.8 illustrates this tradeoff. In our experiments, k = 1 provides the best
tradeoff between memory size and throughput for real life HFAs.
Figure 3.8: Effects of Table Width
As we sequentially read the transitions for a statecharacter pair until a match is found
and different transitions have different probabilities of matching the history, we can reduce
85
memory accesses by ordering the transitions in the decreasing order of hit probabilities. Note
that for any statecharacter pair, we can freely order its transitions because they are all nonoverlapping. The matching probability of each transition can be estimated using past traffic.
The transitions in both T and S can be dynamically reordered at runtime to adapt to traffic
conditions.
3.6
Hardware Design
The HFA constructed from our improved algorithm, the corresponding memory layout, and
the DPI engine have been designed to allow efficient hardware implementation in ASIC or
Field Programmable Gate Arrays (FPGA). This is ensured with a structured memory layout
that systematically organizes the state transitions, conditions and actions of the HFA, and
minimum logical complexity involved in the inspection engine that evaluates the transition
conditions and performs appropriate actions. In this section, we present a reference hardware
design of a HFAbased DPI engine that can be coupled with a variety of memory technologies
or onchip cache hierarchies to store the DFA. This design accommodates wide variations in
the memory bandwidth and latencies, providing the flexibility to be deployed in a variety of
platforms and achieve optimal DPI throughput.
The DPI engine receives the input character stream and performs memory operations
for state transitions. It consists of two interfaces: a host interface to receive DPI requests
and return the match responses, and a standard memory interface to load state transition
information. The host interface receives DPI requests that contain the payload data stream to
be inspected and the initial state of the automaton from where the matching operation should
start. Notice that the initial state information is important to enable stateful inspection
86
across multiple packets of a network flow. The engine responds with a list of matches to
indicate which RegExes are matched, the corresponding offsets to indicate the position in
the payload where the match occurred, and the final DFA state where the inspection ended,
which is needed to resume inspection of the next packet of the traffic flow. To perform
the RegEx matching operation, the engine uses a standard memory read interface to access
DFA transitions and an auxiliary readwrite access to a small internal memory to record
the internal states of the engine. The internal memory has two parts. The first part is used
to store the input data and track the current byte location in input data and the current
automaton state while the matching or automaton traversal is in progress. The second part
accumulates the matches found during the inspection, forming the list of match results for
the response. Upon receiving a new request, the internal auxiliary memory is initialized with
the state information in that request. Using the current active automaton state and the next
input character, the finite automaton memory is read to load the the next active state and
a match flag. If the match flag is set, the match id and current byte position are appended
to the response buffer. If there are additional input characters to process, the address of
the next memory load to fetch the next transition data is computed using the next state
ID from the loaded transition and the next character value. If there are no additional input
characters left to process, the match responses collected in the response buffer are returned
through the host interface.
The core engine logic can be pipelined over n clock cycles to achieve higher clock frequency. In this case, the data received from automaton memory and auxiliary memory will
be presented at the input of the pipeline and the results will be available after n cycles,
which will used to process the next input character and perform next set of memory operations. Additionally, there may be multiple clock cycles of latency between issuing a request
87
to memory and receiving the load data. To compensate for the latency of the engine pipeline
and the memory accesses, multiple inspection contexts should be maintained. Each context
can process a separate request and different contexts can be issued one after other every cycle in the pipeline to keep the pipeline and memory busy. To illustrate, considering a simple
example in which the core engine is not pipelined (i.e., it performs computations in a single
cycle), it takes 6 cycles to access the memory that keeps the automaton image. Without
using multiple contexts, the peak throughput would be one transition per 7 cycles, as we
have to wait for the memory load to fetch the current transition data before we can process
the next character. Using multiple processing contexts, we can process multiple requests for
different traffic flows simultaneously. The contexts are like lightweight threads, each thread
has its own internal state to track their its matching progress and multiple threads are issued
independently whenever they are ready to make progress. With 6 threads, each can issue
one load every cycle, keeping the memory at full utilization. If the core engine logic is also
pipelined at the circuit level to achieve higher clock frequency, the effective memory latency
will further increase the number of threads needed to cover the latency. This level of circuit
pipelining will increase the throughput at which loads will be issued to memory, and will be
useful only when the memory subsystem can support such throughput levels. Notice that
when multiple contexts are used, the internal auxiliary memory must be sized accordingly
so that it can store the internal information of all threads.
The reference DFA engine design is easily expanded to support HFA operations. Figure 3.9 shows the schematic diagram of the HASIC engine design to support HFA. The key
distinctions between HFA and DFA, along with the associated modifications in the hardware
design, are described below.
88
Transition
Data
NewNThread
Loads
NextNTransition
Loads
MemoryNSubsystem
Figure 3.9: Hardware design for HFA module
89
Response
matchesN
withNoffsets
OutputNMatchNBuffers
Action
Processing
Transmit
Matches,
Completion
ThreadNState
OHistoryNbits,NpayloadB
Condition
Matching
History
Update
Process
Requests
withNpayload
Receive
First, HFAs may have a variable number of outgoing transitions for a single state and
input character. The HFA construction algorithm and the memory layout are geared towards
one memory access per input character on average, although the hardware circuit is designed
to accommodate the need to fetch additional transitions for the same character. The thread
issue logic is revised accordingly to allow these operations. The offset of the current input
character being inspected is incremented only when the transition condition is matched,
unlike the DFA design in which it is incremented at every load. Second, HFA transitions
are more advanced than DFA transitions as they contain a set of conditions and actions
on the history bits. Our algorithm represents the condition as two fixed length bit vectors,
which simplifies the logic design to test conditions. If the condition matches, the action
needs to be decoded and history bits are updated accordingly before proceeding to the next
input character and loading its associated transitions. If the condition does not match, we
load the next transition on this state and character. This process repeats until hitting the
match Finally, in addition to tracking a single automaton state in a DFA based system, HFA
based systems must also track the state of the history bits associated with a traffic flow.
The initial history bits are provided as part of the incoming requests and are stored in the
internal auxiliary memory to be read and updated as transitions are processed. Finally, the
history bits are returned with the standard return data of the DFA design when the request
completes.
3.7
Experimental Results
We demonstrate the capability of HASIC by comparing it with DFA for speed because
DFA is the fastest (although the biggest), with NFA for memory size because NFA is the
90
smallest (although the slowest), and with XFA because XFA represents the stateoftheart.
We compare the construction time of the direct HFA construction algorithm in Section 3.3.3,
denoted HASICD , and the mixinbased HFA construction algorithm in Section 3.4, denoted
HASICM , with that of DFA and NFA. We cannot compare with XFA construction time as
it cannot be automated and XFA construction code is not available.
3.7.1
Data Set
The RegEx sets that we use come from a variety of sources. Sets C7, C8, and C10 are
proprietary and come from a large networking vendor. Sets S24, S31, S34, and B217 are
public sets from Snort [47] and Bro [48] that has been used in prior literature [49]. Within
C7, S31, and B217, there are a number of RegExes with .*s that have been commented off.
We further created RegEx sets C7p, S31p, and B217p by restoring the RegExes containing
.*s from these three sets respectively. As C7⊂C7p, S31⊂S31p, B217⊂B217p, we focus the
collection of RegEx sets C7p, C8, C10, S24, S31p, S34, and B217p, denoted as BCS.
The 217 RegExes in Bro217 are almost entirely string matching, although the additional
RegExes that we restored have one to three wildcard closures (.*) each. The Snort sets
have a higher density of wildcard and nearwildcard (e.g., [^\r\n]*) closures, and the C7,
C8, C10 sets have a very high density of wildcard and wildcard closures, with as many
or more closures than the number of RegExes. The density of these closures makes their
corresponding DFA much larger. To determine the scaling properties of HASIC, we created
an additional set Scale by combining all the distinct RegExes from S24, S31p, and S34.
Table 3.7 summarizes the properties of these RegEx sets with their corresponding numbers
of NFA states, DFA states, HFA states (by HASICM ), and history bits. Note that the HFAs
91
generated by HASICM and HASICD for the same RegEx set have almost the same number
of states.
Set RegExes NFA St. DFA St. HASICM St. Hist. Bits
B217p
224
2553
16527
10
C7p
11
295
244366
616
12
C8
8
99
3786
117
7
C10
10
123
19508
238
9
S24
24
702
10257
925
7
S31p
40
1436
39977
2323
18
S34
34
1003
12486
1362
8
Scale
77
2631
593810
7401
30
Table 3.7: RegEx set Properties
We evaluated our solution using both synthetic and reallife traffic traces. The synthetic
trace comes from Becchi et al.’s flow generator [50], which is a useful tool for generating
synthetic streams of various degrees of maliciousness based on given RegExes. The degree of
maliciousness depends on parameter pM where a higher value indicates more malicious traffic.
More specifically, the trace is generated such that with probability pM , each input character
transitions the automaton away from the start state, which will cause a large amount of the
DFA states to be visited, forcing a small cache hit rate and thus many accesses to main
memory. Our synthetic traces are generated with the default pM values of 0.35, 0.55, 0.75,
0.95 as specified in the tool. Furthermore, we generate a trace called rand consisting of purely
random data.
To test the overhead of handling interleaved packets from many simultaneous flows, we
also use real traffic traces. This allows us to test not only raw processing speed but also the
ability to save and load the matching state for each flow. We use three sources for realistic
trace data: (1) the DARPA intrusion detection data set, generated by the MIT Lincoln
Laboratory (LL) [25], (2) traces captured during the 2009 InterService Academy Cyber
92
Defense Competition (CDX) [51], and (3) a locally gathered traffic trace of a student lab
PC (RL). For LL, we process the ten traces from week 5 for a total of 5.8GB. For CDX, we
process the Border Data Capture traces 3 through 8, for a total of 550M. For RL, we capture
10 traces, the size of each is 0.1GB.
3.7.2
Metrics & Experimental Setup
To compare these algorithms, we measure automaton construction time, memory image size,
and packet processing throughput. Memory image size is measured by the amount of contiguous memory needed to store an automaton. Throughput is measured by the RegEx matching
time per byte in processing packets. Error bars in graphs represent standard deviation. The
experiments are carried out on a server with a Sandy Bridge Core(I72600K@3.4GHz) and
8GB RAM. The image construction code is 1.5K lines of OCaml, and the pattern matching
engine is 300 lines of C++. For all experiments, we use a single thread running on a single
core. To remove disk overhead from our measurements and make results more consistent,
the entire trace file on disk is read into memory and payloads are preextracted from pcap
files, although flows are not preassembled.
3.7.3
Automaton Construction: Time & Size
Figure 3.10 shows the automaton construction time for each of the BCS sets divided by
the number of NFA states. This normalization reduces the variation in construction time
due to the underlying complexity of the RegEx set, and allows for easier comparison of
the construction methods. Comparing HASICM and HASICD , we observe that HASICM
is much faster in construction (peaking at 4700 times faster than HASICD for C7p), and
93
has a more consistent construction time per NFA state than HASICD . Comparing DFA and
HASICD , we observe that they take almost the same construction time for each RegEx set.
For B217p, which is too complex to be generated by both HASICD and DFA, it can be
generated by HASICM in 25.2 seconds.
Time(ms)/NFA state
●
DFA
HASICD
HASICM
NFA
●
1000
●
●
10
●
●
●
1/10
B217p
C10
C7p
C8
S24
S31p
S34
Ruleset
Figure 3.10: Construction Time BCS Sets
To evaluate how well HASICM construction scales with the complexity of the input, we
use the Scale RegEx set. We generated DFA and HFA for the first rule, then the first two
rules, then the first three, etc. We stopped generating DFA when the DFA generation time
exceeded 2 minutes, while each of the 77 HFAs were generated in under 2.2 seconds. The
results in Figure 3.11 show that HASICM scales linearly with the number of RegExes while
the DFA has exponential construction cost.
Table 3.12 shows memory image sizes for the various rulesets. The DFA memory images
use 4byte transitions, and the HFA memory images uses 16byte transitions. Although HFA
has much larger memory size per transition than DFA, for complex RegEx sets, the memory
image size of HFA is orders of magnitude smaller than DFA, because HFA has orders of
magnitude fewer states. The memory image size of the HFAs constructed by HASICM is on
average 33% larger than that of HASICD .
94
Construction time(s)
200
DFA
●
HASICM
150
100
50
0
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
20
40
60
77
Number of RegEx
Figure 3.11: Construction Time Scale Sequence
Ruleset
NFA
DFA
HASICD
HASICM
B217p
C7p
C8
C10
S24
S31p
S34
Scale
0.5
0.1
0.1
0.1
0.2
0.4
0.3
0.8
250
4
20
10
41
13
608
4
0.7
2
5
9
6
32
108
4
0.8
2
6
16
9
54
Figure 3.12: Memory Image Sizes
95
3.7.4
Packet Processing Throughput
Figure 3.13 shows 6 categories synthetic traces of increasing degree of maliciousness and
their impact on the processing throughput of different automata. Note that malicious traces
cause RegEx matching performance to drop significantly because they cause the automaton
to access a wide variety of its states, which causes the cache hit rate to drop, requiring more
accesses to slower main memory. We observe that HFA throughput is 2.9 to 3.6 times, with
an average of 3.3x, faster than XFA. XFA performance is estimated from the measurements
in [28], which shows that XFA is 75.6/11.6 = 6.5 times slower than DFA. Applying this
proportion to our DFA results gives estimated XFA throughput.
Throughput(Gbps)
●
3
●
DFA
HFA
●
NFA
XFA (est.)
●
●
●
2
1
0
rand
pM = 0.35
pM = 0.55
pM = 0.75
pM = 0.95
Malicious Factor
Figure 3.13: Throughput Synthetic Traces
For real network traces, the results are similar to random traces, as shown in Figure 3.14.
Each marker indicates the throughput of a RegEx engine on a single trace. For all automaton
types, RegEx sets, and traces, the results were nearly identical to random traces because
these traces have a low number of matches on our RegEx sets. The real traces had additional
overhead of switching between flow states as packets of different flows arrived and were
analyzed, but this did not reduce their performance significantly as compared to the random
traces.
96
Throughput(Gbps)
●
3
DFA
HFA
NFA
XFA(Est.)
●
●●
●●
●●
●
●●
●
●
●●
●
●●
●●
●●● ●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●●
●
●
●●●
●●●
●
●
●●
●
●
●
●●
●
●
●●●
●●
●●
●
●●●
●
● ●● ●●
CDX
LL
RL
●
●●
●
2
1
0
rand
Trace Set
Figure 3.14: Throughput Real Traces
Finally, we evaluate the impact of the HFA optimization technique of ordering transitions based on hit probability of each transition. Figure 3.15 shows the average number of
HFA transitions examined for each input character without and with this optimization for
synthetic traces generated with pM being 0.35 and 0.75, respectively. The results show that
this optimization technique greatly reduces the average number of HFA transitions examined
for each input character and improves packet processing throughput. The optimization is performed using the first 10% of the trace, and then the number of transitions examined on the
rest of the trace is shown. In practice, the average number of HFA transitions examined for
each input character with this optimization on a real packet trace will vary depending the
#Transitions/character
similarity between past traffic and current traffic.
Un−optimized
Optimized
1.4
1.2
1.0
pM = 0.35
pM = 0.55
pM = 0.75
pM = 0.95
Figure 3.15: Transition Order Optimization
97
The hardware implementation of HASIC has throughput dependent on two factors. The
more important factor, and the one that we have control over, is the memory subsystem it
uses. The randomaccess throughput of the memory subsystem determines the number of
transitions per second it can process. The second factor is the number of transitions we need
to examine for an input character. Figure 3.15 shows that we can get the average number
of transitions per character down to less than 1.1 after optimization. Table 3.16 shows the
throughput based on an estimate of 1.1 transitions per character and IBM 32nm eDRAM
technology.
Memory speed
#Read
Ports
Read
Latency
# HASIC
Engines
Throughput
(Gbps)
eDRAM @ 1GHz
eDRAM @ 1GHz
eDRAM @ 1GHz
1
2
4
2
2
2
1
2
4
7.3
14.5
29.1
Figure 3.16: HASIC hardware implementation throughput
3.8
Conclusions
As the core problem of many security and networking applications, RegEx matching has
received much work; however, it has remained an unsolved problem as it is inherently difficult
to achieve high speed with low memory. This work significantly pushes forward the direction
of ASIC friendly RegEx matching with high speed and low memory. Using only a few history
bits, our algorithms are able to achieve DFAlike matching speed with NFAlike memory. Our
algorithms are not only fast in its software implementation but also easy to implement in
ASIC due to the simplicity of the RegEx matching process and memory image.
98
Chapter 4
Firewall Compression
4.1
4.1.1
Introduction
Background and Motivation
Packet classification is the core mechanism that enables many networking devices, such as
routers and firewalls, to perform services such as packet filtering, virtual private networks
(VPNs), network address translation (NAT), quality of service (QoS), load balancing, traffic
accounting and monitoring, differentiated services (Diffserv), etc. A packet classifier is a
function that maps packets to a set of decisions, allowing packets to be classified according
to some criteria. These classifiers are normally written as a sequence of rules where each rule
consists of a predicate that specifies what packets match the rule and a decision for packets
that match the rule. For convenience in specifying rules, more than one predicate is allowed
to match a given packet. In such cases, the decision used comes from the first rule in the
sequence whose predicate matches the packet. Table 4.1 shows a simplified example classifier
with three rules. This packet classifier’s predicates examine 5 fields of the packet, and has
decision set {accept, discard}, as might be used on a firewall. Note that 1.2.0.0/16 denotes
the IP prefix 1.2.*.*, which represents the set of IP addresses from 1.2.0.0 to 1.2.255.255.
The final rule, r3 , is the default rule; it matches all packets, guaranteeing a decision for each
packet.
99
Rule
r1
r2
r3
Source IP Dest. IP Source Port Dest. Port Protocol
Action
1.2.0.0/16 192.168.0.1 [1,65534]
*
*
*
*
*
*
accept
discard
accept
[1,65534]
6881
*
TCP
TCP
*
Table 4.1: An example packet classifier
Packet classification is often the performance bottleneck for Internet routers as they need
to classify every packet. Current generation fiber optic links can operate at over 40 Gb/s, or
12.5 ns per packet for processing. With the explosive growth of Internetbased applications,
efficient packet classification becomes more and more critical. The de facto industry standard
for high speed packet classification uses Ternary Content Addressable Memory (TCAM).
Since 2003, most packet classification devices shipped were TCAMbased [52]. Although a
large body of work has been done on softwarebased packet classification [53], due to its
parallel search capability, TCAM remains the fastest and most scalable solution for packet
classification because it is constant time regardless of the number of rules.
The high speed that TCAM offers for packet classification does not come for free. First,
a TCAM chip consumes a large amount of power and generates lots of heat. This is because
every occupied TCAM entry is tested on every query. The power consumption of a TCAM
chip is about 1.85 Watts per Mb [54], which is roughly 30 times larger than an SRAM
chip of the same size [55]. Second, TCAM chips have large die area on line cards  6 times
(or more) board space than an equivalent capacity SRAM chip [56]. Area efficiency is a
critical issue for networking devices. Third, TCAMs are expensive  often costing more than
network processors [57]. This high price is mainly due to the large die area, not their market
size [56]. Finally, TCAM chips have small capacities. Currently, the largest TCAM chip has
72 megabits (Mb). TCAM chip size has been slow to grow due to their extremely high circuit
density. The TCAM industry has not been able to follow Moore’s law in the past, and it
100
is unlikely to do so in the future. In practice, smaller TCAM chips are commonly used due
to lower power consumption, heat generation, board space, and cost. For example, TCAM
chips are often restricted to at most 10% of an entire board’s power budget; thus, even a 36
Mb TCAM may not be deployable on many routers due to power consumption reasons.
Furthermore, the well known range expansion problem exacerbates the problem of limited
capacity TCAMs. That is, converting rangebased packet classification rules to ternary
format typically results in a much larger number of TCAM entries. In a typical packet
classification rule, the three fields of source and destination IP addresses and protocol type
are specified as prefixes which are easily written as ternary strings. However, the source and
destination port fields are specified in ranges (i.e., integer intervals), which may need to
be expanded to one or more prefixes before being stored in a TCAM. This can lead to a
significant increase in the number of TCAM entries needed to encode a rule. For example, 30
prefixes are needed to represent the single range [1, 65534], so 30 × 30 = 900 TCAM entries
are required to represent the single rule r1 in Table 4.1.
4.1.2
Problem Statement
Formally, a classifier C is a function from binary strings of length w to some decision set D;
that is, C : {0, 1}w → D. While the classifier function may be specified as a combination of
range and ternary predicates on a number of fields, the underlying function takes a binary
string and returns a classification decision. In firewalls, the classifier commonly takes 104
bits of packet header and returns either Accept or Reject. A TCAM classifier T is an ordered
list of rules r1 , r2 , . . . , rn . Each rule ri has a ternary predicate p ∈ {0, 1, ∗}w and a decision
d ∈ D. A ternary predicate p = p0 . . . pw matches a binary string b if all nonstar entries in
101
p match the corresponding entries in b, that is,
w
(pj = bj ∨ pj = ∗).
j=1
The decision of a TCAM classifier T for an input p ∈ {0, 1}w T (p) is the decision of the
first matching rule in T ; that is, TCAMs use firstmatch semantics. A TCAM classifier T
implements a classifier C if T (p) = C(p) for all p ∈ {0, 1}w , that is, if all packets are classified
the same by both.
This paper focuses on the fundamental TCAM Classifier Compression problem: given
a classifier C, construct a minimum size TCAM classifier T that implements C. TCAM
classifier compression helps to address all the aforementioned TCAM limitations  small
sizes, high power consumption, high heat generation, and large die area. Note that even for
the same TCAM chip, storing fewer rules will consume less power and generate less heat
because the unoccupied entries can be disabled in blocks.
4.1.3
Limitations of Prior Art
Prior TCAM Classifier Compression algorithms fall into two categories: list based algorithms
and tree based algorithms. List based algorithms (e.g., Bit Weaving [55], Redundancy Removal [58], Dong’s scheme [59]) take as input a list of TCAM rules and search for optimization
opportunities between rules. These algorithms are sensitive to the representation of their input ruleset which means they can produce very different results for equivalent inputs. Tree
based algorithms (e.g., TCAM Razor [60] and Ternary Razor [55]) convert the input rules
into a decision tree and search for optimization opportunities in the tree. Tree based algorithms typically produce better results because they can find optimization opportunities
102
based on the underlying classifier without being misled by a specific instantiation of that
classifier. A key limitation of prior tree based algorithms is that at each tree level, they only
try to optimize the current dimension and therefore miss some optimization opportunities.
4.1.4
Proposed Approach
In this paper, we propose the Ternary Unification Framework (TUF) for TCAM classifier
compression, which consists of three basic steps. First, TUF converts the given classifier
to a BDD, a binary tree representation that puts all decisions at the leaves. Second, TUF
collapses the BDD, converting leaves into sets of equivalent ternary data structures and
combining these at internal nodes to produce a set of ternary data structures that represent
the classifier. Finally, TUF converts the ternary data structures to TCAM rules and chooses
the smallest as the final result. Broadly, the two decisions that define a specific TUF algorithm are (1) the ternary data structure to represent the intermediate classifiers and (2) the
procedure to combine intermediate classifiers.
TUF advances the state of the art on TCAM classifier compression from two perspectives.
First, it is a general framework, encompassing prior tree based classifier compression algorithms as special cases. Because of the structure that TUF imposes on tree based classifier
compression algorithms, it allows us to understand them better and to easily identify optimization opportunities missed by those algorithms. Second, this framework provides a fresh
look at the TCAM classifier compression problem and allows us to design new algorithms
along this direction.
103
4.1.5
Key Contributions
We make five key contributions in this paper. First, we give a general framework for optimizing ternary classifiers. The framework allows us to find more optimization opportunities and
design new TCAM classifier compression algorithms. More specifically, the choices of which
ternary data structures to use and how to combine them give new flexibility in designing
such algorithms. Second, by designing novel ternary data structures and heuristic combining
procedures, we develop three concrete compression algorithms for three types of classifier
compression. Third, we implemented our algorithms and conducted sidebyside comparison with the prior algorithms on both realworld and synthetic classifiers. The experimental
results show that our new algorithms outperform the best prior algorithms by increasing
amounts as classifier size and complexity increases. In particular, on our largest real life
classifiers, the TUF algorithms improve compression performance over prior algorithms by
an average of 13.7%. Fourth, we give a series of problem variants that we have needed to
solve and give their solutions in terms of TUF. Fifth, we developed tools to allow practical
redundancy removal on ternary classifiers.
4.2
Related Work
Several papers have addressed the problem of optimizing TCAM packet classifiers. Some
major categories of this research include techniques for redundancy removal, compressing
one and twodimensional packet classifiers and techniques for compressing higherdimensional
packet classifiers.
Redundancy removal techniques identify rules within a classifier whose removal does not
change the semantics. Since firewall policy is specified indirectly, redundant rules are com104
monly introduced into real life classifiers. Discovering these redundant rules and removing
them reduces the storage requirements of the classifier in TCAM, as well as potentially simplifying maintenance of the classifier. This technique has been investigated by Liu [61, 62],
using FDD variants to efficiently identify a maximal (not necessarily optimal) set of redundant rules. More recently, Acharya and Gouda [63] have shown a correspondence between
redundancy testing and firewall verification and used this to build a novel redundancy removal algorithm not based on trees that achieves the same compression results as FDDbased
redundancy removal. Redundancy removal is an important component of our algorithms,
but alone it misses many opportunities to combine or rerepresent policy using new rules.
While the problem of efficiently producing minimum prefix classifiers for a onedimensional
classifier has been solved [64,65] optimally, there are still ongoing efforts to produce optimal
prefix classifiers in two or more dimensions. Suri et al. [65] solved the case of optimizing a
twodimensional classifier where any two rules are either disjoint or nested, and Applegate
et al. [66] solved the special case for strip rules where all rules must span one dimension
as well as providing approximation guarantees for the general twodimensional case. These
solutions do not generalize to higher dimensions, so they provide little assistance with typical
fivedimensional classifiers.
Dong et al. [59] proposed the first heuristic fivedimensional prefix classifier minimization
algorithm. Meiners et al. [60] improved upon this in their heuristic TCAM Razor algorithm.
Meiners et al. [55] then developed two heuristic ternary classifier minimization algorithms
Bit Weaving and Ternary Razor. Ternary Razor adds the bit merging algorithm from Bit
Weaving into TCAM Razor. McGeer et al. [67] demonstrated an optimal algorithm for
finding the minimum representation of a given classifier, but this algorithm is impractical
for all but the smallest classifiers due to its exponential runtime. Kogan et al. [68] also
105
target optimal compression by reducing problem to MinDNF for nonoverlapping rulesets or
MinAC0d for overlapping rulesets, but no efficient solutions to these NPcomplete problems
are likely. Rottenstreich et al. [69] attack the optimal encoding problem from a different
perspective, producing optimal encodings of individual rules and then combining these using
a novel TCAM architecture to perform pattern classification.
There are many papers that attack the larger problem of packet classification by using
nonstandard TCAM architectures [70–72] or by reencoding packet fields [73–79]. While this
type of work may have theoretical elegance, the cost of engineering new TCAM architectures
or reencoding makes these works less practical. Algorithms like TUF that work within the
constraints of the standard TCAM architecture have the advantage that they can be deployed
immediately on existing hardware.
4.3
TUF Framework
In this section, we outline our Ternary Unification Framework (TUF), which gives a general
structure for ternary classifier compression algorithms. Section 4.3.1 gives the basic structural
recursion to compress a TCAM classifier. Section 4.3.2 specifies how TUF facilitates efficient
merging of partial solutions in the structural recursion step.
4.3.1
TUF Outline
We first present the basic steps of TUF. TUF takes a TCAM classifier as input and returns
an optimized TCAM classifier as output. Because TCAM classifiers are written as rule lists,
determining a simple property such as whether a TCAM classifier has a decision for every
input is NPcomplete. The first step of TUF is to represent the classifier as a binary tree
106
where every internal node has two children and the decisions are in the leaves, called a BDD.
An example BDD with three leaves is shown in Figure 4.1a. Converting to BDD resolves
the priorities of overlapping classifier rules and gives exact decisions for any input value or
range. Note that the construction is dependent on the order of the bits in the BDD, and
the resulting classifier can be different with different orderings. By considering multiple bit
orderings, typically organized by packet header fields, better compression can be achieved.
The second step of TUF converts the leaves of the BDD into instances of a ternary
data structure. As each leaf represents some collection of input values assigned a single
decision, we can convert it into an equivalent ternary data structure, such as a trie or a
TCAM classifier. This is demonstrated in Figure 4.1b, where the BDD leaves are replaced
by TCAM classifiers. The predicate for each classifier depends on the height of the leaf; in
this case, the bottommost leaves are zerobit classifiers, while the upper c leaf is a onebit
classifier as its height is 1.
The third step, the core of TUF, merges these ternary data structures to form ternary
data structures that encode larger sections of the input space. Figure 4.1c shows the result of
merging the left subtree of Figure 4.1b. It is in the merging process that compression is possible; similarities in the two halves can be identified and a smaller merge result constructed.
After all BDD nodes have been merged, a ternary data structure that represents the entire
classifier is created. If the ternary data structure used is not a TCAM classifier, then it is
converted to a TCAM classifier as the final step. The TCAM classifier for this example is
shown in Figure 4.1d.
TUF can use a number of different ternary data structures such as tries, nested tries (tries
of tries), and TCAM classifiers. To support a particular ternary data structure, TUF requires
that the data structure support two operations: Singleton and LRMerge. Singleton converts
107
(a) Input
(b) Leaves
(c) Merge
(d) Output
Figure 4.1: Structural Recursion Example, converting a BDD to a TCAM Classifier
a BDD leaf to a ternary data structure and LRMerge joins two ternary data structures A and
B into one, A + B. Pseudocode for the TUF Core recursive merging is given in Algorithm 1.
Algorithm 1: TUFCore(c)
Input: A 2tree c representing a classifier
Output: An equivalent ternary data structure
switch c do
case Leaf dec do
return Singleton (dec);
case Node(left,right) do
LeftSol := TUFCore (left);
RightSol := TUFCore (right);
return LRMerge (LeftSol,RightSol);
To support classifier compression, the LRMerge operation should find commonalities between the two halves and use ternary rules to represent any found commonalities only once
in the ternary data structure. This may be a complex operation, spending significant effort
to both find and then efficiently represent such commonalities. We next describe how we can
simplify the task required by LRMerge by tracking backgrounds.
4.3.2
Efficient Solution Merging
The goal of LRMerge is to combine two ternary data structures into one ternary data structure representing both. Using just a single ternary data structure at each step can cause
overspecialization. The minimumsize solution for a small part of the input space is often
108
(a) Concat
(b) LRMerge
(c) LRMerge with EB
Figure 4.2: TUF operations w/ backgrounds
not the best representation for it in the context of the complete classifier. By keeping multiple representations at each step, the right representation of a subtree can be used to create
the final solution. Spending more time to keep extra representations of each subtree allows
smaller ternary classifiers to be constructed. Taking this idea to the logical extreme, an algorithm could keep all combinations of subsolutions every time two solution sets are merged.
This could cause an exponential amount of work to be spent creating and combining them,
leading to an impractical algorithm. The rest of this section explores the use of backgrounds
as a way to limit the number of combinations that are created, allow pruning of useless
solutions from the solution set and improve the speed of merging solutions.
As ternary classifiers can have multiple matching decisions for an input, they have a
precedence structure. For TCAM classifiers, earlier rules in the list have higher precedence
than rules later in the list. Using this relationship, a ternary classifier can be divided into
two classifiers: a foreground of higher precedence and a background of lower precedence.
The operation Concat(A, Z), denoted A, Z , joins a foreground and background ternary
classifier into a single ternary classifier, as shown in Figure 4.2a. Intuitively, the joined
109
classifier searches A first, and if no decision is found, the result from searching Z is used.
If the background is nonempty, the foreground classifier should be an incomplete classifier
so that some inputs do not have a decision. We denote the classifier that has no decision
for any input as ∅ and note that it is the identity element for the Concat operation, e.g.
x, ∅ = x = ∅, x .
F to represent a classifier split into separate foreground, F , and background,
We write B
B, ternary data structures. For each BDD leaf, we will create a set of solutions that encodes
that leaf in different ways. This solution set has two split classifiers, one that encodes the
decision in the foreground, and one that encodes it in the background. The solution set for
a BDD leaf with decision d is
Singleton(d)
∅
,
∅
Singleton(d)
(4.1)
One of these two solutions will be used in the final classifier to represent this part of the
input having decision d. The first solution will be used if the decision d is sufficiently rare in
this subtree that it is best to encode this part of the classifier function as its own rule. The
second solution will be used if d is sufficiently common in this will be common in this subtree,
and that this decision will be encoded as part of a rule with a more general predicate.
TUF maintains the invariant that every solution set will have a solution with an empty
background. Because the empty background solution will be handled differently from the
other pairs, we give it the name EB, for Empty Background. In (4.1), the first solution is
the EB as its background is empty. The EB has the best complete encoding of the classifier
represented by a BDD subtree, while the other solutions are the best encoding that assumes
some background.
110
TUF uses multiple solutions and backgrounds to efficiently merge its two input sets of
solutions into a new set of solutions. TUF will create a new solution for every distinct
background in either input set. For a background found in both input sets, TUF merges the
two associated foregrounds together to make the solution for that background. For ternary
A with B , the result is A+B , as shown in Figure 4.2b.
data structures A, B, and Z, to merge Z
Z
Z
When one child has a solution with a background that the other lacks, TUF substitutes the
EB for the missing foreground. This will produce correct results because the EB is a complete
classifier, so can be placed over any background without changing the semantics, as shown
in Figure 4.2c. An example merge of two solution sets can be written as
A B C
, ,
X Y ∅
+
D E
,
X ∅
=
A+D B+E C +E
,
,
X
Y
∅
.
This is implemented as SetMerge(l, r) in Algorithm 2.
Backgrounds simplify LRMerge’s search for commonalities by allowing LRMerge to focus
on merging sibling ternary data structures that have the same background. The use of
backgrounds also simplifies the merging process by producing the most useful solutions;
instead of trying to merge all pairs of solutions (for merge with m and n solutions on left
and right, O(mn) merges), we instead merge solutions with the same background (O(m + n)
merges). Finally, the use of backgrounds allow a set of solutions to be simplified.
To simplify a set of solutions, TUF incorporates a cost function Cost(C) which returns
the cost of any ternary classifier. Let A be the foreground of the EB in a solution set. For
X
any solution X
Y , if Cost(A) ≤ Cost(X), then TUF removes Y from the set. It is a useful
simplification because substituting A for X in future merges will supply LRMerge with lower
cost inputs, which should produce a lower cost result while maintaining correctness. TUF
111
can also replace the EB by
X,Y
∅
if there is a solution X
Y for which Cost(A) > Cost( X, Y ).
The result of these simplifications is that the EB will have the highest cost foreground of
any solution, and the difference in cost between the EB and any other solution must be less
than that solution’s background cost.
So far, we have treated the classifier as an unstructured string of bits. In actual usage, the
bits being tested are made up of distinct fields, and there is structure to the classifier rules
related to these fields. For example, ACL rulesets often have 5 fields: Protocol, Source IP,
Source Port, Destination IP, and Destination Port. Once we have developed a good ternary
classifier for a section of one field, it is often beneficial to simply store that classifier without
modification as we extend it to consider bits from other fields. To support this, we use a
function Encap(d) that creates a field break by encapsulating the ternary data structure
as a decision of another 1dimensional classifier. The LRMerge operation is not required to
respect this field break, although doing so will reduce the complexity of merging, as it will
have fewer bits to consider. While encapsulating, we also promote the EB to be a background
to make it easy to find as a commonality.
Pseudocode for this enhanced TUF Core is given in Algorithm 3. In it, the pair (X, Y )
is used to represent X
Y.
4.4
Prefix Minimization using Tries
The TUF framework can be used to create a multidimensional prefix classifier compression
algorithm by using tries. Prefix classifiers have prefix predicates where all stars follow all 0’s
and 1’s. TUF will represent multidimensional prefix rules with tries of tries.
112
Algorithm 2: SetMerge (l,r)
Input: solution sets l and r
Output: merged solution set
1 Out = empty solution set;
2 NullLeft = foreground of ∅ in l;
3 NullRight = foreground of ∅ in r;
4 foreach bg in l.backgrounds ∪ r.backgrounds do
5
ForeLeft = foreground of bg in l or NullLeft;
6
ForeRight = foreground of bg in r or NullRight;
7
Out.add(LRMerge (ForeLeft,ForeRight),bg);
8
return Out
In this paper, tries are binary trees with nodes optionally labeled with decisions. As with
BDDs, the binary search key determines a path from the root, taking the left (right) child
if the next bit is 0 (1). The decision of a trie for a search key is the last label found on the
path for that key. Each labeled node corresponds to a prefix rule; the path to it from the
root matches a prefix of the search key, and all other bits are ignored. Tries are commonly
used to represent Internet routing tables where the longest matching prefix determines the
route taken. To handle multidimensional prefix classifiers, the solution is to use tries where
the decision is another trie.
We now describe 1dimensional and multidimensional prefix classifier minimization in
TUF.
4.4.1
1Dimensional Prefix Minimization
Adapting tries into the TUF framework is very natural. The empty classifier, ∅, is a trie
node with no decision. To implement Singleton(d) and produce a classifier where all inputs
have decision d, simply create a trie node with decision d. The Cost(t) of a trie t is the
number of nodes with a decision, which corresponds to the number of TCAM rules needed
113
Algorithm 3: TUFCore (b) with backgrounds
Input: A BDD b
Output: A solution set of (fg,bg) pairs
1 switch b do
2
case Leaf d do
3
return {(Singleton (d),∅),
4
(∅,Singleton (d))};
5
6
7
8
9
10
11
12
13
14
15
16
/* BDD Leaf w/ decision d */
case Node(left,right) do
LeftPairs := TUFCore (left);
RightPairs := TUFCore (right);
MergedPairs := SetMerge (LeftPairs, RightPairs);
Solutions := Concat() to all of MergedPairs;
BestSol := lowest Cost() solution in Solutions;
MergedPairs.removeIf(Cost (x) ≥ Cost (BestSol));
MergedPairs.add(BestSol, ∅);
if at field boundary then
Encap (MergedPairs);
MergedPairs.add(∅, Encap (BestSol));
/* Internal Node */
return MergedPairs;
for that trie. To LRMerge(l,r) two tries, we create a new trie node with no decision and
assign the left child as l and the right child as r. The Concat(f,b) operation only has to
handle the case where the foreground has a root node without decision and the background
is a singleton trie node with a decision. This is because backgrounds are always singleton
tries and because Concat is applied immediately after LRMerge which produces a foreground
trie where the root has no decision. In this situation, Concat just moves the decision of the
background to the root node of the foreground.
Figure 4.3 illustrates the compression of an example classifier into a trie using these
operations. The input classifier assigns decision a to binary input value 01 and b to binary
input values 00, 10 and 11. The BDD representation of this classifier is Figure 4.3a, which
has a leaf labeled a in position 01 (left, right), and leaves with decision b elsewhere. At
each BDD leaf, two solutions are created, one with the decision in the foreground and one
114
b
b
a
(a) Input
BDD
(b) Leaves
converted
(d) Second merge
(c) First merge
(e) Set
optimization
Figure 4.3: TUF Trie compression of simple classifier
with the decision in the background, shown in Figure 4.3b. As backgrounds will always be a
singleton trie, we will show them as the decision of that trie over a shaded background. The
merging step combines solutions that have the same background. In the case where the other
solution set is missing a solution with the same background, a solution is combined with the
EB of the other solution set. We will apply this merging step twice to our example BDD,
as shown in Figures 4.3c and 4.3d. The first merge produces three solutions: the EB, one
solution with a as background, and one solution with b as background. To produce the EB,
the foregrounds of the existing EBs are merged. To produce the solution with background
a or b, the corresponding foreground is merged with the EB of the opposite solution set.
Note that LRMerge is not symmetric, producing differently shaped tries. The second merge
is done similarly, with the two b solutions merged, the two EBs merged, and the a solution
merged with the EB.
After we finish merging two solution sets, we optimize the result. Optimization has no
effect after the first merge in our example, but it does improve the solution set after the
115
second merge. Recall that if any solution has total (foreground + background) cost less than
the EB, then the EB can be replaced by a Concat of that solution. In the example, the total
cost of the solution in Figure 4.3d with background b is 2; the foreground cost is 1 and the
background cost is 1. The EB has a higher cost of 3, which is greater than the total for
b, so we replace the EB. The result of Concat is to put the background decision into the
root node of the trie, as shown in the final solution set, where the EB has decision b in its
root node. Next, recall that if any solution has foreground cost no smaller than the EB, it
is replaceable by the EB and can be removed. In this case, we removed the solution with
background a because its foreground cost of 2 is the same cost as that of the new EB. As
we have finished reducing our BDD to a single solution set, the result of compressing this
classifier is that newly created EB. The final result is shown in Figure 4.3e.
4.4.2
Multidimensional prefix minimization
To represent a multidimensional prefix classifier, tries of tries are the natural solution. In
a trie of tries, each decision of an ndimensional trie is an (n − 1)dimensional trie. The
lowest level 1dimensional tries have the final decisions of the classifier in them. Tries of
tries are turned into decisions for the next level trie at field boundaries using an Encap
function which is run on both the foreground and background classifiers. In this case, Encap
simply takes the existing classifier and sets it as the decision of a singleton trie, producing a
(n + 1)dimensional trie from an ndimensional trie. For the case of an empty trie such as the
background of the EB, Encap returns an empty trie. This is analagous to how leaf solution
sets are created for tries.
The result of encapsulating the solution set in Figure 4.3e is shown in Figure 4.4. The two
existing solutions have been encapsulated as decisions of 2dimensional tries, which is shown
116
Figure 4.4: Encapsulation at a field boundary
with the existing trie inside a new, large trie node. For the second solution, the background
is just an empty background as encapsulation of the empty background is a null operation.
One new solution is added at this step (the rightmost solution in Figure 4.4) where the EB
is set as a background decision to an empty foreground.
The number of solutions at any stage is bounded by the number of different backgrounds
that can be possible. In the 1dimensional case, the bound is simply D, the number of
classification decisions plus one for the EB. Once the input is 2dimensional (or higher), the
encapsulation operation to cross a field boundary creates backgrounds out of the solution
tries that result from compressing the child (already compressed) dimensions at that point.
Worst case, this might be O(D2wtail ), where wtail is the number of bits classified by those
dimensions. This bound is unlikely, as the filtering and promotion of solutions to BestSol
places bounds on the costs of these solutions relative to the best solution. When the BestSol
has cost c, solutions with foreground cost at least c will be eliminated. Intuitively, if the
backgrounds have high cost, more complicated solutions (foreground + background cost)
will be tolerated by the algorithm as they may combine with other solutions with same
background to produce a better global solution.
4.5
Ternary Minimization using Terns
Additional compression can be achieved by exploiting the full ternary nature of TCAMbased classifiers rather than limiting ourselves to prefix classifiers. To facilitate this goal, we
117
develop a novel data structure called a ternary trie or tern that overcomes this restriction
and allows us to represent any TCAM rule in tree form. With the additional capability of
terns, we extend the LRMerge operation to find additional commonalities between its inputs
and compress them further in its output.
TCAM rulesets allow ternary rules, where each bit of the rule can be a 0, 1 or *. Tries
are capable of representing prefix rules, with leading 0 and 1 bits represented by the path to
a node with the decision and trailing stars implied by the height of the node. With terns, we
change the structure to explicitly allow * bits along the path to a decision. Instead of having
only a left and right child, corresponding to 0 and 1 bits, the nodes of a tern have three
children: 0, 1 and *. This allows us to represent any TCAM rule, so we can attain higher
levels of compression. Details of converting a tern into a TCAM ruleset are covered later in
this section.
With terns, we update our LRMerge algorithm to better identify and merge commonalities.
Our updated LRMerge’s input is two terns representing the left and right children of the new
tern we will create. The output is a single tern with the commonalities from the inputs
moved into the * child. Logically, given two terns l and r (for “left” and “right”), we want
to produce a new “star” tern s = l ∩ r.
This leads to a simple LRMerge algorithm for tries which simply traverses the two inputs
in parallel, looking for the same decision in the same position. When one is found, that
decision is moved from the two side terns to the star tern. The most interesting thing about
this algorithm is how the two results are folded back together. After merging the left children,
we get a single tern node, but the children of that node will not end up as siblings in the
final result, but as cousins. That is, they will become the left child of the left, right and
118
star children of their grandparent node, in positions 00, *0 and 10 relative to the root we’re
compressing. Pseudocode for this merging step is shown as Algorithms 4 and 5.
Algorithm 4: ReChild
Input: Three ternary tries, l, s and r, all nonnull
Output: One ternary trie with children striped from the input
1 left := Tern(l.l, s.l, r.l);
2 star := Tern(l.s, s.s, r.s);
3 right := Tern(l.r, s.r, r.r);
4 return Tern(left, star, right);
Algorithm 5: LRMerge for Terns
Input: Two ternary tries, l for left and r for right
Output: A ternary trie with children l , s, r , where s = l ∩ r, l = l − s, r = r − s
1 if isEmpty (l) or isEmpty (r) then
2
return Tern(l,empty,r);
3 else
4
lefts := LRMerge (l.l, r.l);
5
stars := LRMerge (l.s, r.s);
6
rights := LRMerge (l.r, r.r);
7
out := ReChild (lefts, stars, rights);
8
if l.dec == r.dec then
9
out.s.dec = l.dec;
10
else
11
out.l.dec = l.dec;
12
out.r.dec = r.dec;
13
return out;
To illustrate the power of using terns as compared to tries, consider the example classifier
consisting of three rules shown in Figure 4.5a. The equivalent trie for this classifier, shown in
Figure 4.5b, has three labeled nodes which means that no merging of rules was found. Thus,
the output classifier will still have the same three rules. On the other hand, using TUF with
terns, we produce the tern shown in Figure 4.5c. It has merged the two rules with decision a
into a single node that corresponds to the rule *11:a. The resulting classifier will have two
119
Pred.
Decision
011
111
1**
a
a
d
(a) Rules
d
1
1
a
1
*
1
0
1
1
1
a
(b) Trie Classifier
d
1
a
(c) Tern Classifier
Figure 4.5: Example Tern compression
rules. Such compression is impossible in tries because it tries can only produce prefix rules,
and this tern is encoding a nonprefix rule.
To use a compressed tern in a TCAM, we must convert it to a TCAM classifier. As with
tries, each labeled node becomes a rule with the path to the node indicating the predicate
and the label of the node becoming the decision of the rule. It remains to order the rules
to achieve the correct priority. For a trie, doing a postorder traversal will emit rules in the
correct order, as the only consideration is that a parent be emitted after its children. For
a tern, there are ordering requirements between rules in different subtrees, so we must use
a different strategy. The * branch of the tern must have lower priority then the 0 and 1
branches (otherwise it will override decisions there), but nodes deeper in the tree should
have priority over shallower nodes. In our example, a postorder traversal would emit the
rule 1**:d first, because at the root, the 1 subtree is generated before the * subtree. This
results in the input 111 having decision d, which is not the semantics of our input.
To convert a tern into a TCAM classifier, we instead use a bottomtotop, levelorder
traversal of the tree. This results in the combination of two decisions with height h still
shadowing decisions higher in the tree. Correctness follows because for terns generated in
TUF, all rules within the same level are nonoverlapping. This is because the rules are
produced from nonoverlapping BDD prefixes; furthermore, our tern merging operation will
120
not produce overlapping prefixes. Thus, the order of rules within a level does not change the
semantics of the resulting ruleset.
4.6
Ternary Minimization using ACLs
In this section, we use TCAM classifiers as the ternary data structure in the TUF algorithm.
Using tree structures makes the LRMerge operation very natural to represent, but limits the
variety of TCAM classifiers that can be produced. Using TCAM classifiers as the ternary data
structure makes the Concat operation trivial, but has additional complexity in implementing
a compressing LRMerge operation.
Merging TCAM classifiers without factoring out any commonalities can be done by prefixing all rules in the left input by 0 and those from the right input by 1 and concatenating
them: A + B = 0A, 1B . As there is no overlap between the two groups of rules, the order
of concatenation doesn’t matter, and the two prefixed inputs can be interleaved arbitrarily.
Factoring out commonalities is superficially easy; find a rule that appears in both inputs and
put a copy of that rule prefixed by * into the result instead of the two rules prefixed by 0
and 1. This simple method does not take into account the ordering of rules in the inputs, so
it produces incorrect results. A rule that appears in both inputs may be needed to shadow
other rules. We must take this into account when merging.
To preserve correctness in all cases, we must ensure that rules combined in this way
stay in the same order relative to other rules in both inputs. Graphically, this is illustrated
in Figure 4.6. This leads to a recursive procedure to merge two TCAM classifiers. After
identifying a common rule, split both ACLs into the part before the common rule and the
part after the common rule, called the tops and bottoms, respectively. Next, merge the two
121
0 Top1
0 Top1 + 1 Top2
1 Top2
0 Rule + 1 Rule
0 Bot.1
1 Bot.2
* Rule
0 Bot.1 + 1 Bot.2
Figure 4.6: Recursive merging left and right ACLs
A
B
C
D
E
D
F
C
A
B
(a) All pairs
A
B
C
D
E
D
F
C
A
B
A
B
C
D
E
(b) Invalid
pairing
D
F
C
A
B
(c) Maximal
valid pairing
A
B
C
D
E
D
F
C
A
B
(d) Maximum
valid pairing
Figure 4.7: ACL pairing example
tops and the two bottoms, recursively and join the pieces back together. We can write this
algebraically as
T1 , x, B1 + T2 , x, B2 = (T1 + T2 ), ∗x, (B1 + B2 ) .
Given two rulesets, we can maximize the number of rules merged by examining the pattern
of which rules could be merged. Figure 4.7a shows an abstracted pair of ACLs, with letters
representing rules. Each pair of rules that can be merged is connected by a line. Two pairs
of rules conflict if after merging one pair, the other pair cannot be merged. Graphically, two
pairs of rules conflict if their corresponding lines intersect. We define a pairing to be a subset
of the pairs of rules that can be merged without conflict. Figure 4.7b shows an invalid pairing,
as splitting the rules for one pair prevents the other pair from merging. A maximal pairing
is a valid pairing in which no pairs can be added without it becoming invalid. Figure 4.7c
shows a maximal pairing; when we split the rulesets into tops and bottoms, we can see there
are no further pairings. A maximum pairing is a pairing with the property that no other
122
pairing has more pairs. Figure 4.7d shows a maximum pairing; there is no larger set of pairs
that has no conflict.
The problem of finding a maximum pairing can be reduced to the maximum common
subsequence problem [80]. This problem is NPcomplete for an arbitrary number of input
sequences, but has polynomialtime solutions for the case of two input sequences. In our experiments on the reallife rulesets used in Section 4.9, we observe that there is little difference
in the number of pairings identified between the optimal solution and our greedy solution.
4.7
Revisiting prior schemes in TUF
TUF provides a new perspective for classifier compression that leads to new opportunities
for compression and more efficient implementations. We illustrate this feature by studying
previously developed onedimensional and multidimensional prefix classifier minimization
algorithms from the perspective of TUF. Specifically, we examine an existing algorithm for
onedimensional prefix classifier minimization [65] and a nonoptimal but effective algorithm
for multidimensional prefix classifier minimization [60]. Both Suri et al.’s onedimensional
algorithm and Meiners et al.’s multidimensional algorithms can be viewed as instantiations of the TUF framework. Furthermore, when viewed within TUF, we immediately find
improvements to both algorithms.
Suri et al. first build a BDD out of the input trie, then apply a union/intersection operation to label interior nodes with sets of decisions, and finally traverse the tree once more
from the root to give each tree node its final decision. The TUF Trie algorithm presented
in Section 4.4 follows a very similar structure. The set of solutions generated at each step
follows the same pattern of union/intersection. Because of the simple background structure,
123
all foregrounds are always equal cost, except for the EB, which has a cost that is greater by
one. When the children of an internal node have no matching backgrounds, all the merge
results will have the same cost and will be preserved for the next step; this corresponds to
the union case. When there are matching backgrounds with the same decision, the resulting
solution will replace the EB and eliminate all nonpaired solutions; this corresponds to the
intersection case.
There is one important difference between the algorithms which is how they manage the
tree. The existing algorithms are inplace algorithms that modify an existing tree in multiple
passes. This requires the whole BDD to be generated and kept in memory. The TUF Trie
algorithm does a reduction over the structure of the BDD, but does not require it all to be
generated, or in memory at once. The BDD can be lazily generated, and only a few solution
sets need be in memory at once. In this way, TUF Trie can be more efficient than existing
algorithms for very large classifiers.
We next consider the multidimensional TCAM Razor algorithm developed by Meiners
et al.. TCAM Razor builds a multidimensional algorithm from Suri et al.’s onedimensional
algorithm. It first builds an FDD from the input classifier and compresses the 1D classifiers
at the leaves using a weighted version of 1D compression. It then treats the results of this
compression as a single decision with weight equal to the number of rules in the result,
effectively reducing the dimension by one. By repeating this process until all the dimensions
of the FDD have been compressed, Razor produces a single classifier as output.
Looking at TCAM Razor from the perspective of TUF, we identify ways to improve
compression speed. Razor’s weighted onedimensional compression keeps a full solution for
each decision. As the last dimension is compressed, its decisions are compressed tries for
the other dimensions. There may be tens or hundreds of these, and it is wasteful to do
124
Figure 4.8: Razor hoisting the null solution as a decision
merges for all these solutions, when few of them will be used. A TUFbased solution can
remove solutions that are strictly inferior and start with a small set of solutions at the leaves
to greatly decrease the amount of work done. These changes give an average of 20x speed
improvement on the classifiers that take more than 1/4 second to compress.
The compression level achievable by TCAM Razor can also be improved. Looking at both
algorithms from the perspective of solution sets, we see that when Razor finishes compressing
a field, it keeps only a best solution (the EB). When the next level is processed, only this
one solution is used, and it is treated as an opaque value. The resulting transformation
is illustrated in Figure 4.8, which can be compared with Figure 4.4. The classifier with
background b is discarded, and the EB is shown encapsulated on the far right and promoted
as a new background in the middle.
The Multidimensional TUF Trie improves compression by encapsulating more solutions
at field boundaries which can lead to the discovery of more potential similarities. Figures 4.9a
and 4.9b give example outputs of TCAM Razor and the simple multidimensional prefix TUF
algorithm described in section 4.4 for a 2dimensional prefix compression problem. TCAM
Razor treats the rulesets from already compressed fields as opaque values, so once the last
field is processed, the results have no flexibility. Because of TUF’s ability to keep multiple
possible solutions for already compressed fields, it is able to apply the default rule across the
field boundary, resulting in better compression.
125
F1
F2
Dec.
0*
0*
1*
10*
***
***
:a
:d
:d
F2
0*
1*
X (cost 2)
Y (cost 1)
F2
Dec.
0*
**
10*
***
:a
:d
(b) Result of TUF
Trie
(a) Result of TCAM
Razor
F1
F1
Dec.
(c) Razor Midpoint
F1
F2 + Dec.
0*
1*
10*:a, Bg: d
Bg: d
(d) TUF Midpoint
Figure 4.9: TUF and Razor comparison on same input
The critical step in Razor compression is shown in Figure 4.9c, where sections of field 2
have been compressed into rulesets X and Y (with cost shown), and Razor is compressing
field 1. At this point, Razor only knows that X and Y are different, so it must represent them
separately in the compression of the first dimension. Only when X and Y are expanded to
reconstruct the final TCAM classifier is the similarity revealed. Most of the time that this
happens, the final redundancyremoval pass eliminates these redundancies, but in this case,
there are no redundant rules, only rules that should have been merged. On the other hand,
Figure 4.9d shows part of TUF’s view of the problem at this point. The partial solutions for
X and Y have a common background, so they can be merged into a simpler solution
4.8
Ternary Redundancy Removal
The final step in TUF is to remove any redundant rules from the resulting TCAM classifier.
Existing algorithms for redundancy removal [61–63] are designed for range rules, taking advantage of the low dimensionality of most classifiers to efficiently identify redundant rules.
These algorithms can be applied to fully ternary classifiers, but the runtime explodes as
126
mapping a ternary classifier into the lower dimension space of range rules produces an exponential number of nonadjacent regions for a single rule. As a running example, we consider
the problem of compressing the following abstract classifier in Table 4.2 The rule r1 expands
to over 2 billion range rules because the first field matches every even number between 0 and
232 . Here, we explore two ways to perform redundancy removal on ternary classifiers that
allow practical removal of redundant rules.
Rule
Field 1
Field 2
Action
r1
r2
r3
*******************************0
11111111111111*11111111111111110
********************************
0*******
1*******
********
accept
discard
accept
Table 4.2: A classifier equivalent to over 2B range rules
The first way to mitigate this expansion problem is to map the rules to a higher dimensional range format. A 32bit field can be replaced by four 8bit fields, where the concatenation of the smaller fields equals the larger field. Table 4.3 shows an equivalent classifier with
5 fields that expands to only 131 range rules. In this example, we can see that replacing a
wide field by multiple smaller fields can greatly reduce the number of ranges that this rule
expands to. This is because only the field containing the 0bit needs to be expanded into
multiple ranges. Care must be taken not to split fields too finely, as the running time of
redundancy removal algorithms is usually exponential in the dimension of the classifier.
Rule
Field 1.1
Field 1.2
Field 1.3
Field 1.4
Field 2
Action
r1
r2
r3
********
11111111
********
********
111111*1
********
********
11111111
********
*******0
11111110
********
0*******
1*******
********
accept
discard
accept
Table 4.3: An equivalent classifier equivalent to 131 range rules
Another way to mitigate the expansion from ternary to range rules is by reordering the
bits within a field. Permuting the order of the input bits does not change which rules are
127
redundant, but can simplify conversion to range rules. Table 4.4 shows our original classifier
with bits permuted to minimize the number of ranges. Rearranging bits to maximize the
number of ‘*’ bits on the right edge of the field gives a classifier that converts to only 3 range
rules. To maintain correctness, we must permute the predicate bits of all rules identically,
and this prevents us from always converting classifiers to prefix format in this way. The
permutation that converts one rule to prefix form may greatly increase the expansion cost
for another rule. A simple heuristic ordering is to sort the ternary bit positions based on
how many rules have a ‘*’ in that position. Putting ‘*’s on the right edge of a field tends to
reduce the number of ranges represented by a rule.
Rule
Field 1
Field 2
Action
r1
r2
r3
0*******************************
0111111111111111111111111111111*
********************************
0*******
1*******
********
accept
discard
accept
Table 4.4: An equivalent classifier equivalent to 3 range rules
In Section 4.9, we present experimental results using these transformations to reduce the
complexity of redundancy removal.
4.9
4.9.1
Experimental Results
Evaluation Metrics
The critical metric of a classifier compression algorithm is the number of rules in the output
TCAM classifier. As the input classifiers are in range form, they may contain rules that
must be rewritten to be stored in TCAM. When computing compression ratios, we compare
against the result of direct range expansion. This means we replace each nonternary rule
with a collection of prefix rules that compose the same predicate. We denote the result of
128
this process Direct(C) for a classifier C. We denote the size of the result of running algorithm
A on classifier C as A(C). For example, Razor(C) is the number of rules after running
TCAM Razor on a classifier C. Then, the compression ratio for an algorithm A on a classifier
C is
CR(A, C) =
A(C)
.
Direct(C)
A smaller compression ratio indicates that the algorithm produced a smaller output and thus
needs less TCAM space. To measure an algorithm’s performance on a set of classifiers, we
use Average Compression Ratio (ACR). For a set of classifiers S, the ACR of algorithm A
is the mean compression ratio across those classifiers;
ACR(A, S) =
1
S
CR(A, C).
C∈S
We evaluate how much TUF advances TCAM classifier compression using the following
improvement metric. First, for any classifier C, we define CRprior (C) to be the best possible
compression ratio for C using any algorithm excluding TUF algorithms. We then define
CRnew (C) to be the best possible compression ratio for C using any algorithm including
TUF algorithms. In both cases, we use the best possible field permutation order for each
CRnew
algorithm for the given classifier. We define the percent improvement of TUF as 1 − CR
.
prior
In this case, a higher Improvement percentage means that TUF performs better and saves
more TCAM space.
129
4.9.2
Results on realworld classifiers
We test these algorithms on a collection of reallife classifiers in three categories. The categories are based on the number of nonredundant rules and difficulty converting the rules to
ternary format. Table 4.5 gives a breakdown and statistics on these categories.
Cat.
Small
Large
Med.
Count
Avg #
NonRed.
Avg. #
Prefix Exp.
13
8
17
9
3221
209
1578
7525
641
Table 4.5: Classifier categories
Classifiers with an expansion ratio over 20 are categorized as Small RL. These 13 classifiers
have an average of 9 nonredundant rules each, yet their prefix expansions have 1600 ternary
rules. The remaining classifiers with more than 800 nonredundant rules are categorized
as Large RL. The remaining 17 classifiers have neither extreme expansion ratios nor are
extremely large, so we categorize them as Medium RL.
Razor Compression
10000
1000
Size
●
Small
Med.
100
●
●
●
●
●
10
●
Large
●
●
●●
●
●
10
100
1000
10000
Prefix Expansion+RR
Figure 4.10: Razor vs. Redundancy Removal, for classifier grouping purposes
130
40
40
●
% Improvement
% Improvement
●
30
20
10
●
30
●
20
●
●
●
10
0
Small
●●
●●
●●
Medium
●●
●● ●
0
●
●
●
●●
●●●●●●●●●●●●●●●●●●●●● ● ●●
● ●●
●
●●
●
●
●●●●●●●●●●● ●●●●●●●●
Large
Small
Medium
Ruleset
Large
Ruleset
(a) Prefix Compression
(b) Ternary Compression
% Improvement
40
30
●
20
●
●
10
●●●●
●
●●
●
●
●●
●
●● ●
0
●
●
●
●
●●●●●●●●●●●● ●
●●
Small
Medium
Large
Ruleset
(c) Ternary Compression, Unique decisions
Figure 4.11: Improvement of TUF over the state of the art for real life classifiers
These groupings are visually distinguishable by plotting Direct(RR(C)), the prefix
expansion of nonredundant rules, against the compressed size using TCAM Razor, as shown
in Figure 4.10.
To test the sensitivity of algorithms to the number of decisions, we also compressed a
variant of each of these classifiers. These variants are modified to have a unique decision for
each rule. One practical reason for considering unique decisions is that we may need to know
which rule matched; for example, we may be tracking hit counts for each rule.
131
Algorithm
TUF Trie
Orig.
TUF ACL
TUF Trie
Uniq.
TUF ACL
All
26.2
22.8
45.3
43.3
Large Med. Small
%
30.8 41.8
0.8
22.8 38.4
0.7
56.9 70.2
2.4
50.9 68.8
2.1
Table 4.6: ACR on real world classifiers
Table 4.6 shows the results of compressing the reallife classifiers and their unique variants
with TUF Trie and TUF ACL. Both variants are very effective at compressing classifiers,
but TUF ACL does outperform TUF Trie by roughly 13% on all classifiers and 26% on the
Large classifiers where there are more compression opportunities to be exploited by a full
ternary compression algorithm.
4.9.2.1
Sensitivity to number of unique decisions
When the input classifier has a unique decision for each rule, less compression is possible
because there is less potential to apply a background that applies to multiple rules. As a
result, TUF ACL and TUF Trie both have reduced compression performance, needing two
to three times as many TCAM rules to represent the classifiers. TUF ACL still outperforms
TUF Trie but by a smaller amount, roughly 4.5% on all classifiers and 10.5% on Large
classifiers.
4.9.2.2
Comparison with stateoftheart results
We present a direct comparison between TUF algorithms and the previous best algorithms
in Figure 4.11. For prefix compression, Cprior (C) uses only TCAM Razor [60] and Cnew (C)
uses the best of TCAM Razor and TUF Trie. For ternary compression, Cprior (C) uses
the best of Ternary Razor and BitWeaving [55] and Cnew (C) uses the best of Ternary
Razor, BitWeaving, and TUF ACL. Each graph shows the percent improvement of the TUF
132
algorithm over the comparable state of the art for each of our reallife classifiers. The xaxis
of each graph is broken into three parts corresponding to the Small, Medium and Large
classifiers. Within each group, classifiers are sorted in order of increasing improvement from
left to right.
We first consider prefix compression. In Figure 4.11a, we can see that adding TUF Trie
improves performance by an average of 1.9 % on all classifiers. The improvement is small but
does increase as we move from Small to Medium to Large classifiers from 0% to 2.6% to 3.0%.
Furthermore, while the improvement is generally small, the percentage of classifiers where
adding TUF Trie improves performance increases as we move to larger classifiers. Adding
TUF Trie improves performance on 0 of the 13 Small classifiers (0%), 8 of the 17 Medium
classifiers (47%), and 7 of the 8 Large Classifiers (87.5%). There is one notable outlier where
TUF trie outperforms TCAM Razor by 34%.
We next consider ternary compression. In Figure 4.11b, we see how much adding TUF
ACL improves compression over using only Ternary Razor and BitWeaving on our set of real
life classifiers. We see that the improvement is greater than for prefix compression. Specifically, adding TUF ACL improves performance by an average of 5.4 % on all classifiers. As
with prefix compression, the improvement does increase as we move from Small to Medium
to Large classifiers from 0.6% to 4.9% to 13.7%. As with prefix compression, the number of
classifiers where adding TUF ACL improves performance increases as we move to larger classifiers. Specifically, adding TUF ACL improves performance on 1 of the 13 Small classifiers
(7.7%), 11 of the 17 Medium classifiers (64.7%), and 8 of the 8 Large Classifiers (100%).
For prefix compression with unique decisions, TUF Trie offers almost no improvement
over the state of the art, giving a maximum of 1.7% improvement and only improving 3 of
the classifiers. We omit the plot for this uninteresting result.
133
For ternary compression with unique decisions, from Figure 4.11c, we see that TUF
ACL improves performance but the pattern of improvement is quite different. Adding TUF
ACL improves performance by an average of 6.2% on all classifiers with unique decisions,
but now the best performance is on the Small Classifiers followed by the Large classifiers,
with almost no improvement for the Medium classifiers. As we move from Small to Medium
to Large classifiers, the average improvement goes from 11.4% to 1.8% to 10.3%. For the
Medium classifiers, many of them are already in prefix form, so with unique decisions, the only
optimization possible is removing redundant rules. The remainder are nearly in prefix form, so
there is little opportunity for compression not already found by prior algorithms, although we
still find some. For the Small classifiers, TUF ACL achieves improved performance by better
ordering backgrounds to minimize the effects of prefix expansion. For the Large classifiers,
TUF ACL is still better able to find global commonalities than previous algorithms.
TUF Tern
10 9 14 13 13 13 12 8 12 5 7 21
Ternary Razor 10 9 14 13 14 13 12 8 12 5 7 21
Table 4.7: Small Classifiers Compressed # of Rules
TUF Tern
92 331 89
Ternary Razor 138 331 91
34
34
9 250 135 100 269
9 271 138 95 270
TUF Tern
110 518 20 140 12
Ternary Razor 113 520 20 140 14
6
6
33 108
33 108
Table 4.8: Medium Classifiers Compressed # of Rules
TUF Tern
723 1733 1189 1079 2253 1833 4227 2624
Ternary Razor 724 1784 1312 1049 2184 1868 4307 2770
Table 4.9: Large Classifiers Compressed # of Rules
The compression results of TUF using the ternary trie data structure are compared with
Ternary Razor in Tables 4.74.9. For small classifiers, TUF Tern performs almost identically
134
to Ternary razor, producing output classifiers with the same number of rules. For medium
classifiers, TUF Tern generally outperforms Ternary Razor, with up to 33% fewer rules to
represent the same classifier. For large classifiers, TUF Tern performs similarly to Ternary
Razor, having up to 9% fewer rules, but also sometimes needing 3% more rules. TUF Tern
rarely outperforms TUF ACL, producing an average of 2% and a worst case of 21% more
rules.
4.9.3
Efficiency
For prefix compression, TUFbased TCAM Razor is much faster than the original TCAM
Razor. For small classifiers, TUFbased Razor can be more than 1000 times faster than the
original TCAM razor. For larger classifiers, the speed difference is an average of twenty times
faster, achieving the exact same level of compression in much less time.
Ternary compression algorithms take longer than prefix algorithms, as their search for
commonalities throughout the classifier is more difficult. On a desktop PC with an AMD
3GHz Phenom CPU, these algorithms usually complete in under one minute. For some of
our classifiers and some of our algorithms and some field permutations, the running time
occasionally exceeds one minute; fortunately, these cases can be ignored as they almost always
result in poor compression. That is, we can set a reasonable limit such as five minutes for
running a compression algorithm and terminate the algorithm and discard the result for the
given field permutation if it exceeds the given time limit. Furthermore, for many applications,
a slow worst case compression time is acceptable as updates are performed offline.
Figure 4.12 shows a time comparison for TUFbased TCAM Razor on complete classifiers
and on incomplete classifiers. For these tests, we took the same collection of classifiers and
removed the last rule from each, making them into incomplete classifiers. We then compared
135
Normalized runtime
100
●
●
●●
●
10
●●
●
●
●●
1
●
●
●●●●
●●●●●
Small
●●●●●●
Medium
●
●●
●●●●●
●
Large
Ruleset
Figure 4.12: Incomplete compression time comparison
the time it takes to run Razor on the original classifiers with that of running the incomplete
variant of TCAM razor on the incomplete classifiers. Each point in Figure 4.12 represents the
running time of the incomplete Razor divided by the running time of complete Razor on the
same classifier. For small classifiers, incomplete compression can take many times the time of
complete classifier compression, as the number of rules used to represent a partial classifier
can greatly increase as prefix expansion in one dimension is compounded by prefix expansion
of other rules in the next dimension. For medium and large classifiers, the compression time
is similar, with incomplete Razor having a smaller runtime, due to having fewer possible
solutions to merge at each step of compression.
4.9.4
Ternary Redundancy Removal
To evaluate the variety of redundancy removal algorithms, we tested a variety of algorithms on a variety of classifiers. The algorithms tested are Allmatch redundancy removal
(“Allmatch”), and 4 variants of completenessbased Projection/Division redundancy removal (PD). The three variants of PD (PD, PDw, PDwo) use a lazy FDD [81] construction/traversal to identify completeness. The difference between these variants is the transfor136
mation applied to their field structure. PD has no transformations applied to it; PDw splits
wide fields larger than 16 bits to 16bit units, and PDwo does both wide field splitting and
bit ordering by stars. We chose 16bit width as a reasonable compromise between creating
too many fields and limiting the number of ranges needed to represent a ternary field. The
classifiers we use as input are the raw results of TUF ACL, before its final pass of redundancy
removal. We use these classifiers as they are representative of the classifiers that need this
kind of ternary compression.
Cat. Rules
Med.
Med.
Med.
Med.
Med.
Med.
Med.
Med.
Med.
Large
Large
Large
Large
Large
Large
Large
Range
Allmatch
2660
3293
323
378
339
343
924
1063
442
486
200
219
2268 3.8 ∗ 109
28
33
155
180
1728 122236
1505 1541792
2444 1542837
2159
4136
2257
2649
5344
8590
3795
4780
0.82
0.04
0.01
0.03
0.02
0.01
115.18
10.00
0.07
11.16
86.97
timeout
0.09
0.09
2.38
0.25
PD PDw PDwo
0.36
0.01
0.01
0.06
0.02
0.01
0.26
0.00
0.01
80.87
0.11
0.28
0.24
0.28
1.20
0.66
0.38
0.02
0.01
0.06
0.02
0.01
0.29
0.00
0.01
2.04
0.13
0.31
0.27
0.29
1.39
0.72
0.42
0.02
0.02
0.08
0.03
0.01
0.35
0.00
0.02
16.70
0.16
0.35
0.29
0.35
1.40
0.75
Table 4.10: Runtimes in seconds of Red. Rem. algorithms
Table 4.10 gives runtimes of those classifiers that took the Allmatch redundancy removal
algorithm more than 1/100s to process. It also gives the category of the original classifier, the
number of ternary rules in the redundant classifier and the number of range rules that these
are equivalent to. We note that the runtime of Allmatch is generally good, but there are a
few outliers that take inordinate amounts of time to process, notably the Med. 2268, Large
1505 and Large 2444 classifiers. Switching to the basic Projection/divisionbased algorithm
137
(PD) greatly reduces the time taken for Med. 2268, Large 1505 and Large 2444, but greatly
increases the time taken for Large 1728. The classifier transformations of splitting fields
reduces runtime significantly, and reordering bits does and reordering bits does not seem to
improve results significantly beyond the splitting baseline.
4.10
Problem Variants
There are a variety of problems very similar to the original problem where TUF leads to a very
natural solution. Recall that the basic problem was to represent a classifier C : {0, 1}w → D
as a TCAM classifier. This problem statement implies that C is a proper function, assigning
a decision to every input. It is often the case that we don’t care what the decision of the
classifier is on some inputs. Alternately, we may need the classifier to not return a result for
some inputs, i.e. for the classifier function to be partial. Finally, we may have the situation
for which some inputs can be left without a decision, but if they are given a decision, it must
be some particular d ∈ D. We will discuss how to solve problems that have these components
by using the TUF framework.
Don’t care: To compress a classifier for which some inputs can be assigned any decision,
we modify the leaf construction of TUF for the affected inputs to produce a solution set that
represents that any input can be assigned. One way to solve this is by adding a new decision
to denote these regions and compressing as normal, then removing all rules with that new
decision. It can also be done without the postfilter by changing how TUF creates leaves. To
convert a “don’t care” BDD leaf into a solution set, a EB with empty foreground is used,
giving to the solution set
∅
∅
. This solution set can be processed as normal, with the EB
combining with any background despite its foreground being incomplete. This produces a
138
small TCAM classifier with a decision for each “don’t care” input chosen to minimize the
overall classifier size.
Incomplete: The second variant to explore is compressing incomplete classifiers, that
is, classifiers that don’t have a decision for some inputs. The obvious approach of treating
“incomplete” as a decision, compressing the classifier as normal, and then removing rules
with “incomplete” decisions does not work. The rules with “incomplete” decision may be
shadowing other rules, and when those other rules are exposed, their decision will be used,
giving an incorrect decision. By adjusting TUF in a manner similar to how “don’t care” is
handled, a correct solution can be achieved. The solution set constructed for a leaf representing an “incomplete” decision is constructed to indicate that no decision can be given:
∅
Inc.
. In this solution set, Inc. is a new decision that is special only in that it must remain
in the background. When merging a solution set for a complete subtree with the solution set
for an incomplete subtree, the incomplete solution can only merge with the EB solution:
A B C
, ,
X Y ∅
+
D
Inc.
=
C +D
.
Inc. I
This eliminates other solutions from the other set, as they have nothing to merge with.
Merging two incomplete solutions proceeds as normal. In this way, the incompleteness of the
solution propagates while still allowing LRMerge to simplify the foregrounds.
Another solution for compressing incomplete classifiers involves using a weighted compression algorithm. The sections of the input space with no decision are given a decision
with very large weight, to force the construction of a solution where the default rule has the
’incomplete’ decision. Removing this default rule from that solution will correctly produce
an incomplete classifier as desired. The TUF solution shortcircuits the process of creating
139
other solutions with different backgrounds, greatly reducing the amount of work performed.
It also seems a more natural solution to this problem variant than hoping that an overlylarge
weight for a particular decision will construct a ruleset with the expected structure.
Incomplete w/ dec.: The final problem variant is for scenarios when a particular input
must have either a particular decision or have no decision. This can be useful to compress
an extremely large classifier by compressing the first n rules, then compressing the next n
rules, etc. The compressed version of the whole classifier is then the concatenation of the
compressed partial classifiers. Compressing the last part is straightforward as it is a complete
classifier, but the other parts are incomplete, so would normally need to be compressed using
the incomplete classifier compression described above. In this scenario, we have additional
information that can be used to attain better results, namely the decision from the full
classifier.
Incomplete classifier compression is greatly hampered by the requirement that certain
inputs have no decision. Requiring that each part but the last encode exactly the classifier
specified by the rules making that part would result in reduced compression, whereas allowing
some specific decision to be used allows some overlapping of rules. To accomplish this, we
again modify TUF’s leaf creation to convert a leaf marked as “d or nothing” and convert it
to the solution set
Singleton(d) ∅
∅
,
,
Singleton(d)
∅
Inc.
.
The first two values in this solution set correspond to the ways to encode this leaf having value
d, while the third encodes that its value can be left unspecified. Note that the incomplete
solution can be eliminated if its cost is greater than the EB solution. Whichever of these
solutions works best is the one that will be kept by TUF, without any further modification to
140
the algorithm. Further, these variants can be mixed within a single classifier, with different
parts of the input receiving different treatment simply by changing how the leaf nodes are
generated.
4.11
Conclusions
In this paper, we propose a new framework for compressing TCAMbased packet classifiers
and three new algorithms for implementing this framework. This framework allows us to
look at the classifier compression problem from a fresh angle, unify many seemingly different
algorithms, and discover many unknown compression opportunities. Our experimental results
show that TUF gives insights that (1) significantly improve the speed of the existing TCAM
Razor algorithm with no loss in compression and (2) lead to new algorithms that compress
better than prior algorithms. More importantly, this framework opens a new direction for
further work on TCAMbased packet classifier compression.
141
Chapter 5
Hashing
5.1
Introduction
Networking devices depend increasingly on hashing methods for performance. Hash tables
are an important building block for forwarding plane requirements. The MAC table in core
routers and the flow table in GGSN routers are commonly implemented as large hash tables.
Hash is also essential for various proposals which use probabilistic approaches such as Bloom
Filters for IP lookup, packet classification, loadbalancing, routing protocols and network
measurements. ( [82,83] and etc). The pathological cases of these algorithms can have severe
impact on the visible behavior of that device, and widespread abuse of these failure conditions
could have severe consequences.
As networking applications emphasize worst case over average performance, using hash
functions for performance is asking for trouble. As Knuth wrote in [84],
. . . we need a great deal of faith in probability theory when we use hashing methods, since they are efficient only on the average, while their worst case is terrible.
Hash tables with fixedsize buckets and Bloom Filters are two common building blocks in
networking applications. The “terrible” worst cases for them are severely impaired capacity
(could be close to 0) and severely increased false positive rate (could be 100%). Although the
extreme worst case might never occur even for hash functions with low quality output, they
142
would are more likely to cause unexpected performance degradation, such as reduced capacity
of hash tables and increased false positive rates. This is because all performance guarantees
for hashbased algorithms are built upon the assumption of a hash function whose output
is randomlooking. Moreover, the choice of hash function can make it easier for attackers to
trigger these behaviors [85].
The concept of Universal Hashing [86] is proposed to remedy the uncertainty of the hash
function. With the help of universal hashing, many probability tail bounds could be strictly
established. However, the randomness of universal hashing comes from the randomness of
picking function from the family, specifically from the randomness of the seed. For a practical implementation, the random seed would be baked into the hardware. Some wellknown
universal hashing schemes such as the H3 would perform poorly with seed hardwired.
People would have much more confidence when using hash functions designed for cryptopurpose. However, those functions are out of reach of the resource limitations and timing
requirements of network processors. This paper explores what is possible within these constraints.
In this paper, we present two contributions to the state of the art. We provide a statisticsbased evaluation framework that tests the quality of these resourcelimited hash functions to
differentiate the performance of hash functions. We also demonstrate a family of hardware
hash functions with performance similar to hash functions with much higher implementation
cost.
5.1.1
Networking Hash Trends
We see trends towards small, wide, fast and general purpose hardware hash functions. Even
with billions of gates fitting on a single chip, having a hash function take a small number of
143
gates is still important. The input and output should be wide, able to consume many bytes
each cycle and return a large hash value. Speed is always a requirement for networking, but
many applications of hash functions in networking require low latency as well. Finally, this
hash function should be general purpose, not making any assumptions about its input or
output, but doing reasonably well in a broad range of situations.
Wide Increasing traffic speeds combined with clock rates hitting their limit means the
amount of work per cycle needs to increase. One dimension this increase happens is in bus
size, with data buses growing to 128 bits and higher. With the wider deployment of IPV6,
the packet header fields which are most commonly hashed have grown in size, thus our hash
function should take a large input per cycle. At the same time, larger hash tables and other
hashbased data structures such as Bloom Filter require a large hash output. A large input
and output is important for our hash function to interact well with the rest of a networking
system. For this paper, we will work towards a hash function with 128 bit input and 128 bit
output.
Small Small hash functions also increase the system’s ability to handle traffic. Parallel hash
functions also allow more work per clock cycle. Networking processors are scaling to use tens
to hundreds of MIPS, ARM or custom cores on the same die. To show how important size
is, MIPS ships a whole CPU core as small as 0.65 mm2 [87]. In the case that each processor
gets its own hashing block, this level of duplication means the hash function’s size is even
more important. An area equivalent to about 4K gates is small enough to be duplicated
everywhere it’s needed, so we will use that as our target size.
144
Fast The timing of a hardware design affects both latency and throughput. High latency
designs can often be pipelined to gain throughput, but this requires large flipflops to carry
state from one cycle to the next, increasing the size of the hash function. For some applications, in some of those kind of proposals, the hash function is not only on the critical
path of forwarding, it is even in the feedback path of some statemachine. The state machine
needs to run in a high speed with the result from hashing in every cycle. For those types of
application scenarios, a low latency hash function is critical. Even in other situations, lower
latency is important to avoid complex scheduling. We will focus our efforts on designs that
take only one or two cycles.
General Purpose A general purpose hash function will be a more useful addition than
a specialized one. Some network devices are custom built for a specific purpose, but more
and more devices are engineered to be software controlled and hardware accelerated. This
allows them to support new algorithms and protocols, extending the lifetime of the device. A
specialized hash function designed for a hash table keyed by MAC addresses can be designed
to be smaller and more efficient than a general purpose hash, but it will perform poorly
when used by other algorithms. We should not make assumptions on either the structure of
the input nor the use of the hash output, so our hash function is general purpose.
To sum up, industry is heading towards higher expectations for hash functions. In this
paper, we will aim to produce a circuit with about 4K gates that hashes 128 bits into
128 bits in a single cycle. Sometimes the implementation cost of hash functions does not
prevent their use in products, but using better methods to build hardware hashes will reduce
implementation costs and enable better hashbased algorithms to be used in networking.
145
5.2
Related Work
Looking at existing solutions, we do not find any hash functions that satisfy both the quality
requirements of coming network algorithms as well as the size and speed requirements of
ASIC designs. Cryptographic hash functions cannot meet our gate count or timing restrictions. Other hardware hash functions are low quality or can’t scale to larger bus sizes. High
quality software hash functions cannot meet our size restrictions and are inefficient on a
network processor.
5.2.1
Cryptographic hashes
Much work has been spent optimizing hardware implementations of cryptographic hashes.
Satoh and Inoue [88] summarize many implementations and give improvements. All of the
implementations shown are much over 4000 standard gates and nowhere near one cycle to
run. These hash functions were designed to be efficient in both hardware and software. As
such, they do not fully take advantage of the parallelism available in a hardwareonly design.
We will compare our design with MD5, but only to show how close to its output quality we
can come with much more limited resources.
5.2.2
Noncryptographic Hardware Hashes
Commonly used hardware hashes are either low quality or scale poorly in bus width. To our
knowledge, only three general hash functions are widely deployed: CRC32, PearsonHash [89]
and BuzHash [90]. CRC was not designed as a hash function, but instead for detecting bit
errors in transmitted data. As pointed out in many papers such as [85], the linear structure
of hash functions such as CRC and H3 [86] with fixed seed makes them vulnerable to at
146
tacks. Attackers could invert the hash function and forge attack traffic to trigger “terrible”
worst case behavior. Although some people think the design of router benchmarks should
specifically avoid causing these behaviors [91], most would agree that all else being equal, a
design that makes it difficult to trigger worst case behavior is a better design.
Pearson Hash and BuzHash are both designed for string hashing, taking input one byte
at a time. Pearson Hash extends an 8bit to 8bit hash function (usually implemented as a
lookup table) to 8 ∗ n bit inputs by XORing the input byte with the internal state, hashing
that to form a new internal state and repeating on the next byte. BuzHash works similarly,
using an 8bit to wordsize hash function table to hash each input byte and then mixing these
using rotations and XORs. Pearson Hash requires the result of mixing before being able to
do the next table lookup, giving a hash that takes 128bit words as input and latency of at
least 16 memory operations + 16 mixing operations. BuzHash can accept wider input by
using parallel lookup tables, but as each needs 8Kb of fast memory or 16Kb if we want 64bit
output, this quickly exceeds our size budgets. The final cost of a wide, fast implementation
of these kind of hash functions exceeds our intended size constraints very quickly.
5.2.3
Software Hashes
Another source for inspiration in hardware hashes is software hash functions. Software hashes
with high throughput and output quality include Bob Jenkin’s hash [92] and the Murmur
hash [93]. Jenkin’s hash uses a sequence of rotations, additions and subtractions on three
input values at a time to mix the input with its internal state. Murmur hash uses fast multiplications, rotations and XOR to do the core mixing. The commonly used rotate operations
take no gates to implement, as they’re just wire patterns. The rest of the operations seem
reasonable in hardware, but further investigation reveals that the ALUs (Arithmetic Logic
147
Units) used in desktop CPUs are optimized for speed. These optimizations increase the size
of the logic past our area constraints. Without these optimizations, the resulting hash function doesn’t meet our latency constraints. This problem eliminates these successful software
hash functions from hardware implementation in our context.
5.2.4
Existing Evaluation Methods
In the literature that discusses selection of hash functions, few tests are provided for general
purpose, noncryptographic hash functions. Special purpose statistical tests are used in some
cases [94], but are not appropriate for selecting a general purpose hash. Here we review the
most general tests, the Avalanche test and the Chisquared uniformity test.
Avalanche test A strong test of hash functions is the avalanche test, first introduced
in [95]. Intuitively, a small change in the input should trigger a large change in the output,
as a small rock falling can trigger an avalanche. Given a hash function H and input x, denote
H(x) = y. For x different from x in one bit, a hash function that passes the avalanche test
will have H(x ) very different from y. Optimally, every pair of inputs with hamming distance
1 (one bit difference) will hash to nbit outputs with hamming distance n/2.
In practice, we test this by sampling the input space and for each sample, inspecting the
change in output for each onebit change in input. By tabulating which bits change, one can
build up a model of bit interrelationships and the strength of the avalanche property. For
example, if whenever the first input bit is flipped (no matter the value of the other bits),
the first output bit always flips, that combination of bits has an avalanche measurement of
100%. Similarly, if a particular output bit doesn’t ever change when a particular input bit
is flipped, that pair of bits has 0% avalanche. A good avalanche result would be if the first
148
output bit changed 50% of the time when the first input bit was flipped, depending on the
values of the other input bits as to whether or not it would flip. The closer all percentages
are to 50%, the better the avalanche provided by the hash function.
Uniformity testing The Chi Squared (χ2 ) test can quantify the uniformity of hash output
values. For a hash function with output ranging from 0 to n, we hash m input values and
count how often each output value occurs, calling these counts ci , 0 ≤ i ≤ n. The Chi Squared
test compares these counts with the counts that would occur if the output had a uniform
distribution: cu = m/(n + 1). The statistic
n
χ2 =
i=0
(ci − cu )2
cu
(5.1)
measures how close the ci counts are to uniform  a smaller value indicates a more uniform
distribution of ci ’s. As we want hash results with a uniform distribution for packet sampling,
this test is especially important in that context [96].
The Chi Squared test for uniformity is impractical for directly testing hash functions.
Since most hash functions have a very large range of outputs a proper test would require
computing the hash value of a very large number of inputs, otherwise most buckets would
be empty due to lack of input. To resolve this problem, the test is usually performed on a
variation of the hash function with reduced output range.
This variation is usually to treat the original output as an integer and to apply a final
division and use the remainder as the hash value. This means that the test results often don’t
apply directly to the hash function to be tested. Dividing by a prime number takes many
nonuniform distributions and converts them to a much more uniform distribution, which is
149
why many software hash tables use prime sizes. Dividing by a power of 2 throws away the
top bits of the result, only testing part of the hash output. Test results on the modified hash
functions are optimistic, so the user of this test must be aware that uniform results on the
reduced output variation do not imply uniformity of the original function.
5.3
Design Methodology
Section 5.3.1 presents the major considerations in building a resource constrained, general
purpose hardware hash function. This is followed by the specifics of the design in four parts.
Section 5.3.2 outlines the general framework of the family of hash functions we explored.
The following two sections give two implementations of the mixing primitive: XOR stages
and Sbox stages. Finally, Section 5.3.5 explains the complexities of routing bit permutations
and gives a lowercost design.
5.3.1
Design considerations
We identify the following rules of thumb in designing a good hash. They are not the only
way to build a good hash function. They are design principles that led us to our current
design, so understanding them will improve understanding of our design.
As a framework for this discussion, model the input and output of a hash function as
vectors of bits. For a given type of input, each of the input bits has a fixed entropy, or
unpredictability. For example, MAC addresses use a bit to indicate whether this address is
globally unique or locally administered. In most situations, this bit will always be set for
globally unique, so it has low entropy because it is very predictable. Since we don’t know
150
the use for the output values, we should strive to have them each maximally unpredictable
so that the output resembles uniformly distributed random numbers.
The first guideline is to thoroughly “mix” each input bit into all the output bits. A hash
function with nbit output can be viewed as n binary functions of the input value, one binary
function producing each output bit. If some of the output bits don’t depend on some of the
input bits, that output bit is more predictable than if it depends on the value of every input
bit. Thus, we want to use every input bit as part of the “recipe” for producing each output
bit.
A perfect hash function with nbit input and nbit output values would be a 2n → 2n
permutation that assigns each input value a distinct, random output value. In this situation,
the output would be predictable only knowing both the input value and the hash function,
and having partial information on either or both would make the output very unpredictable.
If any input values mapped to the same output value, that output value would be more
common than others, and some output value would not be mapped to, giving a more uneven
result and a less random looking output.
Cryptographers call the device that performs this kind of permutation a Sbox. As large
Sboxes are impossible to build, we emulate them with a sequence of operations that “shuffle”
the input values. The permutation built into the Sbox can be decomposed into a sequence
of simpler mappings that are each feasible to compute.
Within this context, our second guideline is to prefer invertible mappings. It is important
to map similar inputs to very different outputs, as realworld input data often has inputs
with small differences. If the first stage uses a noninvertible simple mapping, similar inputs
are likely to collide in this stage and be mapped to identical outputs by later stages. After a
series of invertible stages, similar inputs should be mapped to very different values without
151
any collisions. If the hash function produces output smaller than its input, the required
noninvertible transformation should be delayed until the last stage to reduce the chance
of similar input collisions caused by the simplicity of this transformation. Using invertible
mappings avoids output collisions of similar inputs.
Building a hardware hash function is quite different from a software hash. The level of
control of individual bits is much higher, but arithmetic operations like integer multiplication
are more expensive. Even the measures of performance are different.
Hardware hash performance is measured in area and timing. The area of a circuit depends
on how much wire and how many gates are needed. The timing of a circuit depends only
on the longest path from a source bit to an output bit. The timing of a circuit does not
depend on how many operations are performed in total, but on how long it takes for all the
output bits to be valid. This is very different from software hash functions, which execute
instructions mostly sequentially, whose timing is determined by the number of operations
performed. Parallel operations are common in hardware, with the time of a complex operation
measured by the longest path through logic gates needed to produce any output bit.
Hardware hashes have direct control of values at the bit level, and have access to simpler
building blocks. In hardware, bits are just signals on wires, so shuffling the bits of a value
is simply routing the wires representing that value to new locations. Bitwise operations
like XOR are cheap, and can be done in parallel on many bits simultaneously. Arithmetic
operations, even addition, are much more expensive, because a single carry can ripple through
all the bits being added, creating a very long path from the input to output bits. This
can be optimized, but at the cost of much more area. All this considered, hardware hash
functions will have a very different design from hash functions designed for efficient software
implementation.
152
Figure 5.1: Framework of hash function
5.3.2
The framework
We propose a modular hash function design, with multiple stages of alternating mixing and
permutation, as shown in Figure 5.1. This design is similar to a cryptographic substitutionpermutation network without key bits being merged at each stage. Each component is designed to be invertible, so we automatically avoid bias and cannot have any collisions. We
attempt to maximize the efficiency, and use many stages to produce a high quality result.
Instead of building each stage as one large mixing function, much area is saved by using
many small mixing functions to mix adjacent bits. By permuting the bits in between stages,
the input bits’ influence will cascade to affect every bit of the interstage state. To mix the
input bits into the entire nbit interstage state, at least logb n stages are needed, where b
output bits are affected by an input bit in each stage. Using many stages that mix bits in
small clusters reduces cost significantly.
We suggest two possibilities for mixing stages, a linear XOR stage and a nonlinear Sbox
stage. For the permutation stages, we provide an efficient hardware implementation of very
large bit permutation functions.
153
Figure 5.2: One stage of XOR Arrays
1
1
...
1
1 1
...
1
1
...
1 1 ...
Y =
... ... .
... 1 1
...
1 1
X
Figure 5.3: Equivalent matrix equation
5.3.3
XOR stage
To produce an invertible stage, we start with a structured transformation. The easiest way
to make an invertible function is by a linear function, representable by a bit matrix. Each
row of this matrix specifies which input bits to XOR together to produce an output bit. If
the matrix is invertible, the linear function it represents is also invertible. The cost of this
function is proportional to the number of ones in the bit matrix, so we want a function whose
matrix is also sparse.
Our XOR stage implements a very sparse, invertible, matrix multiplication as the equation in Figure 5.3. This is implemented by XORing adjacent bits as shown in Figure 5.2.
154
We include one 3input XOR gate in each stage so the resulting mapping is invertible. If
we include only 2input XOR gates, any combination we produce will be noninvertible.1 If
we use a 1input XOR gate (also known as a wire), that input bit will only affect one output
bit. We would need to be careful that its poor mixing doesn’t carry to the last stage.
Figure 5.4: One stage of Sboxes
This design is very inexpensive and mixes bits efficiently. Using 3input XOR gates would
allow more mixing, but the gate size would go up by about 2x and the gate delay as well
by about 60% [97]. The complexity of routing wires would be increased as well, as we would
likely use more nonadjacent bits. We would get similar cost and better mixing from two
smaller stages of 2input XOR gates.
1 Using rowoperations on the matrix cannot change the property that each row has an even number of
1’s. Thus GaussJordan elimination cannot terminate, and the matrix is noninvertible.
155
5.3.4
Sbox stage
Linear functions can provide good mixing, but we also need nonlinear mappings to add additional complexity to the output to reduce its predictability. Without nonlinear mappings,
there will be no avalanche, and each output bit will be a simple XOR of some of the input
bits. By including nonlinear mappings, the output bits can each depend on all of the input
bits, each in different ways.
Our solution to this is to use an array of Sboxes as shown in Figure 5.4. These Sboxes
will be implemented using direct combinatorial logic. This implementation is area efficient
for small Sboxes and produces its result much faster than a lookup table. The smallest
Sbox that provides a nonlinear result is a 3 × 3 Sbox, which takes three bits of input and
returns three bits of output. Among all possible Sboxes, we select the following one (and its
isomorphic equivalents):
P ermutation
[a, b, c] −−−−−−−−−−→ [Qa , Qb , Qc ]
(3−bit S−box)
{0, 1, . . . , 7} → {1, 4, 7, 5, 2, 0, 3, 6}
Qa = a
¯b + a
¯c + bc
Qb = ab + a¯
c + b¯
c
Qc = a
¯b + a
¯c¯ + b¯
c
The Sbox we select provides the nonlinear property that with randomly distributed
input, flip any one bit or two bits, all three output bits would be flipped with probability
50%. Undesirably, when all three input bits are inverted, all three output bits are also
156
inverted. This is a downside of using such a small Sbox, but it still provides the necessary
nonlinearity in our framework.
A0
A1
B0
B1
Y
C0
C1
Figure 5.5: AOI222 gate
To implement the formula a
¯b + a
¯c + bc, we use a single AOI222 gate shown in Figure 5.5.
This gate is about the same size as a 2input XOR gate. Using this gate, the Sbox shown
above could be implemented by only 3 gates and 3 inverters. Overall, the cost of a 3 × 3
Sbox stage is only a little larger than an XOR stage.
For comparison, an example of 4 × 4 Sbox is the following:
Qa = a¯b¯
cd + a
¯b + a
¯c¯d¯ + bc
Qb = a¯bd + a
¯cd¯ + b¯
c
Qc = a¯bc + a¯bd¯ + a
¯bd¯ + a
¯cd¯ + bcd
Qd = a¯bc + a
¯b¯
c + abd¯ + a
¯cd
This 4 × 4 Sbox produces a more consistent nonlinear transformation but the cost is
much higher. AOI2222 gates exist in some standard cell libraries, but even this wouldn’t be
able to compute Qc above using a single large gate. The only option for larger Sboxes is
157
to use multiple gates, which results in a much larger cost in area and timing. Using 3 × 3
Sboxes as the basic element seems the best tradeoff between unit cost and mixing ability.
5.3.5
Permutation stage
The optimal permutation stage is a 128bit Pbox, which permutes all 128 bits arbitrarily.
The cost of wiring is usually ignored in ASIC design, but it turns out that this construction’s
cost is not ignorable. With a wide data path, the distance from bit one to bit 128 is long
enough to add to our total delay, impacting timing as well.
Even more than wire delay, the real difficulty in doing arbitrary permutations of the
input bits is the cost of crossing wires. As integrated circuits are constructed in layers, in
order to swap two wires, one must connect vertically across layers. Laying out an arbitrary
permutation in silicon requires much more area than just connecting straight through from
one set of gates to the next. A constrained permutation can still allow full mixing with much
lower cost.
Figure 5.6: Example permutation with 6 layer separated and then combined
The following solution with constant wire crossing cost, is illustrated in Figure 5.6. We
start with the input points and output points in parallel columns and produce a matching
across these points as follows. All input points are divided into k groups. The output connec
158
tions are divided into the same groups, ensuring the same number of input as output wires
in each group. Within a group, the first input point is randomly connected to an output
point, and the next input point to the next output point, returning to the first output point
after reaching the last. If we label the input and output points with integers 0 to m − 1,
we choose r = rand(m) and connect input point i to output point o = (i + r)
mod m,
0 ≤ i < m. Doing this for each group requires two layers per group, one for the wires i < o
and the other for i > o. Figure 5.6 shows a 3group permutation with red, blue and black
rotations shown in separate layers, together with the combined permutation resulting from
their combination. In some of our tests, we use two rounds of this kind of permutation, and
will discuss the effects of this in Section 5.6.
5.4
Evaluation
In this section, we propose three hash testing criteria: generalized uniformity test, generalized universality test and generalized Avalanche test. For convenience of discussion, we also
present evaluation results in this section.
In our experiments, we use data from CAIDA’s Anonymized 2010 Internet Traces Dataset
[98]. We use the Src IP, Dst IP, Src port, Dst port and sequence numbers from these traces as
hash input, totaling 128 bits per packet. It should be noted that we removed the duplicated
tuples which are generally caused by TCP retransmission. Failing to do this would introduce additional nonuniformity into the results, with duplicate inputs leading to duplicate
outputs. TCP sequence numbers are designed to be unpredictable as they’re required to resist data insertion attacks, so including these adds significantly to the entropy of the input.
Moreover, although CAIDA’s anonymization is prefixpreserving, this might introduce more
159
randomness into the set of input values than would be expected on a production router.
This gives poor hash functions an advantage, as having more input variation requires less
processing to produce widely varying output. Even with this entropy advantage, our tests
can still differentiate levels of output quality.
5.4.1
Hash functions for Comparison
The hash functions to be compared against our function are the following: CRC32 (IEEE
802.3), two H3 functions [86] with pregenerated random matrices, Pearson hash [89], BuzHash function [90], Bob Jenkins’s hash [92], Hsieh’s SuperFastHash [99] and MD5 [100]. As
discussed in Section 5.2, only CRC and H3 come close to fitting our size and latency restrictions, and all other hash functions would not meet our design restrictions. The following
comparisons are only on output quality without considering implementation cost.
All functions were configured to take 128 bits of input. Some functions could only return
32 bits of result, hence for a fair comparison, we only compare the lower 32 bits. Since our
constructions produce equally unpredictable output bits, using only a subset of them still
fairly represents the original.
For convenience of comparison to a uniform distribution, we also included a random
number generator (MT19937), which generates uniform numbers independent of the input.
The results of its tests should show optimal results to within experimental error.
As the methodology to build our hash functions is quite flexible, we present a family
of hash function constructions. The general methodology is to cascade XOR and 3x3 Sbox
stages. For convenience of notation, we use x to denote XOR stage and t to denote the 3x3
Sbox stage (t for three). If more than one stage of the same type are cascaded, we put the
repetition count after the stage identifier. For example, the configuration name tx4t means
160
one SBox stage, four XOR stages, and another Sbox stage, with a permutation block in
between each stage. In Section 5.3.5, we have discussed the method of using more regular
connections between stages to simplify the routing. The suffix “L6” denotes a hash function
using the (3 group) 6layer permutation block demonstrated in Figure 5.6 and “RL” indicates
a completely random permutation of all 128 bits.
We select tx4tx3t,tx4tx4t, tx3tx3tx3t and tx5tx5t as possible configurations of our hash
function based on size and output quality. The configurations t15 and x15 are included only
for comparison purposes, despite their impracticality.
5.4.2
Generalized Uniformity Test
We could characterize the problem of evaluating hash functions in the following way: For
an arbitrary set S, we want to test whether h(S) is uniformly distributed across all output
values. This means we want to see whether h(S) is statistically the same as independently,
identically distributed numbers generated from a uniform distribution with the same range.
However, the range of output values is too large for any direct test. The natural solution
is to group the output values and then perform the ballsandbins test on the groups. We can
think of this as projecting a sparse mass in a highdimensional space into a lowdimension
space. The mapping proposed by us is to randomly select a subset of the output bits, and
group hash outputs based on their value of these bits. Still using the analogy of projection,
the mapping proposed by us is to select some axes on the space and collapse all other
dimensions. Then we use those mapped results to perform the uniformity test.
Another reason for this method of grouping is that we want to test the dependence among
bits and want to reveal possible correlations. This kind of projection would reveal the bias
or correlation among the chosen output bits. For example, if the 13th bit is correlated with
161
18th bit, when we keep those two axes, we’ll find a nonuniformity in the resulting data, as
all four combinations of those two bits won’t appear equally often. As people assume that
hash function output is indistinguishable from random numbers, they often divide the result
into multiple values. This testing is important to know whether those values depend on each
other.
To perform our generalized uniformity test, for each of M rounds, fix K output bits by
sampling without replacement. Each round consists of hashing N input values, projecting
out the K output bits for that round and counting the frequency of each of the 2K possible
outcomes. We use M on the order of 10 to 100, K from 8 to 12 and depending on K, N
from 400 to 106 , with larger N for larger K.
Compared to the previous methods proposed in [96] and the common ones that could be
found in the Internet, the major difference here is that our number of bins is a power of 2
so we’re not mixing any more during the test, and we choose bits other than the low order
bits.
For each round, we use the standard Chisquare test of statistics (Eq. 5.1) as follows:
2K
S=
i=1
(#Balls in Bini − ExpectedLoad)2
,
ExpectedLoad
where ExpectedLoad = N/2K .
−1
For each round, we could have a pvalue α = F −1
2 (S). Here F 2 is the inverse cumulative
χ
χ
distribution function of χ2 distribution with 2K − 1 degree of freedom. Then 1 − α is the tail
probability to observe S if the hash values are truly uniformly distributed. If the hash values
are sampled from a uniform distribution, α of each round should be uniformly distributed
within [0, 1].
162
Since there are multiple rounds, we need to summarize the results of all rounds for
comparison purposes. Here we use the average and the maximum of α values to get two
values: αmean and αmax , which serve as the final metrics of a hash function. The pvalues
for αmean and αworst can be derived as follows:
αmean =
−1 (
Fus
1
M
M
αi ),
i=1
where Fus is the distribution of the sum of M uniformly distributed numbers on [0, 1] and
αworst = (max{αi })M .
The two formulas above hold only when the data sources of each round are independent.
It is worth pointing out that here we use the onesided pvalue. A smaller α indicates more
uniform hash values. If α is too close to 0, it means the hash values are more “uniformly
distributed” than what we would expect from independent random choices. Such a hash
function would be suspicious, but should not be rejected by a uniformity test. Hence we only
use the oneside uppertail pvalue 1 − α across all statistics.
It is also worth pointing out that multiple rounds should use different nonoverlapping
segments of data sources, otherwise the hash results would be correlated. This would compromise our ability to summarize the results of multiple rounds, no matter which function
was used.
An example of testing result is shown in Figure 5.7 (a) and (b). This boxandwhisker
plot shows the α values generated in 100 rounds with 105 balls per round hashed into 1024
bins. For convenience of comparison, the numbers are shown as log10 (1 − α). The numbers in
163
100 rounds with fixed selection of output bits on CAIDA data
−0.04
MD5
−0.46(1.47,0.25)
−0.09
Buzhash
−0.34(1.44,0.24)
−0.55
Bob
−1.31(1.47,0.25)
−1.03
Hsieh
−0.56(1.55,0.25)
−0.39
Pearson
−0.00(1.44,0.25)
−0.62
rand
−0.30(1.47,0.25)
−0.34
CRC
−0.07(1.48,0.26)
−0.13
H3 s1
−0.01(1.50,0.26)
H3 s2 −4.20(not shown)
−4.31(1.41,0.26)
tx3tx3t L6 −1.89(not shown)
−1.51(1.43,0.25)
−1.41
tx3tx3t RL
−0.30(1.53,0.24)
−0.49
tx4tx4t L6
−0.27(1.43,0.24)
−0.05
tx4tx4t RL
−0.28(1.47,0.24)
−0.84
tx5tx5t L6
−0.18(1.46,0.25)
−0.91
tx5tx5t RL
−0.27(1.45,0.26)
−0.60
tx3tx3tx3t L6
−1.07(1.43,0.25)
−0.28
tx3tx3tx3t RL
−0.35(1.43,0.26)
−0.68
t15 RL
−0.32(1.51,0.24)
−0.58
x15 RL
−0.00(1.45,0.25)
−5
−4
−3
−2
−1
0
log10(1−α)
(a) Fixed output bits
100 rounds with different selections of output bits on CAIDA data
−0.57
MD5
−0.15(1.51,0.24)
−0.14
Buzhash
−0.69(1.46,0.25)
−0.32
Bob
−1.48(1.51,0.26)
Hsieh −12.6(not shown)
−21.9(1.63,0.24)
−0.34
Pearson
−0.40(1.56,0.24)
−0.30
rand
−0.29(1.48,0.25)
CRC −7.55(not shown)
−3.42(1.45,0.25)
−0.67
H3 s1
−0.06(1.46,0.24)
H3 s2 −12.9(not shown)
−0.14(1.63,0.24)
−0.16
tx3tx3t L6
−2.10(1.53,0.25)
−0.18
tx3tx3t RL
−0.33(1.47,0.24)
−0.19
tx4tx4t L6
−0.19(1.44,0.26)
−0.24
tx4tx4t RL
−0.80(1.47,0.25)
−0.10
tx5tx5t L6
−0.32(1.46,0.24)
−0.39
tx5tx5t RL
−1.19(1.41,0.25)
−0.47
tx3tx3tx3t L6
−1.14(1.46,0.25)
−0.46
tx3tx3tx3t RL
−0.39(1.45,0.25)
−0.46
t15 RL
−1.37(1.51,0.25)
x15 RL −4.87(not shown)
−0.43(1.47,0.24)
−5
−4
−3
−2
−1
0
log10(1−α)
(b) Dynamic bit selection
Figure 5.7: Boxandwhisker plot of the results of Uniformity tests on real trace
Along the right axis is αmean (maximum load ratio, minimum load ratio) across all rounds. The
load ratio is load of some bucket divided by average load. Inside the figure is the αworst .
164
(a) are generated by a fixed selection of bits, while the αs in (b) are generated by a changed
selection for each round.
We also printed the two pvalues that summarize the results of all rounds, i.e. log10 (1 −
αworst ) and log10 (1 − αmean ), inside the figure and on its right axis separately. 2 It is
customary to declare the observations as nonuniform only when the pvalue is less than 5%
or 1%.
If we fix our choice of bits across rounds, we get the results shown in Figure 5.7 (a),
showing that most functions do well. The exceptions are H3 (s2) and tx3tx3tL6. H3 (s2) has
one worst case 10−6.2 (cut off by the figure boundary) with probability 10−4.2 to observe
this kind of worst case happens in 100 rounds, although on average it performs well. Our
hash function tx3tx3t L6 is slightly bad, having a worst case α of 10−1.89 ≈ 1.29%. All other
numbers indicate no significant flaws in the hash functions.
However, if we dynamically change the selection of outputs of hash function for each
round, as shown in Figure 5.7 (b), we see that some combinations of bits aren’t as uniform
as others. The Hsieh’s SuperFastHash starts to show poor worst case results (some points
out of boundary are not shown in the figure) and its average α is also the worst. Testing CRC
and H3 with two different fixed seeds, we find many occurrences of very low performance,
while their average cases are still good.
We also print the value of the maximum/minimum load ratio of the bins across all rounds
at the right of the figure, inside parentheses. We observe that it is hard to summarize the
2 It is worth noting that the α
mean and αworst are not as simply related as the mean and the minimum
of a dataset. The tail probability of an “average performance” is αmean . If it is very small, it means the
mean α is uncommon and the hash output is often of low quality. The tail probability of a “worstcase
performance” is αworst . If it is very small, it means the worst α observed is very extreme, meaning that the
quality of this hash function is at least volatile (it might be consistently of very low quality, or sometimes
very low, sometimes acceptable). With this in mind, it is possible for either of these two values to be larger
than the other. They are just two scores from slightly different perspectives.
165
results of those maximum/minimum load ratios since they are almost all in the same range.
There are only two exceptions: the results of the Hsieh’s and H3 (s2) in Figure 5.7 (b), which
are 1.63, whose αworst is very bad. This shows that Chisquare test is powerful for testing
uniformity of the output values. Low chisquare results might not necessarily lead to a bad
maximum load of hash bins. The results for H3 (s2) in Figure 5.7 (a), show a poor chisquare
result while its max/min load is still good. A hash function with a more uniform distribution
should still be more useful for other purposes than a hash table.
Figure 5.8: Uniformity results for “+10” series
We also run experiments using linear input sequences, which is also a common way to
test networking devices [91]. Figure 5.8 shows the results of having input data of the form
x + 10i for a constant x and 0 ≤ i < N , with different output bits chosen each round of
testing. The linear hash functions CRC and H3 degrade in performance significantly on this
test, while most other hash functions tested show consistent performance.
166
1
MD5
rand
CRC
H3 s1
H3 s2
tx3tx3t L6
tx3tx3t RL
tx4tx4t L6
tx4tx4t RL
tx5tx5t L6
tx5tx5t RL
t15 RL
x15 RL
0.9
α of each round (Sorted)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
Rounds
80
100
Figure 5.9: QQ plot for “+10” series Many functions which perform well are removed from
the figure.
We also show the qq plot of the α values for the linear sequence test. By sorting all the
pvalue αs of each round of each hash function in Section 5.9, we can visualize how often
a test had a certain range of results. For an ideal hash function, this plot would show a
diagonal line, since α would follow uniform distribution in [0, 1]. As most functions meet this
expectation, we remove many functions from the figure for clarity. The two worst performing
functions on the top of the plot are t15 and tx3tx3tL6. They have mostly large α values,
indicating regularly nonuniform results. The outliers on the bottom of the plot are the hash
functions with linear structure, including CRC and H3 . These give unusually frequently
uniform results and too frequent nonuniform output, indicating uneven output bit quality.
167
5.4.3
Avalanche Test
Our Avalanche test is similar to the ones used in [101]. Instead of always testing on random
input data, we use lower entropy input sets. We do this because we believe strong hash
functions should have the avalanche property even for test sequences such as the “+1” test
sequences.
A different variation on the Avalanche test is given by Castro et. al [102], where functions
are tested to see if the hamming distances between hash values follow a binomial distribution.
Given H(x, y) as the hamming distance from x to y and nbit output values they test whether
1
H(f (x), f (y)) ∼ Bin( , n)
2
This gives a single value for how accurately the given hash function passes the avalanche test,
but obscures the valuable information of where the bit dependencies are. This information is
valuable for improving a hash function in development, as changes to the mixing functions
can remove these dependencies.
In Table 5.1, we present results of testing Avalanche Properties on two input sets. The
first is composed of 105 random numbers and the second is the integers from 1 to 105 in
order. For each input value, we test the 128 hash outputs generated by inverting each input
bit. In total we have 128 × 32 inputoutput pairs and we track how frequently each output
bit inverts as a result of each input bit being inverted. The ±1% and ±5% columns are the
percentage of pairs that fall into [49%, 51%] and [45%, 55%] categories. The max ±% column
is the largest deviation from 50% among all the pairs.
Because of the size of our input, even the random number generator had some bits that
weren’t evenly distributed, as shown in the max column 0%. The hash function closest to
168
Hash
MD5
Bob
Buzhash
Hsieh
Pearson
rand
tx3tx3t L6
tx3tx3t RL
tx4tx4t L6
tx4tx4t RL
tx5tx5t L6
tx5tx5t RL
tx3tx3tx3t L6
tx3tx3tx3t RL
t15 RL
Real Trace
% in % in max
±2% ±5% ±%
Sequential input
% in % in max
±2% ±5% ±%
100.0 100.0
96.6 99.2
28.6 72.6
99.2 99.6
95.2 98.1
100.0 100.0
18.9 18.9
41.1 41.1
71.2 71.2
94.5 94.5
97.3 97.3
100.0 100.0
96.5 99.6
100.0 100.0
60.3 97.8
100.0 100.0
96.1 98.7
3.7
8.7
99.0 99.7
36.0 75.0
100.0 100.0
28.8 30.1
50.8 52.2
74.8 75.4
96.0 96.1
97.9 97.9
100.0 100.0
95.2 99.4
100.0 100.0
71.4 93.4
0.6
12.5
12.9
14.0
15.5
0.6
50.0
50.0
50.0
25.2
12.9
6.6
7.9
0.9
12.4
0.7
14.1
50.0
14.2
18.8
0.6
50.0
50.0
50.0
25.0
12.9
6.1
6.7
1.5
25.6
Table 5.1: Avalanche with real trace and +1 sequence
random is MD5, then is our tx3tx3tx3tRL. It is also interesting to observe that, although
the performance of most hash functions are consistent across the input sets, Buzhash and
Pearson differ between random and sequential input. This is likely caused by the way they
process input bytebybyte.
5.4.4
Universality Test
In practice, people usually want a hash function that gives very different results by changing
its seed. A common way to accomplish this is to design one hash function H(x), and reserve
part of the input as the seed. We write seeded hash functions as H(x, s) = y. 3
Formally, for any arbitrary seed s1 , s2 and hash value x, we want to test whether H(s1 , x)
and H(s2 , x) are independent. We will assume that hash values H(s1 , x) and H(s2 , x) are
3 It should be noted that, when you want to have the 128bit input and 16bit seed, the only correct way
is to build it based on a 144input hash function. It is easily shown to be nearly equivalent if the seed is
simply XORed with the input.
169
uniformly distributed in the space, i.e. H is a good function. Then we can test for the
uniformity of H(s1 , x)&H(s2 , x), where & is the concatenation of the two hash outputs.
Since two random variables X1 and X2 uniformly distributed in [0, d − 1] are independent if
and only if X1 × d + X2 is uniformly distributed in [0, d2 − 1], this test will determine the
independence of the hash function with different seeds. The space to be tested is very large,
so we recommend using the projection method from 5.4.2 to reduce the test complexity.
The testing procedure follows the following steps: We first divide up the input bits into
three groups: fixed, seed and variable. The fixed bits we keep at an arbitrary, fixed value
through testing we call f . Each round, we use two different seed values for the seed bits,
s1 and s2 . The remaining bits give the variation; within a round we test all combinations
of those bits, denoted x. Each output value is reduced to p bits by removing the same bits
from each. On each round, we select s1 and s2 and count the frequencies of all 22p outcomes
possible from the projection of H(f &s1 &x)&H(f &s2 &x) for all possible x. This process is
summarized in Figure 5.10. One important detail of this process is that if seed pairs (s1 , s2 )
and (s1 , s3 ) have been tested, (s2 , s3 ) should not be tested, as the results of these tests
wouldn’t be independent.
This test could be formulated with a larger seed instead of holding bits fixed. We use
the current configuration of test since big changes in seed would give the hash function
more unpredictability on its inputs to produce unpredictable outputs with. We reduce the
variation in seed values to stress the hash function by changing fewer bits in seeds to better
determine its output quality.
For each round, we observe the largest and lowest load of the bins, and perform the
chisquare test for uniformity on the distribution of the load of bins. We summarize multiple
rounds using the method in Section 5.4.2.
170
00100011101 1011011 000000
f
hash
s1
00100011 10110110 11000000
x
11110000 merge
11110000 10101010
p=8 output bits kept
hash
bin
00101001 10101001 11100100
10101010
00100011101 1011101 000000 0
f
s2
216
x
Figure 5.10: Procedure for testing universality
Before we come to present the results of this test, we connect this test with the idea of
universal hashing. Can we test whether a function employed in practice has the universality
property? We know that commonly used hash functions aren’t proven to be in a universal
hashing function family. There may be a statistical method to test if a function is “approximately” universal.
The 2universal hashing is defined for a hash function f with seed space S as
∀x1 , x2 , y1 , y2 , P [f (x1 ) = y1 (and)f (x2 ) = y2 ] ≈
1
.
S2
The natural way to statistically test for universality is to select a fixed pair of x1 and x2 ,
traverse all possible S, and get the distribution of the pair of y1 and y2 . Since the space of
y and S is large, bit projections are needed to make the computation feasible. Surprisingly,
the procedure of this test is the same as the universality test already described. For this
reason, we call our test as Universality Test. The nature of this test is to discover more into
the “2D” structure of the hash values.
Here we present an example of the result of this test in Figure 5.11. We use 18bits of x,
5bits of seed, and project out 6 bits from each hash result, resulting in 1M balls in 64K bins.
171
Figure 5.11: Universality testing results
Actually all results are similar and we don’t select this configuration for a specific reason.
The selections of the 18bits x and 5bits s from the input bits are randomly preconfigured.
In total we could have 25 − 1 pairs of seeds, since pairs should share at most one common
seed as discussed above. Hence there are 31 rounds in total.
The results are shown in Figure 5.11, the meaning of all numbers are the same as the ones
in generalized uniformity test. It is clear that much more functions fail this test. Not surprisingly, CRC and the linear families fail the test because of their linear structure. BuzHash
also performs poorly on this test, with the worst possible max/min load ratios. Looking more
closely at BuzHash’s bin size distribution, we observe that BuzHash still has more variation
in its bin size than linear hash functions, despite the same reported load ratios.
It’s also interesting to observe that even the Pearson hash fails this test both in having a
very nonuniform seed pairing as well as having very poor worst case bucketing. Many of our
172
constructions fail this test, with only those using full random permutations passing. Even
many of those that failed still had reasonable worst case bucket sizes.
5.5
Future Work
It may be possible to improve the output quality with a different staging pattern. Since the
3 × 3 Sbox stage is only a little bigger than an XOR stage (although the delay is much
larger), we may be able to improve hash quality by trading XOR stages for Sbox stages.
Although we have only done a small exploration of the possibilities of this framework, we are
sure that both are important, play different roles in our architecture, and should be deployed
in a balanced way. Using only XOR stages, x15, or only Sbox stages, t15, results in poor
performance in most of the tests. As combining XOR stages produces a more complex but
still linear XOR stage, the results for x15 are unsurprising.
There is more that can be done to discover better hash functions within our general
framework, mainly in the realm of increasing the performance per unit cost of the permutation rounds. We expect our evaluation methods to be useful in guiding this optimization
and leading to better hash functions.
5.6
Conclusion
In the light of current networking trends, we investigated existing hash functions and found
them lacking in either output quality or implementation cost or latency in hardware. Also,
we found that current evaluation methodologies are ineffective for the quality range we were
investigating. In this paper, we developed a generalized evaluation framework to quantify the
randomness of hash functions for those applications and evaluated various hash functions. We
173
also investigated strategies for building hash functions and present a family of hash functions
that is not only small and fast to implement, but also scores high in our evaluations.
To summarize the hash quality results above, Bob Jenkin’s hash is the only previously
known noncryptographic hash that passes all our tests. Among our candidate hash functions,
only tx3tx3tL6 fails any uniformity test. Our function tx3tx3tx3tRL performs nearly as well
on the Avalanche test as M D5. The constructions using RL permutations (full random
mixing) do well on the universality test.
Regarding the implementation costs of our proposed hash functions, we have performed
a preliminary synthesis. This synthesis uses 45nm technologies with clock speeds at 1GHz,
meaning the logic costs for these designs are much higher than the designs in Satoh and
Inoue [88]. Based on our current synthesis results, the random permutation uses around
5% more cell area than the 6layer simplified permutation. Constraining the synthesis so
that our hash function completes in a single cycle, a 128 × 128 random XOR matrix, needs
around 3.3K cell areas. A comparable construction is the pattern tx3tx3tRL, which needs
around 3.9K cell areas. This 9stage hash function gives good uniformity results, moderate
avalanche, and good universality results, and is able to hash data at a rate of 16GB/s with
a single cycle of latency.
Achieving more than this with single cycle latency increases the cell count significantly,
as long chains of dependent gates have to be replaced by more complex constructions with
shorter dependency length. The 13stage tx5tx5tRL needs over 8K cell areas to complete in
a single cycle. We provide a methodology to strike a tradeoff between hashing quality and
cost, not only optimized results for the 4K gate count.
Hardware hash design for networking systems is polarized. One style of hash design is
exemplified by MD5 and SHA1. This world focuses on very high output quality crypto174
graphic hashes that are mandated by external requirements and used for security purposes.
The opposite end of hash design is not constrained by external standards, and has mediocre
output quality in exchange for simple implementations. We expect this work to open new exploration in the field of hardware hash functions to bridge the gap with high output quality,
low cost hash functions. Success in this area will help new high speed routers keep pace with
bandwidth demands with a minimum of pathological cases for a highly reliable Internet.
175
REFERENCES
176
REFERENCES
[1] “Simple object access protocol (soap), www.w3.org/tr/soap/.”
[2] “XMLRPC, http://www.xmlrpc.com/spec.”
[3] H. J. Wang, C. Guo, D. R. Simon, and A. Zugenmaier, “Shield: vulnerabilitydriven
network filters for preventing known vulnerability exploits,” in Proceedings SIGCOMM,
2004.
[4] Z. Li, L. Wang, Y. Chen, and Z. Fu, “Networkbased and attackresilient length signature generation for zeroday polymorphic worms,” IEEE International Conference
Network Prorocols (ICNP), pp. 164–173, 2007.
[5] N. Schear, D. R. Albrecht, and N. Borisov, “Highspeed matching of vulnerability
signatures,” in Proceedings International Symposium on Recent Advances in Intrusion
Detection (RAID), 2008.
[6] D. Brumley, J. Newsome, D. Song, H. Wang, and S. Jha, “Towards automatic generation of vulnerabilitybased signatures,” IEEE Symposium Security and Privacy, 2007.
[7] R. Pang, V. Paxson, R. Sommer, and L. Peterson, “binpac: a yacc for writing application protocol parsers,” in Proceedings ACM Internet Measurement Conference (IMC),
2006.
[8] N. Borisov, D. J. Brumley, and H. J. Wang, “A generic applicationlevel protocol
analyzer and its language,” in Proceedings Network and Distributed System Security
Symposium (NDSS), 2007.
[9] C. Ciressan, E. Sanchez, M. Rajman, and J.C. Chappelier, “An FPGAbased coprocessor for the parsing of contextfree grammars,” in Proceedings IEEE Symposium on
FieldProgrammable Custom Computing Machines (FCCM), 2000, p. 236.
[10] ——, “An FPGAbased syntactic parser for reallife almost unrestricted contextfree
grammars,” in Proceedings 11th International Conference on FieldProgrammable Logic
and Applications (FPL), 2001, pp. 590–594.
[11] A. Koulouris, N. Koziris, T. Andronikos, G. Papakonstantinou, and P. Tsanakas, “A
parallel parsing VLSI architecture for arbitrary context free grammars,” in Proceedings
International Conference on Parallel and Distributed Systems (ICPADS), 1998, p. 783.
177
[12] Y. H. Cho and W. H. MangioneSmith, “Highperformance contextfree parser for
polymorphic malware detection,” in Proceedings Advanced Networking and Communications Hardware Workshop, 2005.
[13] J. Moscola, Y. H. Cho, and J. W. Lockwood, “Reconfigurable contextfree grammar
based data processing hardware with error recovery,” in Proceedings 20th International
Parallel and Distributed Processing Symposium (IPDPS), 2006.
[14] J. Moscola, J. W. Lockwood, and Y. H. Cho, “Reconfigurable contentbased router using hardwareaccelerated language parser,” ACM Transactions on Design Automation
of Electronic Systems (TODAES), vol. 13, no. 2, pp. 1–25, 2008.
[15] Y. H. Cho, J. Moscola, and J. W. Lockwood, “Contextfreegrammar based token tagger in reconfigurable devices,” in Proceedings ACM/SIGDA 14th International symposium on Field programmable gate arrays (FPGA), 2006, pp. 237–237.
[16] Z. Li, G. Xia, H. Gao, Y. Tang, Y. Chen, B. Liu, J. Jiang, and Y. Lv, “NetShield:
Massive semanticsbased vulnerability signature matching for highspeed networks,”
in Proceedings SigCOMM, 2010.
[17] “Ethereal
OSPF
protocol
dissector
buffer
overflow
vulnerability.
http://www.idefense.com/intelligence/vulnerabilities/display.php?id=349.”
[18] “Snort TCP stream reassembly integer overflow exploit,
http://www.securiteam.com/exploits/5bp0o209ps.html.”
[19] “tcpdump
ISAKMP
packet
delete
http://xforce.iss.net/xforce/xfdb/15680.”
payload
buffer
[20] “Symantec multiple firewall NBNS response processing stack
http://research.eeye.com/html/advisories/published/ad20040512a.html.”
[21] C. Shannon and D. Moore,
“The spread
http://www.caida.org/research/security/witty/.”
of
the
witty
overflow.
overflow.
worm.
[22] A. Kumar, V. Paxson, and N. Weaver, “Exploiting underlying structure for detailed
reconstruction of an internetscale event,” in Proceedings ACM Internet Measurement
Conference (IMC), 2005.
[23] S. C. Johnson, “Yacc  yet another compilercompiler,” Bell Laboratories, Technical
Report 32, 1975.
[24] T. T. J. Parr and R. R. W. Quong, “Antlr: A predicatedll(k) parser generator,”
Software, Practice and Experience, vol. 25, 1995.
178
[25] “Darpa intrusion detection evaluation data set,” www.ll.mit.edu/mission/
communications/ist/corpora/ideval/data/1998data.html, 1998.
[26] “Harvestman,” http://code.google.com/p/harvestmancrawler/, 2013.
[27] “Tcmalloc,” http://googperftools.sourceforge.net/doc/tcmalloc.html, 2011.
[28] R. Smith, C. Estan, and S. Jha, “Xfa: Faster signature matching with extended automata,” in Proceedings IEEE Symposium on Security and Privacy, 2008, pp. 187–201.
[29] R. Smith, C. Estan, S. Jha, and S. Kong, “Deflating the big bang: fast and scalable
deep packet inspection with extended finite automata,” in Proceedings SIGCOMM,
2008, pp. 207–218.
[30] L. Yang, R. Karim, V. Ganapathy, and R. Smith, “Fast, memoryefficient regular
expression matching with NFAOBDDs,” Computer Networks, vol. 55, no. 55, pp.
3376–3393, 2011.
[31] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, “Algorithms to accelerate multiple regular expressions matching for deep packet inspection,” in Proceedings
SIGCOMM, 2006, pp. 339–350.
[32] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and memoryefficient
regular expression matching for deep packet inspection,” in Proceedings ACM/IEEE
Symposium on Architecture for Networking and Communications Systems (ANCS),
2006, pp. 93–102.
[33] M. Becchi and P. Crowley, “A hybrid finite automaton for practical deep packet inspection,” in Proceedings of ACM CoNEXT. ACM, 2007.
[34] ——, “Extending finite automata to efficiently match perlcompatible regular expressions,” in Proceedings CoNEXT, 2008, pp. 1–12.
[35] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese, “Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia,” in Proceedings
ACM/IEEE ANCS, 2007, pp. 155–164.
[36] M. Bando, N. Artan, and H. Chao, “Scalable lookahead regular expression detection
system for deep packet inspection,” Networking, IEEE/ACM Transactions on, vol. 20,
no. 3, pp. 699 –714, june 2012.
[37] J. Koˇrenek and V. Koˇsaˇr, “Nfa split architecture for fast regular expression matching,”
in Proceedings ANCS. New York, NY, USA: ACM, 2010, pp. 14:1–14:2. [Online].
Available: http://doi.acm.org/10.1145/1872007.1872024
179
[38] Y.H. E. Yang, W. Jiang, and V. K. Prasanna, “Compact architecture for
highthroughput regular expression matching on fpga,” in Proceedings ANCS, 2008,
pp. 30–39. [Online]. Available: http://doi.acm.org/10.1145/1477942.1477948
[39] C.H. Lin, C.T. Huang, C.P. Jiang, and S.C. Chang, “Optimization of regular
expression pattern matching circuits on fpga,” in Proceedings DATE, 2006, pp. 12–17.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1131355.1131359
[40] I. Bonesana, M. Paolieri, and M. D. Santambrogio, “An adaptable fpgabased system
for regular expression matching,” in Proceedings DATE, 2008, pp. 1262–1267. [Online].
Available: http://doi.acm.org/10.1145/1403375.1403681
[41] A. Mitra, W. Najjar, and L. Bhuyan, “Compiling pcre to fpga for accelerating snort
ids,” in Proceedings ANCS. New York, NY, USA: ACM, 2007, pp. 127–136. [Online].
Available: http://doi.acm.org/10.1145/1323548.1323571
[42] C. R. Clark and D. E. Schimmel, “Efficient reconfigurable logic circuits for matching complex network intrusion detection patterns,” in Proceedings FieldProgrammable
Logic and Applications, 2003, pp. 956–959.
[43] R. Sidhu and V. K. Prasanna, “Fast regular expression matching using fpgas,” in
Proceedings IEEE Symposium on FieldProgrammable Custom Computing Machines
FCCM, 2001, pp. 227–238.
[44] C. Meiners, J. Patel, E. Norige, E. Torng, and A. Liu, “Fast regular expression matching using small tcams for network intrusion detection and prevention systems,” in
Proceedings 19th USENIX Security, 2010.
[45] K. Peng, S. Tang, M. Chen, and Q. Dong, “Chainbased dfa deflation for
fast and scalable regular expression matching using tcam,” in Proceedings ANCS.
Washington, DC, USA: IEEE Computer Society, 2011, pp. 24–35. [Online]. Available:
http://dx.doi.org/10.1109/ANCS.2011.13
[46] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory
of NPCompleteness. W. H. Freeman, 1979.
[47] M. Roesch, “Snort: Lightweight intrusion detection for networks,” in Proceedings 13th
Systems Administration Conference (LISA), USENIX Association, November 1999,
pp. 229–238.
[48] V. Paxson, “Bro: a system for detecting network intruders in realtime,”
Computer Networks, vol. 31, no. 2324, pp. 2435–2463, 1999. [Online]. Available:
citeseer.ist.psu.edu/paxson98bro.html
180
[49] M. Becchi and P. Crowley, “An improved algorithm to accelerate regular expression
evaluation,” in Proceedings ACM/IEEE ANCS, 2007.
[50] M. Becchi, M. Franklin, and P. Crowley, “A workload for evaluating deep packet
inspection architectures,” in Proceedings IEEE IISWC, 2008.
[51] “Us army itoc research cdx 2009 trace,” http://www.itoc.usma.edu/research/dataset/
index.html, 2009.
[52] “A
guide
to
search
engines
http://www.linleygroup.com/pdf/NMv4.pdf.
and
networking
memory,”
[53] D. E. Taylor, “Survey & taxonomy of packet classification techniques,” ACM Computing Surveys, vol. 37, no. 3, pp. 238–275, 2005.
[54] B. Agrawal and T. Sherwood, “Modeling TCAM power for next generation network
devices,” in Proceedings IEEE International Symposium on Performance Analysis of
Systems and Software, 2006, pp. 120– 129.
[55] C. R. Meiners, A. X. Liu, and E. Torng, “Bit weaving: A nonprefix approach to
compressing packet classifiers in TCAMs,” IEEE/ACM Transactions on Networking,
vol. 20, no. 2, pp. 488–500, 2012.
[56] C. Lambiri, “Senior staff architect IDT, private communication,” 2008.
[57] P. C. Lekkas, Network Processors  Architectures, Protocols, and Platforms. McGrawHill, 2003.
[58] A. X. Liu and M. G. Gouda, “Complete redundancy removal for packet classifiers in
tcams,” IEEE Transactions on Parallel and Distributed Systems (TPDS), in press.
[59] Q. Dong, S. Banerjee, J. Wang, D. Agrawal, and A. Shukla, “Packet classifiers in
ternary CAMs can be smaller,” in Proceedings ACM Sigmetrics, 2006, pp. 311–322.
[60] A. X. Liu, C. R. Meiners, and E. Torng, “TCAM razor: A systematic approach towards
minimizing packet classifiers in TCAMs,” IEEE/ACM Transactions on Networking,
vol. 18, no. 2, pp. 490–500, April 2010.
[61] A. X. Liu and M. G. Gouda, “Complete redundancy detection in firewalls,” in
Proceedings 19th Annual IFIP Conference on Data and Applications Security, LNCS
3654, August 2005, pp. 196–209. [Online]. Available: http://www.cs.utexas.edu/
users/alex/publications/Redundancy/redundancy.pdf
[62] A. X. Liu, C. R. Meiners, and Y. Zhou, “Allmatch based complete redundancy removal
for packet classifiers in TCAMs,” in Proceedings 27th Annual IEEE Conference on
Computer Communications (Infocom), April 2008.
181
[63] H. Acharya and M. Gouda, “Firewall verification and redundancy checking are equivalent,” in INFOCOM, 2011 Proceedings IEEE. IEEE, 2011, pp. 2123–2128.
[64] R. Draves, C. King, S. Venkatachary, and B. Zill, “Constructing optimal IP routing
tables,” in Proceedings IEEE INFOCOM, 1999, pp. 88–97.
[65] S. Suri, T. Sandholm, and P. Warkhede, “Compressing twodimensional routing tables,” Algorithmica, vol. 35, pp. 287–300, 2003.
[66] D. A. Applegate, G. Calinescu, D. S. Johnson, H. Karloff, K. Ligett, and J. Wang,
“Compressing rectilinear pictures and minimizing access control lists,” in Proceedings
ACMSIAM Symposium on Discrete Algorithms (SODA), January 2007.
[67] R. McGeer and P. Yalagandula, “Minimizing rulesets for tcam implementation,” in
Proceedings IEEE Infocom, 2009.
[68] K. Kogan, S. Nikolenko, W. Culhane, P. Eugster, and E. Ruan, “Towards efficient
implementation of packet classifiers in sdn/openflow,” in Proceedings of the second
ACM SIGCOMM workshop on Hot topics in software defined networking. ACM,
2013, pp. 153–154.
[69] O. Rottenstreich, I. Keslassy, A. Hassidim, H. Kaplan, and E. Porat, “Optimal in/out
tcam encodings of ranges,” 2014.
[70] T. Mishra and S. Sahni, “Petcam–a power efficient tcam architecture for forwarding
tables,” Computers, IEEE Transactions on, vol. 61, no. 1, pp. 3–17, 2012.
[71] O. Rottenstreich, R. Cohen, D. Raz, and I. Keslassy, “Exact worstcase tcam rule
expansion,” IEEE Transactions on Computers, 2012.
[72] E. Spitznagel, D. Taylor, and J. Turner, “Packet classification using extended TCAMs,”
in Proceedings 11th IEEE International Conference on Network Protocols (ICNP),
November 2003, pp. 120– 131.
[73] R. Wei, Y. Xu, and H. Chao, “Block permutations in boolean space to minimize tcam
for packet classification,” in INFOCOM, 2012 Proceedings IEEE, march 2012, pp. 2561
–2565.
[74] Y. Chang, C. Lee, and C. Su, “Multifield range encoding for packet classification in
tcam,” in INFOCOM, 2011 Proceedings IEEE. IEEE, 2011, pp. 196–200.
[75] A. BremlerBarr, D. Hay, D. Hendler, and B. Farber, “Layered interval codes for
TCAM based classification,” in Proceedings of IEEE Infocom, 2009.
182
[76] A. BremlerBarr and D. Hendler, “Spaceefficient TCAMbased classification using
gray coding,” in Proceedings 26th Annual IEEE Conference on Computer Communications (Infocom), May 2007.
[77] K. Kogan, S. Nikolenko, O. Rottenstreich, W. Culhane, and P. Eugster, “Saxpac (scalable and expressive packet classification),” in Proceedings of the 2014 ACM conference
on SIGCOMM. ACM, 2014, pp. 15–26.
[78] K. Kogan, S. Nikolenko, P. Eugster, and E. Ruan, “Strategies for mitigating tcam space
bottlenecks,” in HighPerformance Interconnects (HOTI), 2014 IEEE 22nd Annual
Symposium on. IEEE, 2014, pp. 25–32.
[79] O. Rottenstreich, M. Radan, Y. Cassuto, I. Keslassy, C. Arad, T. Mizrahi, Y. Revah,
and A. Hassidim, “Compressing forwarding tables,” in INFOCOM, 2013 Proceedings
IEEE. IEEE, 2013, pp. 1231–1239.
[80] D. Maier, “The complexity of some problems on subsequences and supersequences,”
J. ACM, vol. 25, no. 2, pp. 322–336, Apr. 1978. [Online]. Available: http:
//doi.acm.org/10.1145/322063.322075
[81] M. G. Gouda and A. X. Liu, “Firewall design: consistency, completeness and
compactness,” in Proceedings 24th IEEE International Conference on Distributed
Computing Systems (ICDCS04), March 2004, pp. 320–327. [Online]. Available:
http://www.cs.utexas.edu/users/alex/publications/fdd.pdf
[82] A. Z. Broder and M. Mitzenmacher, “Survey: Network applications of bloom filters:
A survey,” Internet Mathematics, vol. 1, no. 4, 2003.
[83] J. Xu, “Tutorial on network data streaming,” ACM Sigmetrics, 2008.
[84] D. Knuth, The Art of Computer Programming 3: Sorting and Searching.
Wesley, 1968.
Addison
[85] S. Goldberg and J. Rexford, “Security vulnerabilities and solutions for packet sampling,” in Sarnoff Symposium, 2007 IEEE. IEEE, 2008, pp. 1–7.
[86] J. L. Carter and M. N. Wegman, “Universal classes of hash functions,”
Journal of Computer and System Sciences, vol. 18, no. 2, pp. 143 –
154, 1979. [Online]. Available:
http://www.sciencedirect.com/science/article/
B6WJ04B55K9JD/2/036439eff8b0d54d7974c2d5d6997669
[87] MIPS Technologies, Inc., “Mips32 4ke family,” Nov 2010, http://www.mips.com/
products/cores/3264bitcores/mips324ke/.
183
[88] A. Satoh and T. Inoue, “Asichardwarefocused comparison for hash functions md5,
ripemd160, and shs,” Integration, the VLSI Journal, vol. 40, no. 1, pp. 3–10, 2007.
[89] P. K. Pearson, “Fast hashing of variablelength text strings,” Commun. ACM, vol. 33,
pp. 677–680, June 1990. [Online]. Available: http://doi.acm.org/10.1145/78973.78978
[90] R. Uzgalis and M. Tong, “Hashing myths,” Technical Report 97, Department of Computer Science University of Auckland, July 1994.
[91] D. Newman and T. Player, “Hash and Stuffing: Overlooked Factors in Network
Device Benchmarking,” RFC 4814 (Informational), Internet Engineering Task Force,
Mar. 2007. [Online]. Available: http://www.ietf.org/rfc/rfc4814.txt
[92] R. Jenkins, “The hash,” Nov 2010, http://burtleburtle.net/bob/hash/doobs.html.
[93] A. Appleby, “Murmurhash 2.0,” Nov 2010, http://sites.google.com/site/murmurhash/.
[94] C. Henke, C. Schmoll, and T. Zseby, “Empirical evaluation of hash functions for multipoint measurements,” ACM SIGCOMM Computer Communication Review, vol. 38,
no. 3, pp. 39–50, 2008.
[95] A. F. Webster and S. E. Tavares, “On the design of sboxes,” in CRYPTO ’85: Advances in Cryptology. London, UK: SpringerVerlag, 1986, pp. 523–534.
[96] M. Molina, S. Niccolini, and N. Duffield, “A comparative experimental study of hash
functions applied to packet sampling,” in International Teletraffic Congress (ITC19),
Beijing. Citeseer, 2005.
[97] Artisan Components, “Tsmc 0.18mm process 1.8volt sagextm standard cell
library databook,” Nov 2010, http://www.ece.virginia.edu/∼mrs8n/cadence/
SynthesisTutorials/tsmc18.pdf.
[98] P. H. kc claffy, Dan Andersen, “The caida anonymized 2010 internet traces  mar 25,
2010,” http://www.caida.org/data/passive/passive 2010 dataset.xml.
[99] P. Hsieh, “Hash functions,” Nov 2010, http://www.azillionmonkeys.com/qed/hash.
html.
[100] R. Rivest, “The MD5 MessageDigest Algorithm,” RFC 1321 (Informational), Internet
Engineering Task Force, Apr. 1992, updated by RFC 6151. [Online]. Available:
http://www.ietf.org/rfc/rfc1321.txt
[101] B. Mulvey, “Hash functions,” Nov 2010, http://bretm.home.comcast.net/∼bretm/
hash/.
184
[102] J. Castro, J. Sierra, A. Seznec, A. Izquierdo, and A. Ribagorda, “The strict avalanche
criterion randomness test,” Mathematics and Computers in Simulation, vol. 68, no. 1,
pp. 1–7, 2005.
185