ALGORITHMS FOR DEEP PACKET INSPECTION
By
Jignesh D. Patel

A DISSERTATION
Submitted to
Michigan State University
in partial fulllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2012

ABSTRACT

ALGORITHMS FOR DEEP PACKET INSPECTION
By
Jignesh D. Patel
The core operation in network intrusion detection and prevention systems is

Deep Packet

Inspection (DPI), in which each security threat is represented as a signature, and the payload of each data packet is matched against the set of current security threat signatures.
DPI is also used for other networking applications like advanced QoS mechanisms, protocol identication etc.. In the past, attack signatures were specied as strings. Today most
DPI systems use

Regular Expression (RE)s to represent signatures. RE matching for

networking applications is dicult for several reasons. First, the DPI application is usually
implemented in network devices, which have limited computing resources. Second, as new
threats are discovered, the size of the signature set grows over time. Last, the matching
needs to be done at network speeds, the growth of which outpaces improvements in computing speed; so there is a need for novel solutions that can deliver higher throughput. As
a result, RE matching for DPI is a very important and active research area.
We study existing methods proposed for RE matching, identify their limitations, and propose new methods to overcome these limitations. RE matching remains a fundamentally
challenging problem due to the diculty in compactly encoding

Deterministic Finite

state Automata (DFA). While the DFA for any one RE is typically small, the DFA that
corresponds to the entire set of REs is usually too large to be constructed or deployed.

To address this issue, many alternative automata implementations that compress the size
of the nal automaton have been proposed. We improve upon previous research in three
ways. First, we propose a more ecient \Minimize then Union" framework for constructing
compact alternative automata that minimizes smaller automata before combining them.
Previously proposed automata construction algorithms employ a \Union then Minimize"
framework where the automata for each RE are joined before minimization occurs. This
leads to expensive minimization on a large automata and a large intermediate memory footprint. Our minimize then union approach requires much less time and memory, allowing us
to handle a much larger RE set. Second, we propose the rst hardware-based RE matching approach that uses

Ternary Content Addressable Memory (TCAM). Prior hardware

based RE matching algorithms typically use FPGA. The main drawback of FPGA is that
resynthesizing and updating FPGA circuitry to handle RE updates is slow and dicult.
In contrast, TCAM supports easy RE updates, and we show that we can achieve very
high throughput. Furthermore, TCAMs are widely used in modern networking devices for
tasks such as packet classication, so no major architecture modications are needed to
implement our approach in existing networking devices. Finally, we propose new overlay
automata models that eectively address the replication of DFA states that occurs when
multiple REs are combined. The idea is to group together the replicated DFA structures
instead of repeating them multiple times. The result is that we get a nal automata size
that is close to that of a NFA (which is linear in the size of the RE set), and simultaneously
achieve fast deterministic matching speed of a DFA.

ACKNOWLEDGMENTS

I would like to take this opportunity to thank all the people who have helped me during
my graduate career and made this Dissertation possible.
First and foremost, I would like to thank my advisor, Dr. Eric Torng, for his constant
guidance, support and encouragement.
I would like to express my earnest gratitude to my thesis committee members Dr. Richard
Enbody, Dr. Alex Liu and Dr. Peter Magyar for being there for me whenever I needed.
I would also like to thank the sta of the CSE department for all their help and support.
Finally I would like to thank my friends and family for all their support and encouragement.

iv

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
2
4

Chapter 2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Chapter 3

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Chapter 1

3.1 DFA for RE Matching . . . . . . . . . . . . . . . . . . . . . . .
3.2 Understanding DFA space explosion . . . . . . . . . . . . . . .
3.2.1 Transition Sharing . . . . . . . . . . . . . . . . . . . . .
3.2.2 State Replication . . . . . . . . . . . . . . . . . . . . . .
3.3 D2 FA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 D2 FA Denition . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Original D2 FA Algorithm . . . . . . . . . . . . . . . . .
3.3.3 Limiting Deferment Depth in Original D2 FA Algorithm
3.3.4 Backpointer D2 FA Algorithm . . . . . . . . . . . . . . .
3.4 Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Classier denition . . . . . . . . . . . . . . . . . . . . .
3.4.1.1 Prex Classier . . . . . . . . . . . . . . . . . .
3.4.1.2 Ternary Classier . . . . . . . . . . . . . . . . .
3.4.1.3 Weighted Classier . . . . . . . . . . . . . . . .
3.4.2 Classier Minimization . . . . . . . . . . . . . . . . . . .
3.5 TCAM Introduction . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

12
14
17
18
19
20
24
25
26
27
27
28
29
29
30
30

Software Implementation . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Introduction/Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v

4.2
4.3

4.4

4.5

4.1.1 Solution Goals . . . . . . . . . . . . . . . . . . .
4.1.2 Summary and Limitations of Prior Art . . . . . .
4.1.3 Summary of Our Approach . . . . . . . . . . . .
4.1.3.1 Advantages of our algorithm . . . . . .
Minimum State PMDFA construction . . . . . . . . . .
Ecient D2 FA Construction . . . . . . . . . . . . . . . .
4.3.1 Improved D2 FA Construction for One RE . . . .
4.3.2 D2 FA Merge Algorithm . . . . . . . . . . . . . .
4.3.3 Direct D2 FA construction for RE set . . . . . . .
4.3.4 Optional Final Compression Algorithm . . . . . .
D2 FA Merge Algorithm Properties . . . . . . . . . . . .
4.4.1 Proof of Correctness . . . . . . . . . . . . . . . .
4.4.2 Limiting Deferment Depth . . . . . . . . . . . . .
4.4.3 Deferment to a Lower Level . . . . . . . . . . . .
4.4.4 Algorithmic Complexity . . . . . . . . . . . . . .
Experimental Results . . . . . . . . . . . . . . . . . . . .
4.5.1 Methodology . . . . . . . . . . . . . . . . . . . .
4.5.1.1 Data Sets . . . . . . . . . . . . . . . . .
4.5.1.2 Metrics . . . . . . . . . . . . . . . . . .
4.5.1.3 Measuring Space . . . . . . . . . . . . .
4.5.1.4 Correctness . . . . . . . . . . . . . . . .
4.5.2 D2 FAMERGE versus ORIGINAL . . . . . . . . .
4.5.3 Assessment of Final Compression Algorithm . . .
4.5.4 D2 FAMERGE versus ORIGINAL with Bounded
ment Depth . . . . . . . . . . . . . . . . . . . . .
4.5.5 D2 FAMERGE versus BACKPTR . . . . . . . . .
4.5.6 Scalability results . . . . . . . . . . . . . . . . . .

Chapter 5

. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Maximum Defer. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .

33
33
35
36
37
41
41
46
51
53
53
54
55
57
59
61
61
61
62
63
65
65
68
69
71
73

TCAM Implementation . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 Introduction/Motivation . . . . . . . . . . . . . . . .
5.1.1 TCAM Architecture for RE matching . . . .
5.1.2 Reducing TCAM size . . . . . . . . . . . . . .
5.1.2.1 Transitions Sharing . . . . . . . . . .
5.1.2.2 Table Consolidation . . . . . . . . .
5.1.3 Increasing Matching Throughput . . . . . . .
5.1.4 Comparison of Transition Sharing with D2 FA
5.2 Transition Sharing . . . . . . . . . . . . . . . . . . .
5.2.1 Character Bundling . . . . . . . . . . . . . . .
5.2.2 Shadow Encoding . . . . . . . . . . . . . . . .
5.2.2.1 Observations . . . . . . . . . . . . .
vi

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

75
76
77
78
79
80
81
82
82
83
83

5.3

5.4

5.5
5.6

5.2.2.2 Determining Table Order . . . . . . . . .
5.2.2.3 Shadow Encoding Algorithm . . . . . . .
5.2.2.4 Choosing Transitions . . . . . . . . . . . .
Table Consolidation . . . . . . . . . . . . . . . . . . . . .
5.3.1 Observations . . . . . . . . . . . . . . . . . . . . .
5.3.2 Computing a k-decision table . . . . . . . . . . . .
5.3.3 Choosing States to Consolidate . . . . . . . . . . .
5.3.3.1 Greedy Matching . . . . . . . . . . . . . .
5.3.4 Eectiveness of Table Consolidation . . . . . . . .
Variable Striding . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Observations . . . . . . . . . . . . . . . . . . . . .
5.4.2 Eliminating State Explosion . . . . . . . . . . . . .
5.4.3 Controlling Transition Explosion . . . . . . . . . .
5.4.3.1 Self-Loop Unrolling Algorithm . . . . . .
5.4.3.2 k-var-stride Transition Sharing Algorithm
5.4.4 Variable Striding Selection Algorithm . . . . . . .
Implementation and Modeling . . . . . . . . . . . . . . . .
Experimental Results . . . . . . . . . . . . . . . . . . . . .
5.6.1 Methodology . . . . . . . . . . . . . . . . . . . . .
5.6.2 Results on 1-stride DFAs . . . . . . . . . . . . . . .
5.6.3 Results on 7-var-stride DFAs . . . . . . . . . . . .

Chapter 6

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

85
86
95
98
99
101
103
105
107
108
109
110
111
111
113
116
118
120
120
121
126

Overlay Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Limitations of Prior Automata Models . . . .
6.1.2 Summary of Overlay Automata Approach . .
6.1.2.1 Overlay DFA . . . . . . . . . . . . .
6.1.2.2 Overlay D2 FA . . . . . . . . . . . . .
6.1.2.3 Building OD2 FA . . . . . . . . . . .
6.1.2.4 Implementing OD2 FA . . . . . . . .
6.2 Overlay DFA . . . . . . . . . . . . . . . . . . . . . .
6.3 Overlay D2 FA . . . . . . . . . . . . . . . . . . . . . .
6.3.1 OD2 FA Multiplicative Compression . . . . . .
6.3.2 Eectiveness of OD2 FA on Ideal RE set . . .
6.4 OD2 FA Construction . . . . . . . . . . . . . . . . . .
6.4.1 OD2 FA Construction from One RE . . . . . .
6.4.2 OD2 FA Construction from 2 OD2 FAs . . . .
6.4.3 Direct OD2 FA Construction from 2 OD2 FAs .
6.5 Building Super-state Transitions . . . . . . . . . . .
6.5.1 Combining State Transitions . . . . . . . . .
vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

131
132
133
133
134
135
136
136
144
147
148
149
150
154
162
166
168

6.5.1.1 Computing State Transitions . . . . . .
6.5.2 Creating Overlay Classier . . . . . . . . . . . . .
6.5.3 Minimizing Overlay Classier . . . . . . . . . . .
6.5.3.1 Pre-merging Bits . . . . . . . . . . . . .
6.5.3.2 Bit Merging Algorithm . . . . . . . . . .
6.5.4 Overlay Discussion . . . . . . . . . . . . . . . . .
6.5.4.1 Restricting Overlay Count to Power of 2
6.5.4.2 Eliminating Overlay Bits . . . . . . . .
2 FA Software Implementation . . . . . . . . . . . . .
6.6 OD
6.6.1 Implementing OD2 FA . . . . . . . . . . . . . . .
6.6.2 Overlay Classier Storage and Lookup . . . . . .
6.6.3 Space Requirement . . . . . . . . . . . . . . . . .
6.7 OD2 FA Implementation in TCAM . . . . . . . . . . . .
6.7.1 Generating Super-state IDs and Codes . . . . . .
6.7.2 Implementing Super-state Transitions . . . . . .
6.7.3 TCAM Table Generation . . . . . . . . . . . . . .
6.7.4 Variable Striding . . . . . . . . . . . . . . . . . .
6.7.4.1 Self-loop Unrolling . . . . . . . . . . . .
6.7.4.2 Full Variable Striding . . . . . . . . . .
6.8 Experimental Results . . . . . . . . . . . . . . . . . . . .
6.8.1 Eectiveness of OverlayCAM . . . . . . . . . . .
6.8.2 Results on 7-var-stride . . . . . . . . . . . . . . .
6.8.2.1 Self-loop Unrolling . . . . . . . . . . . .
6.8.2.2 Full Variable Striding . . . . . . . . . .
6.8.3 Scalability of OverlayCAM . . . . . . . . . . . . .
Chapter 7

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

170
173
175
179
180
182
182
184
185
186
187
188
188
189
190
191
193
193
194
198
199
203
203
204
205

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

viii

LIST OF TABLES

Table 4.1

Performance data of ORIGINAL and D2 FAMERGE . . . . . . . . . 65

Table 4.2

Comparing D2 FAMERGE and D2 FAMERGEOPT with ORIGINAL. 66

Table 4.3

Performance data of D2 FAMERGEOPT . . . . . . . . . . . . . . . 68

Table 4.4

The D2 FA size and D2 FA average ψ deferment depth for ORIGINAL and D2 FAMERGE on our eight primary RE sets given maximum deferment depth bounds of 1, 2 and 4. . . . . . . . . . . . . . 70

Table 4.5

Comparing D2 FAMERGE with ORIGINAL given maximum deferment depth bounds of 1, 2 and 4. . . . . . . . . . . . . . . . . . . . 70

Table 4.6

Performance data for both variants of BACKPTR and D2 FAMERGE
with the back-pointer property. . . . . . . . . . . . . . . . . . . . . 71

Table 4.7

Comparing D2 FAMERGE with both variants of BACKPTR. . . . . 72

Table 5.1

TCAM size and Latency . . . . . . . . . . . . . . . . . . . . . . . . 119

Table 5.2

TCAM size and throughput for 1-stride DFAs . . . . . . . . . . . . 121

Table 6.1

Experimental results of OverlayCAM on 8 RE sets in comparison
with RegCAM-TC and RegCAM+TC . . . . . . . . . . . . . . . . . . . . 201

Table 6.2

Number of TCAM rules for RegCAM-TC and OverlayCAM for 1stride, with self-loop unrolling and with 7-var-stride . . . . . . . . . 204

Table 6.3

Average stride values for self-loop unrolling and 7-var-stride for
RegCAM-TC and OverlayCAM for pM = 0, 50 and 95. . . . . . . . . 205
ix

LIST OF FIGURES

Figure 3.1

Example of DFA and state replication. . . . . . . . . . . . . . . . . 15

Figure 3.2

D2 FA example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 4.1

Edge weights distribution in a typical SRG. . . . . . . . . . . . . . 42

Figure 4.2

Example showing D2 FA with non self-looping root states. . . . . . 44

Figure 4.3

D2 FA merge example. . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 4.4

Algorithm D2FAMerge(D1 , D2 ) for merging two D2 FAs. . . . . . . 52

Figure 4.5

Memory and time required to build D2 FA versus number of Scale
REs used for ORIGINAL's D2 FA and D2 FAMERGE's D2 FA. . . . 74

Figure 5.1

A DFA with its TCAM table. . . . . . . . . . . . . . . . . . . . . . 77

Figure 5.2

TCAM table with shadow encoding. . . . . . . . . . . . . . . . . . 84

Figure 5.3

D2 FA, SRG, and deferment tree of the DFA in Figure 5.1. . . . . . 85

Figure 5.4

Shadow encoding example. . . . . . . . . . . . . . . . . . . . . . . . 90

Figure 5.5

Shadow Encoding Algorithm. . . . . . . . . . . . . . . . . . . . . . 92

Figure 5.6

3-decision table for 3 states in Figure 5.1 . . . . . . . . . . . . . . . 100

Figure 5.7

Consolidating two trees. . . . . . . . . . . . . . . . . . . . . . . . . 104
x

Figure 5.8

Algorithm for Consolidating Trees. . . . . . . . . . . . . . . . . . . 106

Figure 5.9

D2 FA for RE set f/abc/, /abd/, /e. f/g. . . . . . . . . . . . . . 107

Figure 5.10 3-var-stride transition table for s0 . . . . . . . . . . . . . . . . . . . 110
Figure 5.11 States s1 and s2 share transition aa . . . . . . . . . . . . . . . . . . 113
Figure 5.12 Uncompressed 2-var-stride transition tables for D2 FA in Figure 5.3(a)
(a = 97, o = 111) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 5.13 TCAM entries per DFA state (a) and compute time per DFA state
(b) for Scale 26 through Scale 34. . . . . . . . . . . . . . . . . . . . 124
Figure 5.14 Consolidation times for Scale 26 through Scale 34 for Optimal and
Greedy consolidation algorithms. . . . . . . . . . . . . . . . . . . . 125
Figure 5.15 The throughput and average stride length of RE sets. . . . . . . . . 128
Figure 6.1

Relationship of Automata Models. . . . . . . . . . . . . . . . . . . . 135

Figure 6.2

Example of DFA, state replication and Overlay DFA. . . . . . . . . 137

Figure 6.3

OD2 FA Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Figure 6.4

OD2 FA construction from one RE. . . . . . . . . . . . . . . . . . . 151

Figure 6.5

D2 FA and OD2 FA for RE /cd[ˆn] pr/. . . . . . . . . . . . . . . 154

Figure 6.6

Merged OD2 FA construction example. . . . . . . . . . . . . . . . . 155

Figure 6.7

Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs. . . . . 159

Figure 6.8

Algorithm DirectOD2FAMerge(D1 , D2 ) for merging two OD2 FAs. 167

Figure 6.9

Overlay classier and corresponding super-state transitions for the
super-states in OD2 FA in Figure 6.6(c). . . . . . . . . . . . . . . . 175
xi

Figure 6.10 Algorithm CreateOverlayClassifier(Dec, Reqd). . . . . . . . . . 176
Figure 6.11 Minimizing overlay classier example. . . . . . . . . . . . . . . . . . 177
Figure 6.12 Algorithm MinimizeOverlayClassifier(C). . . . . . . . . . . . . . 178
Figure 6.13 Overlay Padding Example. . . . . . . . . . . . . . . . . . . . . . . . 182
Figure 6.14 TCAM rules for RegCAM and OD2 FA. . . . . . . . . . . . . . . . . 192
Figure 6.15 Root super-state self loop unrolling example for TCAM rules in
Figure 6.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Figure 6.16 variable stride transitions generated for super-state 0 from 1-stride
transition in Figure 6.9. . . . . . . . . . . . . . . . . . . . . . . . . 198
Figure 6.17 Algorithm BuildVarStrideOD2FA(D) to build k-var-stride rules. . 199
Figure 6.18 (a) TEF vs. # NFA states for OverlayCAM and RegCAM, (b) SEF
vs. # NFA states for OverlayCAM . . . . . . . . . . . . . . . . . . 207

xii

Chapter 1

Introduction

1.1

Problem Statement

Deep Packet Inspection (DPI) is the core component of many networking devices on
the Internet such as Network Intrusion Detection (or Prevention) Systems (NIDS/NIPS),
rewalls, and layer 7 switches. In DPI, in addition to examining the packet headers, the
entire contents of each packet is compared against a set of signatures to check if any
signature is found in the packet or not. For instance, for security applications, each individual virus or attack threat is represented using one signature. The payload of each
packet passing through the network device is compared against the set of signatures, and
a match indicates the corresponding threat is found. Necessary action to neutralize the
threat can then be taken. Application level signature analysis is also used for providing advanced QoS mechanisms, detecting peer-to-peer trac, and in general application
1

protocol identication.
In the past, DPI typically used string matching as the core operation, in which signatures
are specied as simple strings. Today, DPI typically uses

Regular Expression (RE)

matching as the core operation, in which signatures are specied as REs. REs are used
instead of simple string patterns because REs are fundamentally more expressive and
thus are able to describe a wider variety of attack signatures [43]. Most open source and
commercial intrusion detection and prevention systems such as Snort [2, 39], Bro [37], HP
TippingPoint and Cisco networking appliances use RE matching. Likewise, some operating
systems such as Cisco IOS and Linux [1] have built RE matching into their layer 7 ltering
functions.
So the problem we are trying to solve is as follows: given a set of REs, R, and an input
stream, we want to quickly nd all occurrences of each RE from R in the input stream.

1.2

Research Problems

There are several challenges in implementing RE matching parsers for network applications. First, for many DPI applications, the signature set size rapidly grows over time.
For example for security applications, new attack threats are regularly discovered and so
the signature set size keeps growing. The current release of the Snort rules has close to
2000 REs in it. So the DPI engine should be able to handle a large RE set and it also

needs to be scalable. Second, since each packet needs to be scanned in real time as it is
2

processed, the DPI engine needs to be able to process the packets at a fast and deterministic rate. As network speed increases, this becomes an increasingly dicult and important
problem to solve. Finally, the DPI engine is typically implemented in a network device,
like a router, which usually has limited memory and processing power. So the DPI engine
needs to achieve the high throughput using limited hardware resources. As both trac
rates and signature set sizes are rapidly growing over time, fast and scalable RE matching
is now a core network security issue. As a result, there has been a lot of recent work on
implementing high speed RE parsers for network applications.

The straightforward approach to performing RE matching is to convert the RE set into an
equivalent automata and use the packet payloads as input strings for the automata. Two
standard choices are

Deterministic Finite state Automata (DFA) and

Nondeter-

ministic Finite state Automata (NFA). The DFA has the advantage of maintaining only
a single active state at any time. Thus processing each input character requires only a
single lookup, so the throughput achieved is fast and deterministic. However, DFAs experience state explosion where the number of states in the DFA can be exponential in terms
of the number of REs. Thus, DFAs require too much memory to store them. The NFA
has the advantage of small size where the number of states in the NFA is typically linear
in the number of REs, hence requiring little memory to store them. However, the NFA has
no limits on the number of active states, which means that the number of lookups needed
to process each input character is high and unpredictable. So NFAs cannot achieve high
and deterministic throughput.
3

1.3

Research Goals

As high and deterministic throughput is the primary requirement on networking devices,
high speed RE matching is typically based on the DFA. But the high memory requirement
of DFAs limits the number of REs in the ruleset that can be parsed simultaneously. In
this thesis, we propose algorithmic solutions to implement RE matching based on the DFA
that simultaneously achieves high throughput and low memory requirement.
Storing a DFA requires a large amount of memory because (1) the number of states grows
exponentially with the number of REs, and (2) more states implies more transitions need
to be stored since each state needs to store 256 = 28 transitions.
The rst research goal was to develop ecient algorithms that reduce the number of transitions of a DFA that need to be stored. The

Delayed Input DFA (D2 FA) proposed

by Kumar et al. [26] reduces the number of stored transitions by exploiting redundancy
among the transitions. This and other previous techniques employ a \union then minimize" framework, in which they rst build a large automata corresponding to all the REs
in the ruleset, and then perform an expensive minimization on the large automata. We
develop algorithms that use a \minimize then union" framework to build the D2 FA. In
this approach we rst minimize the automata corresponding to each individual RE in the
ruleset, which is an inexpensive step because the automata are very small. We then use
a fast algorithm to union the minimized automata together in such a way that the minimization is not lost. The D2 FA can be used for a software implementation of a DPI engine.
4

The compressed transition table is stored in RAM, and the processor does a RAM lookup
for each transition of the automata. The drawback of implementing D2 FA in software is
that the throughput is reduced (we explain this in Section 3.3.3.)
The second research goal was to nd an ecient implementation of RE matching in networking device hardware. To this end, we develop techniques to implement the D2 FA
for RE matching using

Ternary Content Addressable Memory (TCAM). TCAMs

are already widely used in networking devices for header based packet forwarding, so our
techniques can be implemented on current TCAM hardware without requiring major modications. We also develop techniques to increase throughput by processing more than one
input character in each cycle.
While the D2 FA is much smaller than a DFA, the memory requirement is still proportional
to the number of DFA states, which grows exponentially with the number of REs. The
ultimate goal for RE matching is to develop an automata model for RE matching that
achieves throughput close to that of a DFA but only requires space close to that of a NFA.
Our nal research goal was to develop such an automata model. For this, we have developed
two new automata models,

Overlay Deterministic Finite state Automata (ODFA) and

Overlay Delayed Input DFA (OD2 FA) as well as algorithms to implement OD2 FA automata
in both software and hardware. Our hardware OD2 FA implementation achieves the speed
of a DFA and the memory requirement of a NFA for many RE sets.
The rest of this proposal is organized as follows. In Chapter 2 we discuss related problems
and research. Background about DFA, D2 FA and TCAM is presented in Chapter 3. Our
5

research related to D2 FA and implementing RE matching in TCAM is presented in Chapters 4 and 5, respectively. Chapter 6 presents our research for the OD2 FA automata model
and implementation. Finally, Chapter 7 ends the dissertation with concluding remarks.

6

Chapter 2

Related Work
In the past, DPI typically used string matching (often called pattern matching) as a core
operator; string matching solutions have been extensively studied [4, 5, 44, 46, 48, 49, 52].
Several TCAM-based solutions have been proposed for string matching [5, 12, 46, 52], but
they do not generalize to RE matching because they only deal with independent strings
and do not use DFAs. Sommer and Paxson [43] rst proposed using REs instead of strings
to specify attack signatures. Today most DPI engines uses RE matching as a core operator
because strings are not adequate to precisely describe attack signatures.
There are two main approaches in previous work to developing RE matching solutions.
One is to start with a DFA and compress it. The second is to start with an NFA and
develop methods for coping with multiple active states.
We rst review DFA compression work. Great work has been done in reducing the number
of transitions stored per DFA state such as D2 FA [6,8,17,26,27]. These techniques exploit
7

transition redundancy between states to compress the size of the DFA. We present a novel
\minimize then union" approach of building the D2 FA incrementally. Our approach can
build much larger D2 FAs in fraction of the time compared to the previous solutions. This
work is presented in [36]. Recently and independently, Liu et al. proposed to construct
DFA by hierarchical merging [29]. That is, they essentially propose the \minimize then
union" framework for DFA construction. They consider merging multiple DFAs at a time
rather than just two. However, they do not consider D2 FA, and they do not prove any
properties about their merge algorithm including that it results in minimum state DFAs.
Another approach to reducing the number of transitions stored per DFA state is alphabet
encoding. In this approach the input characters are mapped to a new alphabet such that
input characters which are always treated identically in the DFA are combined into one
new character, thus reducing the size of the alphabet [8,9,13,22]. This work is orthogonal
to our techniques, and can be used together to improve the results.
In [32] we present our current RE matching solution using TCAMs. Here we exploit both
inter state and intra state transition redundancy to minimize the number of transitions
stored per DFA state.
There has been work to increase the throughput by creating multi-stride DFAs and NFAs
that scan multiple characters per transition [9, 13]. This work primarily applies to FPGA
NFA implementations since multiple character SRAM based DFAs have only been evaluated for a small number of REs. The ability to increase stride has been limited by the
constraint that all transitions must be increased in stride; this leads to excessive memory
8

explosion for strides larger than 2. In [32] we present the technique of variable striding,
in which we increase stride selectively on a state by state basis while carefully controlling the increase in required space. Alicherry et al. have explored variable striding for
TCAM-based string matching solutions [5] but not for DFAs that apply to arbitrary RE
sets.
Our techniques in [32] achieve very high transition compression; requiring close to just
1 transition per state. However, that might still not be practical if the number of states

grows exponentially with the number of REs. Some work has attempted to address state
explosion that occurs due to extensive state replication.
One approach is to simply partition REs into groups building an automata for each group
[7,42,51]. With this approach, at run time, each automata must process all packet payloads;
that is, similar to an NFA, multiple active states must be maintained. The one advantage
this approach has compared to an NFA is that the number of active states at any given
time is known in advance, so a system can be designed to accommodate the increased
bandwidth requirements for processing packet payloads. This approach is usually used
with any of the RE matching techniques when all REs cannot be compiled into a single
automata. Our goal is to conquer state explosion so that such partitioning is not needed. If
we cannot fully achieve our goal, our work should at least reduce the number of partitions
required. In particular, because our techniques achieve greater compression of DFAs than
previous software-based techniques, less partitioning of REs will be required.
A second approach is to use \scratch memory" to manage state replication and avoid state
9

explosion [10, 25, 41]. However, there are several issues with this approach. First, the size
of the required scratch memory may itself be signicant. Second, the processing required
to update the scratch memory after each transition may be signicant. Finally, many of
these approaches are not fully automated. For example, as Yang et al. write in [50] about
XFA, \... prior work on improved signature representations has required manual analysis
of REs (e.g., to identify and eliminate ambiguity [41]) ...".
Liu et al. developed a new method for RE matching that was the rst to introduce
relative state addressing through the use of oset transitions [28]. In their work, they
signicantly reduce the number of stored transitions by exploiting state replication and
transition sharing without using TCAM. However, they do require the use of bitmaps for
each DFA state which means they still require at least one bit per DFA state which means
they ultimately do not address the state explosion problem. The current best approach
for coping with state explosion is that of Peng et al. [38], though they do not oer an
automata model. We propose new automata model, ODFA, which facilitates reasoning
about state replication and provides a systematic way of handling state replication. Some
preliminary results indicate that our technique require signicantly fewer TCAM entries
than the technique in [38].
Much of the NFA work has exploited the parallel processing capabilities of FPGA technology to cope with the multiple active states that arise from NFA [7, 9, 14, 15, 33, 34, 40, 45].
However, it is not clear that FPGA's can cope with the large number of active states
required when processing large signature sets. Furthermore, FPGA's cannot be quickly
10

recongured when the RE sets change and they have relatively slow clock speeds. Also,
FPGAs are not commonly embedded in network processors as TCAMs commonly are.
One recent work in this direction is that of Yang et al. [50] where they use ordered binary decision diagrams to facilitate updating a set of active states in one operation. This
is an intriguing idea that merits further study and comparison with DFA compression
approaches.

11

Chapter 3

Background
In this section, we rst discuss the background material for the research presented in the
later sections.

3.1

DFA for RE Matching

Most RE parsers use some variant of the

Deterministic Finite state Automata (DFA)

representation of REs. Any set of REs can be converted into an equivalent DFA with
the minimum number of states [19, 20]. Traditionally, a DFA is dened as a 5-tuple
D = (Q, Σ, q0 , A, δ), where
Q is the set of states,
Σ is the alphabet,
q0 ∈ Q is the start state, and
A ⊆ Q is the set of accepting states.

12

δ : Q × Σ → Q is the transition function,

DFAs have the property of needing constant memory access per input symbol, and hence
result in predictable and fast bandwidth. The main problem with DFAs is space explosion:
a huge amount of memory is needed to store the transition function δ which has |Q| × |Σ|
entries. Specically, the number of states can be very large (state explosion), and the
number of transitions per state is large (|Σ| = 256).
A straightforward approach to implement DFAs is to store the transition function δ in a
two dimensional (|Q| by |Σ|) array. However, |Q| is very large (typically ten thousand or
larger) and |Σ| = 28∗k , where k ≥ 1, for k-stride DFAs that process k 8 bit characters per
transition. Thus, although a |Q| by |Σ| array is fast in theory, it is not in reality because
it consumes so much memory (hundreds of megabytes) that it has to be stored in DRAM
instead of SRAM and DRAM is an order of magnitude slower than SRAM.
In a standard DFA, each state is only marked as either accepting or non-accepting. Given
the set of REs R, reaching an accepting state only tells us that some RE in R matched, but
does not tell specically which RE in R matched. However, in DPI applications we must
keep track of which REs in R have been matched. For example, each RE may correspond
to a unique security threat that requires its own processing routine.
This leads us to dene Pattern Matching Deterministic Finite State Automata (PMDFA).
The key dierence between a PMDFA and a DFA is that for each state q in a PMDFA, we
cannot simply mark it as accepting or rejecting; instead, we must record which REs from
R are matched when we reach q.

13

Deﬁnition 1 (Pattern Matching DFA (PMDFA)).

Given as input a set of REs R, a

PMDFA is a 5-tuple (Q, Σ, q0 , M, δ) where the term M is dened as M : Q → 2R . For
each state q in the DFA, M gives the set of REs from R that are matched when we
reach q. All the other terms are dened in the same way as in a DFA.
In a PMDFA, there can be many pairs of states that are equivalent except for the set of
REs accepted by the two states. In a DFA, such a pair of states will be merged since they
would be completely equivalent. Because of this, the resulting minimum state PMDFA is
typically larger than the minimum state DFA. Since we always use a PMDFA, in the rest
of the report we just use the term DFA to mean a PMDFA.

3.2

Understanding DFA space explosion

DFAs suer from space explosion due to two reasons, which we call transitions sharing
and state replication. We explain these reasons using the DFAs shown in Figure 3.1.
We rst dene some of our notation for the DFAs in Figure 3.1 for the RE sets f/abc/,
/abd/g and f/abc/, /abd/, /e. f/g. Note that any RE that is not anchored (i.e. does

not begin with a `ˆ') has an implicit `. ' in the beginning, since the RE match can begin
anywhere in the input stream. To simplify the diagram, we condense many transitions that
have a common destination state on common input characters as follows. These transitions
are denoted with double arrows with their character labels next to the double arrow. The
source states for these transitions are denoted as \From [x..y]" which represents the set
14

From [0..4]

3/1

From [1..4]

a

fail

0

a

c
b

1

2
d

4/2

(a) DFA for RE set f/abc/, /abd/g.
From [0..4]

From [1..4]

a

fail

d

a

b

1

c

8/1
9/2

From [6..10]

From [1..4]

a

e
a

b

6

fail

From [5..10] f

10/3

4/2

2

e

5

3/1

d

0

c

7

f From [6..10]

(b) DFA for RE set f/abc/, /abd/, /e. f/g.
Figure 3.1: Example of DFA and state replication. (For interpretation of the references
to color in this and all other gures, the reader is referred to the electronic version of this
dissertation.)

of states with state IDs in the range [x..y]. For example, we represent four transitions
starting in states 1 through 4 that end in state 1 on character `a' using double arrows
beneath \From [1..4]" and an `a' next to the double arrow. When the text next to a
double arrow is \fail", this represents all character transitions not explicitly shown in the
15

gure. For example, the \fail" transition in Figure 3.1(a) represents all transitions out of
state 0 for characters that are not `a', all transitions out of state 1 for characters that are
not 'b', and so on. Finally, in an accepting state, the number(s) following the `/' represents
σ
the ID(s) of the RE matched by that accepting state. We also use the notation s1 − s2
→

to denote the transition δ(s1 , σ) = s2 .
We dene a self-looping state as a state which has more than Σ/2 (= 128) of its outgoing
transitions going back to itself. Self-looping states are the \failure states" on which the
DFA stays when the current input character does not advance the (partial) matching of
any of the REs in the RE set. For example in Figure 3.1(b) states 0 and 5 are self-looping
states. The transitions in a DFA can be categorized into three types:
1. Failure transitions are those that go to the self-looping states. It indicated that
the current input character does not advance (or start) the matching of any RE. In
Figure 3.1(a), all the incoming transitions of state 0 are failure transitions.
2. Restartable transitions are those that go to a state at a lower level than the current
state, usually a non self-looping state. It indicates that the current partial matches
are lost but there is a new partial match of another (possibly the same) RE. In
Figure 3.1(b), the incoming transitions of state 5 on character `e' from states [1..4]
e
are restartable transitions. For instance the transitions 2 − 5 means that we had a
→

partial match (ab) of REs/abc/ and /abd/ (since the current state is 2), and the
current input `e' does not advance the match of either of these REs, but it starts the
matching of a new RE /e. f/.
16

3. Forward transitions are the those that go from one state to the next in a chain
of states that identify a RE. These transitions advance the current partial match of
the RE by one character. In Figure 3.1(b), the outgoing transition of state 0 on
characters `a' and `e' are forward transitions.

3.2.1

Transition Sharing

We say two transitions are shared when, out of the three values in a transition (source
state, input character, destination state), they dier in only one value. Two shared transitions can only possibly dier in either the input character or the source state (since a
DFA has only one transition per source state and input character pair). This gives us two
causes of transition sharing: character redundancy and state redundancy.

Character redundancy is when two shared transitions dier in only the input character
value. That is, for a state q ∈ Q, we often have δ(q, σ1 ) = δ(q, σ2 ) for characters σ1
and σ2 in Σ. A DFA has a lot of character redundancy since for most states, most of
their transitions are failure transitions going to the same self-looping state. Only a few of
transitions for most states are either restartable or forward transitions. In addition, if a
RE has a chracter range (like `[a-z]') in it, then it leads to character redundant forward
transitions. For example in Figure 3.1(a), 254 of the 256 transitions for state 1 go to the
same state 0.

State redundancy is when two shared transitions dier in only the source state value. That
is, for a character σ ∈ Σ, we have δ(p, σ) = δ(q, σ) for states p and q in Q. The cause for
17

the large amount of state redundancy is failure and restartable transitions, because both
of these types of transitions go to the same next state for many dierent states in the DFA.
For example in Figure 3.1(a), for all the states in the DFA, their failure transitions go to
state 0, and their transition on input character `a' goes to state 1.

3.2.2

State Replication

When the NFA is converted to an equivalent DFA, the number of states typically increases
exponentially. This happens because most of the states in the NFA are replicated many
times in the DFA. To understand this, consider the DFAs in Figure 3.1. Figure 3.1(a)
shows the DFA for the RE set f/abc/, /abd/g, and Figure 3.1(b) shows the DFA after
the RE /e. f/ is added to this RE set. As we can see, the entire DFA in Figure 3.1(a) is
repeated twice in the DFA in Figure 3.1(b). Each state is replicated twice because of the
wildcard closure `. ' in the new RE that is added.

In general when building the DFA for an RE set where some REs contains 's, the states in
the DFAs that corresponds to individual REs are replicated multiple times. And when a
state is replicated, we automatically get replication of the transitions of that state, causing
transitions replication.
18

3.3
The

D2FA
Delayed Input DFA (D2 FA) was proposed by Kumar et al. [26] to compress the size

of the DFA transition function δ by exploiting state redundancy. The basic idea of D2 FA is
that in a typical DFA for real world RE set, given two states u and v, δ(u, σ) = δ(v, σ) for
many symbols σ ∈ Σ. We can remove all the transitions for v from δ for which δ(u, σ) =
δ(v, σ) and make a note that v's transitions were removed based on u's transitions. When

the D2 FA is later processing input and is in state v and encounters input symbol σ, if
δ(v, σ) is missing, the D2 FA can use δ(u, σ) to determine the next state. We can do the

same thing for most states in the DFA, and it results in tremendous transition compression.
Kumar et al. observe an average decrease of 97.6% in the amount of memory required to
store a D2 FA when compared to its corresponding DFA.
In more detail, to build a D2 FA from a DFA, we just do the following two steps:
1. For each state u ∈ Q, pick a deferred state, denoted by F(u). (We can have F(u) =
u.)

2. For each state u ∈ Q for which F(u) = u, remove all the transitions for u for which
δ(u, σ) = δ(F(u), σ).

When traversing the D2 FA, if on current state u and current input symbol σ, if δ(u, σ) is
missing (i.e. has been removed), we can use δ(F(u), σ) to get the next state. Of course,
δ(F(u), σ) might be missing too, in which case we then use δ(F(F(u)), σ) to get the next

state, and so on.
19

Figure 3.2(a) shows a DFA for the REs set f/. a. bcb/, /. c. bcb/g, and Figure 3.2(c)
shows the D2 FA built from the DFA. The dashed lines represent deferred states. The DFA
has 13 × 256 = 3328 transitions, whereas the D2 FA only has 1030 actual transitions and 9
deferred transitions.

3.3.1

D2 FA Deﬁnition

We formally dene a D2 FA and introduce some notation here.
Deﬁnition 2 (D2 FA).

Let

is dened as a 6-tuple

D = (Q, Σ, q0 , M, δ)

be a DFA. A corresponding D2 FA

D = (Q, Σ, q0 , M, ρ, F).

D

The rst four terms here are dened

the same way as in the DFA. The function F : Q → Q denes a unique deferred state
from 1,3

-{b,c}

b
b

1
-{a,c}

a

c

3

b

6

c

from 4,6,7,9,10,12
from
4,6,10 c
b
-{b,c}

c

c

0

b

4
a
c

-{a,b,c}
from 2,5,8,11

b

b

10

b

a from 5,8,11

2

c

7

c

c
b

c
b

8
b

(a) DFA for RE set f/. a. bcb/, /. c. bcb/g
Figure 3.2: D2 FA example.
20

12/1,2

b
c

5

9/1

11/2

4

0

6
254

2

5

1

254

254

8

254

256

12

256

11

7

254

3

254

256

10

256

254
254

9

(b) SRG for the DFA. Edges with weight ≤ 1 are not shown. Unlabeled edges have
weight 255

-{b,c}

1
-{a,c}

b

3

c

6

b

7

c

10

b

9/1

a
c

0

4

b

12/1,2

-b
c

a

2

b

5

c

8

b

11/2

-{a,b}
(c) The corresponding D2 FA. Dashed edges represent deferment.
Figure 3.2: D2 FA example (cont'd).

21

for each state in

Q,

and the partial function

ρ: Q × Σ → Q

transition function. Together, the deferment function
function

ρ

F

is a partially dened

and the partial transition

are equivalent to DFA transition function δ. We use

dom(ρ)

to denote

the domain of ρ, i.e. the values for which ρ is dened. The key property of the D2 FA
D

that corresponds to DFA D is as follows:
∀ s, σ ∈ Q × Σ, s, σ ∈ dom(ρ) ⇐⇒ (F(s) = s ∨ δ(s, σ) = δ(F(s), σ))

That is, for each state,

ρ

only has those transitions that are dierent from that of

its deferred state in the underlying DFA. When dened, ρ(s, σ) = δ(s, σ).
The function F denes a directed graph on the states of Q, which we call the deferment
forest. A D2 FA is well dened if and only if there are no cycles of length > 1 in the

deferment forest ( i.e. there are no cycles except self-loops.)
The total transition function for the D2 FA (derived from ρ) is dened as


 ρ(s, σ)

δ (s, σ) =

if



 δ (F(s), σ)

else

s, σ ∈ dom(ρ)

It is easy to see that δ is well dened and equal to δ if the D2 FA is well dened.
We need the restriction that the deferment forest cannot have a cycle other than a self-loop
on the states because otherwise all states on the cycle might have their transitions on some
σ ∈ Σ removed, and there would no way of nding the next state.

22

We also use the term deferment pointer to refer to the deferred state of a state. That is,
if F(u) = v ∧ u = v, we say the deferment pointer of state u is set to state v. If F(u) = u,
we say the deferment pointer for state u is not set.
States that defer to themselves (i.e. deferment pointer is not set), which we call root states,
must have all their transitions dened. Each connected component of the deferment forest
is called a deferment tree. It is easy to see that each deferment tree has exactly one root
state in it, and the deferment pointer of all the other states in the deferment tree are set
towards the root state.
We use u→v to denote F(u) = v, i.e. u directly defers to v. In this case, we say state u
is a child of state v, and state v is the parent of state u, in the deferment forest. We use
u v to denote that there is a path from u to v in the deferment forest dened by F. In

this case we say state u is a descendant of state v, and state v is the ancestor of state u,
in the deferment forest.
The deferment depth of state u, denoted ψ(u), is the distance, in the deferment tree
containing u, of state u from the root state of that deferment tree. The (maximum)

deferment depth of D2 FA D , denoted Ψ(D ), is the maximum deferment depth among
all the states in D . We use ψ(D ) to denote the average deferment depth among all the
states in D .
We use u v to denote the number of transitions in common between states u and v; i.e.
u

v = |{σ | σ ∈ Σ ∧ δ(u, σ) = δ(v, σ)}|.

23

We only consider D2 FA that correspond to minimum state DFA, though the denition
applies to all DFA.

3.3.2

Original D2 FA Algorithm

In this section we explain the original D2 FA construction algorithm proposed by Kumar

et al. [26]. They rst build a DFA for the given RE set.
The amount of transition compression achieved by the D2 FA depends on the number of
common transition between each (non-root) state and its deferred state. So next, in order
to maximize transition compression, they essentially solve a maximum weight spanning
tree problem on the following weighted graph which they call a

Space Reduction Graph

(SRG). The SRG is a complete graph with the DFA states, Q, as its vertices. The weight
of any edge (u, v) in the SRG is equal to the number of common transitions between DFA
states u and v. They use the the Kruskal's algorithm [23] to construct the maximum
weight spanning tree. Edges with weight ≤ 1 are not considered (selecting an edge with
weight 1 does not reduce the transition function, since it will result in removal of one
actual transition and addition of the deferment pointer transition.) For this reason the
maximum weight spanning tree construction might result in a forest.
Once the spanning forest is constructed, (one of) the state(s) in the center of each tree
is selected as the root for that tree, and all edges are directed towards the root. These
directed edges give the deferred state for each state.
24

Figure 3.2(b) shows the SRG built for the DFA in Figure 3.2(a), along with the maximum
weight spanning forest with roots selected and the edges directed.

3.3.3

Limiting Deferment Depth in Original D2 FA Algorithm

A D2 FA has the drawback that while parsing the input string, the current input character
is not advanced when a deferment transition is followed (hence the name delayed input
DFA.) In the worst case for a given state u and current input character σ, we might have
to do ψ(u) + 1 lookups to nd the next state δ (u, σ); that is ψ(u) lookups to get to the
root state following deferment transitions and 1 more lookup to get the actual next state.
This is a problem since we no longer get deterministic throughput, which was the main
reason for using the DFA. So, in general, it is better to have low deferment depth for all
states. If we set an upper bound on Ψ, then we achieve deterministic throughput, since
we would have a constant bound on the number of lookups per input character.
Recall that during the maximum weight spanning tree construction, Kruskal's algorithm
considers edges in decreasing edge weight order. At any time during the construction,
many edges will have the current largest edge weight (since there are only 257 possible
edge weights.) In order to reduce the deferment depth of the resulting D2 FA, Kumar et al.
propose the following tie breaking heuristic: among all edges with the current maximum
weight, pick the one that will result in the least increase in the diameter when added to
the spanning forest.
25

Also, given an upper bound, Ω, on the D2 FA deferment depth Ψ, Kumar et al. propose
the following method to generate D2 FA with deferment depth within the bound: during
the maximum weighted spanning tree construction, an edge is only added to the spanning
tree if it does not cause the tree diameter to go over 2 × Ω. Since the tree center is chosen
as the root state, this guarantees that Ψ(D ) ≤ Ω.

3.3.4

Backpointer D2 FA Algorithm

The level of a state u in a DFA is the length of the shortest string that takes the DFA from
the start state to state u. Becchi and Crowley [8] propose an algorithm to build the D2 FA
based on the following idea: each state in the DFA should defer to a state that is at a
lower level than itself. Because of this, every deferred transition followed will decrease the
level of the current state by at least 1. Any actual transition taken can only increase the
level of the current state by 1. Therefore, when processing any input string of length n, at
most n − 1 deferred transitions will be followed. So this method guarantees an amortized
cost of at most 2 lookups per input character.
To build the D2 FA, they build the DFA for the given RE set rst. Next, for each state u,
among all the states at a lower level than u, they set F(u) to be the state which shares the
most transitions with u. Since each state defers to a state at a lower level than itself, the
deferment forest can never have a cycle, so the D2 FA is well dened.
The resulting D2 FA is typically a bit larger in size than the D2 FA built using the algorithm
proposed by Kumar et al..
26

3.4

Classiﬁers

In this section we dene a classier, related terminology and describe a classier minimization problem. A classier is essentially a mapping function from the source domain to the
target domain. In a d-dimensional classier, the input value is composed of d elds.
A classier is traditionally dened for the (header based) packet classication problem.
The input value is the packet header, which has ve elds: Protocol type, Source IP
address, Source port number, Destination IP address and Destination port number. The
output is the decision or action to be taken for the packet, which typically has values
like accept, discard, accept and log, discard and log etc.. So the classier is dened as a
5-dimensional classier, with the set of possible packet headers as the source domain, and

set of possible actions as the target domain. For each possible packet header, the classier
gives the action to be taken.

3.4.1

Classiﬁer deﬁnition

We now formally dene a d-dimensional classier and related terminology.
A eld Fi is a nite width variable. The domain of eld Fi of w bit width is dom(Fi ) =
[0..2w − 1]. The domain of a d-dimensional classier, f, dened over the d elds F1 , . . . , Fd

is dom(f) = dom(F1 ) × · · · × dom(Fd ). A packet is a d-tuple (p1 , . . . , pd ), where, for
1 ≤ i ≤ d, pi ∈ dom(Fi ).

A rule has the form predicate → decision . A rule predicate is a d-tuple (S1 , . . . , Sd ),
27

where, for 1 ≤ i ≤ d, Si ⊆ dom(Fi ); and it covers the set of packets S1 ×· · ·×Sd ⊆ dom(f).
A packet p matches rule r if and only if the predicate of r covers p. The set of possible
rule decisions is denoted by H.
The classier f = r1 , . . . , rn is specied as a sequence of rules. For packet p, the rst
rule in the sequence that p matches is said to be the binding rule for p. If p does not
match any rule in f, then p does not have any binding rule (or is unbound). For a bound
packet p, the output of the classier, f(p), is given by the decision of the binding rule for
p. For unbound packets, p, f(p) is undened. The cost of a classier f, denoted Cost(f),

is the number of rules in f.
The Cover of a classier f, denoted Cover(f), is dened as the set of packets in dom(f)
that have a binding rule in f (i.e. set of packets that match at least one rule in f.) A
classier f, is said to be a complete classier if Cover(f) = dom(f), otherwise f is said
to be an incomplete classier.
Clearly, two rules in a classier can be overlapping (i.e. at least one packet matches both
rules), as well as conicting (i.e. overlapping and having dierent decisions). But that is
ok, since the classier output for a bound packet is uniquely dened by its binding rule.

3.4.1.1

Preﬁx Classiﬁer

A prex {0, 1}k {∗}w−k with k leading bits (i.e. 0s or 1s), for a eld of width w, denotes
the range of values [{0, 1}k {0}w−k , {0, 1}k {1}w−k ]. A rule is said to be a prex rule if and
28

only if every Si in the rule predicate (S1 , . . . , Sd ) is represented as a prex. A classier f
is said to be a prex classier if and only if every rule in f is a prex rule.

3.4.1.2

Ternary Classiﬁer

A ternary value for a eld of width w is of the form {0, 1, ∗}w , and denotes the set of
values obtained by replacing the ∗'s with 0's and 1's in all possible combinations (if there
are k ∗'s, there are 2k ways to replace the ∗'s with 0's and 1's.) A rule is said to be a

ternary rule if and only if every Si in the rule predicate (S1 , . . . , Sd ) is represented as a
ternary value. A classier f is said to be a ternary classier if and only if every rule in f
is a ternary rule.
A prex classier is a special case of a ternary classier, since every prex is also a ternary
value.

3.4.1.3

Weighted Classiﬁer

In a weighted classier, each decision in H has a weight associated with it. The cost of a
classier f is then equal to the sum of the weights of decisions of all the rules in f. The
unweighted classier is a special case of weighted classier with weights of all the decisions
set to 1.
29

3.4.2

Classiﬁer Minimization

Two classiers f1 and f2 are equivalent, denoted f1 ≡ f2 , if and only if Cover(f1 ) =
Cover(f2 ) and ∀p ∈ Cover(f1 ), f1 (p) = f2 (p). For a classier f, we use {f} to denote the

set of all classiers that are equivalent to f.
The classier minimization problem is then dened as follows.
Deﬁnition 3 (Classier Minimization Problem).

Given a classier

f1 ,

nd a prex

classier f2 ∈ {f1 } such that for any prex classier f ∈ {f1 }, Cost(f2 ) ≤ Cost(f).
Multi-dimensional classier minimization has been shown to be NP-hard. An optimal
solution for 1-dimensional complete classier minimization was proposed by Suri et al. [47].
Meiners et al. [30, 31] proposed algorithms for 1-dimensional complete weighted classier
minimization and 1-dimensional incomplete weighted classier minimization.

3.5

TCAM Introduction

In any regular memory, the input is the memory address location, and the output is the
contents of the memory at that location. In a

Ternary Content Addressable Memory

(TCAM), as the name suggests, it is the exact opposite. The input to a TCAM is binary
value, and the output of the TCAM is the address of the location, if any, at which the
given value occurs. The ternary refers to the fact that the contents of the memory are
ternary bits, i.e. 0, 1 or ∗ (don't care). The ∗ matches both a 0 and a 1.
30

If more than one location matches the given (binary) value, then the address of the rst
location that matches the value is returned. We call this the rst match semantics of
TCAM.
The key thing about TCAMs is that the output is returned in constant time. TCAMs
internally have a massively parallel hardware that searches the given input against all the
entries stored in the TCAM at once, and returns the address of the rst match. For this
reason, TCAM memory chips have very limited size. The largest available chip is about
72Mb, and typical sizes are around 1Mb to 8Mb. TCAM chips also consume a lot of energy

compared to regular memory.
The TCAM chip is usually paired with a corresponding SRAM that stores output values.
The matching address from the TCAM is used as input to the SRAM to get the output
value.
TCAM chips are widely used in networking devices for packet classication. A ternary
classier for packet classication can be naturally implemented in a TCAM. All the rules
predicates are stored, in order, in the TCAM, and the corresponding rule decisions are
stored in the SRAM. The packet header is then used as a lookup key for the TCAM, and
the matching SRAM values gives the decision for the packet.

31

Chapter 4

Software Implementation
In this section we present our work on the software implementation of RE matching. A
software solution typically uses a DFA to achieve deterministic throughput. The software
solution can be implemented on general purpose processors, or on customized ASIC chips.

4.1

Introduction/Motivation

The straightforward way to implement a DFA in software is to store the DFA transition
table, δ, in a two dimensional Q × Σ array. But DFAs suer from space explosion when
multiple REs are combined, making them impractical even for moderately sized RE set.
D2 FA are very eective at dealing with the space explosion problem of the DFA. In particular, D2 FA exhibit tremendous transition compression reducing the size of the DFA by
a huge factor. This makes D2 FA much more practical for a software implementation of
RE matching than DFAs. In our work we focus on the D2 FA.
32

4.1.1

Solution Goals

For software implementation of RE matching, given as input a set of REs R, we need
to be able to build a compact D2 FA as eciently as possible that also supports frequent
updates. Eciency is important because RE matching solutions are typically implemented
in networking devices, which usually have very limited computing resources. Current
methods for constructing D2 FA may be so expensive in both time and space that they may
not be able to construct the nal D2 FA even if the D2 FA is small enough to be deployed
in networking devices that have limited computing resources. Such issues become doubly
important when we consider the issue of the frequent updates (typically additions) to R
that occur as new security threats are identied.

4.1.2

Summary and Limitations of Prior Art

Given the input RE set R, any solution that builds a D2 FA for R will have to do the
following two operations: (a) union the automata corresponding to each RE in R and
(b) minimize the automata, both in terms of the number of states and the number of
edges. Previous solutions [8,26] (discussed in Section 3.3) employ a \Union then Minimize"
framework in which: (1) they rst build automata for each RE within R, and perform
union operations on these automata to arrive at one combined automaton for all the
REs in R, and (2) next they minimize the resulting combined automaton. In particular,
previous solutions rst construct the combined NFA for the RE set. Then they perform a
computationally expensive NFA to DFA subset construction on the large combined NFA,
33

followed by or composed with DFA minimization (for states). And last they perform the
D2 FA minimization (for edges).

There are three fundamental limitations with prior solutions, due to which they do not
meet our goals. First, they perform the minimization on the large combined automata
which is expensive in both time and space. Second, prior methods build the corresponding
minimum state DFA before constructing the nal D2 FA. This is very costly in both space
and time. The D2 FA is typically 50 to 100 times smaller than the DFA, so even if the
D2 FA would t in available memory, the intermediate DFA might be too large, making
it impractical to build the D2 FA. This is exacerbated in the case of the Kumar et al.
algorithm which needs the SRG which ranges from about the size of the DFA itself to over
50 times the size of the DFA. The resulting space and time required to build the DFA

and SRG impose serious limits on the D2 FA that can be practically constructed. We do
observe that the method proposed in [8] does not need to create the SRG. Furthermore,
as the authors have noted, there is a way to go from the NFA directly to the D2 FA, but
implementing such an approach is still very costly in time as many transition tables need
to be repeatedly recreated in order to realize these space savings. In addition, this direct
NFA to D2 FA construction would still need to perform the expensive subset construction
on the large combined NFA. Third, none of the previous methods support updating the
D2 FA when a new RE is added to R. The whole D2 FA would have to be rebuilt when the
RE set is updated.
34

4.1.3

Summary of Our Approach

To address the limitations of prior solutions, we propose a \Minimize then Union framework". Specically, we rst minimize the small automata corresponding to each RE from
R, and then union the minimized automata together. In particular, given R, we rst build

a DFA and D2 FA for each individual RE in R. The heart of our technique is the D2 FA
merge algorithm that performs the union. It merges two smaller D2 FAs into one larger
D2 FA such that the merged D2 FA is equivalent to the union of REs that the D2 FAs being
merged were equivalent to. Starting from the the initial D2 FAs for each RE, using this
D2 FA merge subroutine, we merge two D2 FAs at a time until we are left with just one
nal D2 FA. The initial D2 FAs are each equivalent to their respective REs, so the nal
D2 FA will be equivalent to the union of all the REs in R. A key property of our D2 FA
merge algorithm is that it automatically produces a minimum state D2 FA without explicit
state minimization. Likewise, it creates ecient state deferment in the merged D2 FA using
state deferment information from the input D2 FAs. Together, these optimizations lead to
a vastly more ecient D2 FA construction algorithm in both time and space.
The D2 FA produced by our merge algorithm can be larger than the minimal D2 FA produced by the Kumar et al. algorithm. This is because the Kumar et al. algorithm does
a global optimization over the whole DFA (using the SRG), whereas our merge algorithm
eciently computes state deferment in the merged D2 FA based on state deferment in the
two input D2 FAs. In most cases, the D2 FA produced by our approach is suciently small
to be deployed. However, in situations where more compression is needed, we oer an
35

ecient nal compression algorithm that produces a D2 FA very similar in size to that
produced by the Kumar et al. algorithm. This nal compression algorithm uses an SRG;
we improve eciency by using the deferment already computed in the merged D2 FA to
greatly reduce the size of this SRG and thus signicantly reduce the time and memory
required to do this compression.

4.1.3.1

Advantages of our algorithm

One of the main advantages of our algorithm is a dramatic increase in time and space
eciency. These eciency gains are partly due to our use of the Minimize then Union
framework instead of the Union then Minimize framework. More specically, our improved
eciency comes about from the following four factors. First, other than for the initial DFAs
that correspond to individual REs in R, we build D2 FA bypassing DFAs. Those initial
DFAs are very small (typically < 50 states), so the memory and time required to build
the initial DFAs and D2 FAs is negligible. The D2 FA merge algorithm directly merges the
two input D2 FAs to get the output D2 FA without creating the DFA rst. Second, other
than for the initial DFAs, we never have to perform the NFA to DFA subset construction.
Third, other than for the initial DFAs, we never have to perform DFA state minimization.
Fourth, when setting deferred states in the D2 FA merge algorithm, we use deferment
information from the two input D2 FA. This typically involves performing only a constant
number of comparisons per state rather than a linear in the number of states comparison
per state as is required by previous techniques. All told, our algorithm has a practical time
36

complexity of O(n|Σ|) where n is the number of states in the nal D2 FA and |Σ| is the size
of the input alphabet. In contrast, Kumar et al.'s algorithm [26] has a time complexity
of O(n2 (log(n) + |Σ|)) and Becchi and Crowley's algorithm [8] has a time complexity of
O(n2 |Σ|) just for setting the deferment state for each state and ignoring the cost of the

NFA subset construction and DFA state minimization. Section 4.4.4 has a more detailed
complexity analysis.
Because of these eciency advantages in time and space complexity, given the same limited resources, our algorithm can build much larger D2 FAs than are possible with previous
methods. Besides being much more ecient in constructing D2 FA from scratch, our algorithm is very well suited for frequent RE updates. When an RE needs to be added to the
current set, we just need to merge the D2 FA for the RE to the current D2 FA using our
merge routine which is a very fast operation.

4.2

Minimum State PMDFA construction

Before we present our algorithm for ecient D2 FA construction, we consider the problem
of constructing minimum state DFA for a given RE set.
Given a set of REs R, we can build the corresponding minimum state DFA using the
standard Union then Minimize framework: rst build a combined NFA for all the REs in
R, then convert the NFA to a DFA, and nally minimize the DFA. This method can be

very slow, mainly due to subset construction in the NFA to DFA conversion, which often
37

results in an exponential growth in the number of states. Instead, we propose a more
ecient Minimize then Union framework.
Let R1 and R2 denote any two disjoint subsets of R, and let D1 and D2 be their corresponding minimum state DFAs. We use the standard union cross product construction for DFAs
to construct a minimum state DFA D3 that corresponds to R3 = R1 ∪ R2 . Specically, suppose we are given the two DFAs D1 = (Q1 , Σ, q01 , M1 , δ1 ) and D2 = (Q2 , Σ, q02 , M2 , δ2 ).
The union cross product DFA of D1 and D2 , denoted as UCP(D1 , D2 ), is given by

D3 = UCP(D1 , D2 ) = (Q3 , Σ, q03 , M3 , δ3 )

where

Q3 = Q1 × Q2
q03 = q01 , q02
∀qi ∈ Q1 , ∀qj ∈ Q2 , M3 ( qi , qj ) = M1 (qi ) ∪ M2 (qj )
∀σ ∈ Σ, ∀qi ∈ Q1 , ∀qj ∈ Q2 , δ3 ( qi , qj , σ) = δ1 (qi , σ), δ2 (qj , σ)

Each state in D3 corresponds to a pair of states, one from D1 and one from D2 . For
notational clarity, we use and to enclose an ordered pair of states. Transition function
δ3 just simulates both δ1 and δ2 in parallel. Many states in Q3 might not be reachable

from the start state q03 . Thus, while constructing D3 , we only create states that are
reachable from the start state q03 .
38

We now argue that this construction is correct. This is a standard construction, so the fact
that D3 is a DFA for R3 = R1 ∪ R2 is straightforward and covered in standard automata
theory textbooks (e.g. [20]). We now show that D3 is also a minimum state DFA for R3
assuming R1 ∩ R2 = ∅. Recall that we are using DFA to mean a PMDFA (see Section 3.1.)
For a traditionally dened DFAs, the UCP construction is not guaranteed to produce a
minimum state DFA.
Theorem 1.

Given two RE sets, R1 and R2 , and equivalent minimum state DFAs, D1

and D2 , the union cross product DFA D3 = UCP(D1 , D2 ), with only reachable states
constructed, is the minimum state DFA equivalent to R3 = R1 ∪ R2 if R1 ∩ R2 = ∅.

Proof. First since only reachable states are constructed, D3 cannot be trivially reduced.
Now assume D3 is not minimum. That would mean there are two dierent states in D3 ,
say p1 , p2 and q1 , q2 , that are indistinguishable. This implies that

∀x ∈ Σ , M3 (δ3 ( p1 , p2 , x)) = M3 (δ3 ( q1 , q2 , x)).

Working on both sides of this equality, we get,

∀x ∈ Σ , M3 (δ3 ( p1 , p2 , x)) = M3 ( δ1 (p1 , x), δ2 (p2 , x) )
= M1 (δ1 (p1 , x)) ∪ M2 (δ2 (p2 , x))

39

as well as,

∀x ∈ Σ , M3 (δ3 ( q1 , q2 , x)) = M3 ( δ1 (q1 , x), δ2 (q2 , x) )
= M1 (δ1 (q1 , x)) ∪ M2 (δ2 (q2 , x))

This implies that

∀x ∈ Σ , M1 (δ1 (p1 , x)) ∪ M2 (δ2 (p2 , x)) = M1 (δ1 (q1 , x)) ∪ M2 (δ2 (q2 , x))

Now since R1 ∩ R2 = ∅, this gives us

∀x ∈ Σ , M1 (δ1 (p1 , x)) = M1 (δ1 (q1 , x))

and

∀x ∈ Σ , M2 (δ1 (p2 , x)) = M2 (δ1 (q2 , x))

This implies that p1 and q1 are indistinguishable in D1 and p2 and q2 are indistinguishable
in D2 . Since p1 , p2 = q1 , q2 , we have that p1 = p2 ∨ q1 = q2 , implying that at least
one of D1 or D2 is not a minimum state DFA, which is a contradiction and the result
follows.
Our ecient construction algorithm works as follows. First, for each RE r ∈ R, we build
an equivalent minimum state DFA D for r using the standard method, resulting in a set of
DFAs D. Then we merge two DFAs from D at a time using the above UCP construction
until there is just one DFA left in D. The merging in done in a greedy manner: in each step,
40

the two DFAs with the fewest states are merged together. Note the condition R1 ∩R2 = ∅ is
always satised in all the merges, so Theorem 1 ensures that we always have a minimized
DFA.
In our experiments, our Minimize then Union technique runs exponentially faster than the
standard Union then Minimize technique because we only apply the NFA to DFA subset
construction step on the NFAs that correspond to each individual RE rather than on the
combined NFA for all the REs. This makes a signicant dierence even when we have a
relatively small number of REs. For example, for the C7 RE set which contains 7 REs,
the standard technique requires 385.5 seconds to build the DFA, but our technique builds
the DFA in only 0.66 seconds.

4.3

Eﬃcient D2FA Construction

In this section, we describe how we can extend the Minimize then Union technique to
directly build the D2 FA bypassing the DFA construction. We rst build the D2 FA for each
individual RE in the RE set, and then merge these D2 FAs together to get the combined
D2 FA for the entire RE set.

4.3.1

Improved D2 FA Construction for One RE

To build the initial D2 FA for each RE in R, we can use the original D2 FA algorithm
proposed in [26]. However, we propose several improvements to original algorithm that
41

facilitate our D2 FA merge algorithm, our techniques for hardware implementation of RE
matching presented in Chapter 5 and the overlay automata approach presented in Chapter
6.

Edge weight distribution


1.0E+8
1.0E+7

Count

1.0E+6
1.0E+5
1.0E+4
1.0E+3
1.0E+2
1.0E+1
0
5
10
15
20

175
180
185
190
195
200
205
210
215
220
225
230
235
240
245
250
255



1.0E+0

Edge Weight
Figure 4.1: Edge weights distribution in a typical SRG.

Figure 4.1 shows the typical distribution of the weights of the edges in the SRG. The
distribution is typically bimodal. The weights of the edges are very high (> 128) or very
low (< 20). The reason behind this is that, for all state pairs for which both states have
their failure transitions going to the same self-looping state, the two states will have most
of their transitions in common, and hence result in a very high weight edge in the SRG.
Likewise, for all state pairs for which both states have their failure transitions going to
dierent self-looping states, the two states will have none (or very few) of their transitions
in common, and hence result in a very low weight edge in the SRG. If we remove the
low weight edges from the SRG, we get a natural partitioning of the states based on the
42

self-looping state they fail to. Let us call this partitioning of states P . Each partition in
P will have at most one self-looping state.

Multiple deferment trees: We remove the low weight (< 20) edges from the SRG

before building the maximum spanning tree. The result of this is that the deferment
forest has multiple deferment trees, one tree for each partition in P . This only results in
a small increase in the number of transitions in the resulting D2 FA, since edges removed
from the SRG have very low weight. For each partition in P , the unique self-looping state
(if any) within the partition is chosen as the root of the corresponding deferment tree.

Handling non-self-looping roots:

We can have a partition in P which does not have

any self-looping state. In such cases we will have a non self-looping state selected for
the partition. This will happen for REs that have a `.' (or a large range like [ˆa])
without the closure ` '. For example consider that D2 FA shown in Figure 4.2(a) for the
RE /a. b..c/. The deferment forest will have 4 root states, 0, 1, 2 and 3. States 0 and
1 are self-looping. However, states 2 and 3 are not self-looping and are only roots states

because they have no transition in common with other states.
In such cases, we make these states non root states and set their deferment as follows. We
look at the deferment of the next state where the transition on the `.' goes to. If we have
more than one consecutive `.', we note the state where the last `.' transitions to. In our
example, the next state of the last `.' is state 4. We follow the deferment of this state
until we reach its root, and select that root as the deferred state of the non self-looping
43

‐a

0

‐{n,b}
a

1

b

2



3



4

c

5

(a) D2 FA for RE /a. b..c/ with non self-looping roots

‐a

0

‐{n,b}
a

1

b

2



3



4

c

5

(b) D2 FA after setting deferment for non self-looping roots
Figure 4.2: Example showing D2 FA with non self-looping root states.
roots. In our example, the deferment chain of state 4 ends in state 1, so state 1 is chosen
as the deferred state for both states 2 and 3. Figure 4.2(b) shows the resulting D2 FA.
Setting the deferment of non self-looping roots in this manner does not reduce the size
of the D2 FA since these states will not have any transitions (or very few transitions)
in common with their deferred states. However, this results in a better structure of the
deferment forest. It also ensures we have the condition that all roots states are self-looping
states and vice versa.

Improved edge weight tie breaking: Recall that during the construction of the max-

imum spanning tree using Kruskal's algorithm, at any time there are usually many edges
with the current maximum weight. We use the following tie breaking strategy.
For each state u, we store a value, deg (u), which is initially set to 0. During Kruskal's
44

algorithm, when an edge e = (u, v) is added to the current spanning tree, deg (u) is
incremented by 2 if level(u) ≤ level(v); otherwise it is incremented by 1. Recall that
level(u) is the length of the shortest string that takes the DFA from the start state to

state u. We similarly update deg (v). Then we use the following tie breaking order among
edges having the current maximum weight.
1. Edges that have a self-looping state as one of their end points are given the highest
priority.
2. Next, priority is given to edges with higher sum of deg of their end vertices.
3. Next, priority is given to edges with higher dierence between the levels of their end
vertices.
The sum of degrees of end vertices is used for tie breaking in order to prioritize states that
are already highly connected. However, we also want to prioritize connecting to states at
lower levels, so we use deg instead of just the degree. Using the dierence between levels
of end points for tie breaking also prioritizes states at a lower level. This helps reduce
the deferment depth and the D2 FA size for RE sets whose D2 FAs have a higher average
deferment depth.
There are several benets of these improvements.
1. Having the self-looping states in the center helps to minimize the average height of
the deferment tree. Also, prioritizing edges with well connected endpoints increases
the fanout, which again reduces tree height. The result is that we get a D2 FA that
45

has a much lower deferment depth.
2. The state partitioning P identies a natural partitioning of states, such that all
replications of one NFA state are in dierent partitions. So typically all partitions
in P have sizes close to each other; and because of our tie breaking strategy, all
the deferment trees have very similar structure. This property helps to improve the
eectiveness of our D2 FA merge algorithm explained in the next section, and of our
table consolidation technique explained in Section 5.3.
3. Having self-looping states as roots helps to improve the eectiveness of our variable
striding technique which we describe in Section 5.4. And the condition that all
roots states are self-looping states and vice versa is needed for our overlay automata
approach described in Chapter 6.

4.3.2

D2 FA Merge Algorithm

The UCP construction merges two DFAs together. We extend the UCP construction to
merge two D2 FAs together as follows. To build a D2 FA from a DFA, we basically just need
to set the deferment pointer, F(u), for each state. During the UCP construction, as each
new state u is created, we dene F(u) at the same time. We then dene ρ to only include
transitions for u that dier from F(u).
To help explain our algorithm, Figure 4.3 shows an example execution of the D2 FA
merge algorithm. Figures 4.3(a) and 4.3(b) show the D2 FAs for the REs/. a. bcb/
46

and /. c. bcb/. Figure 4.3(c) shows the merged D2 FA for the D2 FAs in Figures 4.3(a)
and 4.3(b). We use the following conventions when depicting a D2 FA. The dashed lines
correspond to the deferred state for a given state. For each state in the merged D2 FA, the
pair of numbers above the line refers to the states in the original D2 FAs that correspond
to the state in the merged D2 FA. The number below the line is the state in the merged
D2 FA. The number(s) after the `/' in accepting states gives the id(s) of the pattern(s)
matched. Figure 4.3(d) shows how the deferred state is set for a few states in the merged
D2 FA D3 . We explain the notation in this gure as we give our algorithm description.
For each state u ∈ D3 , we set the deferred state F(u) as follows. While merging D2 FAs
D1 and D2 , let state u = p0 , q0 be the new state currently being added to the merged

D2 FA D3 . Let p0 →p1 →· · ·→pl be the maximal deferment chain DC1 (i.e. pl defers
to itself) in D1 starting at p0 , and q0 →q1 →· · ·→qm be the maximal deferment chain

-a

0

-b
a

1

b

2

c

3

b

4/1

b

4/2

(a) D1 , the D2 FA for RE /. a. bcb/.

-c

0

-b
c

1

b

2

c

3

(b) D2 , the D2 FA for RE /. c. bcb/.
Figure 4.3: D2 FA merge example.
47

-{a,c}

-{b,c}
a

0,0
0

c
-{a,b} 0,1

2,0
3

c

c
a

1,1
4

2

b

b

1,0
1

3,1
6

b

-b

0,2
5

b
2,2
7

4,2
9/1

c

c
0,3
8

3,3
10

b

b
0,4
11/2

4,4
12/1,2

(c) D3 , the merged D2 FA.

0
5

2
7

2

1

Deferment for 5=0,2

4
9

2

2
7
256

4
255

2

2
255

1
1

Deferment for 7=2,2

1

4

4
255

2

9
7
12 256 256

1

4

Deferment for 9=4,2

2

1
4
255

1

Deferment for 12=4,4

(d) Illustration of setting deferment for some states in D3 .
Figure 4.3: D2 FA merge example (cont'd).
48

DC2 in D2 starting at q0 . For example, in Figure 4.3(d), we see the maximal deferment

chains for u = 5 = 0, 2 , u = 7 = 2, 2 , u = 9 = 4, 2 , and u = 12 = 4, 4 . For
u = 9 = 4, 2 , the top row is the deferment chain of state 4 in D1 and the bottom row is

the deferment chain of state 2 in D2 . We will choose some state pi , qj where 0 ≤ i ≤ l
and 0 ≤ j ≤ m to be F(u). In Figure 4.3(d), we represent these candidate F(u) pairs with
edges between the nodes of the deferment chains. For each candidate pair, the number
on the top is the corresponding state number in D3 and the number on the bottom is
the number of common transitions in D3 between that pair and state u. For example,
for u = 9 = 4, 2 , the two candidate pairs represented are state 7 ( 2, 2 ) which shares
256 transitions in common with state 9 and state 4 ( 1, 1 ) which shares 255 transitions in
common with state 9. Note that a candidate state pair is only considered if it is reachable
in D3 . In Figure 4.3(d) with u = 9 = 4, 2 , three of the candidate pairs corresponding to
4, 1 , 2, 1 , and 1, 2 are not reachable, so no edge is included for these candidate pairs.

Ideally, we want i and j to be as small as possible though not both 0. For example, our
best choices are typically p0 , q1 or p1 , q0 . In the rst case, p0 p1 = p0 , q0
and we already have p0→p1 in D1 . In the second case, q0 q1 = p0 , q0

p1 , q0 ,

p0 , q1 , and we

already have q0→q1 in D2 . In Figure 4.3(d), we set F(u) to be p0 , q1 for u = 5 = 0, 2
and u = 12 = 4, 4 , and we use p1 , q0 for u = 9 = 4, 2 . However, it is possible that
both states are not reachable from the start state in D3 . This leads us to consider other
possible pi , qj . For example, in Figure 4.3(d), both 2, 1 and 1, 2 are not reachable in
D3 , so we use reachable state 1, 1 as F(u) for u = 7 = 2, 2 .

49

We consider a few dierent algorithms for choosing pi , qj . The rst algorithm which we
call the rst match method is to nd a pair of states (pi , qj ) for which pi , qj ∈ Q3 and
i + j is minimum. Stated another way, we nd the minimum z ≥ 1 such that the set of

states Z = { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ ( pi , qz−i ∈ Q3 )} = ∅. From
the set of states Z, we choose the state that has the most transitions in common with
p0 , q0 breaking ties arbitrarily. If Z is empty for all z > 1, then we just pick p0 , q0 ,

i.e. the deferment pointer is not set (or the state defers to itself). The idea behind the
rst match method is that p0 , q0

pi , qj decreases as i + j increases. In Figure 4.3(d),

all the selected F(u) correspond to the rst match method.
A second more complete algorithm for setting F(u) is the best match method where we
always consider all (l+1)×(m+1)−1 pairs and pick the pair that is in Q3 and has the most
transitions in common with p0 , q0 . The idea behind the best match method is that it is
not always true that p0 , q0

px , qy ≥ p0 , q0

px+i , qy+j for i + j > 0. For instance

we can have p0 p2 < p0 p3 , which would mean p0 , q0

p2 , q0 < p0 , q0

p3 , q0 .

In such cases, the rst match method will not nd the pair along the deferment chains
with the most transitions in common with p0 , q0 . In Figure 4.3(d), all the selected F(u)
also correspond to the best match method. It is dicult to create a small example where
rst match and best match dier.
When adding the new state u to D3 , it is possible that some state pairs along the deferment
chains that were not in Q3 while nding the deferred state for u will later on be added
to Q3 . This means that after all the states have been added to Q3 , the deferment for u
50

can potentially be improved. Thus, after all the states have been added, for each state
we again nd a deferred state. If the new deferred state is better than the old one, we
reset the deferment to the new deferred state. Algorithm 4.4 shows the pseudocode for the
D2 FA merge algorithm with the rst match method for choosing a deferred state. Note
that we use u and u1 , u2 interchangeably to indicate a state in the merged D2 FA D3
where u is a state in Q3 , and u1 and u2 are the states in Q1 and Q2 , respectively, that
state u corresponds to.

4.3.3

Direct D2 FA construction for RE set

Similar to ecient DFA construction, we rst build the D2 FA for each RE in R. We
now need to merge the D2 FAs together using the D2FAMerge algorithm from the previous
section. We consider a variety of methods for merging the D2 FAs together including
a greedy \Human" approach, where in each step, the two smallest D2 FA are merged
together. The best approach, we have found experimentally, is to merge all the D2 FAs
in a balanced binary tree fashion. This is because a binary tree minimizes the worst-case
number of merges that any RE experiences.
We use two dierent variations of our D2FAMerge algorithm while merging D2 FAs. For
all merges except the nal merge, we use the rst match method for setting F(u). When
doing the nal merge to get the nal D2 FA, we use the best match method for setting
F(u). It turns out that using the rst match method results in a better deferment forest

structure in the D2 FA, which helps when the D2 FA is further merged with other D2 FAs.
51

2
1 Input: A pair of D FAs, D1 = (Q1 , Σ, ρ1 , q0 1 , M1 , F1 ) and D2 = (Q2 , Σ, ρ2 , q0 2 , M2 , F2 ),
corresponding to RE sets, say R1 and R2 , with R1 ∩ R2 = ∅.
Output: A D2 FA corresponding to the RE set R1 ∪ R2

1
2
3
4
5
6
7
8
9
10
11
12
13

Initialize D3 to an empty D2 FA;
Initialize queue as an empty queue;
queue.push ( q01 , q02 );

while queue not empty do
u, u1 u2 ← queue.pop();
Q3 ← Q3 ∪ {u};
foreach c ∈ Σ do
nxt ← δ1 (u1 , c), δ2 (u2 , c) ;
if nxt ∈ Q3 ∧ nxt ∈ queue then queue.push (nxt);
/
/
Add (u, c) → nxt transition to ρ3 ;
M3 (u) ← M1 (u1 ) ∪ M2 (u2 );
F3 (u) ← FindDefState(u);
Remove transitions for u from ρ3 that are in common with F3 (u);

14 foreach u ∈ Q3 do
15
newDptr ← FindDefState(u);
16
if (newDptr = F3 (u)) ∧ (newDptr u > F3 (u) u) then
17
F3 (u) ← newDptr;
18
Reset all transitions for u in ρ3 and then remove ones that are in common with F3 (u);
19 return D3 ;
20 Function FindDefState( v1 , v2 )
21
Let p0 = v1 , p1 , . . . , pl be the list of states on the deferment chain from v1 to the root in
D1 ;
22
Let q0 = v2 , q1 , . . . , qm be the list of states on the deferment chain from v2 to the root
in D2 ;
23
for z = 1 to (l + m) do
24
S ← { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z))∧ ( pi , qz−i ∈ Q3 )};
25
if S = ∅ then return argmaxv∈S ( v1 , v2
v);
26
27

return v1 , v2 ;

Figure 4.4: Algorithm D2FAMerge(D1 , D2 ) for merging two D2 FAs.

The local optimization achieved by using the best match method only helps when used in
the nal merge.
52

4.3.4

Optional Final Compression Algorithm

When there is no bound on the deferment depth (see Section 4.4.2), the original D2 FA
algorithm proposed in [26] results in a D2 FA with smallest possible size because it runs
Kruskal's algorithm on a large SRG. Our D2 FA merge algorithm results in a slightly larger
D2 FA because it uses a greedy approach to determine deferment. We can further reduce
the size of the D2 FA produced by our algorithm by running the following compression
algorithm on the D2 FA produced by the D2 FA merge algorithm.
We construct an SRG and perform a maximum weight spanning tree construction on the
SRG, but we only add edges to the SRG that have the potential to reduce the size of the
D2 FA. More specically, let u and v be any two states in the current D2 FA. We only add
the edge e = (u, v) in the SRG if its weight w(e) is ≥ min(u F(u), v F(v)). Here, F(u) is
the deferred state of u in the current D2 FA. As a result, very few edges are added to the
SRG, so we only need to run Kruskal's algorithm on a small SRG. This saves both space
and time compared to previous D2 FA construction methods. However, this compression
step does require more time and space than the D2 FA merge algorithm because it does
construct an SRG and then runs Kruskal's algorithm on the SRG.

4.4

D2FA Merge Algorithm Properties

We now discuss some properties of the D2 FA merge algorithm itself and the resulting
D2 FA.
53

4.4.1

Proof of Correctness

The D2 FA merge algorithm exactly follows the UCP construction to create the states.
So the correctness of the underlying DFA follows from the the correctness of the UCP
construction.
Theorem 2 shows that the merged D2 FA is also well dened (no cycles in deferment forest).
Lemma 1.

In the D2 FA D3 = D2FAMerge(D1 , D2 ),

u1 , u2

v1 , v2 ⇒ u1 v1 ∧ u2

v2 .

Proof. If

u1 , u2 = v1 , v2 then the lemma is trivially true. Otherwise, let u1 , u2 →

w1 , w2

v1 , v2 be the deferment chain in D3 . When selecting the deferred state for

u1 , u2 , D2FA Merge always choose a state that corresponds to a pair of states along

deferment chains for u1 and u2 in D1 and D2 , respectively. Therefore, we have that
u1 , u2 → w1 , w2 ⇒ u1

chain and the fact that the

Theorem 2.

w1 ∧ u 2

w2 . By induction on the length of the deferment

relation is transitive, we get our result.

If D2 FAs D1 and D2 are well dened, then the D2 FA D3 = D2FAMerge(D1 , D2 )

is also well dened.

Proof. Since D1 and D2 are well dened, there are no cycles in their deferment forests.
Now assume that D3 is not well dened, i.e. there is a cycle in its deferment forest. Let
54

u1 , u2 and v1 , v2 be two distinct states on the cycle. Then, we have that

u1 , u2

v1 , v2 ∧ v1 , v2

u1 , u2

Using Lemma 1 we get

(u1 v1 ∧ u2 v2 ) ∧ (v1 u1 ∧ v2 u2 )

i.e. (u1

v1 ∧ v1 u1 ) ∧ (u2 v2 ∧ v2 u2 )

Since u1 , u2 = v1 , v2 , we have u1 = v1 ∨ u2 = v2 which implies that at least one of D1
or D2 has a cycle in its deferment forest, which is a contradiction.

4.4.2

Limiting Deferment Depth

Since no input is consumed while traversing a deferred transition, in the worst case, the
number of lookups needed to process one input character is given by the deferment depth
of the D2 FA. As proposed in [26], we can guarantee a worst case performance by limiting
the deferment depth of the D2 FA.
Recall that ψ(u) denoted the deferment depth of state u, and Ψ(D) denoted the deferment
depth of the D2 FA D.
Lemma 2.

In the D2 FA

D3 = D2FAMerge(D1 , D2 ), ∀ u1 , u2 ∈ Q3 , ψ( u1 , u2 ) ≤

ψ(u1 ) + ψ(u2 ).

55

Proof. Let

ψ( u1 , u2 ) = d. If ψ( u1 , u2 ) = 0, then u1 , u2 is a root and the lemma

is trivially true. So, we consider d ≥ 1 and assume the lemma is true for all states with
ψ < d. Let u1 , u2 → w1 , w2

v1 , v2 be the deferment chain in D3 . Using the inductive

hypothesis, we have
ψ( w1 , w2 ) ≤ ψ(w1 ) + ψ(w2 )

Given u1 , u2 = w1 , w2 , we assume without loss of generality that u1 = w1 . Using
Lemma 1 we get that u1 w1 . Therefore ψ(w1 ) ≤ ψ(u1 ) − 1. Combining the above, we
get

ψ( u1 , u2 ) = ψ( w1 , w2 ) + 1
≤ ψ(w1 ) + ψ(w2 ) + 1
≤ (ψ(u1 ) − 1) + ψ(u2 ) + 1
≤ ψ(u1 ) + ψ(u2 )

Lemma 2 directly gives us the following Theorem.
Theorem 3.

If D3 = D2FAMerge(D1 , D2 ), then Ψ(D3 ) ≤ Ψ(D1 ) + Ψ(D2 ).

For an RE set R, if the initial D2 FAs have Ψ = d, in the worst case, the nal merged
D2 FA corresponding to R can have Ψ = d × |R|. Although Theorem 3 gives the value of
Ψ in the worst case, in practical cases, Ψ(D3 ) is very close to max(Ψ(D1 ), Ψ(D2 )). Thus

56

the deferment depth of the nal merged D2 FA is usually not much higher than d.
Let Ω denote the desired upper bound on Ψ. To guarantee Ψ(D3 ) ≤ Ω, we modify the
FindDefState subroutine in Algorithm 4.4 as follows: When selecting candidate pairs for

the deferred state, we only consider states with ψ < Ω. Specically, we replace line 24
with the following

S := { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ pi , qz−i ∈ Q3 ) ∧ (ψ( pi , qz−i ) < Ω)}

When we do the second pass (lines 14-18), we may increase the deferment depth of nodes
that defer to nodes that we readjust. We record the aected nodes and then do a third
pass to reset their deferment states so that the maximum depth bound is satised. In
practice, this happens very rarely.
When constructing a D2 FA with a given bound Ω, we rst build D2 FAs without this
bound. We only apply the bound Ω when performing the nal merge of two D2 FAs to
create the nal D2 FA.

4.4.3

Deferment to a Lower Level

Becchi and Crowley [8] propose a D2 FA algorithm where each state defers to a state at a
lower level than itself (see Section 3.3.4.) More formally, they ensure that for all states u,
level(u) > level(F(u)) if F(u) = u. We call this property the back-pointer property. If the

back-pointer property holds, then every deferred transition taken decreases the level of the
57

current state by at least 1. Since a regular transition on an input character can only increase
the level of the current state by at most 1, there have to be fewer deferred transitions taken
on the entire input string than regular transitions. This gives an amortized cost of at most
2 transitions taken per input character.

Unfortunately, if D2 FAs D1 and D2 have the back-pointer property, the merged D2 FA
D3 = D2FAMerge(D1 , D2 ) is not guaranteed to have the back-pointer property. A simple

counter example is when trying to merge the D2 FAs corresponding to the REs/(aaa)+/
and /(aaaa)+/. Typically, for practical cases, if the initial D2 FAs have the back-pointer
property, in the nal merged D2 FA, almost all of the states have the back-pointer property.
In order to guarantee the D2 FA D3 has the back-pointer property, we perform a similar
modication to the FindDefState subroutine in Algorithm 4.4 as we performed when we
wanted to limit the maximum deferment depth. When selecting candidate pairs for the
deferred state, we only consider states with a lower level. Specically, we replace line 24
with the following:

S :={ pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧
( pi , qz−i ∈ Q3 ) ∧ (level( v1 , v2 ) > level( pi , qz−i ))}

For states for which no candidate pairs are found, we just search through all states in Q3
that are at a lower level for the deferred state. In practice, this search through all the states
needs to be done for very few states because if D2 FAs D1 and D2 have the back-pointer
58

property, then almost all the states in D2 FAs D3 have the back-pointer property. As with
limiting maximum deferment depth, we only apply this restriction when performing the
nal merge of two D2 FAs to create the nal D2 FA.

4.4.4

Algorithmic Complexity

The time complexity of the original D2 FA algorithm proposed in [26] is O(n2 (log(n)+|Σ|)).
The SRG has O(n2 ) edges, and O(|Σ|) time is required to add each edge to the SRG and
O(log(n)) time is required to process each edge in the SRG during the maximum spanning

tree routine. The time complexity of the D2 FA algorithm proposed in [8] is O(n2 |Σ|). Each
state is compared with O(n) other states, and each comparison requires O(|Σ|) time.
The time complexity of our new D2FAMerge algorithm to merge two D2 FAs is O(nΨ1 Ψ2 |Σ|),
where n is the number of states in the merged D2 FA, and Ψ1 and Ψ2 are the maximum
deferment depths of the two input D2 FAs. When setting the deferment for any state
u = u1 , u2 , in the worst case the algorithm compares u1 , u2 with all the pairs along

the deferment chains of u1 and u2 , which are at most Ψ1 and Ψ2 in length, respectively.
Each comparison requires O(|Σ|) time. In practice, the time complexity is O(n|Σ|) as each
state needs to be compared with very few states for the following three reasons. First, the
maximum deferment depth Ψ is usually very small. The largest value of Ψ among our 8
primary RE sets in Section 4.5 is 7. Second, the length of the deferment chains for most
states is much smaller than Ψ. The largest value of average deferment depth ψ among
our 8 RE sets is 2.54. Finally, many of the state pairs along the deferment chains are
59

not reachable in the merged D2 FA. Among our 8 RE sets, the largest value of the average
number of comparisons needed is 1.47.
When merging all the D2 FAs together for an RE set R, the total time required in the
worst case would be O(nΨ1 Ψ2 |Σ| log(|R|)). The worst case would happen when the RE set
contains strings and there is no state explosion. In this case, each merged D2 FA would
have a number of states roughly equal to the sum of the sizes of the D2 FAs being merged.
When there is state explosion, the last D2 FA merge would be the dominating factor, and
the total time would just be O(nΨ1 Ψ2 |Σ|).
When modifying the D2FAMerge algorithm to maintain back-pointers, the worst case time
would be O(n2 |Σ|) because we would have to compare each state with O(n) other states
if none of the candidate pairs are found at a lower level than the state. In practice, this
search needs to be done for very few states, typically less than 1%.
The worst case time complexity of the nal compression step is the same as that of Kumar

et al.'s D2 FA algorithm, which is

O(n2 (log(n) + |Σ|)), since both involve computing a

maximum weight spanning tree on the SRG. However, because we only consider edges
which improve upon the existing deferment forest, the actual size of the SRG in practice is
typically linear in the number of nodes. In particular, for the real-world RE sets that we
consider in the experiments section, the size of the SRG generated by our nal compression
step is on average 100 times smaller than the SRG generated by Kumar et al.'s algorithm.
As a result the optimization step requires much less memory and time compared to the
original algorithm.
60

4.5

Experimental Results

In this section, we evaluate the eectiveness of our algorithms on real-world and synthetic RE sets. We consider two variants of our D2 FA merge algorithm: the main variant

D2 FAMERGE which just merged the D2 FAs, and D2 FAMERGEOPT, which applies
our nal compression algorithm after running D2 FAMERGE. We compare our algorithms
with the original D2 FA construction algorithm proposed in [26] ORIGINAL that optimizes
transition compression and the D2 FA construction algorithm proposed in [8] BACKPTR
that enforces the back-pointer property described in Section 4.4.3.

4.5.1

Methodology

4.5.1.1

Data Sets

Our main results are based on eight real RE sets, four proprietary RE sets C7, C8,
C10, and C613 from a large networking vendor and four public RE sets Bro217, Snort
24, Snort31, and Snort 34, that we partition into three groups, STRING, WILDCARD,
and SNORT, based upon their RE composition. For each RE set, the number indicates
the number of REs in the RE set. The STRING RE sets, C613 and Bro217, contain
mostly string matching REs. The WILDCARD RE sets, C7, C8 and C10, contain mostly
REs with multiple wildcard closures `. '. The SNORT RE sets, Snort24, Snort31, and
Snort34, contain a more diverse set of REs, roughly 40% of which have wildcard closures.
To test scalability, we use Scale, a synthetic RE set consisting of 26 REs of the form
61

/. cu 0123456. cl 789!#%&/, where cu and cl are the 26 uppercase and lowercase al-

phabet letters. Even though all the REs are nearly identical diering only in the character
after the two . 's, we still get the full multiplicative eect where the number of states in
the corresponding minimum state DFA roughly doubles for every RE added.

4.5.1.2

Metrics

We use the following metrics to evaluate the algorithms. First, we measure the resulting D2 FA size (# transitions) to assess transition compression performance. Our
D2 FAMERGE algorithm typically performs almost as well as the other algorithms even
though it builds up the D2 FA incrementally rather than compressing the nal minimum
state DFA. Second, we measure the the maximum deferment depth (Ψ) and average deferment depth (ψ) in the D2 FA to assess how quickly the resulting D2 FA can be used
to perform regular expression matching. Smaller Ψ and ψ mean that fewer deferment
transitions that process no input characters need to be traversed when processing an input string. Our D2 FAMERGE signicantly outperforms the other algorithms. Finally,
we measure the space and time required by the algorithm to build the nal automaton.
Again, our D2 FAMERGE signicantly outperforms the other algorithms. When comparing the performance of D2 FAMERGE with another algorithm A on a given RE or RE set,
we dene the following quantities to compare them: transition increase is (D2 FAMERGE
D2 FA size - A D2 FA size) divided by A D2 FA size, transition decrease is (A D2 FA size D2 FAMERGE D2 FA size) divided by A D2 FA size, average (maximum) deferment depth
62

ratio is A average (maximum) deferment depth divided by D2 FAMERGE average (maximum) deferment depth, space ratio is A space divided by D2 FAMERGE space, and time
ratio is A build time divided by D2 FAMERGE build time.

4.5.1.3

Measuring Space

When measuring the required space for an algorithm, we measure the maximum amount
of memory required at any point in time during the construction and then nal storage
of the automaton. This is a dicult quantity to measure exactly; we approximate this
required space for each of the algorithms as follows. For D2 FAMERGE, the dominant data
structure is the D2 FA. For a D2 FA, the transitions for each state can be stored as pairs of
input character and next state id, so the memory required to store a D2 FA is calculated as
= (#transitions) × 5 bytes. However, the maximum amount of memory required while

running D2 FAMERGE may be higher than the nal D2 FA size because of the following
two reasons. First, when merging two D2 FAs, we need to maintain the two input D2 FAs
as well as the output D2 FA. Second, we may create an intermediate output D2 FA that
has more transitions than needed; these extra transitions will be eliminated once all D2 FA
states are added. We keep track of the worst case required space for our algorithm during
D2 FA construction. This typically occurs when merging the nal two intermediate D2 FA
to form the nal D2 FA.
For ORIGINAL, we measure the space required by the minimized DFA and the SRG. For
the DFA, the transitions for each state can be stored as an array of size Σ with each array
63

entry requiring four bytes to hold the next state id. For the SRG, each edge requires
17 bytes as observed in [8]. This leads to a required memory for building the D2 FA of
= |Q| × |Σ| × 4 + (#edges in SRG) × 17 bytes.

For D2 FAMERGEOPT, the space required is the size of the nal D2 FA resulting from the
merge step, plus the size of the SRG used by the nal compression algorithm. The sizes
are computed as in the case of D2 FAMERGE and ORIGINAL.

For BACKPTR, we consider two variants. The rst variant builds the minimized DFA
directly from the NFA and then sets the deferment for each state. For this variant, no
SRG is needed, so the space required is the space needed for the minimized DFA which
is |Q| × |Σ| × 4 bytes. The second variant goes directly from the NFA to the nal D2 FA;
this variant uses less space but is much slower as it stores incomplete transition tables for
most states. Thus, when computing the deferment state for a new state, the algorithm
must recreate the complete transition tables for each state to determine which has the most
common transitions with the new state. For this variant, we assume the only space required
is the space to store the nal D2 FA which is = (#transitions) × 5 bytes even though
more memory is denitely needed at various points during the computation. We also note
that both implementations must perform the NFA to DFA subset construction on a large
NFA which means even the faster variant runs much more slowly than D2 FAMERGE.
64

4.5.1.4

Correctness

We tested correctness of our algorithms by verifying the nal D2 FA is equivalent to the
corresponding DFA. Note, we can only do this check for our RE sets where we were able
to compute the corresponding DFA. Thus, we only veried correctness of the nal D2 FA
for our eight real RE sets and the smaller Scale RE sets.

4.5.2

D2 FAMERGE versus ORIGINAL

We rst compare D2 FAMERGE with ORIGINAL that optimizes transition compression
when both algorithms have unlimited maximum deferment depth. These results are shown
in Table 4.1 for our 8 primary RE sets.
RE
set
Bro217
C613
C7
C8
C10
Snort24
Snort31
Snort34

ORIGINAL
#
Def. depth RAM
States # Trans Avg. Max. (MB)
6533
9816 3.42
8 179.3
11308
21633 8.43
16 1039.5
24750 205633 19.18
30
47.4
3108
23209 8.95
13
4.9
14868
96793 13.68
27
25.5
13886
38485 9.53
20 861.2
20068
70701 11.41
23 298.5
13825
40199 9.99
17 795.4

Time
# Trans
(s)
119.4
11737
326.0
26709
397.7 207540
14.5
23334
141.0
97296
299.2
39409
244.3
92284
309.9
43141

D2 FAMERGE
Def. depth RAM Time
Avg. Max. (MB) (s)
2.15
5 0.13
3.2
2.69
7 0.23
9.7
1.14
3 1.07
0.9
1.14
2 0.14
0.2
1.18
3 0.52
0.6
1.56
4 0.32
0.2
2.00
6 1.29
2.6
1.38
5 0.27
1.8

Table 4.1: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space
estimate and time required to build the D2 FA for ORIGINAL and D2 FAMERGE.
Table 4.2 summarizes these results by RE group. We make the following observations.

(1) D2 FAMERGE uses much less space than ORIGINAL. On average, D2 FAMERGE
uses 1500 times less memory than ORIGINAL to build the resulting D2 FA. This dierence
65

RE set
group
All
STRING
WILDCARD
SNORT

Trans
increase
10.8%
21.5%
1.0%
13.3%

D2 FAMERGE
Def. depth ratio Space
Avg.
Max. ratio
7.5
5.2 1499.8
2.4
1.9 2994.8
12.1
8.5 42.8
6.3
4.1 1960.3

Time Trans
ratio increase
154.5
0.4%
35.4
0.0%
246.6
1.0%
141.8
0.0%

D2 FAMERGEOPT
Def. depth ratio Space
Avg.
Max. ratio
7.4
5.4 113.1
2.1
1.6 103.5
12.1
10.0 16.8
6.1
3.3 215.8

Time
ratio
9.4
0.8
10.8
13.7

Table 4.2: Average values of transition increase, deferment depth ratios, space ratios, and
time ratios for D2 FAMERGE and D2 FAMERGEOPT compared with ORIGINAL.
is most extreme when the SRG is large, which is true for the two STRING RE sets and
Snort24 and Snort34. For these RE sets, D2 FAMERGE uses between 1422 and 4568 times
less memory than ORIGINAL. For the RE sets with relatively small SRGs such as those
in the WILDCARD and Snort31, D2 FAMERGE uses between 35 and 231 times less space
than ORIGINAL.

(2) D2 FAMERGE is much faster than ORIGINAL. On average, D2 FAMERGE builds
the D2 FA 155 times faster than ORIGINAL. This time dierence is maximized when the
deferment chains are shortest. For example, D2 FAMERGE only requires an average of 0.05
msec and 0.09 msec per state for the WILDCARD and SNORT RE sets, respectively, so
D2 FAMERGE is, on average, 247 and 142 times faster than ORIGINAL for these RE sets,
respectively. For the STRING RE sets, the deferment chains are longer, so D2 FAMERGE
requires an average of 0.67 msec per state, and is, on average, 35 times faster than ORIGINAL.

(3) D2 FAMERGE produces D2 FA with much smaller average and maximum deferment depths than ORIGINAL. On average, D2 FAMERGE produces D2 FA that have
average deferment depths that are 7.5 times smaller than ORIGINAL and maximum de66

ferment depths that are 5.2 times smaller than ORIGINAL. In particular, the average
deferment depth for D2 FAMERGE is less than 2 for all but the two STRING RE sets,
where the average deferment depths are 2.15 and 2.69. Thus, the expected number of
deferment transitions to be traversed when processing a length n string is less than n.
One reason D2 FAMERGE works so well is that it eliminates low weight edges from the
SRG so that the deferment forest has many shallow deferment trees instead of one deep
tree. This is particularly eective for the WILDCARD RE sets and, to a lesser extent, the
SNORT RE sets. For the STRING RE sets, the SRG is fairly dense, so D2 FAMERGE has
a smaller advantage relative to ORIGINAL.

(4) D2 FAMERGE produces D2 FA with only slightly more transitions than ORIGINAL, particularly on the RE sets that need transition compression the most. On
average, D2 FAMERGE produces D2 FA with roughly 11% more transitions than ORIGINAL does. D2 FAMERGE works best when state explosion from wildcard closures creates
DFA composed of many similar repeating substructures. This is precisely when transition
compression is most needed. For example, for the WILDCARD RE sets that experience
the greatest state explosion, D2 FAMERGE only has 1% more transitions than ORIGINAL.
On the other hand, for the STRING RE sets, D2 FAMERGE has, on average, 22% more
transitions. For this group, ORIGINAL needed to build a very large SRG and thus used
much more space and time to achieve the improved transition compression. Furthermore,
transition compression is typically not needed for such RE sets as all string matching REs
can be placed into a single group and the resulting DFA can be built.
67

In summary, D2 FAMERGE achieves its best performance relative to ORIGINAL on the
WILDCARD RE sets (except for space used for construction of the D2 FA) and its worst
performance relative to ORIGINAL on the STRING RE sets (except for space used to
construct the D2 FA). This is desirable as the space and time ecient D2 FAMERGE is
most needed on RE sets like those in the WILDCARD because those RE sets experience
the greatest state explosion.

4.5.3

Assessment of Final Compression Algorithm

We now assess the eectiveness of our nal compression algorithm by comparing
D2 FAMERGEOPT to ORIGINAL and D2 FAMERGE. The results are shown in Table 4.3
for our 8 primary RE sets.
Def. depth RAM Time
RE
#
# Trans
Avg. Max. (MB) (s)
set
States
Bro217
6533
9816 2.44
7 2.64 99.2
C613 11308
21633 3.04
8 7.48 940.4
C7 24750 207540 1.14
3 2.49 45.7
C8
3108
23334 1.14
2 0.32
1.0
C10 14868
97296 1.17
2 1.61 14.8
Snort24 13886
38601 1.57
4 2.67 19.9
Snort31 20068
70780 2.17
8 15.61 59.1
Snort34 13825
40387 1.42
8 2.60 14.2

Table 4.3: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space
estimate and time required to build the D2 FA for D2 FAMERGEOPT.
Table 4.2 summarizes these results by RE group. As expected D2 FAMERGEOPT produces
a D2 FA that is almost as small as that produced by ORIGINAL; on average, the number
of transitions increases by only 0.4%. There is a very small increase for WILDCARD and
68

SNORT because ORIGINAL also considers all edges with weight > 1 in the SRG, whereas
D2 FAMERGEOPT does not use edges with weight < 10. There is a signicant benet to
not using these low weight SRG edges; the deferment depths are much higher for the D2 FA
produced by ORIGINAL when compared to the D2 FA produced by D2 FAMERGEOPT.
The nal compression algorithm of D2 FAMERGEOPT does require more resources than
are required by D2 FAMERGE. In some cases, this may limit the size of the RE set
D2 FAMERGEOPT can be used for. However, as explained earlier, D2 FAMERGE performs best on the WILDCARD (which has the most state explosion) and performs the
worst on the STRING (which has the no or limited state explosion). So the nal compression algorithm is only needed for and is most benecial for RE sets with limited state
explosion. Finally, we observe that D2 FAMERGEOPT requires on average 113 times less
RAM than ORIGINAL, and, on average, runs 9 times faster than ORIGINAL.

4.5.4

D2 FAMERGE versus ORIGINAL with Bounded Maximum
Deferment Depth

We now compare D2 FAMERGE and ORIGINAL when they impose a maximum deferment
depth bound Ω of 1, 2, and 4. Because time and space do not change signicantly, we
focus only on number of transitions and average deferment depth. These results are shown
in Table 4.4. Note that for these data sets, the resulting maximum depth Ψ typically is
identical to the maximum depth bound Ω; the only exception is for D2 FAMERGE and
Ω = 4; thus we omit the maximum deferment depth from Table 4.4.

69

ORIGINAL
# Trans
Avg. def. depth

RE
set

Ω=1

Bro217
C613
C7
C8
C10
Snort24
Snort31
Snort34

698229
1204831
2044171
206897
1105160
1376779
2193679
1357697

D2 FAMERGE
# Trans
Avg. def. depth

Ω = 2 Ω = 4 Ω=1 Ω=2 Ω=4 Ω = 1 Ω = 2 Ω = 4 Ω=1 Ω=2 Ω=4

296433
507613
597544
40411
325536
543378
1102693
559255

52628
102183
206814
23261
97137
106211
405785
85800

0.62
0.62
0.71
0.77
0.75
0.66
0.62
0.66

1.18
1.17
1.24
1.32
1.31
1.25
1.11
1.19

2.09
2.16
2.07
2.51
2.39
2.39
2.08
2.17

50026
154548
215940
24090
101556
68906
208136
57187

15087
51858
208044
23334
97326
42176
119810
44607

11757
27735
207540
23334
97296
39409
95496
43231

1.00
1.00
0.97
0.98
0.98
0.99
1.00
1.00

1.83
1.94
1.13
1.14
1.18
1.47
1.52
1.34

2.15
2.64
1.14
1.14
1.18
1.56
1.97
1.38

Table 4.4: The D2 FA size and D2 FA average ψ deferment depth for ORIGINAL and
D2 FAMERGE on our eight primary RE sets given maximum deferment depth bounds of
1, 2 and 4.
Table 4.5 summarizes the results by RE group highlighting how much better or worse
D2 FAMERGE does than ORIGINAL on the two metrics of number of transitions and
average deferment depth ψ.
RE set
group
All
STRING
WILDCARD
SNORT

Ω=1

Ω=2

Ω=4

Trans Avg. def.
decr. depth ratio
91.3%
0.7
90.0%
0.6
89.3%
0.8
94.0%
0.7

Trans Avg. def.
decr. depth ratio
79.4%
0.9
92.5%
0.6
59.0%
1.1
91.0%
0.8

Trans Avg. dptr
decr. len ratio
42.5%
1.5
75.5%
0.9
0.0%
2.0
63.0%
1.4

Table 4.5: Average values of transition decrease and average deferment depth ratios for
D2 FAMERGE compared with ORIGINAL for our RE set groups given maximum deferment depth bounds of 1, 2 and 4.
Overall, D2 FAMERGE performs very well when presented a bound Ω. In particular, the
average increase in the number of transitions for D2 FAMERGE with Ω equal to 1, 2 and
4, is only 131%, 20% and 1% respectively, compared to D2 FAMERGE with unbounded

maximum deferment depth. Stated another way, when D2 FAMERGE is required to have
a maximum deferment depth of 1, this only results in slightly more than twice the number
70

of transitions in the resulting D2 FA. The corresponding values for ORIGINAL are 3121%,
1216% and 197%.

These results can be partially explained by examining the average deferment depth data.
Unlike in the unbounded maximum deferment depth scenario, here we see that D2 FAMERGE
has a larger average deferment depth ψ than ORIGINAL except for the WILDCARD when
Ω is 1 or 2. What this means is that D2 FAMERGE has more states that defer to at least

one other state than ORIGINAL does. This leads to the lower number of transitions in
the nal D2 FA. Overall, for Ω = 1, D2 FAMERGE produces D2 FA with roughly 91% fewer
transitions than ORIGINAL for all RE set groups. For Ω = 2, D2 FAMERGE produces
D2 FA with roughly 59% fewer transitions than ORIGINAL for the WILDCARD RE sets
and roughly 92% fewer transitions than ORIGINAL for the other RE sets.

4.5.5

D2 FAMERGE versus BACKPTR

BACKPTR
D2 FAMERGE with back-pointer
RE
Def. depth RAM Time RAM2 Time2
Def. depth RAM Time
# Trans
set # Trans
Avg. Max. (MB)
(s) (MB)
(s)
Avg. Max. (MB) (s)
Bro217 11247 2.61
6 6.38 88.08 0.05 273.95 13567 2.33
6 0.13 6.24
C613 26222 2.50
5 11.04 55.91 0.13 971.45 33777 2.30
5 0.25 10.78
C7 217812 5.94
13 24.17 277.80 1.04 1950.00 219684 1.15
4 1.12 4.51
C8 34636 2.44
8 3.04 12.61 0.17 27.76 35476 1.20
4 0.19 0.69
C10 157139 2.13
7 14.52 96.86 0.75 476.54 158232 1.21
4 0.80 11.94
Snort24 46005 8.74
17 13.56 70.95 0.22 1130.00 58273 1.62
8 0.41 47.77
Snort31 82809 2.87
8 19.60 109.56 0.39 1110.00 124584 1.74
6 1.29 3.61
Snort34 46046 7.05
14 13.50 94.19 0.22 983.98 51557 1.42
5 0.30 6.06

Table 4.6: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space
estimate and time required to build the D2 FA for both variants of BACKPTR and
D2 FAMERGE with the back-pointer property.
71

We now compare D2 FAMERGE with BACKPTR which enforces the back-pointer property described in Section 4.4.3. We adapt D2 FAMERGE to also enforce this back-pointer
property. The results for all our metrics are shown in Table 4.6 for our 8 primary RE sets.
We consider the two variants of BACKPTR described in Section 4.5.1.3, one which constructs the minimum state DFA corresponding to the given NFA and one which bypasses
the minimum state DFA and goes directly to the D2 FA from the given NFA. We note the
second variant appears to use less space than D2 FAMERGE. This is partially true since
BACKPTR creates a smaller D2 FA than D2 FAMERGE. However, we underestimate the
actual space used by this BACKPTR variant by simply assuming its required space is the
nal D2 FA size. We ignore, for instance, the space required to store intermediate complete
tables or to perform the NFA to DFA subset construction. Table 4.7 summarizes these
results by RE group displaying ratios for many of our metrics that highlight how much
better or worse D2 FAMERGE does than BACKPTR.
RE set
Trans Def. depth ratio Space Time Space2 Time2
group
increase Avg.
Max. ratio ratio
ratio
ratio
All
17.9%
2.9
1.9
30.4 19.3
0.7 142.5
STRING
25.0%
1.1
1.0
47.3
9.7
0.5
67.0
WILDCARD
1.3%
3.0
2.3
18.5 29.3
0.9 170.8
SNORT
29.7%
4.0
2.1
31.1 15.8
0.5 164.5

Table 4.7: Average values of transition increase, deferment depth ratios, space ratios,
and time ratios for D2 FAMERGE compared with both variants of BACKPTR for RE set
groups.
Similar to D2 FAMERGE versus ORIGINAL, we nd that D2 FAMERGE with the backpointer property performs well when compared with both variants of BACKPTR. Specically, with an average increase in the number of transitions of roughly 18%, D2 FAMERGE
72

runs on average 19 times faster than the fast variant of BACKPTR and 143 times faster
than the slow variant of BACKPTR. For space, D2 FAMERGE uses on average almost 30
times less space than the rst variant of BACKPTR and on average roughly 42% more
space than the second variant of BACKPTR. Furthermore, D2 FAMERGE creates D2 FA
with average deferment depth 2.9 times smaller than BACKPTR and maximum deferment
depth 1.9 times smaller than BACKPTR. As was the case with ORIGINAL, D2 FAMERGE
achieves its best performance relative to BACKPTR on the WILDCARD RE sets and its
worst performance relative to BACKPTR on the STRING RE sets. This is desirable as
the space and time ecient D2 FAMERGE is most needed on RE sets like those in the
WILDCARD because those RE sets experience the greatest state explosion.

4.5.6

Scalability results

Finally, we assess the improved scalability of D2 FAMERGE relative to ORIGINAL using
the Scale RE set assuming that we have a maximum memory size of 1GB. For both
ORIGINAL and D2 FAMERGE, we add one RE at a time from Scale until the space
estimate to build the D2 FA goes over the 1GB limit. For ORIGINAL, we are able only
able to add 12 REs; the nal D2 FA has 397, 312 states and requires over 71 hours to
compute. As explained earlier, we include the SRG edges in the RAM size estimate. If we
exclude the SRG edges and only include the DFA size in the RAM size estimate, we would
only be able to add one more RE before we reach the 1GB limit. For D2 FAMERGE, we are
able to add 19 REs; the nal D2 FA has 80, 216, 064 states and requires only 77 minutes to
73

compute. This data set highlights the quadratic versus linear running time of ORIGINAL
and D2 FAMERGE, respectively. Figure 4.5 shows how the space and time requirements
grow for ORIGINAL and D2 FAMERGE as REs from Scale are added one by one until 19
have been added.
Memory required to build
1000

RAM (MB)

100
10
1
ORIGINAL
D FAMERGE

0.1

2

0.01
2

4

6

8

10
12
#REs

14

16

18

20

18

20

Time required to build
1e+006
100000

Time (s)

10000
1000
100
10
1

ORIGINAL
D2FAMERGE

0.1
0.01
2

4

6

8

10
12
#REs

14

16

Figure 4.5: Memory and time required to build D2 FA versus number of Scale REs used
for ORIGINAL's D2 FA and D2 FAMERGE's D2 FA.

74

Chapter 5

TCAM Implementation
In this section we present our work on the hardware implementation of RE matching using
TCAM, which we call RegCAM.

5.1

Introduction/Motivation

Previous hardware solutions of RE matching have be based on FPGA. Although FPGAbased solutions can be modied, resynthesizing and updating FPGA circuitry in a deployed
system to handle RE updates is slow and dicult. This makes FPGA-based solutions
dicult to be deployed in many networking devices (such as NIDS/NIPS and rewalls)
where the RE need to be updated frequently.
We propose the rst TCAM based RE matching solution. TCAMs are prevalent in networking devices because TCAM-based packet classication is the de facto industry stan75

dard for high-speed packet classication, i.e., header-based ltering. We show that TCAMs
are also very eective for high-speed DPI, i.e., payload-based ltering.

5.1.1

TCAM Architecture for RE matching

We rst explain the straightforward implementation of RE matching using TCAM without
any compression.
Given a RE set, we rst construct an equivalent minimum state DFA. Second, we build
a two column TCAM lookup table where each column encodes one of the two inputs to
δ: the

source state ID and the input character. Third, for each TCAM entry, we store

the destination state ID in the same entry of the associated SRAM. Figure 5.1 shows an
example DFA, its TCAM lookup table, and its SRAM decision table. We illustrate how
this DFA processes the input stream \01101111, 01100011". We form a TCAM lookup key
by appending the current input character to the current source state ID; in this example,
we append the rst input character \01101111" to \00", the ID of the initial state s0 , to
form \0001101111". The rst matching entry is the second TCAM entry, so \01", the
destination state ID stored in the second SRAM entry is returned. We form the next
TCAM lookup key \0101100011" by appending the second input character \011000011"
to this returned state ID \01", and the process repeats.
Directly encoding a DFA in a TCAM using one TCAM entry per transition is infeasible.
For example, consider a DFA with 25, 000 states that consumes one 8 bit character per
transition. Each state has 28 transitions, and each transition needs 8 bits for the character
76

fail

fail

S0
fail
b

[a..o]

S1
b,c

a,[c..o]

S2

TCAM
Source 
Input 
ID
character
00 0110 0000
S0
00 0110 
00  
01 0110 0000
01 0110 0010
S1
01 0110 
01  
10 0110 0000
10 0110 001
S2
10 0110 
10  
Src ID

a,[d..o]
(a) Example DFA.













SRAM
Dest. 
ID
00
01
00
00
01
10
00
00
01
10
00

S0
S1
S0
S0
S1
S2
S0
S0
S1
S2
S0

Input

Input stream

(b) Corresponding TCAM table.

Figure 5.1: A DFA with its TCAM table.
and log 25000 bits for the source state ID. Thus, we would need a total of 140.38 Mb
(= 25000 × 28 × (8 + log 25000 )). This is infeasible given the largest available TCAM

chip has a capacity of only 72 Mb. To address this challenge, we use two techniques that
minimize the TCAM space for storing a DFA: transition sharing and table consolidation.

5.1.2

Reducing TCAM size

Recall that the two causes of DFA space explosion are transitions sharing and state replication (Section 3.2). We propose two techniques to reduce the size of TCAM required
77

to implement a DFA: Transitions Sharing that exploits transitions sharing and Table

Consolidation that exploits state replication. The basic idea is to combine multiple transitions into one such that we use the ternary nature and rst-match semantics of TCAMs
to encode multiple DFA transitions using one TCAM entry.

5.1.2.1

Transitions Sharing

The two reasons for transition sharing are character redundancy and state redundancy.

Character redundancy: Prior work exploits character redundancy mainly by alphabet

encoding, where the alphabet Σ is mapped to a smaller alphabet Σ . Alphabet encoding
cannot fully leverage all the compression opportunities presented by character redundancy,
as it can only exploit global character redundancy that is common to all states in the
DFA. Specically, alphabet encoding can map two characters σ1 and σ2 in Σ to the same
character σ in Σ if and only if ∀q ∈ Q, δ(q, σ1 ) = δ(q, σ2 ).
To exploit character redundancy at each state, we propose the technique of character

bundling. In character bundling, we leverage the ternary nature and rst-match semantics
of TCAMs on the input character eld to represent multiple characters and thus multiple
transitions that share the same source and destination states.

State redundancy: Prior work exploits state redundancy mainly by

deferred transi-

tions, where one state p might defer most of the transitions for another state q. Existing
78

deferred transition based solutions cannot fully exploit state redundancy because of the
speed penalty, i.e., traversal of a deferred transition leads to no input being processed.
Thus, to alleviate this speed penalty, such solutions often choose deferred transitions that
do not fully compress the transition table.
To exploit state redundancy, we propose the technique of shadow encoding. In shadow
encoding, we leverage the ternary nature and rst-match semantics of TCAMs on the
current state ID eld to encode many incoming transitions of a state from dierent states
using only one TCAM entry.

5.1.2.2

Table Consolidation

We get state explosion in a DFA because each NFA state is replicated multiple times in
the DFA. Table Consolidation exploits state replication in a DFA based on the following
observation: two DFA states that are replications of the same NFA state, will usually have
transitions remaining in the D2 FA (i.e. non-deferred transitions) on the same set of input
characters (although the corresponding transitions in the two states might go to dierent
states.) In this case, the TCAM tables for the two states will be exactly the same except
for the state IDs. If the corresponding transition go to dierent next state then the SRAM
tables for the two states will be dierent.
The idea is that we can merge the TCAM tables for the two states into one TCAM table,
and store both the SRAM tables side by side. This results in reduction in the TCAM size,
at the cost of possibly increasing SRAM size, which is ne since TCAM size is much more
79

critical than SRAM size.

5.1.3

Increasing Matching Throughput

Another challenge that we address is improving RE matching speed and thus throughput.
One way to improve the throughput by up to a factor of k is to use k-stride DFAs that
consume k input characters per transition. However, this leads to an exponential increase
in both state and transition spaces. For example, a k-stride DFA requires 28∗k transitions
per state, so the transition space grows exponentially in k. Previous multi-stride DFAs
suer from a signicant increase in the number of states and the number of transitions
such that only 2-stride DFAs are achieved in practice [9, 13].
To avoid this space explosion, we use the novel idea of variable striding. The basic idea
is to use transitions with variable strides, i.e. dierent transitions can consume dierent
numbers of input characters. This allows us to increase the average number of characters
consumed per transition while ensuring all the transitions t within the allocated TCAM
space. This idea is based on two key observations. First, for many states, we can capture
many but not all k-stride transitions using relatively few TCAM entries whereas capturing
all k-stride transitions requires prohibitively many TCAM entries. Second, with TCAMs,
we can easily store transitions with dierent strides in the same TCAM lookup table.
Variable striding would be very dicult to implement without TCAMs and thus it is not
surprising variable striding has not been considered before.
80

5.1.4

Comparison of Transition Sharing with D2 FA

The observation behind the transition sharing technique, namely many states share a large
number of outgoing transactions, is similar to that of deferred transition in a D2 FA. We
use a D2 FA as the starting point for transition sharing, and it can be viewed as a way of
implementing a D2 FA in TCAM.
But there are several dierences between transition sharing and D2 FA:
(1) The transitions stored at each state is given by the D2 FA. But our character bundling
technique achieves further compression, and so the total number of TCAM rules is significantly less than the number of transitions in the D2 FA.
(2) D2 FA suers from speed penalty, as no input is consumed when a deferred transition is
taken. The number of lookups needed in the worst case is given the the deferment depth
of the current state. Because of or shadow encoding technique, there is no speed penalty
in transition sharing. Only one TCAM lookup is needed for each character, irrespective
of the deferment depth of the current state.
(3) Because of the speed penalty in the D2 FA, for practical implementation, the deferment
depth of the D2 FA is bounded, which signicantly increases the number of transitions
in the D2 FA. For transition sharing, we build D2 FA without any limit on the deferment
depth, achieving maximum transition compression.

We now explain each of our techniques in detail.
81

5.2

Transition Sharing

The basic idea of transition sharing is to combine multiple transitions into a single TCAM
entry. We propose two transition sharing ideas: character bundling and shadow encoding. Character bundling exploits intra-state optimization opportunities and minimizes
TCAM tables along the input character dimension. Shadow encoding exploits inter-state
optimization opportunities and minimizes TCAM tables along the source state dimension.

5.2.1

Character Bundling

Character bundling exploits character redundancy by combining multiple transitions from
the same source state to the same destination into one TCAM entry. Character bundling
consists of four steps. (1) Assign each state a unique ID of log |Q| bits. (2) For each state,
enumerate all 256 transition rules where for each rule, the predicate is a transition's label
and the decision is the destination state ID. (3) For each state, treating the 256 rules as a
1-dimensional packet classier and leveraging the ternary nature and rst-match semantics
of TCAMs, we minimize the number of transitions using the optimal 1-dimensional TCAM
minimization algorithm (Section 3.4.2). (4) Concatenate the |Q| 1-dimensional minimal
prex classiers together by prepending each rule with its source state ID. The resulting
list can be viewed as a 2-dimensional classier where the two elds are source state ID and
transition label and the decision is the destination state ID. Figure 5.1 shows an example
DFA and its TCAM lookup table built using character bundling. The three chunks of
82

TCAM entries encode the 256 transitions for s0 , s1 , and s2 , respectively. Because each
TCAM entry matches one or more input characters, we need only 11 total TCAM entries
instead of the nave implementation that requires 256 × 3 = 768 entries.


5.2.2

Shadow Encoding

Whereas character bundling encodes multiple transitions with the same source and destination states using one TCAM entry, shadow encoding encodes multiple transitions with
the same character label and destination state ID using one TCAM entry. This technique
is based upon the observation of state redundancy. More specically, character bundling
uses ternary codes in the input character eld to encode multiple input characters, and
shadow encoding uses ternary codes in the source state ID eld to encode multiple source
states.

5.2.2.1

Observations

We use our running example in Figure 5.1 to illustrate shadow encoding. We observe that
all transitions with source states s1 and s2 have the same destination state except for the
transitions on character c. Likewise, source state s0 diers from source states s1 and s2
only in the character range [a..o]. This implies there is a lot of state redundancy. The
table in Figure 5.2 shows how we can exploit state redundancy to further reduce required
TCAM space. First, since states s1 and s2 are more similar, we give them the state IDs 00
and 01, respectively. State s2 uses the ternary code of 0∗ in the state ID eld of its TCAM
83

entries to share transitions with state s1 . We give state s0 the state ID of 10, and it uses the
ternary code of ∗∗ in the state ID eld of its TCAM entries to share transitions with both
states s1 and s2 . Second, we order the state tables in the TCAM so that state s1 is rst,
state s2 is second, and state s0 is last. This facilitates the sharing of transitions among
dierent states where earlier states have incomplete tables deferring some transitions to
later tables. Specically, s1 has an incomplete table with only a single TCAM entry to
specify the transitions it does not share with s2 , and s2 has an incomplete table with only
3 TCAM entries to specify the transitions it (and s1 ) does not share with s0 .

TCAM
Source 
Input 
SC
character
00 0110 0011
S1
0 0110 001
S2
0 0110 0000
0 0110 
 0110 0000
S0
 0110 
  









SRAM
Dest. 
ID
01
00
10
01
10
00
10

S2
S1
S0
S2
S0
S1
S0

Figure 5.2: TCAM table with shadow encoding.

Implementing shadow encoding requires solving three key problems: (1) Find the best
order of the state tables in the TCAM (any order is allowed). (2) Choose binary IDs and
ternary codes for each state given the state table order. (3) Identify entries to remove from
each state table.
84

5.2.2.2

Determining Table Order

We rst describe how we compute the order of tables within the TCAM. In order to exploit
inter-state transition sharing, we rst build a D2 FA for the given RE set. If p q (i.e. state
p is a descendant of state q), we say that state p is in state q's shadow. We use the partial

order of the deferment forest of the D2 FA to determine the order of state transition tables
in the TCAM. Specically, state q's transition table must be placed after the transition
tables of all states in state q's shadow. That is, the state order in given by a depth rst
traversal of the deferment forest.

fail

S0
S0

[a..o]

S0

S1
b,c

c

242

243

S2

S2
a,[d..o]
(a) D2 FA

S0

255
(b) SRG

S0

S1
(c) Deferment tree

Figure 5.3: D2 FA, SRG, and deferment tree of the DFA in Figure 5.1.
Figure 5.3 shows the D2 FA, SRG, and the deferment tree, respectively, for the DFA in
Figure 5.1.
85

5.2.2.3

Shadow Encoding Algorithm

We now describe our shadow encoding algorithm which takes as input a deferment forest F,
and outputs the state IDs. We also use the term nodes to refer to states in the description
of the algorithm. To ensure that proper sharing of transitions occurs, we need to compute
a shadow encoding for the given deferment forest. In a valid shadow encoding, each state
q is assigned a binary state

ID (ID(q)) and a ternary shadow code (SC(q)). Binary state

IDs are used in the destination state ID eld (in the SRAM) of transition rules. Ternary
shadow codes are used in the source state ID eld (in the TCAM) of transition rules. The

shadow length of a shadow encoding is the common length of every state ID and shadow
code.
A valid shadow encoding for a given deferment forest F must satisfy the following four

Shadow Encoding Properties (SEP):
1. Uniqueness Property : For any two distinct states p and q, ID(p) = ID(q) and
SC(p) = SC(q).

2. Self-Matching Property : For any state p, ID(p) ∈ SC(p) (i.e., ID(p) matches
SC(p)).

3. Deferment Property : For any two states p and q, p q (i.e., q is an ancestor of p
in the given deferment forest) if and only if SC(p) ⊂ SC(q).
4. Non-interception Property : For any two distinct states p and q, p q if and only
if ID(p) ∈ SC(q).
86

Lemma 3.

Given a valid shadow encoding for deferment forest F, for any state

q

and all states p in q's shadow, ID(p) ∈ SC(q).
Proof. The deferment property implies that SC(p) ⊂ SC(q). The self-matching property
implies that ID(p) ∈ SC(p). Thus, the result follows.
Lemma 4.

Given a valid shadow encoding for deferment forest F, for any state

q

and all states p not in q's shadow, ID(p) ∈ SC(q).
/
Proof. This follows immediately from the non-interception property.
Intuitively, q's shadow code must match the state ID of all states in q's shadow and cannot
match the state ID of any states not in q's shadow.
Theorem 4.
F

Given a valid shadow encoding for a DFA

and a TCAM classier

C

M

and deferment forest

that uses only binary state IDs for both source and

destination state IDs in transition rules and that orders the state tables according
to F, the TCAM classier formed by replacing each source state ID in

C

with the

corresponding shadow code and each destination state ID in C with the corresponding
state ID will be equivalent to C.
Proof. This follows from the rst match nature of TCAMs, the state tables are ordered
according to F, and Lemmas 3 and 4.
We give a shadow encoding algorithm where the deferment forest is a single deferment tree
DT . We handle deferment forests by simply creating a virtual root node whose children

87

are the roots of the deferment trees in the forest and then running the algorithm on this
tree.
Our algorithm uses the following internal variables for each node v: a local binary ID
denoted L(v), a global binary ID denoted G(v), and an integer weight denoted W(v) that
is the shadow length we would use for the subtree of DT rooted at v. Intuitively, the
state ID of v will be G(v)|L(v) where | denotes concatenation, and the shadow code of v
will be the prex string G(v) followed by the required number of ∗'s; some extra padding
characters may be needed. We use #L(v) and #G(v)to denote the number of bits in L(v)
and G(v), respectively.
Our algorithm works as follows. For all v, we initially set L(v) = G(v) = ∅ and W(v) = 0.
Our algorithm works recursively in a bottom-up fashion. We mark nodes red when they
have been processed. We begin by marking each leaf node of DT as processed. We process
an internal node v when all its children v1 , · · · , vn are marked red. Once a node v is
processed, its weight W(v) and its local ID L(v) are xed, but we will prepend additional
bits to its global ID G(v) when we process its ancestors in DT .
While precessing v, we assign v and each of its n children a variable-length binary code

HCode that is prex free (i.e. no HCode is a prex of another HCode.) One option is to
assign each of the (n + 1) nodes a binary number from 0 to n using log2 (n + 1) bits. To
minimize the shadow length W(v), we use a Human coding style algorithm to compute
the HCodes and W(v). This algorithm uses two data structures: a binary encoding tree
T with n + 1 leaf nodes, one for v and each of its children, and a min-priority queue PQ,

88

initialized with n + 1 elements (one for v and each of its children) that is ordered by node
weight. While PQ has more than one element, we remove the two elements x and y with
lowest weight from PQ, create a new internal node z in T with two children x and y, and
set weight(z)=maximum(weight(x), weight(y))+1, and then put element z into PQ. When
PQ has only one element, T is complete. The HCode assigned to each leaf node v is the

path in T from the root node to v where left edges have value 0 and right edges have value
1.

We update the internal variables of v and its descendants in DT as follows. We set L(v)
to be its HCode, and W(v) to be the weight of the root node of T ; G(v) is left empty. For
each child vi , we prepend vi 's HCode to the global ID of every node in the subtree rooted
at vi including vi itself. We then mark v as red. This continues until all nodes in DT are
red.
We now set state IDs and a shadow codes. The shadow length is k, the weight of the root
node of DT . We use {∗}m to denote a ternary string with m ∗'s and {0}m to denote a
binary string with m 0's. For each node v, we compute v's state ID and shadow code as
follows:
ID(v) = G(v)|L(v)|{0}k−#G(v)−#L(v) ,

SC(v) = G(v)|{∗}k−#G(v) .

We illustrate our shadow encoding algorithm in Figure 5.4. Figure 5.4(a) shows all the
internal variables just before v1 is processed. Figure 5.4(b) shows the Human style binary
encoding tree T built for node v1 and its children v2 , v3 , and v4 and the resulting HCodes.
89

G : 
L : 
W: 0

v1
G : 
L : 
W: 0

G : 
L : 0
W: 1

v3

G : 1
L : 
W: 0

v2

v4

v5

v6

G : 
L : 00
W: 2

G : 01
L : 
W: 0

v7

G : 10
L : 
W: 0

(a) Deferment tree with internal variables before processing v1 .

3

=max(2,2)+1
0

2

=max(1,1)+1
1

0

1

1

=max(0,0)+1
0

1

Weight:

0

0

1

2

Node:

v1

v2

v3

v4

HCode:

000

001

01

1

(b) Build Hufman tree and assigned HCodes while
processing v1 .
Figure 5.4: Shadow encoding example.

90

v1
G : 001
v2
L : 
W: 0
SC = 001
ID = 001

G : 01
L : 0
W: 1

G :  SC = 
L : 000 ID = 000
W: 3

v3

SC = 01
ID = 010

G : 011
v5
L : 
W: 0
SC = 011
ID = 011

v6
SC = 101
ID = 101

v4
G : 101
L : 
W: 0

G : 1 SC = 1
L : 00 ID = 100
W: 2

v7
SC = 110
ID = 110

G : 110
L : 
W: 0

(c) Internal variables before processing v1 and assigned state IDs and
shadow codes.
Figure 5.4: Shadow encoding example (cont'd).
Figure 5.4(c) shows each node's nal weight, global ID, local ID, state ID and shadow code.
The pseudo-code for the Shadow Encoding algorithm is given in gure Algorithm 5.5.
We now prove two properties of our shadow encoding algorithm using induction on the
height n of the deferment tree T . In both proofs, in the inductive case, we let s denote
the root node of T , s1 through sc denote the c children of s, and Ti for 1 ≤ i ≤ c denote
the subtree rooted at si .
Theorem 5.

The state IDs and shadow codes generated by our Shadow Encoding

algorithm satisfy the SEP.
Proof. We prove by induction on the height n of T . The base case where n = 0 is trivial
since there is only a single node. For the inductive case, our inductive hypothesis is that
the shadow codes and state IDs generated for Ti for 1 ≤ i ≤ c satisfy the SEP. Note, we do
91

1 Input: Deferment forest, DF, with n states, s1 , . . . , sn .
Output: ID[1..n] and SC[1..n] for each state.
1
2
3
4

Add state s0 to DF with all the tree roots as its children;
Set all ID[1..n] and SC[1..n] to the empty string;
ShadowEncode (s0 );
return ID[1..n] and SC[1..n];

5 Function ShadowEncode(s)
// Base case
6
if s has no children then return 0;
// Recursive case
7
r ← Number of children of s;
8
CHILD[1..r] ← List of children of s;
9
for i = 1 to r do
10
W[i] ← ShadowEncode(CHILD[i]);
11
12
13
14
15
16

W[0] ← 0;
G ← HCode(W);
l ← max0≤i≤r (|G[i]| + W[i]);
for i = 1 to r do
Append 0's at end of G[i] to make |G(i)| + W(i) = l;
Attach G[i] in front of ID and SC for each state in the subtree of CHILD[i];

17
18
19

ID(s) ← (0)l ;
SC(s) ← (∗)l ;
return l;

20 Function HCode(W[0..r])
21
Initialize Q as a min priority queue of binary tree nodes;
22
for i = 0 to r do
23
Insert leaf node ni in Q with value V[ni ] ← W[i];
24
25
26
27

while |Q| > 1 do
nl ← pop(Q);
nr ← pop(Q);
Insert node n in Q with nl and nr as left and right children, and value
V[n] ← max(V[nl ], V[nr ]) + 1;

28
29
30

n ← pop(Q);

31

Generate the codes based on the Human Tree rooted at n;
return the codes assigned to the leaf nodes;

Figure 5.5: Shadow Encoding Algorithm.

92

not process the root node s in this assumption. We now consider what happens when we
process s. For each node v ∈ Ti for 1 ≤ i ≤ c, HCode(si ) is prepended to the SC(v) and
ID(v). Thus, the SEP still holds for all the nodes within Ti for 1 ≤ i ≤ c. For any nodes p

and q from dierent subtrees Ti and Tj , it follows that ID(p) ∈ SC(q) and ID(q) ∈ SC(p)
/
/
because HCode(si ) and HCode(sj ) are not prexes of each other. Finally, for all nodes
v ∈ T , ID(v) ∈ SC(s) because SC(s) contains only ∗'s.

We dene a prex shadow encoding as a shadow encoding where all shadow codes are
prex strings; that is, all ∗'s are after any 0's or 1's. For any prex shadow encoding E
of T , ETi denotes the subset of state ids and shadow codes for all v ∈ Ti . For any state
id or shadow code X,

p

X denotes the rst p characters of X, and X p denotes the last p

characters of X. We dene ETi
Lemma 5.
E

p

= {X p | X ∈ ET }.
i

Consider a deferment tree T with a valid length x prex shadow encoding

that satises the SEP. For every child si , 1 ≤ i ≤ c, of the root of T , there exist two

values pi and qi such that:
1.

∀i, 0 < pi ≤ x, 0 ≤ qi < x

and pi + qi = x.

2.

∀i, ∀v ∈ Ti , pi ID(v) = pi SC(v) = pi SC(si ).

3.

∀i, ET qi
i

is a valid prex shadow encoding of Ti .

4. The set EID = {pi

SC(si ) | 1 ≤ i ≤ c}

is prex free.

Proof. Since E is a prex shadow encoding, for any child si , SC(si ) must be of the form
93

{0, 1}a {∗}x−a . Let pi = a and qi = x − a. Now, pi > 0, otherwise we would have
SC(si ) = {∗}x , which is not possible as it would violate the deferment and non-interception

properties. This proves (1). Also, since E satises the deferment and self-matching properties, we must have (2) and (3). And we must have (4) because of the non-interception
property.
Our shadow encoding algorithm produces minimum length encodings.
Theorem 6.

For any deferment tree T , our shadow encoding algorithm generates the

shortest possible prex shadow encoding that satises the SEP.
Proof. First, our shadow encoding algorithm generates a prex shadow encoding. We
prove by induction on the height n of T that it is the shortest possible prex shadow
encoding. The base case where n = 0 is trivial since the encoding for a single node is
empty and thus optimal. For the inductive case, our inductive hypothesis is that the
prex shadow encoding for Ti for 1 ≤ i ≤ c is the shortest possible.
Let E be the prex shadow encoding generated by our shadow encoding algorithm and
F be the optimal prex shadow encoding. Let l and m be the lengths of E and F

respectively. Let gi and wi be the values dened by Lemma 5 for E . And let pi and qi
be the corresponding values for F . By the inductive hypothesis, we have wi ≤ qi for
1 ≤ i ≤ c.

If m < l, this implies that the optimal shortest prex shadow encoding for T must compute
a better set of HCode equivalents for each child node si . In particular, we have that
94

maxi (pi + qi ) < maxi (gi + wi ). That is, given equal or larger initial lengths, {qi }, optimal
prex shadow encoding computes prex-free codes FID for the children that are shorter
than the prex-free codes EID computed by the HCode subroutine. However, this is a
contradiction, since the Human style encoding used to compute the HCodes minimizes
the term maxi (gi + wi ) [21]. Therefore, we must have l ≤ m.
Experimentally, we found that our shadow encoding algorithm is eective at minimizing
shadow length. No DFA had a shadow length larger than log2 |Q| + 3 where log2 |Q| is
the shortest possible shadow length.

5.2.2.4

Choosing Transitions

Section 5.2.1 describes how the TCAM tables are generated for states with all 256 transitions (i.e. for root states) using 1-dimensional complete classier minimization. But
non-root states do not have complete tables. We now describe how we apply the character
bundling technique to generate the TCAM tables for non-root states.
For a given DFA and a corresponding deferment forest, we construct a D2 FA by choosing
which transitions to encode in each transition table as follows. If state p has a default
transition to state q, we identify p's deferrable transitions which are the transitions that are
common to both p's transition table and q's transition table. These deferrable transitions
are optional for p's transition table; that is, they can be removed to create an incomplete
transition table or included if that results in fewer TCAM entries. Figure 5.2 is an example
where including a deferrable transition produces a smaller classier. The second entry in
95

s2 's table in Figure 5.2 can be deferred to state s0 's transition table. However, this results

in a classier with at least 4 TCAM entries whereas specifying the transition allows a
classier with just 3 TCAM entries. This leads us to the following problem for which we
give an optimal solution.
Deﬁnition 4 (Partially Deferred Incomplete One-dimensional TCAM Minimization Prob-

lem). Given a one-dimensional packet classier f on {∗}b and a subset D ⊆ {∗}b , nd

the minimum cost prex classier f such that Cover(f ) ⊇ {∗}b \ D and is equivalent
to f over Cover(f ).
Here b is the eld width (in bits), D is the set of packets that can be deferred, and Cover(c)
is the union of the predicates of all the rules in c (i.e. all the packets matched by c). For
simplicity of description, we assume that f has attened rule set (i.e. one rule for each
packet with the packet as the rule predicate). Assuming the packet is a one byte character,
this implies f has 256 rules.
We provide a dynamic programming formulation for solving this problem that is similar to
the dynamic programming formulation used in [31] and [47] to solve the related problem
when all transitions must be specied. In these previous solutions for complete classiers,
for each prex, the dynamic program maintains an optimal solution for each possible nal
decision. It then species how to combine these optimal solutions for matching prexes
into an optimal solution for the prex that is the union of the two matching prexes;
in this step, two nal rules for each prex that have the same decision can be replaced
by a single nal rule for the combined prex resulting in a savings of one TCAM entry.
96

The main change is to maintain an optimal solution for each prex where we defer some
transitions within the prex.

We now formally specify this dynamic program introducing the following notation. Let
di , i ≥ 1 denote the actual decisions in a classier. For a prex P = {0, 1}k {∗}b−k , we use P

to denote the prex {0, 1}k 0{∗}b−k−1 , and P to denote the matching prex {0, 1}k 1{∗}w−k−1 .
For a classier f on {∗}b and a prex P ⊆ {∗}b , fP denotes a classier on P that is equivalent
to f (i.e. the subset of rules in f with predicates that are in P ). So f = f{∗}b . For i ≥ 1, fPi
d

denotes a classier on P that is equivalent to f and the decision of the last rule is di . Note
d

that all packets in P are specied by such classiers. Classier fP0 denotes the optimal
classier that is equivalent to f except that it possibly defers some packets within D. We
d

d

use C(fPi ) to denote the cost of the minimum classier equivalent to fPi for i ≥ 0. [P(x)]
evaluates to 1 when the statement inside is true; otherwise it evaluates to 0. We use x to
represent some packet in the prex P currently being considered.
Theorem 7.

Given a one-dimensional classier f on {∗}b and a subset D ⊆ {∗}b with

a set of possible decisions

{d1 , d2 , . . . , dz }

and a prex

P ⊆ {∗}b ,

we have that

is calculated as follows:
(1) For i > 0

d
C(fPi )





1 + [f(x) = d ]

i
=

if f is consistent on P




minz (C(fdj ) + C(fdj ) − 1 + [j = i])

j=1
P
P

97

else

d

C(fPi )

(2) For i = 0:

d

C(fP0 ) =





0


if P ⊆ D




min(minz (C(fdi )), C(fd0 ) + C(fd0 ))

i=1
P
P
P

else

Proof. (1) When i > 0, we just build a minimum cost complete classier. The recursion
and the proof is exactly the same as given in [31] Theorem 4.1 (with decision weights = 1).
(2) We consider two possibilities. Either the optimal classier is a complete classier or
the optimal classier is an incomplete classier. If the optimal classier is incomplete, we
consider two cases. If the entire prex P is contained with D and can be deferred, the
minimum cost classier is to defer all transitions and has cost 0. Otherwise, the minimum
cost classier for P would just be the minimum cost classier for P concatenated with the
minimum cost classier for P . This is represented by the last term in the minimization
for case (2). The possibility that the optimal classier is a complete classier is handled
by the rst term in the rst minimization for case (2).

5.3

Table Consolidation

We now present table consolidation where we combine multiple transition tables for
dierent states into a single transition table such that the combined table takes less TCAM
space than the total TCAM space used by the original tables. To dene table consolidation,
we need two new concepts: k-decision rule and k-decision table. A k-decision rule is a rule
98

whose decision is an array of k decisions. A k-decision table is a sequence of k-decision
rules following the rst-match semantics. Given a k-decision table T and i (0 ≤ i < k), if
for any rule r in T we delete all the decisions except the i-th decision, we get a 1-decision
table, which we denote as T[i]. In table consolidation, we take a set of k 1-decision tables
T0 , · · · , Tk−1 and construct a k-decision table T such that for any i (0 ≤ i < k), the

condition Ti ≡ T[i] holds where Ti ≡ T[i] means that Ti and T[i] are equivalent (i.e., they
have the same decision for every search key). We call the process of computing k-decision
table T table consolidation, and we call T the consolidated table.

5.3.1

Observations

Table consolidation is based on three observations. First, semantically dierent TCAM
tables may share common entries with possibly dierent decisions. For example, the three
tables for s0 , s1 and s2 in Figure 5.1 have three entries in common: 01100000, 0110∗∗∗∗,
and ∗∗∗∗∗∗∗∗. Table consolidation provides a novel way to remove such information redundancy. Second, given any set of k 1-decision tables T0 , · · · , Tk−1 , we can always nd
a k-decision table T such that for any i (0 ≤ i < k), the condition Ti ≡ T[i] holds. This is
easy to prove as we can use one entry per each possible binary search key in T. Third, a
TCAM chip typically has a build-in SRAM module that is commonly used to store lookup
decisions. For a TCAM with n entries, the SRAM module is arranged as an array of
n entries where SRAM[i] stores the decision of TCAM [i] for every i. A TCAM lookup

returns the index of the rst matching entry in the TCAM, which is then used as the
99

index to directly nd the corresponding decision in the SRAM. In table consolidation,
we essentially trade SRAM space for TCAM space because each SRAM entry needs to
store multiple decisions. As SRAM is cheaper and more ecient than TCAM, moderately
increasing SRAM usage to decrease TCAM usage is worthwhile.
Figure 5.6 shows the TCAM lookup table and the SRAM decision table for a 3-decision
consolidated table for states s0 , s1 , and s2 in Figure 5.1. In this example, by table consolidation, we reduce the number of TCAM entries from 11 to 5 for storing the transition tables for states s0 , s1 , and s2 . This consolidated table has an ID of 0. As
both the table ID and column ID are needed to encode a state, we use the notation
< Table ID > @ < Column ID > to represent a state.

TCAM
Consolidated
Input
Src Table ID Character
0
0
0
0
0

0110
0110
0110
0110
∗∗∗∗

0000
0010
0011
∗∗∗∗
∗∗∗∗

→
→
→
→
→

SRAM
Column ID
00 01 10
s0
s1
s1
s1
s0

s0
s1
s2
s2
s0

s0
s1
s1
s2
s0

Figure 5.6: 3-decision table for 3 states in Figure 5.1
We illustrate input character stream processing with table consolidation using this example
3-decision table. Suppose the input character string is \01101111, 01100011". The initial
state is state s0 which is represented as 0@00. We append s0 's table ID of 0 to the rst
character 01101111 to form the lookup key 001101111. This matches the fourth TCAM
entry in the 3-decision table. We now need to nd the decision. We use s0 's column ID
00 to determine that the rst decision is the correct decision. This gives us the state s1
100

which is represented as 0@01. We then prepend s1 's table ID of 0 to the second character
01100011 to form the lookup key 001100011. This matches the third TCAM entry. We
use s1 's column ID of 01 to determine that the second decision is the correct decision.
This gives us the next state s2 which has code 0@10. Because s2 is an accepting state, we
would accept the input string. Note that because this DFA has only 3 states which have
all been consolidated together, all three states have the same table ID of 0. In general,
with more states than just those consolidated together, we would have more table IDs.
There are two key technical challenges in table consolidation. The rst challenge is how
to consolidate k 1-decision transition tables into a k-decision transition table. The second
challenge is which 1-decision transition tables should be consolidated together. Intuitively,
the more similar two 1-decision transition tables are, the more TCAM space saving we
can get from consolidating them together. However, we have to consider the deferment
relationship among states. We present our solutions to these two challenges.

5.3.2

Computing a k-decision table

In this section, we assume we know which states need to be consolidated together and
present a local state consolidation algorithm that takes a k1 -decision table for state set
Si and a k2 -decision table for another state set Sj as its input and outputs a consolidated
(k1 + k2 )-decision table for state set Si ∪ Sj . For ease of presentation, we rst assume that
k 1 = k 2 = 1.

Let s1 and s2 be the two input states which have default transitions to states s3 and s4 .
101

The consolidated table will be assigned a common table ID X. We assign state s1 column
ID 0 and state s2 column ID 1. Thus, we encode s1 as X@0 and s2 as X@1. We enforce
a constraint that if we do not consolidate s3 and s4 together, then s1 and s2 cannot defer
any transitions at all. If we do consolidate s3 and s4 together, then s1 and s2 may have
incomplete transition tables due to default transitions to s3 and s4 , respectively.
The key concepts underlying this algorithm are breakpoints and critical ranges. To dene
breakpoints, it is helpful to view Σ as numbers ranging from 0 to |Σ| − 1; given 8 bit
characters, |Σ| = 256. For any state s, we dene a character i ∈ Σ to be a breakpoint for s
if δ(s, i) = δ(s, i−1). For the end cases, we dene 0 and |Σ| to be breakpoints for every state
s. Let b(s) be the set of breakpoints for state s. We then dene b(S) =

s∈S b(s)

to be the

set of breakpoints for a set of states S ⊂ Q. Finally, for any set of states S, we dene r(S)
to be the set of ranges dened by b(S): r(S) = {[0, b2 −1], [b2 , b3 −1], . . . , [b|b(S)|−1 , |Σ|−1]}
where bi is ith smallest breakpoint in b(S). Note that 0 = b1 is the smallest breakpoint
and |Σ| is the largest breakpoint in b(S). Within r(S), we label the range beginning at
breakpoint bi as ri for 1 ≤ i ≤ |b(S)| − 1. If δ(s, bi ) is deferred, then ri is a deferred range.
When we consolidate s1 and s2 together, we compute b({s1 , s2 }) and r({s1 , s2 }). For each
r ∈ r({s1 , s2 }) where r is not a deferred range for both s1 and s2 , we create a consolidated

transition rule where the decision of the entry is the ordered pair of decisions for state s1
and s2 on r . For each r ∈ r({s1 , s2 }) where r is a deferred range for one of s1 but not the
other, we ll in r in the incomplete transition table where it is deferred, and we create
a consolidated entry where the decision of the entry is the ordered pair of decisions for
102

state s1 and s2 on r . Finally, for each r ∈ r({s1 , s2 }) where r is a deferred range for both
s1 and s2 , we do not create a consolidated entry. This produces a non-overlapping set of

transition rules that may be incomplete if some ranges do not have a consolidated entry.
If the nal consolidated transition table is complete, we minimize it using the optimal
1-dimensional TCAM minimization algorithm in [30, 47]. If the table is incomplete, we
minimize it using the 1-dimensional incomplete classier minimization algorithm in [31].
We generalize this algorithm to cases where k1 > 1 and k2 > 1 by simply considering
k1 + k2 states when computing breakpoints and ranges.

5.3.3

Choosing States to Consolidate

We now describe our global consolidation algorithm for determining which states to consolidate together. As we observed earlier, if we want to consolidate two states s1 and s2
together, we need to consolidate their parent nodes in the deferment forest as well or else
lose all the benets of shadow encoding. Thus, we propose to consolidate two deferment
trees together.
A consolidated deferment tree must satisfy the following properties. First, each node is
to be consolidated with at most one node in the second tree; some nodes may not be
consolidated with any node in the second tree. Second, a level i node in one tree must
be consolidated with a level i node in the second tree. The level of a node is its distance
from the root. We dene the root to be a level 0 node. Third, if two level i nodes are
consolidated together, their level i − 1 parent nodes must also be consolidated together.
103

An example legal matching of nodes between two deferment trees is depicted in Figure 5.7.

x0
x1
x5
x8

x2
x6

y0
x3

x4

x7

y1
y5

x9

y2

y3

y4

y6

y7
Figure 5.7: Consolidating two trees.

Given two deferment trees, we start the consolidation process from the roots. After we
consolidate the two roots, we need to decide how to pair their children together. For
each pair of nodes that are consolidated together, we again must choose how to pair their
children together, and so on. We make an optimal choice using a combination of dynamic
programming and matching techniques. Suppose we wish to compute the minimum cost
C(x, y), measured in TCAM entries, of consolidating two subtrees rooted at nodes x and
y where x has u children X = {x1 , . . . , xu } and y has v children Y = {y1 , . . . , yv }. We

rst recursively compute C(xi , yj ) for 1 ≤ i ≤ u and 1 ≤ j ≤ v using our local state
consolidation algorithm as a subroutine. We then construct a complete bipartite graph
KX,Y such that each edge (xi , yj ) has the edge weight C(xi , yj ) for 1 ≤ i ≤ u and 1 ≤ j ≤ v.

Here C(x, y) is the cost of a minimum weight matching [24, 35] of K(X, Y) plus the cost of
consolidating x and y. When |X| = |Y|, to make the sets equal in size, we pad the smaller
104

set with null states that defer all transitions.
Finally, we must decide which trees to consolidate together. We assume that we produce
k-decision tables where k is a power of 2. We describe how we solve the problem for k = 2

rst. We create an edge-weighted complete graph with where each deferment tree is a
node and where the weight of each edge is the cost of consolidating the two corresponding
deferment trees together. We nd a minimum weight matching [16, 18] of this complete
graph to give us an optimal pairing for k = 2. For larger k = 2l , we then repeat this
process l − 1 times. Our matching is not necessarily optimal for k > 2.
In some cases, the deferment forest may have only one tree. In such cases, we consider consolidating the subtrees rooted at the children of the root of the single deferment tree. We
also consider similar options if we have a few deferment trees but they are not structurally
similar.
Figure Algorithm 5.8 shows the pseudo-code for the algorithm.

5.3.3.1

Greedy Matching

Our algorithm using the matching subroutines gives the optimal pairing of deferment trees
but can be relatively slow on larger DFAs. When running time is a concern, we present
a greedy matching routine. When we need to match children of two nodes, x and y, we
consider one child at a time from the node with fewer children (say x). First all children of
y are set

unmarked. For each child, xi , of x, we nd the best match from the unmarked
105

1 Input: Deferment forest, DF, with r tree roots, s1 , . . . , sr .
Output: Optimal matching of the r roots.
1 For each pair of roots, si and sj , compute C(si , sj );
2 Construct complete graph Kr , with the roots as vertices and C(si , sj ) as edge weights;
3 return Minimum Weight Matching(Kr );
4 Function C(s1 , s2 )
// Base case
5
if s1 and s2 have no children then
6
return Consolidated Cost(s1 , s2 );

9
10
11
12

// Recursive case
Attach NULL children so that both s1 and s2 have same number of children, q;
Construct complete bipartite graph Kq,q , with the children of s1 and s2 as the vertices,
and set C(sx , sy ) as the edge weight between vertices sx and sy ;
M = Minimum Weight Bipartite Matching(Kq,q ) gives the matching of the children;
count ← 0;
foreach matching (sx , sy ) ∈ M do
count ← count + C(sx , sy );

13

return (count + Consolidated Cost(s1 , s2 ));

7
8

14

Figure 5.8: Algorithm for Consolidating Trees.

children of y, match them up, and set the matched child in y as marked. The best match
for xi is given by
C(xi , yj )
argminy ∈{unmarked children of y}
j
C(xi ) + C(yj )

where C(x) is just the cost (in TCAM entries) of the subtree rooted at x. If C(xi )+C(yj ) =
0, then we set the ratio to 0.5. All unmarked children of y at the end are matched with

null states. We consider the children of x in decreasing order of C(xi ) to prioritize the
larger children of x. We use the same approach for matching roots. First all roots are set
unmarked. Each time we consider the largest unmarked root, nd the best match for it,
and then mark the newly matched roots.
In our experiments, this greedy approach runs much faster than the optimal approach
106

and the resulting classier size is not much larger. We also observe that another greedy
approach that uses C(xi , yj ) instead of

C(xi ,yj )
C(xi )+C(yj )

produces classiers with much larger

TCAM sizes. This approach often matches a large child of x with a small child of y that
it does not align well with.

5.3.4

Eﬀectiveness of Table Consolidation

We now explain why table consolidation works well on real-world RE sets.
Most real-world RE sets contain REs with wildcard closures `. ' where the wildcard `.'
matches any character and the closure ` ' allows for unlimited repetitions of the preceding
character. Wildcard closures create deferment trees with lots of structural similarity.
For example, consider the D2 FA in Figure 5.9 for RE set f/abc/, /abd/, /e. f/g

‐{a,e}
b

1

4/2

c

8/1

d

‐{a,f}

a

3/1

d

0

c

9/2

2

e

5

a

b

6

f

7

10/3
Figure 5.9: D2 FA for RE set f/abc/, /abd/, /e. f/g.
107

where we use dashed arrows to represent the default transitions. The second wildcard
closure `. ' in the RE /e. f/ duplicates the entire DFA sub-structure for recognizing
REs/abc/ and /abd/. Thus, table consolidation of the subtree (0, 1, 2, 3, 4) with the
subtree (5, 6, 7, 8, 9, 10) will lead to signicant space saving.

5.4

Variable Striding

We explore ways to improve RE matching throughput by consuming multiple characters
per TCAM lookup. One possibility is a k-stride DFA which uses k-stride transitions that
consume k characters per transition. Although k-stride DFAs can speed up RE matching
by up to a factor of k, the number of states and transitions can grow exponentially in
k. To limit the state and transition space explosion, we propose variable striding using

variable-stride DFAs. A k-var-stride DFA consumes between 1 and k characters in each
transition with at least one transition consuming k characters. Conceptually, each state in
a k-var-stride DFA has 256k transitions, and each transition is labeled with (1) a unique
string of k characters and (2) a stride length j (1 ≤ j ≤ k) indicating the number of
characters consumed.
In TCAM-based variable striding, each TCAM lookup uses the next k consecutive characters as the lookup key, but the number of characters consumed in the lookup varies from
1 to k; thus, the lookup decision contains both the destination state ID and the stride
length.
108

There are many technical challenges in implementing variable striding. First, we need to
control the exponential growth in the number of states. Second, we need to control the
exponential growth in the number of transitions. Third, we need to carefully choose which
transitions to expand from 1-stride to multi-stride given a specic amount of available
TCAM space. Fourth, we need to carefully decide on the maximum stride length k.
Increasing k can help by increasing average RE matching throughput; however, increasing
k can hurt by requiring more TCAM space. Specically, implementing a k-var-stride DFA

in TCAM requires 8k bits for the k input characters in each lookup key. The width of
a TCAM chip is congurable, but not arbitrary. Commercially available TCAM chips
typically can be congured with length 36, 72, 144, 288, or 576 bits. We must choose k so
that we optimize throughput while not wasting bits in each TCAM entry.

5.4.1

Observations

We use an example to show how variable striding can achieve a signicant RE matching
throughput increase with a small and controllable space increase. Figure 5.10 shows a
3-var-stride transition table that corresponds to state s0 in Figure 5.1. This table only has

7 entries as opposed to 116 entries in a full 3-stride table for s0 . If we assume that each of
the 256 characters is equally likely to occur, the average number of characters consumed
per 3-var-stride transition of s0 is 1 ∗ 1/16 + 2 ∗ 15/256 + 3 ∗ 225/256 = 2.82.
109

Src state
s0
s0
s0
s0
s0
s0
s0

TCAM
Inp char1 Inp char2
0110
0110
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗

0000
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗

∗∗∗∗
∗∗∗∗
0110
0110
∗∗∗∗
∗∗∗∗
∗∗∗∗

∗∗∗∗
∗∗∗∗
0000
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗

Inp char3
∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗
0110
0110
∗∗∗∗

∗∗∗∗
∗∗∗∗
∗∗∗∗
∗∗∗∗
0000
∗∗∗∗
∗∗∗∗

→
→
→
→
→
→
→

SRAM
Dest state Stride
s0
1
s1
1
s0
2
s1
2
s0
3
s1
3
s0
3

Figure 5.10: 3-var-stride transition table for s0

5.4.2

Eliminating State Explosion

We rst explain how converting a 1-stride DFA to a k-stride DFA causes state explosion.
For a source state and a destination state pair (s, d), a k-stride transition path from s to
d may contain k − 1 intermediate states (excluding d); for each unique combination of

accepting states that appear on a k-stride transition path from s to d, we need to create
a new destination state because a unique combination of accepting states implies that the
input has matched a unique combination of REs. This can be a very large number of new
states.
We eliminate state explosion by ending any k-var-stride transition path at the rst accepting state it reaches. Thus, a k-var-stride DFA has the exact same state set as its
corresponding 1-stride DFA. Ending k-var-stride transitions at accepting states does have
subtle interactions with table consolidation and shadow encoding. We end any k-var-stride
consolidated transition path at the rst accepting state reached in any one of the paths
being consolidated which can reduce the expected throughput increase of variable striding. There is a similar but even more subtle interaction with shadow encoding which we
110

describe in the next section.

5.4.3

Controlling Transition Explosion

In a k-stride DFA converted from a 1-stride DFA with alphabet Σ, a state has |Σ|k outgoing
k-stride transitions. Although we can leverage our techniques of character bundling and

shadow encoding to minimize the number of required TCAM entries, the rate of growth
tends to be exponential with respect to stride length k. We have two key ideas to control
transition explosion: self-loop unrolling and k-var-stride transition sharing.

5.4.3.1

Self-Loop Unrolling Algorithm

We now consider root states, all of which are self-looping states. We have two methods
to compute the k-var-stride transition tables of root states. The rst is direct expansion
(stopping transitions at accepting states) since these states do not defer to other states
which results in an exponential increase in table size with respect to k. The second method,
which we call self-loop unrolling, scales linearly with k.
Self-loop unrolling increases the stride of all the self-loop transitions encoded by the last
default TCAM entry. Self-loop unrolling starts with a root state j-var-stride transition
table encoded as a compressed TCAM table of n entries with a nal default entry representing most of the self-loops of the root state. Note that given any complete TCAM table
where the last entry is not a default entry, we can always replace that last entry with a default entry without changing the semantics of the table. We generate the (j+1)-var-stride
111

transition table by expanding the last default entry into n new entries, which are obtained
by prepending 8 ∗'s as an extra default eld to the beginning of the original n entries.
This produces a (j+1)-var-stride transition table with 2n − 1 entries. Figure 5.10 shows
the resulting table when we apply self-loop unrolling twice on the DFA in Figure 5.1.

We next illustrate the idea of self-loop unrolling using an example. Consider state s0
of Figure 5.1. The default transition in s0 's table is a self-loop that is matched by 240
characters; one self-loop is matched by the rst TCAM entry in s0 's table. We can \unroll"
this self-loop and increase the stride of many but not all 2-stride and 3-stride transitions
as follows. First, we leave in place the rst two 1-stride transitions. We then make 2-stride
copies of these transitions where we shift the characters over by one and put a default
character in the rst position. These 2-stride transitions capture the case where the rst
character in the transition self-loops but is not 01100000 and the second character leaves
state s0 or is 01100000. We then make 3-stride copies of these transitions where we shift the
characters over by one again and put default characters for the rst two positions. Finally,
we include a stride-3 default transition that self-loops back to state 0. The resulting 7
transition variable-stride table is shown in Figure 5.10. In this example, we could continue
using self-loop unrolling to create even larger stride transitions with an additional cost of
only 2 TCAM entries per extra character consumed.
112

5.4.3.2

k-var-stride Transition Sharing Algorithm

Similar to 1-stride DFAs, there are many transition sharing opportunities in a k-var-stride
DFA. Consider two states s0 and s1 in a 1-stride DFA where s0 defers to s1 . The deferment
relationship implies that s0 shares many common 1-stride transitions with s1 . In the kvar-stride DFA constructed from the 1-stride DFA, all k-var-stride transitions that begin
with these common 1-stride transitions are also shared between s0 and s1 . Furthermore,
two transitions that do not begin with these common 1-stride transitions may still be
shared between s0 and s1 . For example, in the 1-stride DFA fragment in Figure 5.11,
although s1 and s2 do not share a common transition for character a, when we construct
the 2-var-stride DFA, s1 and s2 share the same 2-stride transition on string aa that ends
at state s5 .

b

S1

a

S3

a

S5

a

S2

a

S4

b

S6

Figure 5.11: States s1 and s2 share transition aa
To promote transition sharing among states in a k-var-stride DFA, we rst need to decide
on the deferment relationship among states. The ideal deferment relationship should be
calculated based on the SRG of the nal k-var-stride DFA. However, the k-var-stride DFA
cannot be nalized before we need to compute the deferment relationship among states
113

because the nal k-var-stride DFA is subject to many factors such as available TCAM
space. There are two approximation options for the nal k-var-stride DFA for calculating
the deferment relationship: the 1-stride DFA and the full k-stride DFA. We have tried both
options in our experiments, and the dierence in the resulting TCAM space is negligible.
Thus, we simply use the deferment forest of the 1-stride DFA in computing the transition
tables for the k-var-stride DFA.

Second, for any two states s1 and s2 where s1 defers to s2 , we need to compute s1 's k-varstride transitions that are not shared with s2 because those transitions will constitute s1 's
k-var-stride transition table. Although this computation is trivial for 1-stride DFAs, this

is a signicant challenge for k-var-stride DFAs because each state has too many (256k )
k-var-stride transitions. The straightforward algorithm that enumerates all transitions

has a time complexity of O(|Q|2 |Σ|k ), which grows exponentially with k. We propose
a dynamic programming algorithm with a time complexity of O(|Q|2 |Σ|k), which grows
linearly with k. Our key idea is that the non-shared transitions for a k-stride DFA can be
quickly computed from the non-shared transitions of a (k-1)-var-stride DFA. For example,
consider the two states s1 and s2 in Figure 5.11 where s1 defers to s2 . For character a,
s1 transits to s3 while s2 transits to s4 . Assuming that we have computed all (k-1)-var-

stride transitions of s3 that are not shared with the (k-1)-var-stride transitions of s4 , if we
prepend all these (k-1)-var-stride transitions with character a, the resulting k-var-stride
transitions of s1 are all not shared with the k-var-stride transitions of s2 , and therefore
should all be included in s1 's k-var-stride transition table. Formally, using n(si , sj , k) to
114

denote the number of k-stride transitions of si that are not shared with sj , our dynamic
programming algorithm uses the following recursive relationship between n(si , sj , k) and
n(si , sj , k − 1):

n(si , sj , 0) =



 0 if s = s

i
j


 1 if s = s
i
j
n(δ(si , c), δ(sj , c), k − 1)

n(si , sj , k) =

(5.1)
(5.2)

c∈Σ

The above formulae assume that the intermediate states on the k-stride paths starting
from si or sj are all non-accepting. For state si , we stop increasing the stride length along
a path whenever we encounter an accepting state on that path or on the corresponding
path starting from sj . The reason is similar to why we stop a consolidated path at an
accepting state, but the reasoning is more subtle. Let p be the string that leads sj to
an accepting state. The key observation is that we know that any k-var-stride path that
starts from sj and begins with p ends at that accepting state. This means that si cannot
exploit transition sharing on any strings that begin with p.
Figure 5.12 shows the resultant 2-var-stride transition tables for all three states s0 , s1 ,
and s2 of the D2 FA in Figure 5.3(a). Note that the one transition out of state s1 and two
self-loop transitions for state s2 have stride-1 because they end at s2 , an accepting state.
The above dynamic programming algorithm produces non-overlapping and incomplete
115

TCAM
Src state Inp char1 Inp char2
s1
s2
s2
s2
s0
s0
s0
s0
s0
s0
s0
s0
s0
s0
s0

[c]
[b..c]
[a]
[d..o]
[a..o]
[a..o]
[a..o]
[a..o]
[a..o]
[0..96]
[0..96]
[0..96]
[112..255]
[112..255]
[112..255]

∗
[c]
∗
∗
[0..96]
[a]
[b]
[c..o]
[112..255]
[0..96]
[a..o]
[112..255]
[0..96]
[a..o]
[112..255]

→
→
→
→
→
→
→
→
→
→
→
→
→
→
→

SRAM
Dest state Stride
s2
1
s2
2
s2
1
s2
1
s0
2
s2
2
s1
2
s2
2
s0
2
s0
2
s1
2
s0
2
s0
2
s1
2
s0
2

Figure 5.12: Uncompressed 2-var-stride transition tables for D2 FA in Figure 5.3(a) (a = 97,
o = 111)
transition tables that we compress using the 1-dimensional incomplete classier minimization algorithm in [31].

5.4.4

Variable Striding Selection Algorithm

We now propose solutions for the third key challenge - which states should have their stride
lengths increased and by how much, i.e., how should we compute the transition function
δ. Note that each state can independently choose its variable striding length as long as

the nal transition tables are composed together according to the deferment forest. This
can be easily proven based on the way that we generate k-var-stride transition tables. For
any two states s1 and s2 where s1 defers to s2 , the way that we generate s1 's k-var-stride
transition table is seemingly based on the assumption that s2 's transition table is also
116

k-var-stride; actually, we do not have this assumption. For example, if we choose k-var-

stride (2 ≤ k) for s1 and 1-stride for s2 , all strings from s1 will be processed correctly; the
only issue is that strings deferred to s2 will process only one character.

We view this as a packing problem: given a TCAM capacity C, for each state s, we select
a variable stride length value Ks , such that

s∈Q |T(s, Ks )|

≤ C, where T(s, Ks ) denotes

the Ks -var-stride transition table of state s. This packing problem has a avor of the
knapsack problem, but an exact formulation of an optimization function is impossible
without making assumptions about the input character distribution. We propose the
following algorithm for nding a feasible δ that strives to maximize the minimum stride
of any state. First, we use all the 1-stride tables as our initial selection. Second, for each
j-var-stride (j ≥ 2) table t of state s, we create a tuple (l, d, |t|) where l denotes variable

stride length, d denotes the distance from state s to the root of the deferment tree that
s belongs to, and |t| denotes the number of entries in t. As stride length l increases, the

individual table size |t| may increase signicantly, particularly for the complete tables of
root states. To balance table sizes, we set limits on the maximum allowed table size for
root states and non-root states. If a root state table exceeds the root state threshold when
we create its j-var-stride table, we apply self-loop unrolling once to its (j − 1)-var-stride
table to produce a j-var-stride table. If a non-root state table exceeds the non-root state
threshold when we create its j-var-stride table, we simply use its (j − 1)-var-stride table as
its j-var-stride table. Third, we sort the tables by these tuple values in increasing order
rst using l, then using d, then using |t|, and nally a pseudorandom coin ip to break
117

ties. Fourth, we consider each table t in order. Let t be the table for the same state s in
the current selection. If replacing t by t does not exceed our TCAM capacity C, we do
the replacement.

5.5

Implementation and Modeling

We now describe some implementation issues associated with our TCAM based RE matching solution. First, the only hardware required to deploy our solution is the o-theshelf TCAM (and its associated SRAM). Many deployed networking devices already have
TCAMs, but these TCAMs are likely being used for other purposes. Thus, to deploy our
solution on existing network devices, we would need to share an existing TCAM with
another application. Alternatively, new networking devices can be designed with an additional dedicated TCAM chip.
Second, we describe how we update the TCAM when an RE set changes. First, we must
compute a new DFA and its corresponding TCAM representation. For the moment, we
recompute the TCAM representation from scratch, but we believe a better solution can be
found and is something we plan to work on in the future. We report some timing results
in our experimental section. Fortunately, this is an oine process during which time the
DFA for the original RE set can still be used. The second step is loading the new TCAM
entries into TCAM. If we have a second TCAM to support updates, this rewrite can occur
while the rst TCAM chip is still processing packet ows. If not, RE matching must halt
118

while the new entries are loaded. This step can be performed very quickly, so the delay
will be very short. In contrast, updating FPGA circuitry takes signicantly longer.

We have not developed a full implementation of our system. Instead, we have only developed the algorithms that would take an RE set and construct the associated TCAM
entries. Thus, we can only estimate the throughput of our system using TCAM models. We use Agrawal and Sherwood's TCAM model [3] assuming that each TCAM chip is
manufactured with a 0.18µm process to compute the estimated latency of a single TCAM
lookup based on the number of TCAM entries searched. These model latencies are shown
in Table 5.1. We recognize that some processing must be done besides the TCAM lookup
such as composing the next state ID with the next input character; however, because the
TCAM lookup latency is much larger than any other operation, we focus only on this
parameter when evaluating the potential throughput of our system.

Entries
1024
2048
4096
8192
16384
32768
65536
131072

TCAM
TCAM
Latency
Chip size
Chip size
ns
(36-bit wide) (72-bit wide)
0.037 Mb
0.074 Mb
0.94
0.074 Mb
0.147 Mb
1.10
0.147 Mb
0.295 Mb
1.47
0.295 Mb
0.590 Mb
1.84
0.590 Mb
1.18 Mb
2.20
1.18 Mb
2.36 Mb
2.57
2.36 Mb
4.72 Mb
2.94
4.72 Mb
9.44 Mb
3.37

Table 5.1: TCAM size and Latency

119

5.6

Experimental Results

In this section, we evaluate our TCAM-based RE matching solution on real-world RE sets
focusing on two metrics: TCAM space and RE matching throughput.

5.6.1

Methodology

We use the same 8 RE sets used in Section 4.5 for the main results.
To test the scalability of our algorithms, we use one family of 34 REs from a recent public
release of the Snort rules with headers ($EXTERNAL NET, $HTTP PORTS, $HOME NET,
any), most of which contain wildcard closures `. '. We added REs one at a time until the
number of DFA states reached 305, 339. We name this family Scale.
We calculate TCAM space by multiplying the number of entries by the TCAM width: 36,
72, 144, 288, or 576 bits. For a given DFA, we compute a minimum width by summing

the number of state ID bits required with the number of input bits required. In all cases,
we needed at most 16 state ID bits. For 1-stride DFAs, we need exactly 8 input character
bits, and for 7-var-stride DFAs, we need exactly 56 input character bits. We then calculate
the TCAM width by rounding the minimum width up to the smallest larger legal TCAM
width. For all our 1-stride DFAs, we use TCAM width 36. For all our 7-var-stride DFAs,
we use TCAM width 72.
We estimate the potential throughput of our TCAM-based RE matching solution by using
the model TCAM lookup speeds we computed in Section 5.5 to determine how many
120

TCAM lookups can be performed in a second for a given number of TCAM entries and
then multiplying this number by the number of characters processed per TCAM lookup.
With 1-stride TCAMs, the number of characters processed per lookup is 1. For 7-var-stride
DFAs, we measure the average number of characters processed per lookup in a variety of
input streams.
We use Becchi et al.'s network trac generator [11] to generate a variety of synthetic
input streams. This trac generator includes a parameter that models the probability
of malicious trac pM . With probability pM , the next character is chosen so that it
leads away from the start state. With probability (1 − pM ), the next character is chosen
uniformly at random.

5.6.2

Results on 1-stride DFAs

TS
TS + TC2
TS + TC4
RE set #states tcam #rows thru tcam #rows thru tcam #rows thru
Mbits per state Gbps Mbits per state Gbps Mbits per state Gbps
Bro217
6533
0.31 1.40
3.64
0.21 0.94
4.35
0.17 0.78
4.35
C613
11308
0.63 1.61
3.11
0.52 1.35
3.64
0.45 1.17
3.64
C10
14868
0.61 1.20
3.11
0.31 0.61
3.64
0.16 0.32
4.35
C7
24750
1.00 1.18
3.11
0.53 0.62
3.64
0.29 0.34
3.64
C8
3108
0.13 1.20
5.44
0.07 0.62
5.44
0.03 0.33
8.51
Snort24 13886 0.55 1.16
3.64
0.30 0.64
3.64
0.18 0.38
4.35
Snort31 20068 1.43 2.07
2.72
0.81 1.17
2.72
0.50 0.72
3.64
Snort34 13825 0.56 1.18
3.11
0.30 0.62
3.64
0.17 0.36
4.35

Table 5.2: TCAM size and throughput for 1-stride DFAs
Table 5.2 shows our experimental results on the 8 RE sets using 1-stride DFAs. We
use TS to denote our transition sharing algorithm including both character bundling and
121

shadow encoding. We use TC2 and TC4 to denote our table consolidation algorithm
where we consolidate at most 2 and 4 transition tables together, respectively. For each
RE set, we measure the number states in its 1-stride DFA, the resulting TCAM space
in megabits, the average number of TCAM table entries per state, and the projected RE
matching throughput; the number of TCAM entries is the number of states times the
average number of entries per state. The TS column shows our results when we apply
TS alone to each RE set. The TS+TC2 and TS+TC4 columns show our results when we
apply both TS and TC under the consolidation limit of 2 and 4, respectively, to each RE
set.

We draw the following conclusions from Table 5.2. (1) Our RE matching solution is

extremely eective in saving TCAM space. Using TS+TC4, the maximum TCAM size
for the 8 RE sets is only 0.50 Mb, which is two orders of magnitude smaller than the current
largest commercially available TCAM chip size of 72 Mb. More specically, the number of
TCAM entries per DFA state ranges between .32 and 1.17 when we use TC4. We require
16, 32, or 64 SRAM bits per TCAM entry for TS, TS+TC2, and TS+TC4, respectively
as we need to record 1, 2, or 4 state 16 bit state IDs in each decision, respectively. (2)

Transition sharing alone is very eective. With the transition sharing algorithm alone,
the maximum TCAM size is only 1.43Mb for the 8 RE sets. Furthermore, we see a relatively
tight range of TCAM entries per state of 1.16 to 2.07. Transition sharing works extremely
well with all 8 RE sets including those with wildcard closures and those with primarily
strings. (3) Table consolidation is very eective. On the 8 RE sets, adding TC2 to
122

TS improves compression by an average of 41% (ranging from 16% to 49%) where the
maximum possible is 50%. We measure improvement by computing (TS − (TS + TC2))/TS).
Replacing TC2 with TC4 improves compression by an average of 36% (ranging from 13% to
47%) where we measure improvement by computing ((TS + TC2) − (TS + TC4))/(TS + TC2).
Here we do observe a dierence in performance, though. For the two RE sets Bro217 and
C613 that are primarily strings without table consolidation, the average improvements of
using TC2 and TC4 are only 24% and 15%, respectively. For the remaining six RE sets that
have many wildcard closures, the average improvements are 47% and 43%, respectively.
The reason, as we touched on in Section 5.3.4, is how wildcard closure creates multiple
deferment trees with almost identical structure. Thus wildcard closures, the prime source
of state explosion, is particularly amenable to compression by table consolidation. In
such cases, doubling our table consolidation limit does not greatly increase SRAM cost.
Specically, while the number of SRAM bits per TCAM entry doubles as we double the
consolidation limit, the number of TCAM entries required almost halves! (4) Our RE

matching solution achieves high throughput with even 1-stride DFAs. For the TS+TC4
algorithm, on the 8 RE sets, the average throughput is 4.60Gbps (ranging from 3.64Gbps
to 8.51Gbps).

We use our Scale dataset to assess the scalability of our algorithms' performance focusing
on the number of TCAM entries per DFA state. Figure 5.13(a) shows the number of TCAM
entries per state for TS, TS+TC2, and TS+TC4 for the Scale REs containing 26 REs (with
DFA size 1275) to 34 REs (with DFA size 305, 339). The DFA size roughly doubled for
123

(b)

# entries/state
time/state (msec)

(a)

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
1000

TS
TS+TC2
TS+TC4

10000
100000
# states

10000
1000
100
10
1
0.1
1000

TS Build
TS+TC2 Build
TS+TC4 Build
TS BW
TS+TC2 BW
TS+TC4 BW
10000
100000
# states

Figure 5.13: TCAM entries per DFA state (a) and compute time per DFA state (b) for
Scale 26 through Scale 34.
every RE added. In general, the number of TCAM entries per state is roughly constant and
actually decreases with table consolidation. This is because table consolidation performs
better as more REs with wildcard closures are added as there are more trees with similar
structure in the deferment forest.
We now analyze running time. We ran our experiments on the Michigan State University
High Performance Computing Center (HPCC). The HPCC has several clusters; most of
our experiments were executed on the fastest cluster which has nodes that each have 2
quad-core Xeons running at 2.3GHz. The total RAM for each node is 8GB. Figure 5.13(b)
shows the compute time per state in milliseconds. The build times are the time per DFA
124

state required to build the non-overlapping set of transitions (applying TS and TC); these
increase linearly because these algorithms are quadratic in the number of DFA states. For
our largest DFA Scale 34 with 305,339 states, the total time required for TS, TS+TC2, and
TS+TC4 is 19.25 mins, 118.6 hrs, and 150.2 hrs, respectively. These times are cumulative;
that is going from TS+TC2 to TS+TC4 requires an additional 31.6 hours. This table
consolidation time is roughly one fourth of the rst table consolidation time because the
number of DFA states has been cut in half by the rst table consolidation and table
consolidation has a quadratic running time in the number of DFA states. The BW times are
the time per DFA state required to minimize these transition tables using the Bitweaving
algorithm in [31]; these times are roughly constant as Bitweaving depends on the size of
the transition tables for each state and is not dependent on the size of the DFA. For our
largest DFA Scale 34 with 305, 339 states, the total Bitweaving optimization time on TS,
TS+TC2, and TS+TC4 is 10 hrs, 5 hrs, and 2.5 hrs. These times are not cumulative and
fall by a factor of 2 as each table consolidation step cuts the number of DFA states by a

time/state (msec)

factor of 2.

10000
Opt TC2
Opt TC4
Greedy TC2
Greedy TC4

1000
100
10
1
1000

10000
100000
# states

Figure 5.14: Consolidation times for Scale 26 through Scale 34 for Optimal and Greedy
consolidation algorithms.
125

Figure 5.14 shows the time required per state for the greedy and optimal consolidation
algorithms on the Scale dataset. The greedy algorithm runs roughly 6 times faster than
the optimal algorithm. The average increase in the number of resulting TCAM rules is
around 4% for TC2 and around 9% for TC4.
The partially deferred algorithm given in Section 5.2.2.4 always performs at least as well
as the completely deferred minimization algorithm given in [31]. For the three Snort RE
sets and C613, the partially deferred algorithm results in a reduction of 1, 2, 152, and 194
TCAM entries over the completely deferred algorithm. For the other RE sets, both algorithms perform equally well. The partially deferred algorithm is slower than the completely
deferred algorithm because there are more unique decisions during minimization, so we
use the completely deferred minimization algorithm for computing classier sizes during
consolidation, and we use the partially deferred minimization algorithm for generating the
nal TCAM classiers for each state.

5.6.3

Results on 7-var-stride DFAs

We consider two implementations of variable striding assuming we have a 2.36 megabit
TCAM with TCAM width 72 bits (32,768 entries). Using Table 5.1, the latency of a
lookup is 2.57 ns. Thus, the potential RE matching throughput of by a 7-var-stride DFA
with average stride S is 8 × S/.00000000257 = 3.11 × S Gbps.
In our rst implementation, we only use self-loop unrolling of root states in the deferment
forest. Specically, for each RE set, we rst construct the 1-stride DFA using transition
126

sharing. We then apply self-loop unrolling to each root state of the deferment forest to
create a 7-var-stride transition table. Because of the linear increase in transition table size,
we know that the resulting TCAM table will increase in size by at most a factor of 7. In
all our experiments, the size never increased by more than a factor of 2.25, and the largest
DFA (for C7) required only 2.25 megabits. We can decrease the TCAM space by using
table consolidation; this was very eective for all RE sets except the string matching RE
sets Bro217 and C613. This was unnecessary since all self-loop unrolled tables t within
our available TCAM space.
Second, we apply full variable striding. Specically, we rst create 1-stride DFAs using
transition sharing and then apply variable striding with no table consolidation, table consolidation with 2-decision tables, and table consolidation with 4-decision tables. We use
the best result that ts within the 2.36 megabit TCAM space. For the RE sets Bro217,
C8, C613, Snort24 and Snort34, no table consolidation is used. For C10 and Snort31, we
use table consolidation with 2-decision tables. For C7, we use table consolidation with
4-decision tables.
We now run both implementations of our 7-var-stride DFAs on traces of length 287484
to compute the average stride. For each RE set, we generate 4 traces using Becchi et

al.'s trace generator tool using default values 35%, 55%, 75%, and 95% for the parameter
pM . These generate increasingly malicious trac that is more likely to move away from

the start state towards distant accept states of that DFA. We also generate a completely
random string to model completely uniform trac such as binary trac patterns which
127

we treat as pM = 0.
We group the 8 RE sets into 3 groups: group (a) represents the two string matching
RE sets Bro217 and C613; group (b) represents the three RE sets C7, C8, and C10 that
contain all wildcard closures; group (c) represents the three RE sets Snort24, Snort31, and
Snort34 that contain roughly 40% wildcard closures. Figure 5.15 shows the average stride
length and throughput for the three groups of RE sets according to the parameter pM
(the random string trace is pM = 0).

Throughput (Gbps)

6

Group (a)
Group (b)
Group (c)

15

4
10
2

5
0

Average Stride length

Self-Loop Unrolling
20

0
0

0.2

0.4

pM

0.6

0.8

1

Throughput (Gbps)

6

Group (a)
Group (b)
Group (c)

15

4
10
2

5
0

Average Stride length

Variable Striding
20

0
0

0.2

0.4

pM

0.6

0.8

1

Figure 5.15: The throughput and average stride length of RE sets.
128

We make the following observations. (1) Self-loop unrolling is extremely eective on

the uniform trace. For the non string matching sets, it achieves an average stride length
of 5.97 and 5.84 and RE matching throughput of 18.58 and 18.15 Gbps for groups (b)
and (c), respectively. For the string matching sets in group (a), it achieves an average
stride length of 3.30 and a resulting throughput of 10.29 Gbps. Even though only the
root states are unrolled, self-loop unrolling works very well because the non-root states
that defer most transitions to a root state will still benet from that root state's unrolled
self-loops. In particular, it is likely that there will be long stretches of the input stream
that repeatedly return to a root state and take full advantage of the unrolled self-loops.

(2) The performance of self-loop unrolling does degrade steadily as pM increases for
all RE sets except those in group (b). This occurs because as pM increases, we are more
likely to move away from any default root state. Thus, fewer transitions will be able to
leverage the unrolled self-loops at root states. (3) For the uniform trace, full variable

striding does little to increase RE matching throughput. Of course, for the non-string
matching RE sets, there was little room for improvement. (4) As pM increases, full

variable striding does signicantly increase throughput, particularly for groups (b)
and (c). For example, for groups (b) and (c), the minimum average stride length is 2.91
for all values of pM which leads to a minimum throughput of 9.06Gbps. Also, for all
groups of RE sets, the average stride length for full variable striding is much higher than
that for self-loop unrolling for large pM . For example, when pM = 95%, full variable
striding achieves average stride lengths of 2.55, 2.97, and 3.07 for groups (a), (b), and (c),
respectively, whereas self-loop unrolling achieves average stride lengths of only 1.04, 1.83,
129

and 1.06 for groups (a), (b), and (c), respectively.
These results indicate the following. First, self-loop unrolling is extremely eective at
increasing throughput for random trac traces. Second, other variable striding techniques
can mitigate many of the eects of malicious trac that lead away from the start state.

130

Chapter 6

Overlay Automata

In this section we present our overlay automata model for handling DFA state replication,
and the implementation of the overlay automata in both software and hardware.

6.1

Introduction

As discussed in Section 3.2, the main reason for redundancy in a DFA is state replication,
which causes the exponential increase in the size of the DFA as multiple REs are combines.
Ideally we would like to build an automata whose size is proportional to a NFA and
matching speed close to that of a DFA. We achieve this goal using our new overlay automata
model.
131

6.1.1

Limitations of Prior Automata Models

DFA-based automata models have been developed to address DFA space explosion. Two
representative models are D2 FA proposed by Kumar et al. [26] and XFA proposed by Smith

et al. [41]. D2 FAs reduce the number of transitions stored per state by using deferred
transitions to compactly represent common transitions, i.e., the transitions with the same
input character and destination state. This elegant solution can be automated; however,
it only handles transition sharing, and does not address state replication, and resulting
replicated transitions. So although there is a huge reduction in space required, it is still
proportional to the number of DFA states, which grows exponentially with the number of
REs in the RE set. XFAs deal with state replication using scratch memory and auxiliary
code stored at each state that must be executed before or after each transition. This
interesting solution models state replication; however, it cannot be fully automated [50].
Furthermore, the code that needs to be executed for each transition limits the throughput
that can be achieved.
Our technique of table consolidation presented in Section 5.3 actually exploits state replication to reduce the size of TCAM required, but it does so accidentally. That is, table
consolidation works well because of state replication, but the the technique is oblivious
to state replication. The algorithm does not explicitly search for replicated states, it only
looks for state pairs that are good matches for consolidation. But replicate states are
usually good matches for consolidation, and so states that are consolidated together are
usually replications of the same NFA state. There are several limitations of table consoli132

dation because of which state replication is not fully exploited. First, there is a practical
limit on the number of TCAM tables that can be consolidated. For instance we only consider consolidating 4 tables together. Thus, table consolidation can only lead to a constant
factor reduction in TCAM storage no matter how much state replication exists in the DFA.
So the nal TCAM size can still be exponential in the size of the RE set. Ideally we would
like to combine together all the replications of a NFA state. Second, table consolidation
does not reduce the associated SRAM required to store decisions because although the
TCAM entries are merged, the decisions are not. Furthermore, the SRAM required by
table consolidation might increase due to imperfect merging of tables.

6.1.2

Summary of Overlay Automata Approach

We developed a new overlay automata model which exploit state replication to compress
the size of the DFA. The idea is to group together the replicated DFA structures instead
of repeating them multiple times. We briey describe here the overlay automata model
and how the automata is implmented in software and hardware.

6.1.2.1

Overlay DFA

We propose

Overlay Deterministic Finite state Automata (ODFA) that models state

replication in DFAs. The basic idea is to overlay all the DFA states that are replications
of the same NFA state vertically together into what we call a super-state. If we view a
DFA as a 2-D object, then an ODFA can be viewed as a 3-D object. Figure 6.2 depicts
133

the DFA and ODFA for the RE set f/abc/, /abd/, /e. f/g. The ODFA model gives
us the following key benets. First, it allows us to easily identify replications of the

same NFA state as they are all in the same super-state. For example, in Figure 6.2, we
merge states 0 and 5 and states 1 and 6 into super-states S0 and S1 , respectively. Second,

it allows us to represent replications of the same NFA transition by one super-state
transition between two super-states. For example, for any NFA transition from s1 to s2
on character σ, in the corresponding ODFA, all replications of state s1 are in the same
super-state say S1 , all replications of state s2 are in the same super-state say S2 , and
all replicates of state s1 have a transition on σ to their corresponding replicates on state
s2 . We merge these replicate transitions into one combined super-state transition from

super-state S1 to super-state S2 on character σ. For example, in Figure 6.2, we merge
the two transitions from states 0 and 5 on character `a' into one super-state transition on
character `a'.

6.1.2.2

Overlay D2 FA

Combining our overlay idea, which models state replication and replicated transitions, and
the delayed input idea in D2 FA, which models sharing non-replicated transitions among
non-replicated DFA states (i.e. transition sharing) through a state deferment relationship,
we propose

Overlay Delayed Input DFA (OD2 FA) to model state replication, repli-

cated transitions, and transition sharing. The relationship among these automata models,
DFA, D2 FA, ODFA, and OD2 FA, is illustrated in Figure 6.1. A key benet of OD2 FA
134

is that we can represent the deferment relationship among D2 FA states more compactly
using deferment among OD2 FA super-states. From the perspective of transitions, OD2 FA
optimizes both deferred transitions (i.e., common transitions among states) and replicated
transitions.

DFA
Models
State
Replication

D2FA

ODFA

Models 
Transition
Sharing

OD2FA
Models State Replication and Transition Sharing

Figure 6.1: Relationship of Automata Models.

6.1.2.3

Building OD2 FA

To build an OD2 FA, we propose algorithms for constructing it from a given set of REs
incrementally. We rst construct the equivalent OD2 FA for each RE. We then eciently
merge OD2 FAs until only a single OD2 FA for the entire set of REs is left. We propose an
incremental construction algorithm that builds the OD2 FA D for RE set R1 ∪R2 by merging
the OD2 FA D1 for R1 with the OD2 FA D2 for R2 . This algorithm automatically identies
and groups together replicate states in D into super-states and replicate transitions into
super-state transitions without having to perform an expensive analysis of the nal DFA
structure.
135

6.1.2.4

Implementing OD2 FA

We develop techniques for implementing the OD2 FA is software and hardware.
We extend the software implementation of a D2 FA to OD2 FA. The main problem we
need to solve is that, since an OD2 FA only stores super-state transitions, how do we
eciently lookup state transitions from the super-state transitions. Our ecient encoding
of super-state transition facilitates in performing this lookup very quickly.
For the hardware implmentation, we develop a solution which we call OverlayCAM, by
extending RegCAM to implement the OD2 FA in TCAM. Again, our ecient encoding
of super-state transition allows us to implement each super-state transition using only
one TCAM entry. So OverlayCAM not only encodes multiple deferred state transitions
using one TCAM entry but also encodes multiple non-deferred state transitions that are
replications of the same NFA transition using only one TCAM entry. We also extend the
variable striding technique in RegCAM for use with OverlayCAM to increase the matching
throughput.

6.2

Overlay DFA

In this section, we formally dene a new automata,

Overlay Deterministic Finite state

Automata (ODFA), which we propose to deal with state explosion in DFA.
There are two ideas behind an ODFA. The rst is to group all DFA states that are repli136

cations of the same NFA state into a single super-state. The second is to merge as many
transitions from the replicate states within a super-state as possible. To dene ODFA, we
will use the concepts of super-states, overlays, super-state transitions, and overlay osets.
We begin by informally dening ODFA and these concepts using the ODFA in Figure 6.2
as a running example.

From [0..4]

3/1

From [1..4]

a

fail

0

a

c
b

1

2
d

4/2

(a) DFA for RE set f/abc/, /abd/g
From [0..4]

From [1..4]

a

fail

d

a

b

1

c

8/1
9/2

From [6..10]

From [1..4]

a

e
a

b

6

fail

From [5..10] f

10/3

4/2

2

e

5

3/1

d

0

c

7

f From [6..10]

(b) DFA for RE set f/abc/, /abd/, /e. f/g
Figure 6.2: Example of DFA, state replication and Overlay DFA.

137

From [0..4]

S3

From [1..4]

fail

a

S0

S1

0

fail

1

b

5

e

a
a

6

S2

b

f

e

From [1..4]

S5

From [5..10]

3/1

c

2

c

8/1

d

7

S4
d

a

From [6..10]

4/2
9/2

10/3

From [6..10]

f

(c) Corresponding ODFA
From [S0..S5]
fail 0

From [S1..S5]
a 0

S3/1
3

From [0..4]

S0

f

0

S1
1

a

b

0

c

S2
2

0

0

5

6

7

0

f
e

8

d

S5/3

From [S0..S5]

S4/2
4
9

10

f

From [6..10]

(d) ODFA with super-state transitions
Figure 6.2: Example of DFA, state replication and Overlay DFA (cont'd).
138

Figure 6.2(a) shows the DFA for the RE set f/abc/, /abd/g from Figure 3.1(a). The
notations used in the gure are explained in Section 3.2. Figure 6.2(b) shows the DFA
after the RE /e. f/ is added to the RE set (same as Figure 3.1(b).) This DFA illustrates
the potential for ODFA as the entire DFA for the RE set f/abc/, /abd/g is replicated
twice. The corresponding ODFA is shown in Figure 6.2(c).
In Figure 6.2(c), we overlay the two copies of the DFA for the RE set f/abc/, /abd/g)
on top of each other. Each pair of replicated DFA states is a super-state in the ODFA.
Each layer of states is called an overlay. The ODFA in Figure 6.2(c) has six super-states
S0 , . . . , S5 and two overlays. Each overlay contains a subset of the states in the entire DFA;

in Figure 6.2(c), the rst overlay does not contain a state from super-state S5 .
We now introduce the concept of super-state transitions. One super-state transition
represents multiple DFA transitions much as one super-state represents a group of DFA
states. In a standard DFA transition, the source state is a DFA state. In a super-state
transition, the source state is an ODFA super-state and represents transitions from all the
replicated DFA states within the super-state. The destination state is usually an ODFA
super-state but can sometimes be a DFA state. The two super-state transition forms are
σ

σ

S1 − S2 , o, 1 and S1 − S2 , O, 0 (distinguished by the last bit value 1/0). In the rst form,
→
→

the semantics are that each DFA state q in super-state S1 transitions on character σ to
a DFA state q in super-state S2 , with o = (overlay of q − overlay of q) mod #overlays.
We call this dierence in the overlay value the overlay oset (or just oset for short.)
The value of the overlay oset o is usually 0. In the second form, the semantics are that
139

each DFA state q in super-state S1 transitions on character σ to the DFA state located in
b
super-state S2 at overlay O. For example, consider the two DFA state transitions 1 − 2
→
b
and 6 − 7 in Figure 6.2(c). These two transitions can be represented by one super-state
→
b
transition S1 − S2 , 0; the 0 denotes no change in overlay. As a second example, consider
→
e
e
the two DFA state transitions 3 − 5 and 8 − 5 in Figure 6.2(c). These two transitions
→
→
e
can be represented by one super-state transition S3 − S2 , 1, 0.
→

In the ideal case, all DFA transitions can be replaced by super-state transitions which
reduces the total number of transitions by the number of overlays in the ODFA. In some
cases, not all states in a super-state have transitions that can be merged. We generalize
super-state transitions to allow super-state transitions to be dened for a specic subset
of overlays X within a given super-state. Technically, traditional transitions from a single
state s are super-state transitions where X contains only s's overlay. We refer to these as

singleton super-state transitions.
Figure 6.2(d) shows the ODFA for our running example with non-singleton super-state
a
a
transitions denoted with thick edges. For example, the two transitions 0 − 1 and 5 − 6
→
→
a
from Figure 6.2(c) are represented with one super-state transition S0 − S1 , 0, 1. For
→
σ
super-state transitions of the form S1 − S2 , o, 1 (i.e. destination is also a super-state),
→

the number besides the thick edge gives the overlay oset o. As we use double arrows
to represent multiple transitions, we use thick double arrows to represent multiple none
e
singleton super-state transitions. For example, the two transitions 0 − 5 and 5 − 5 from
→
→
e
Figure 6.2(c) are included in one super-state transition S0 − S0 , 1, 0 which is part of the
→

140

thick double arrow labeled with `e' ending at state 5. The DFA in Figure 6.2(b) has
11 × 256 = 2816 total transitions; the ODFA in Figure 6.2(d) has 1542 total super-state

transitions which is close to the best possible result of 2816/2 = 1408 total super-state
transitions; only a few of these transitions are singleton super-state transitions.
Recall the DFA is dened as a 5-tuple (Q, Σ, q0 , M, δ) (Section 3.1). We now formally
dene the ODFA.
Deﬁnition 5 (

Overlay Deterministic Finite state Automata (ODFA)). An ODFA for a

set of REs R is dened as a 7-tuple D = (Q, Σ, q0 , S, O, M, ∆). The rst three terms
are the same as those in the above DFA denition.
The next two terms dene the overlay structure on top of a DFA: S = {S0 , . . . , S|S|−1 }
is a set of super-states that partitions Q, while O = {O0 , . . . , O|O|−1 } is a set of overlays
that also partitions Q. We shall treat each overlay as a unique number in the range
[0..|O|).

We overload notation and dene

S: Q → S

and

O: Q → O

as functions

mapping states to super-states and overlays, respectively. For any two states si = sj ,
it must be the case that

(S(si ), O(si )) = (S(sj ), O(sj )).

For any super-state

S

and

overlay O, S ∩ O is either empty or contains one state s ∈ Q.
The term

M : S → 2R

gives the subset of REs matched by any super-state. The

set of REs matched by any state

s ∈ Q

∆ : S × 2O × Σ → S × [0..|O|) × {0, 1}

transition function. For any
dom(∆)

with

O(s) ∈ X

s∈Q

is then given by

M(S(s)).

The nal term

is a partial function and denes the super-state
and any

σ ∈ Σ,

all the transition

(S(s), X, σ) ∈

must have the same value; i.e. if we have two transitions
141

(S(s), X, σ) ∈ dom(∆)

and (S(s), Y, σ) ∈ dom(∆), with O(s) ∈ X ∩ Y , then we must have

∆(S(s), X, σ) = ∆(S(s), Y, σ).
δ (s, σ)

We dene the derived total state transition function

based on this unique transition value, say (S , o, b), as follows. First, if b = 0,

we call the transition a non-oset transition, and δ
we call the transition an oset transition, and δ
value

b

(s, σ) = S ∩ o.

(s, σ) = S ∩ ((O(s) + o) mod |O|).

is called the oset bit. It must be the case that overlay

does intersect

S

Otherwise (b = 1),

. Normally for oset transitions

o = 0,

The

(O(s) + o) mod |O|

so the resulting overlay is

just O(s).

σ
We use the notation (S1 , O) − (S2 , o, b) to denote the super-state transition ∆(S1 , O, σ) =
→

(S2 , o, b). Even though an ODFA has super-states and overlays, an ODFA processes an

input string much like a DFA does. That is, the ODFA is always in a unique state and each
character processed moves the ODFA to a potentially new state. The main dierence is that
the ODFA hopefully compresses multiple DFA transitions into a single ODFA super-state
transition, and the RE matching information is stored at the super-state level rather than
at the state level. For example, given the ODFA in Figure 6.2(d) and the input string
abea, the ODFA begins in state 0. After processing character a, the ODFA moves to state
1. After processing character b, the ODFA moves to state 2. After processing character
e, the ODFA moves to state 5. Finally, after processing character a, the ODFA moves

to state 6. The rst and fourth transitions are actually the same super-state transition.
The third transition corresponds to the rst form of super-state transition with specied
destination state 5. In all cases, M(S(s )) = ∅, so no RE is matched at any point in time.
142

Overlays and super-states are two orthogonal partitionings of states in Q; intuitively,
super-states partition Q vertically and overlays partition Q horizontally. There exist many
possible ways to partition the states of a DFA into super-states and overlays. The benets
of an ODFA are only realized by a careful partitioning; for example, grouping replicate
states of the same NFA state together in a super-state. Note that some super-states may
not have DFA states in each overlay. If overlay O in super-state S is empty, we denote it
by S ∩ O =⊥ (i.e. ⊥ denotes an empty location). In Figure 6.2(d), super-state S5 contains
only one DFA state 10 which belongs to the second overlay. The compressive power of a
super-state transition increases with the number of overlays that it includes. In the best
case, all overlays are included in a super-state transition. In Figure 6.2(d), most super-state
transitions include all overlays; there are only a few singleton super-state transitions. In
more complex ODFA, there may be cases where a given super-state transition includes
more than one overlay but not all overlays.

In an ODFA the RE matching is stored at the super-state level (i.e. M) and state matching
is dened by M. So when constructing an ODFA D for a given DFA D, we must create
the super-states such that the following condition is satised

∀S ∈ SD , ∀s1 , s2 ∈ S, MD (s1 ) = MD (s2 ),

143

(C1)

6.3

Overlay D2FA
Overlay Delayed Input DFA (OD2 FA),

In this section we present another new automata,

which we propose to deal with both state and transition explosion in DFA.
Recall that, given a DFA D=(Q, Σ, q0 , M, δ), its corresponding D2 FA D is dened as a
6-tuple (Q, Σ, q0 , M, ρ, F) (Section 3.3).

ODFAs address state explosion and D2 FAs address transition explosion. We propose
OD2 FA to address both state and transition explosion in DFAs.
Deﬁnition 6 (Overlay D2 FA (OD2 FA)).
q0 , F , S , O, M, ∆),

We dene an OD2 FA as an 8-tuple

(Q, Σ,

where the rst three terms are the same as in dening D2 FA.

The last four terms are the same as in dening ODFA. The only dierence is that,
we derive a partial state transition function

ρ : Q×Σ → Q

from ∆. Since

partial function, we do not require the existence of a covering transition in
each

s∈Q

and

σ ∈ Σ. F : S → S

is a

ρ
∆

for

is the super-state deferment function, and gives

the deferred super-state for each super-state. We dene the D2 FA state deferment
function

F

from

F

as

F(s) = F(S(s)) ∩ O(s)).

To ensure this is a valid deferment

function, F must satisfy the following two conditions. First,
(C2)

∀s ∈ Q, F(S(s)) ∩ O(s)) =⊥,

Second, the deferment forest of super-states dened by
self-loops. Finally,

ρ

and

F

F

has no cycles other than

dene the derived total state transition function
144

δ

as

follows.


 ρ (s, σ)

δ (s, σ) =

We say that
If

s, σ ∈ dom(ρ )

s, σ ∈ dom(ρ ),

if



 δ (F(s), σ)

else

s, σ ∈ dom(ρ )

if there exists a transition (S(s), X, σ) ∈ ∆ with O(s) ∈ X.

then ρ (s, σ) is dened as δ is dened for ODFA.

We say that super-state S overlay covers super-state S if ∀O ∈ O, (S ∩ O =⊥) →
(S ∩ O =⊥). That is, every overlay that is empty in S is also empty in S . Then,

Condition (C2) says that for every super-state S, super-state F(S) overlay covers S.
The transition function δ is computed by nding the transition (S(s), X, σ) ∈ ∆ with
O(s) ∈ X if such a transition exists. If not, the OD2 FA follows the super-state deferment

function.
As dened, we store F rather than F; thus deferment information is stored only at the
super-state level. Likewise, we store just RE matching information M at the super-state
level. Finally, with ∆, many super-state transitions represent multiple singleton transitions. Combined, we can achieve signicant savings.
Figure 6.3(a) shows the D2 FA for the RE set f/abc/, /abd/, /e. f/g. The dashed edges
are deferment transitions. Figure 6.3(b) shows the corresponding OD2 FA. The D2 FA needs
to store 518 actual transitions and 10 deferment transitions while the OD2 FA only needs to
store 260 actual transitions, most of which are non-singleton super-state transitions, and
5 super-state deferred transitions. For this example, we achieve near optimal compression

145

‐{a,e}

4/2

c

8/1

d

b

1

3/1

d

a

0

c

9/2

2

e

‐{a,f}

a

5

b

6

7

f

10/3
(a) D2 FA for RE set f/abc/, /abd/, /e. f/g

S3/1

‐{a,e}

3

0

S0
0

S1
1

a
0

b

0

2

8

0

5
e

c

S2

6

7

0

d

f

S5/3

S4/2
4
9

10
(b) Corresponding OD2 FA
Figure 6.3: OD2 FA Example.

146

given only two overlays in the OD2 FA when compared to the D2 FA.

6.3.1

OD2 FA Multiplicative Compression

OD2 FA multiplies the compressive eect of D2 FA and ODFA to signicantly reduce the
space required to store transitions. ODFA reduces the storage space for transitions among

DFA replicates by storing one super-state transition for each replicated transition. The
compression limit for ODFA is the number of DFA replicates. D2 FA reduces the storage
space for transitions within each DFA replicate using deferment transitions. The compression limit for D2 FA is the number of states within each DFA replicate. OD2 FA is able
to do both simultaneously. The compression limit is the number of DFA replicates multiplied by the number of states within each replicate which is essentially the total number
of DFA states.
To illustrate this multiplicative compression, consider again the OD2 FA in Figure 6.3(b).
The original DFA for this RE set requires 11 × 256 = 2816 transitions. The corresponding
ODFA in Figure 6.2(d) is able to reduce the number of transitions by almost a factor of 2 by
storing one super-state transition for each pair of replicated transitions. The corresponding
D2 FA in Figure 6.3(a) is able to reduce the number of transitions by more than a factor of
5 using deferment transitions. In particular, in both replicates, almost all of the transitions

for all states except the self-looping start states are eliminated. Finally, the OD2 FA in
Figure 6.3(b) multiplies both eects and ends up with 260 super-state transitions and 5
super-state deferment transitions. This is almost a factor of 11 times smaller than the
147

original DFA where 11 is the compression limit since the DFA has 11 states. Starting from
the D2 FA, the OD2 FA is able to replicate all the self-looping transitions out of the two
self-looping states in the D2 FA (adding one singleton transition on `f' for state 5). This
is critical since the vast majority of transitions remaining in many D2 FA are self-looping
transitions.

6.3.2

Eﬀectiveness of OD2 FA on Ideal RE set

We can further demonstrate the eectiveness of OD2 FA using an example set of n REs
where each RE is of the form /Ai,1 Ai,2 · · · Ai,p . Bi,1 Bi,2 · · · Bi,p /, 1 ≤ i ≤ n; that is, each
RE has p characters followed by `. ' and another p characters and all the 2np characters
are unique. This is a simple RE set, in the sense that there is no interaction between the
REs in the set, and we get a simple exponential increase in the size of the DFA relative to
the number of REs in the set n because of state replication.
In this case, the NFA has (2p+1)n+2 (O(pn)) states and the DFA has ((2p−1)n+2)2n−1
(O(pn2n )) states. The D2 FA has ((p − 1)n + 256)2n (O(pn2n )) transitions, and our

RegCAM presented in Section 5.2 will generate (pn+1)2n (O(pn2n )) TCAM entries. The
OD2 FA only has pn + 1 (O(pn)) super-states, 2pn + 256 (O(pn)) super-state transitions,
and a straightforward TCAM implementation of these transitions needs only 2pn + 1
(O(pn)) TCAM entries. The number of rules with the OD2 FA is the same as the NFA

size, which is a lower bound on the compression any method can achieve.
148

6.4

OD2FA Construction

In this section we present our algorithms for constructing an OD2 FA for a set of REs.
Given a set of REs, we construct its equivalent OD2 FA incrementally in two phases. In
the rst phase, we construct an equivalent individual OD2 FA for each RE. In the second
phase, we merge all the individual OD2 FAs in a binary tree fashion; i.e. we merge two
OD2 FAs into one OD2 FA at a time until there is only one OD2 FA for the entire given RE
set.

Constructing an OD2 FA involves three main steps: (1) creating the super-states (i.e.
assigning a super-state, overlay pair for each DFA state), (2) setting the deferment for each
super-state and (3) for each super-state creating the (combined) super-state transitions
from the (singleton) state transitions. The algorithms for the rst two steps (creating
super-states and setting deferment) are dierent for the two phases mentioned above.
However the algorithms for the third step (creating super-state transitions) are almost
identical for the two phases. So we describe the OD2 FA construction algorithms in two
parts. In this section we demonstrate how the super-states are created and how super-state
deferment is set (i.e. steps 1 and 2) during both the phases. In the next section we show
how super-state transitions are built from state transitions (i.e. step 3).
149

6.4.1

OD2 FA Construction from One RE

Given one RE, we rst build its equivalent D2 FA using the technique described in Section 4.3.1. The deferment relationship among states in this D2 FA denes a deferment
forest. The root states in this forest are all self-looping states which means they transit
to themselves for more than |Σ|/2 = 128 characters. Most failure transitions end in selflooping states. For example, in the D2 FA in Figure 6.4, states 0 and 2 are self-looping
states. An important property of the D2 FA constructed using the technique described in
Section 4.3.1 is that each self-looping state in the DFA is the root of a tree in the deferment
forest of the D2 FA, and vice versa. Furthermore, all the states whose failure transitions
go to a self-looping state s are in the deferment tree rooted at s.
Now we describe our algorithm for constructing the OD2 FA from a D2 FA using the example
in Figure 6.4 for the RE /ab[ˆn] pq/. A key observation is that any D2 FA is also a
valid OD2 FA with only a single overlay, singleton super-states, and singleton super-state
transitions. We gradually convert the D2 FA into a more compact OD2 FA rst creating
valid overlays and super-states and then updating the super-state transition function to
combine multiple transitions into one super-state transition.
We begin by specifying the number of deferment trees in the super-state deferment forest
and the number of overlays in a super-state. We accomplish these tasks by partitioning the
self-looping root states of the D2 FA into two groups, accepting root states and rejecting
root states. If either partition is empty, we create one deferment tree in the OD2 FA;
otherwise there are two deferment trees. The number of overlays in the OD2 FA is the
150

larger of the number of accepting root states and the number of rejecting root states.
For any non-empty partition, we merge the root states in that partition into a single
root super-state in the OD2 FA. Typically, self-looping states are failure states, so the
accepting root state partition is empty and the resulting root super-state is not formed.
This observation holds for all of our experimental RE sets. Thus, the deferment forest of
the OD2 FA typically has one deferment tree rooted at the rejecting root super-state. For
example, the OD2 FA in Figure 6.4 has one deferment tree with two overlays, 0 and 1, and
the rejecting root super-state is 0 2 .

‐a

‐{n,p}

n
a

0

b

1

2

p

q

3

4/1

D2FA for RE ab[^n]pq

‐{n}

0

0

2
0

1

0

1

0

2

4

3
D2FA

n

1

deferment forest

0

1

1

3

2

0



1

4

Corresponding OD2FA (singleton
super-state transitions not shown)

Figure 6.4: OD2 FA construction from one RE.
There are two reasons we group root states into super-states even though the self-looping
states in the D2 FA are usually not replications of the same NFA state. First, all the com151

mon self-loops can be merged into super-state transitions. We specify this more precisely
in Section 6.5. Second, as self-looping states are typically the \replication points" when
combining REs, grouping self-looping states into a common super-state helps us automatically identify the state replications and replicated transitions when we merge two OD2 FAs.
We elaborate this more in Section 6.4.2. Condition (C2) is satised as the root super-state
defers to itself.
We now describe how we assign the remaining states to super-states and overlays ensuring
Condition (C2) is maintained. Given a super-state S that is in the OD2 FA deferment
forest, our algorithm groups the children of the states in S into new super-states that
defer to S. This grouping is recursively applied to the new super-states formed until all
states are assigned to super-states. We now specify how the children of the states of S
are grouped into super-states. Let n be the number of non-empty overlays in S, and let
s1 , . . . , sn be the states in these overlays. Let Ci = F−1 (si ) be the set of children for each

state si in S, and let U =

n
i=1 Ci

be the total set of states to be grouped into super-states.

To ensure all states in a super-state match the same REs, we partition U into accepting
states and rejecting states and work with each partition independently. Without loss of
generality, we assume U has one partition. We create super-states with the following two
goals in mind: grouping together states u ∈ U from dierent Ci to (1) maximize the
number of super-state transitions that can be formed and (2) minimize the total number
of super-states formed.
We propose the following greedy strategy. We start with an arbitrary state u from the rst
152

non-empty Ci removing u from Ci and creating super-state S with just u in O(si ). From
each of the remaining non-empty Ck , we pick the state uk that has the most common
non-deferred transitions with u, delete uk from Ck , and add uk to super-state S in O(sk ).
State uk must have at least one common non-deferred transition with u to be selected.
We repeat this process until all the Ci are empty. Condition (C2) is maintained because
a state s in a super-state S is added to overlay O if and only if the corresponding state
s in F(S) is in overlay O. For the D2 FA in Figure 6.4 with root super-state 0 2 as S,

we have C0 = {1} and C1 = {3, 4}, and we create three super-states, 1 ⊥ , ⊥ 3 and ⊥ 4 ,
each of which defers to 0 2 . No super-states with more than one overlay occupied are
formed because states 1 and 3 as well as 1 and 4 do not have any common non-deferred
transitions.
After the super-states have been created, we greedily merge together compatible pairs of
super-states. Two super-states are compatible if there is no overlay that is non-empty in
both super-states. For our example in Figure 6.4, the super-states 1 ⊥ and ⊥ 3 will
be merged together, giving us two nal super-states 1 3 and ⊥ 4 .
The last step is to create the super-state transitions which is discussed in Section 6.5.
We use greedy algorithms in several of our steps. This does not have much eect on
overall compression because most compression opportunities are accidental; they are not
the result of replications of the same NFA state. The key compression that is attained
results from grouping the root states together and combining the resulting self-loops into
super-state transitions; everything else is a bonus.
153

6.4.2

OD2 FA Construction from 2 OD2 FAs

We present our OD2 FA merge algorithm, which we call OD2FAMerge, that constructs
OD2 FA D3 with underlying D2 FA D3 for the RE set R3 = R1 ∪ R2 given two OD2 FAs, D1
with underlying D2 FA D1 for RE set R1 and D2 with underlying D2 FA D2 for RE set R2
where R1 ∩ R2 = ∅.

‐c

0

‐{n,p}

n
c

d

1

p

2

r

3

4/1

D2FA for RE cd[^n]pr

‐{n}

n

0

0
1

0

1

1

3

0

1

0

2

2

0



1

4

Corresponding OD2FA (singleton
super-state transitions not shown)

Figure 6.5: D2 FA and OD2 FA for RE /cd[ˆn] pr/.
The rst step is to create the merged D2 FA D3 using the the D2 FA merge algorithm
described in Section 4.3.2. For example, Figure 6.6(a) shows the D2 FA constructed from
the D2 FAs in Figure 6.4 and Figure 6.5. For each state, the number below the line is the
state id in D3 and the two numbers above the line are the state ids of the states in D1
154

and D2 that this state corresponds to.
We now construct OD2 FA D3 = (Q3 , Σ, q03 , F3 , S3 , O3 , M3 , ∆3 ) from the input OD2 FAs
D1 = (Q1 , Σ, q01 , F1 , S1 , O1 , M1 , ∆1 ) and D2 = (Q2 , Σ, q02 , F2 , S2 , O2 , M2 , ∆2 ) as

well as the merged D2 FA D3 . The rst three terms in D3 are derived from D3 . We then
set S3 = S1 × S2 and O3 = O1 × O2 . We reduce S3 to only include reachable super-states
(a super-state is reachable if it contains at least one reachable state). We discuss how we
handle empty overlays in Section 6.5.4.

‐{a,c}
0,0
0

‐{c,n,p}

n
a

b

1,0
1

2,0
3

0,1
2

‐{a,n,p}

d
a

1,2
7

b

4,0
10/1

‐{n,p}

2,2
9

4,2
13/1

p

p

0,3
8

q
3,3
12

r

r
0,4
11/2

q

2,1
5

n

d
0,2
4

3,0
6

c

c
n

p

2,4
14/2

(a) D2 FA merged from D2 FAs in Figures 6.4 and 6.5.
Figure 6.6: Merged OD2 FA construction example.
155

‐{n}
0
{0,2}
c
0

n
0

0,0
0

1

1

0,0 0,1 1,0 1,1
0
1
2
3

0,2
2/2



0,4
11



2,4
14

2,0
5/1

0

2
b

{0,2}
0

3

0,0 0,2 2,0 2,2
0 4 3 9

{0,1}
a
0

1
d

0,1
1

2

{0,1}
2

3

1,0
3



0,1 0,3 2,1
2 8 5

0

1

2

3

1,1
4



1,0 1,2 3,0
1 7 6

0

1

2

3

  

3,3
12

1

 

merged

(b) OD2 FA merged from OD2 FAs in Figures 6.4 and 6.5.
‐{n}
0
{0,2}

0

c
0

n
0

{0,1}
a
0

1
d

1

2

3

{2,3}
p
0

1

1

1

0,4
11

0

1

2

3



2,4
14

2

3

2
b

{0,2}
0

0

2/2 

0,0 0,2 2,0 2,2
0 4 3 9

{0,1}
2

0,1 0,3 2,1
2 8 5

3



3

0

1

2

3

1,0 1,2 3,0 3,3
1 7 6 12

4/1  

4,0 4,2
10 13

q 0

{2,3}

(c) Corresponding optimized OD2 FA.
Figure 6.6: Merged OD2 FA construction example (cont'd).

156

SS
0
1
2
3
4

SSCD

001
010
011
100

SSID
000
001
010
011
100

2

3

4,0 4,2
10 13

Recall that the notation S3 = S1 , S2 means super-state S3 in D3 corresponds the pair of
super-states super-state S1 from D1 and S2 from D2 . Both S3 and S1 , S2 refer to the
same super-state in D3 . Then for any super-state S3 = S1 , S2 ∈ S3 , we set M3 (S3 ) =
M1 (S1 ) ∪ M2 (S2 ). Condition (C1) holds because all the states in super-state S1 match

the REs in M1 (S1 ) and all the states in super-state S2 match the REs in M2 (S2 ).
Just as each state in D3 (D3 ) corresponds to a pair of states from D1 (D1 ) and D2 (D2 ),
each super-state in D3 will correspond to a pair of super-states from D1 and D2 , and
similarly each overlay in D3 will correspond to a pair of overlays from D1 and D2 . Any
state in D3 is assigned to a super-state and an overlay as follows. Let u= v, w be a state
in D3 . Then S3 (u) ← S1 (v), S2 (w) and O3 (u) ← O1 (v), O2 (w) . That is, we assign u
to the super-state (overlay) that corresponds to the pair of super-states (overlay) that v
and w belong to in D1 and D2 respectively.
Figure 6.6(b) shows the OD2 FA D3 constructed from OD2 FA D1 in Figure 6.4 and OD2 FA
D2 in Figure 6.5. In this gure, for each super-state, the number below the line is the

super-state ID in D3 and the pair numbers above the line are the super-state IDs of the
super-states in D1 and D2 that this super-state corresponds to. For instance, consider
state 7 in D3 , which corresponds to state 1 in D1 and state 2 in D2 . As we can see
from Figures 6.4 and 6.5, state 1 ∈ D1 belongs to super-state 1 and overlay 0, and state
2 ∈ D2 belongs super-state 0 and overlay 1. Therefore, in OD2 FA D3 , we assign state
7 to super-state 3, which corresponds to super-state 1 from D1 and super-state 0 from
D2 ; similarly, we assign state 7 to overlay 1, which corresponds to overlay 0 from D1 and

157

overlay 1 from D2 . In Figure 6.6(b), the input character and overlay oset are shown along
each super-state transition. For super-state transitions that do not include all the overlays
in the super-state, the set of numbers at the base of the transition gives the included
overlays.
We dene the super-state deferment relationship F3 as follows: for any super-state S, which
contains one or more states in Q3 , we defer it to the super-state that contains most of the
states that the states in S defer to; i.e., ∀S ∈ S , F3 (S) ← mode({S3 (F3 (u)) | u ∈ S}). After
dening F3 , we need to adjust the deferment relationship F for D2 FA D3 . Specically, for
each state s in a super-state S where S defers to super-state S , we let s defer to state s in
S where s and s are in the same overlay if s =⊥. If s =⊥, we split S into two super-states
S1 = S \ {s} and S2 = {s}, where S2 defers to the super-state that contains the state that
s defers to (i.e., F3 (S2 ) := S3 (F3 (s))). Note that the case that s =⊥ rarely happens in

our experimental RE sets. This super-state splitting ensures that Condition (C2) holds
for D3 .
We show how the super-state transitions are created for the merged OD2 FA Section 6.5.
Pseudo-code for our OD2FAMerge algorithm is given in Algorithm 6.7.
We now consider the following optimization for D3 . Among the super-states that defer to
the same super-state, we merge two compatible super-states into one super-state if merging
them results in more super-state transitions. This will commonly be the case when we
lose a D2 FA state we expect to generate from a self-looping state. For example, in D2 FA
Figure 6.6(a), we lost the expected states 2, 3 and 3, 2 getting instead state 12 = 3, 3 .
158

2
2
1 Input: OD FAs, D1 and D2 , with underlying D FAs D1 and D2 , corresponding to RE sets
R1 and R2 .
Output: An OD2 FA and its underlying D2 FA corresponding to the RE set R1 ∪ R2 .
1 Let D3 ← D2FAMerge(D1 , D2 ) // algorithm from Section 4.3.2
2 Set #overlays in D3 , |O3 | = n ← |O1 | × |O2 |;
3 foreach Si ∈ S1 × Sj ∈ S2 do // Create the super-states
4
Initialize super-state S= Si , Sj with n NULL states;
5
foreach Ok ∈ O1 , 0 ≤ k < |O1 | × Ol ∈ O2 , 0 ≤ l < |O2 | do
6
if state s= Si ∩ Ok , Sj ∩ Ol ∈ Q3 then
7
Assign s to overlay O(k×|O2 |+l) in super-state S;

8
9
10

if

at least one non-NULL state in S then
Add S to S3 ;
M3 (S) ← M1 (Si ) ∪ M2 (Sj );

11 foreach S ∈ S3 do // set super-state deferment
12
Set F3 (S) ← mode({S3 (F3 (s)) | s ∈ S});
13
Let P = {s | (s ∈ S) ∧ (F3 (S) ∩ O3 (s) =⊥)};
14
foreach state u ∈ P do
15
Remove u from super-state S;
16
Create new super-state S with just state u in overlay O3 (u) and add S to S3 ;
17
Set M3 (S ) ← MD3 (u);
18
Set F(S ) ← S3 (F3 (u));
19
20

foreach state s ∈ S with F3 (s) = F3 (S) ∩ O3 (s) do
Set F3 (s) ← F3 (S) ∩ O3 (s), and regenerate non-deferred transitions for ρ3 in D3 for
state s;

21 foreach S ∈ S3 × c ∈ Σ do // create super-state trans.
22
CreateSupreStateTrans(S, c);
23 Function CreateSupreStateTrans(S, c)
24
C ← CreateSupreStateTransClassifier(S, F3 (S), c);
25
For each rule, ri ∈ C add super-state transition ∆3 (S, P(ri ), c) = D(ri );
26 Function CreateSupreStateTransClassifier(S, DS, c)
/* Generate transitions for character c and super-state S when it defers to
DS */
27
Let ODec[n] be the oset decision vector initialized to ;
28
Let NODec[n] be the non-oset decision vector initialized to ;
29
Let Reqd[n] be the required vector initialized to False;
30(cont'd)

31

Figure 6.7: Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs.

159

1(cont'd)
30
foreach O ∈ O3 do
31
if S ∩ O =⊥ then
32
u= u1 , u2 ← S ∩ O; // current state
33
nu ← δ3 (u, c); // next state
34
if ρ3 (u, c) is dened then // not deferred
35
if S = DS ∨ u = nu then Reqd[O] ← True
ODec[O] ← (S3 (nu), (O3 (nu)−O) mod n, 1);
NODec[O] ← (S3 (nu), O3 (nu), 0);

36
37
38
39
40
41
42

ODec ≤ #Unique values in NODec then
return CreateOverlayClassifier(ODec, Reqd);
else
return CreateOverlayClassifier(NODec, Reqd);
if

#Unique values in

Figure 6.7: Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs (cont'd).

As a result, in Figure 6.6(b), the super-states 13 = 2 8 5 ⊥ and 33 = 1 7 6 ⊥ have
⊥ in overlay 3, and there is the super-state 43 = ⊥ ⊥ ⊥ 12 with just state 12 in overlay

3, and super-state 43 is compatible with both super-states 13 and 33 . We can create new
super-state transitions by merging super-state 43 with either 13 or 33 . In Figure 6.6(c),
we show the resulting OD2 FA when we merge 43 from Figure 6.6(b) with 33 adding the
super-state transitions out of super-state 03 on `p' to super-state 33 for overlays 2 and 3
with oset o = 0 and the super-state transitions out of super-state 33 to super-state 53
(renamed 43 in Figure 6.6(c)) on 'q' for overlays 2 and 3 with oset o = 0. Alternatively,
we could have merged super-state 43 from Figure 6.6(b) with super-state 13 and added
a super-state transition out of super-state 03 on `p' to super-state 13 for overlays 1 and
3 with oset o = 0 and a super-state transition out of super-state 13 on r to super-state
23 for overlays 1 and 3 with oset o = 0. After merging super-states, we regenerate

the super-state transitions for all the super-states and not just the super-states that were
160

merged, as merging super-states could lead additional transition merging opportunities in
other super-states too.
Theorem 8.

Given as input OD2 FAs D1 and D2 and corresponding equivalent D2 FAs

D1

and

D3

that is equivalent to D2 FA D3 for RE set R1 ∪ R2 .

D2

for RE sets

R1

and

R2 ,

the OD2FAMerge algorithm outputs an OD2 FA

Proof. The D2 FA D3 constructed by merging D2 FAs D1 and D2 using D2FAMerge algorithm is equivalent to RE set R1 ∪ R2 ( [36]). Line 20 only changes the deferred state for
some states and so D3 is equivalent to RE set R1 ∪ R2 .
We now show that the generated OD2 FA D3 is equivalent to D2 FA D3 . To show equivalence, we need to show that for each state s ∈ Q3 , the deferred state for s, the non-deferred
transitions for s, and the matched REs for s, derived from D3 are same as in D3 . Let
s = s1 , s2 ∈ Q3 be any state in D3 . First, S3 (s) and O3 (s) are dened as we take a

complete cross product of S1 × S2 and O1 × O2 . The super-state transitions are directly
generated from the D2 FA state transitions. It is easy to see that ∀σ ∈ Σ, ρ3 (s, σ) is dened
in D3 ⇐⇒ ρ3 (s, σ) is dened in D3 ; and when dened ρ3 (s, σ) = ρ3 (s, σ).
Then we have the following two cases.
Case 1: S3 (s) added to S3 on line 16. Then REs matched in D3 by s = MD3 (s) ∪
M3 (S(s)) = MD (s) ( MD (s) = ∅).
3

3

Deferred state of s in D3 = F3 (S3 (s)) ∩ O3 (s) = S3 (F3 (s)) ∩ O3 (F3 (s)) = F3 (s).
Case 2: S3 (s) added on line 9. Then let S3 (s) = S= S1 , S2 . REs matched in D3 by s =
161

MD (s) ∪ M3 (S) = M1 (S1 ) ∪ M2 (S2 ) = MD (s1 ) ∪ MD (s2 ) = MD (s).
3
1
2
3

Deferred state of s in D3 = F3 (S) ∩ O3 (s) = F3 (s).

6.4.3

Direct OD2 FA Construction from 2 OD2 FAs

Our OD2 FA merge algorithm presented in Section 6.4.2 requires the underlying D2 FA to
be stored along with the OD2 FA. This underlying D2 FA requirement for merging OD2 FAs
is problematic for two main reasons. First, in most practical cases, we would need to
update the RE set over time. If the underlying D2 FA is discarded, then when a new RE
is added to the RE set, we cannot use the merge algorithm to merge the OD2 FA for the
new RE into the existing OD2 FA. Instead, we will have to build the entire OD2 FA again.
This defeats one of the main advantages of the merge approach to building the OD2 FA
which is automatic support for updating the RE set. The second problem is that because
the underlying D2 FA is generally orders of magnitude larger than the OD2 FA, the size of
the D2 FA limits the scalability of the algorithm.
We now present our algorithm, called DirectOD2FAMerge, to merge two OD2 FAs which
does not require storing the underlying D2 FA. After the initial OD2 FAs have been built
for each individual RE, we only store the OD2 FA at each merge step.
The input is two OD2 FAs, D1 = (Q1 , Σ, q01 , F1 , S1 , O1 , M1 , ∆1 ) for RE set R1 and
D2 = (Q2 , Σ, q02 , F2 , S2 , O2 , M2 , ∆2 ) for RE set R2 where R1 ∩ R2 = ∅, and we construct

construct OD2 FA D3 = (Q3 , Σ, q03 , F3 , S3 , O3 , M3 , ∆3 ) for the RE set R3 = R1 ∪ R2 .
162

Just as in our OD2FAMerge algorithm in Section 6.4.2, each state (super-state) in D3
corresponds to a pair of states (super-states) from D1 and D2 . The rst step is to compute
Q3 ,

i.e. nd which states in the underlying DFA for D3 that are reachable. The set Q3 is

not stored explicitly but is implicit from the set of non-empty overlays for each super-state.
If we store the set of non-empty overlays for each super-state as a list, the total size will
be proportional to Q3 , which can be very large. So the set of non-empty overlays for each
super-state is stored as a ternary classier (similar to how we store super-state transitions
which is discussed in Section 6.5.)
One option to nd the reachable states is to simulate a UCP construction of the underlying
DFAs of D1 and D2 . That is, we do the UCP construction, but after computing the
transitions of each merged state, we do not store them. The UCP construction also gives
the state to super-states and overlay assignment. The problem with this method is that
the queue of unexplored states while doing the UCP construction can be proportional to
|Q3 |.

To avoid this, we simulate the UCP construction focusing on super-states instead of states.
The construction works as follows. For each discovered super-state in D3 , we maintain two
sets of overlays: (1) the Explored set containing the overlays which have a reachable DFA
state that have already been explored, and (2) the Unexplored set containing the overlays
which have a reachable DFA state that have not already been explored. We maintain a
queue, Queue, of super-states in D3 that currently need to be explored, and explore one
super-state from the queue at a time. For the super-state, say S, currently being explored,
163

we explore all the states corresponding to the overlays in S's Unexplored set, and them
move all the overlays from the Unexplored to the Explored set.
When a new state, say (S ∩ O ), is discovered, it is processed as follows. If S is
a newly discovered super-state, we add it to Queue and set Explored(S ) = ∅ and
Unexplored(S ) = O . Otherwise S is already discovered and so is in S3 . In this case, if
O ∈ Explored(S ) or O ∈ Unexplored(S ), then we do not have to do anything as the

state has already been discovered. Otherwise, this is a newly discovered state, so we add
O to Unexplored(S ), and add S to Queue if S is not already there.

A super-state may be added to Queue and explored multiple times because all non-empty
overlays within a super-state are not discovered at the same time. As mentioned earlier,
the Explored and Unexplored overlay sets are maintained as ternary classiers. As new
overlays are added to the sets, the classiers are minimized using the bit merging algorithm
that is explained in Section 6.5.3.
After computing the reachable states, we have all the terms in D3 constructed except for
F3 and ∆3 .

For the OD2 FAs in Figure 6.4 and Figure 6.5, this new merge algorithm results in the
same OD2 FA as earlier shown in Figure 6.6(b).
To set the super-state deferment, we use a method similar to that used in Section 4.3.2
to set state deferment when merging D2 FAs. Let S= S0 , T0 be the current super-state in
D3 for which we need to compute the deferment. Let S0 →S1 →· · ·→Sl be the maximal

164

deferment chain DC1 (i.e. Sl is the root super-state) in D1 starting at S0 , and T0→T1→
· · ·→Tm be the maximal deferment chain DC2 in D2 starting at T0 . We will choose some

super-state Si , Tj where 0 ≤ i ≤ l and 0 ≤ j ≤ m to be F3 (S). We only consider a
candidate super-state pair if it is reachable in D3 and it overlay covers super-state S (so
Condition (C2) holds). Ideally, we want i and j to be as small as possible though not both
0. For example, our best choices are typically S0 , T1 or S1 , T0 . However, it is possible

that both super-states are not eligible (either not reachable or do not overlay cover S).
This leads us to consider other possible Si , Tj .

For any candidate super-state pair Si , Tj , we build the super-state transitions for super-state
S as if it were to defer to super-state Si , Tj in D3 (we show how to build the super-state

transitions in Section 6.5). The number of super-state transitions built gives us the measure of the eectiveness of the deferment; the fewer transitions built, the better it is. One
strategy (the best match method ) is to consider all candidate super-state pairs, and pick
the one that results in the fewest super-state transitions built for super-state S. A faster
strategy (the rst match method ) is to consider the `distance sum' z = i + j in increasing order, from 1 to l + m. For the current distance sum z, we consider all super-state
pairs at that distance; i.e. the set of super-states Z = { Si , Tz−i | (max(0, z − m) ≤ i ≤
min(l, z)) ∧ ( Si , Tz−i ∈ Q3 ) ∧ ( Si , Tz−i overlay coversS)}. From the set of super-states
Z, we choose the super-state that results in the fewest super-state transitions built for

super-state S. We can always nd an eligible super-state to set as F3 (S), since the
root super-state pair Sl , Tm is always reachable in D3 and it overlay covers all other
165

super-states.
For example in Figure 6.6(b), for super-state 4= 1, 1 , there are three reachable super-state
pairs along the deferment chains: 1= 0, 1 , 3= 1, 0 and 0= 0, 0 . However super-states
1= 0, 1 and 3= 1, 0 do not overlay cover super-state 4= 1, 1 , leaving the super-state
0= 0, 0 as the only candidate pair, which is chosen as the deferred super-state.

How the super-state transitions are created for the merged OD2 FA is shown in Section 6.5.
Pseudo-code for our DirectOD2FAMerge algorithm is given in Algorithm 6.8.
At the end, we apply the same optimization of merging sibling super-states together as in
the case of our OD2FAMerge algorithm.

6.5

Building Super-state Transitions

In this section we describe how we combine state transitions to create super-state transitions after the super-states have been created. The OD2 FA captures similarity among
states in dierent overlays within a super-state. So we would expect that state transitions
(which are just singleton super-state transitions) would be combined over the overlay eld;

i.e. multiple singleton super-state transitions with the same current super-state, current
input character and decision values but dierent overlay values will be combined.
The super-state transitions are created for each super-state and input character at a time.
In the rest of the section, S refers to the current super-state and σ refers to the current
166

2
1 Input: OD FAs, D1 = (Q1 , Σ, q0 1 , F1 , S1 , O1 , M1 , ∆1 ) and D2 = (Q2 , Σ, q0 2 , F2 , S2 , O2 ,
M2 , ∆2 ), corresponding to RE sets R1 and R2 .
Output: An OD2 FA and its underlying D2 FA corresponding to the RE set R1 ∪ R2 .
1 Initialize D3 to an empty OD2 FA;
2 Set #overlays in D3 , |O3 | = n ← |O1 | × |O2 |;
// Create the super-states
3 Initialize queue as an empty queue;
4 queue.push ( q01 , q02 );
5 while queue not empty do
6
u= u1 , u2 ← queue.pop();
7
Q3 ← Q3 ∪ {u};
8
S1 ← S1 (u1 ); O1 ← O1 (u1 );
9
S2 ← S2 (u2 ); O2 ← O2 (u2 );
10
if super-state S= S1 , S2 ∈ S3 then
/
11
Initialize super-state S= S1 , S2 with n NULL states;
12
Add S to S3 ;
13
M3 (S) ← M1 (S1 ) ∪ M2 (S2 );

14
15
16
17

Assign u to overlay (O1 × |O2 | + O2 ) in super-state S;
foreach c ∈ Σ do
nxt ← δ1 (u1 , c), δ2 (u2 , c) ;
if nxt ∈ Q3 ∧ nxt ∈ queue then queue.push (nxt);
/
/

18 foreach S ∈ S3 do F3 (S) ← FindDefState(S); // set super-state deferment
19 foreach S ∈ S3 × c ∈ Σ do // create super-state trans.
20
CreateSupreStateTrans(S, c);
21 Function FindDefState( S1 , S2 )
22
Let p0 = S1 , p1 , . . . , pl be the list of super-states on the deferment chain from S1 to the
root super-state in D1 ;
23
Let q0 = S2 , q1 , . . . , qm be the list of super-states on the deferment chain from S2 to the
root super-state in D2 ;
24
for z = 1 to (l + m) do
25
S ← { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z))∧ ( pi , qz−i ∈ S3 )};
26
if S = ∅ then return argminDS∈S (Σc∈Σ
Cost(CreateSupreStateTransClassifier( S1 , S2 , DS, c)));
27

return S1 , S2 ;

28 Function CreateSupreStateTrans(S, c)
29
C ← CreateSupreStateTransClassifier(S, F3 (S), c);
30
For each rule, ri ∈ C add super-state transition ∆3 (S, P(ri ), c) = D(ri );
31(cont'd)
32

Figure 6.8: Algorithm DirectOD2FAMerge(D1 , D2 ) for merging two OD2 FAs.
167

1(cont'd)
31 Function CreateSupreStateTransClassifier(S, DS, c)
/* Generate transitions for character c and super-state S when it defers to
DS */
32
Let ODec[n] be the oset decision vector initialized to ;
33
Let NODec[n] be the non-oset decision vector initialized to ;
34
Let Reqd[n] be the required vector initialized to False;
35
foreach O ∈ O3 do
36
if S ∩ O =⊥ then
37
u= u1 , u2 ← S ∩ O; // current state
38
nu1 ← δ1 (u1 , c); nu2 ← δ2 (u2 , c); // next state
39
if S = DS then // for the root super-state
40
if (u1 = nu1 ) ∨ (u2 = nu2 ) then Reqd[O] ← True; // not a self-loop
41
else
42
du= du1 , du2 ← DS ∩ O;
43
if (δ1 (du1 , c) = nu1 ) ∨ (δ2 (du2 , c) = nu2 ) then Reqd[O] ← True; // not
deferred
ODec[O] ← (S3 ( nu1 , nu2 ), (O3 ( nu1 , nu2 )−O) mod n, 1);
NODec[O] ← (S3 ( nu1 , nu2 ), O3 ( nu1 , nu2 ), 0);

44
45
46
47
48
49

ODec ≤ #Unique values in NODec then
return CreateOverlayClassifier(ODec, Reqd);
else
return CreateOverlayClassifier(NODec, Reqd);
if

50

#Unique values in

Figure 6.8: Algorithm DirectOD2FAMerge(D1 , D2 ) for
merging two OD2 FAs (cont'd).

input character for which we want to build the super-state transitions. T refers to the
current (or potential) deferred super-state of S.

6.5.1

Combining State Transitions

To combine the state (singleton super-state) transitions, we rst need to identify the
(subset of) overlays that have the same decision; that is, the same next super-state, overlay
value and the oset bit.
168

A trivial way to combine state transitions is to create one super-state transition for each
unique decision value among all the overlay decisions. All the overlays (i.e. state transitions) having the same decision will be combined into the super-state transition for that
decision. In this case, we will have the smallest possible number of super-state transitions,
which is equal to the number of unique decisions.

The problem with this approach is that we may have any arbitrary subset of overlays in a
super-state transition. Thus, we will need to represent arbitrary subsets of overlays. This
is problematic because any such representation will have a size that will be linear in the
size of the overlay set, O. The combined memory requirement of such a representation over
all the super-state transitions for all super-states will essentially be linear in terms of the
number of state transitions. This defeats the purpose of combining the state transitions.

To address this issue, we only create overlay subsets (i.e. only combine state transitions)
for states whose overlay sets that can be concisely represented. Specically, we only create
overlay subsets that can be represented as a ternary value; i.e. the set of overlays in
each combined super-state transition is equal to the ternary expansion of a ternary value.
Recall that we treat the overlays as integers in the range [0..|O|) and |O| is always a power
of 2. In most cases this restriction does not result in lost state combining opportunities.
In almost all cases, we are able to combine all state transitions with the same decision into
a single super-state transition.
169

6.5.1.1

Computing State Transitions

For each overlay O ∈ O, we can have one of the following three cases: (a) S ∩ O =⊥, i.e.
the overlay is empty, (b) S ∩ O = s and δ (s, σ) = δ (T ∩ O, σ), i.e. state transition is not
deferred and (c) S ∩ O = s and δ (s, σ) = δ (T ∩ O, σ), i.e. state transition is deferred.
Of ⊆ O denotes the set of lled overlays, and Or ⊆ Of denotes the set of overlays for

which the state transition is not deferred. Note that Of depends on S, and Or depends on
S, T and σ. The super-state transitions generated for super-state S need to cover all the

overlays in Or . We make the following two observations which help us combine the state
transitions into fewer super-state transitions.


We never do a lookup on the OD2 FA for any overlay O ∈ O \ Of for super-state
S. Because of this, empty overlays can have any decision, and so can be `merged' with

any overlay. For example, suppose we have |O| = 4, where overlay 2 = (10)2 is empty,
and overlays 0 = (00)2 , 1 = (01)2 and 3 = (11)2 all have the same decision. If we try
to combine just the lled overlays, we get two super-state transitions with overlay sets
0∗ and 11. But since we would never do a lookup on the empty overlay, we can include

it in the super-state transition, which results in only one transition with overlay set
∗∗. For every empty overlay we designate a special

wildcard decision, denoted by

, that matches any actual decision. Also note that if we include empty overlays
in super-state transitions, Condition (C2) is necessary and sucient to ensure that
transition deferment works correctly.


It is not necessary to defer transitions that match the deferred state. When
170

combining state transitions, including transitions that can be deferred can result
in fewer super-state transitions. For example, suppose we have |O| = 4, where all
four overlays are lled and all have the same decision, but the transition for overlay
2 = (10)2 is deferred, whereas transitions for overlays 0 = (00)2 , 1 = (01)2 and
3 = (11)2 are not deferred. If we require that the transition for overlay 2 must be

deferred, then we need need two super-state transitions with overlay sets 0∗ and
11 to cover the remianing overlays. Including the state transition for overlay 2 in

the combined super-state transition results in only one super-state transition with
overlay set ∗∗.
Before we can combine state transitions, we rst need to compute the state transition and
deferment for each overlay. We create a Decision array which records the decision for each
overlay, and a corresponding Boolean Required array which records whether the decision
is necessary or not (i.e. whether it must be specied or it can be deferred). For empty
overlays, the Decision value is set to

and Required is set to false. For lled overlays,

how the state transitions are computed depends on the stage of the OD2 FA construction.
During initial OD2 FA construction for one RE: The underlying D2 FA is available

during the initial OD2 FA construction, so the state transitions and deferments are determined by the D2 FA.

During OD2FAMerge:

Since the underlying D2 FA is stored in OD2FAMerge, again the

state transitions and deferments are determined by the stored D2 FA. The D2 FA lookup
171

from the underlying D2 FA corresponds to lines 33 and 35 in Algorithm 6.7.

During DirectOD2FAMerge:

During DirectOD2FAMerge, we do a lookup from the

input OD2 FAs to compute the state transitions and deferments. The lookup from the two
input OD2 FAs corresponds to lines 38 and 43 in Algorithm 6.8.
For the root super-state, for self-loop state transitions we set the Required value to false,
even though these transitions are not deferred. As a result, the root super-state will not
store the self-looping super-state transitions. If lookup fails for a non-root super-state,
then we would follow the deferment pointer and do a lookup on its deferred super-state. If
lookup fails for the root super-state, there no deferment pointer to follow along; however
we would know that the missing transition is a self-loop (on the root super-state), so the
destination super-state is the root super-state and the destination overlay is the current
overlay. Since most transitions for the root super-state are self-loops, this greatly reduces
the resulting number of super-state transitions.
We need to determine which of the two forms of super-state transitions (oset transitions
or non-oset transitions) to create. Clearly we would use the form which results in fewer
super-state transitions. So we create a Decision array for both oset and non-oset
decision, and use the one which has fewer unique values in it to create the super-state
transitions. In most of the cases using the oset decisions results in fewer super-state
transitions.
We only compute and store transitions for all states in one super-state at at time. Once the
172

super-state transitions have been constructed, the state transitions are discarded. Hence
we never store state transitions for all states in the OD2 FA at the same time.
For example consider super-state 1 and input character d in the OD2 FA in Figure 6.6(c).
The OD2 FA has four overlays so O = {0, 1, 2, 3}. In this case we have Of = {0, 1, 2} and
Or = {0, 2}. The

Decision array will be

[(0, 1, 1), (0, 0, 1), (0, 1, 1), ] and the

Required

array will be [true, false, true, false].

6.5.2

Creating Overlay Classiﬁer

The set of state transitions for each overlay for super-state S and input character σ essentially forms a 1-dimensional classier over the overlay eld. The problem of creating a
minimum set of covering super-state transitions then boils down to nding an equivalent
ternary minimized classier.
We introduce some standard terminology rst. A 1-dimensional classier is dened over a
eld F and consists of a list of rules. Each rule r has a predicate P(r) ⊆ F and a decision
D(r). A packet p ∈ F matches rule r if p ∈ P(r). The decision of the classier C for a

packet p is given by the rst rule in C that matches p. For our purpose of using a classier
to build super-state transitions, we dene a generalized version of a classier that we call
an overlay classier.
Deﬁnition 7 (Overlay classier).

over the eld O. Each rule

r

An overlay classier, C, is 1-dimensional classier

has a Boolean ag, denoted by
173

R(r),

that indicates

whether the rule is required or not. Rules with decision

have their ag R(r) set to

false. The rules in C satisfy the following properties:
 Ternary classier:

For each rule r ∈ C, its predicate P(r) is a ternary value.

 Non-conicting property:

For every packet p ∈ Of , all the rules that match p (if

any) also have matching decisions (note that
 Covering property:

matches any actual decision.)

For every packet p ∈ Or , there is at least one rule r ∈ C that

matches p and R(r) is true (which also implies D(r) = .)
 Restricted equivalence:

Two overlay classiers are equivalent if, for every packet

in Of for which both overlay classiers have a match, they both have the same
decision.
Given the Decision and Required values for each overlay, we rst construct an overlay
classier with one rule for each overlay. Specically, we create an empty overlay classier
C over O. Then for each overlay O, we add the rule Rule(O, Decision[0], Required[O]) to
C. Here Rule(x, y, z) refers to creating a rule r with P(r) = x, D(r) = y and R(r) = z. Next

we minimize the rules in C to get an equivalent overlay classier C (which is discussed
in the next section). After minimizing, each rule r ∈ C with R(r) = true gives us a
combined super-state transition ∆(S, P(r), σ) = D(r) in the OD2 FA.
The covering property of overlay classiers ensures that super-state S will have a super-state
transition covering every overlay in Or . The non-conicting property of overlay classier
ensures that each overlay in Of has at most one decision. Note that we can have more
174

than one super-state transition covering an overlay, but in that case the non-conicting
property ensures that they all have the same decision.
For example with super-state 1 and input character d in the OD2 FA in Figure 6.6(c), the
overlay classier created will have just one required rule ∗0 → (0, 1, 1), which gives us
d
the super-state transition (1, ∗0) − (0, 1, 1). Figure 6.9 shows the overlay classiers and
→

corresponding super-state transitions generated for all the super-states in the OD2 FA in
Figure 6.6(c).
Super-state Char. Overlay classier Super-state transition
a

3

(0, 0∗) − (3, 0, 1)
→
c
(0, ∗0) − (1, 0, 1)
→
n
(0, ∗∗) − (0, 0, 0)
→

01 → (1, 0, 1)

(0, 01) − (1, 0, 1)
→

1∗ → (3, 0, 1)

(0, 1∗) − (3, 0, 1)
→

d
r

∗0 → (0, 1, 1)
01 → (2, 0, 1)

(1, ∗0) − (0, 1, 1)
→
r
(1, 01) − (2, 0, 1)
→

b

1

0∗ → (3, 0, 1)
∗0 → (1, 0, 1)
∗∗ → (0, 0, 0)

p

0

a
c
n

0∗ → (0, 2, 1)

(3, 0∗) − (0, 2, 1)
→

q
r

1∗ → (4, 0, 1)
11 → (2, 0, 1)

(3, 1∗) − (4, 0, 1)
→
r
(3, 11) − (2, 0, 1)
→

p

p

d

b

q

Figure 6.9: Overlay classier and corresponding super-state transitions for the super-states
in OD2 FA in Figure 6.6(c).
The pseudo-code for creating the overlay classier is given in Algorithm 6.10.

6.5.3

Minimizing Overlay Classiﬁer

We now explain how we minimize the initial overlay classier created from the Decision
and Required arrays. We generalize the bit merging algorithm proposed in [31] to handle
175

1 Input: The decision, Dec[], and required value, Reqd[], for each overlay.
Output: An equivalent ternary minimized overlay classier.
1 n ← len(Dec); // number of overlays, will be a power or 2
2 w ← log2 (n); // number of bits
3 Create empty overlay classier C with eld width w;
4 foreach overlay o ∈ [0..n) do
5
Insert Rule(o, Dec[o], Reqd[o]) in C;
6 return MinimizeOverlayClassifier(C); // minimize the rules and return
7

Figure 6.10: Algorithm CreateOverlayClassifier(Dec, Reqd).

wildcard decision

and optional deferment.

We introduce some standard terminology rst. For a ternary value T , the ternary position

mask of T , denoted by τ(T ), is the binary value obtained by replacing all binary bits in T
by 0 and all ternary bits (∗) in T by 1. The ternary position mask of T basically indicates
the positions in T which have a ternary bit. The binary bit mask of T , denoted by β(T ),
is the binary value obtained by replacing all ternary bits in T by 1. The ternary position
mask and binary bit mask together represent a ternary value using two binary values. If
bit location b is a 1 bit in τ(T ) then T has a ∗ in location b; otherwise T has the same
binary bit in location b as in β(T ). So we can represent at ternary value T as the pair of
binary values (τ(T ), β(T )).
Two ternary values, T1 and T2 , are said to be ternary adjacent if τ(T1 ) = τ(T2 ) and β(T1 )
and β(T2 ) dier in exactly one bit. In other words, T1 and T2 are ternary adjacent if they
dier in exactly one location which has a binary bit in both T1 and T2 . The ternary cover
of T1 and T2 is the ternary value (τ(T1 ) | (β(T1 ) ^ β(T2 )), β(T1 ) | (β(T1 ) ^ β(T2 ))) (here | is
the bitwise OR, and ^ is the bitwise XOR). That is, the ternary cover is the ternary value
176

obtained by replacing the diering binary bit location in T1 (or in T2 ) by the ternary bit
∗. Two rules are said to be

ternary adjacent if their predicates are ternary adjacent and

their decision match.

We rst minimize the rules in the overlay classier and then remove rules that are not
required (i.e. have the R(r) ag set to false). Minimizing the overlay classier is done in
two steps, pre-merging bits and bit merging. We explain these two steps using the example
in Figure 6.11.

0000 → A
0001 →
0010 → A?
0011 → A
0100 →
0101 →
0110 → B
0111 → B
1000 → B
1001 →
1010 →
1011 → A?
1100 →
1101 →
1110 →
1111 → B?

000∗ → A
001∗ → A
010∗ →
011∗ → B
100∗ → B
101∗ → A?
110∗ →
111∗ → B?

00∗∗ → A
∗01∗ → A?
0∗0∗ → A?

00∗∗ → A
∗01∗ → A?
0∗0∗ → A?

∗11∗ → B
01∗∗ → B?
1∗0∗ → B
11∗∗ → B?

∗11∗ → B
1∗0∗ → B

Bit merge
rst pass

00∗∗ → A
∗11∗ → B
1∗0∗ → B

∗1∗∗ → B?

Bit merge
second pass

Remove nonrequired rules

Bit 0
eliminated
Figure 6.11: Minimizing overlay classier example.

The pseudo-code for minimizing the overlay classier is given in Algorithm 6.12
177

1 Input: A initial overlay classier C with n = O rules.
Output: Equivalent overlay classier with rules minimized.
1 w ← log2 (n); // number of bits
2 foreach bit k ∈ [0..w) do // first try pre-merging bits
3
premerge ← True;
4
foreach pair of rules, ri , rj , such that P(ri ) and P(ri ) dier only in bit k do
5
if ri and rj are not ternary adjacent then // i.e. decisions of ri and rj do
not match
6
premerge ← False;
7
break;
8
9
10
11

if premerge then // bit k is pre-merged
foreach pair of rules, ri , rj , such that P(ri )
Remove rules ri and rj from C;
Insert rule MergedRule(ri , rj ) in C;

and P(ri ) dier only in bit k do

12 C ← BitMerge(C); // then do bit merging
13 foreach rule ri ∈ C do if R(ri ) = False then Remove ri from C; // remove non-required
rules
14 return C;
15 Function BitMerge(C)
16
Create empty overlay classier C ;
17
foreach rule ri ∈ C do Initialize covered[i] ← False;
18
PM ← Partition of rules in C based on rule predicate ternary position masks;
19
foreach Partition pm ∈ PM do
20
PD ← Partition of rules in pm based on rule decision;
21
foreach Partition pd ∈ PD with corresponding decision d do
22
foreach pair or rules ri , rj ∈ pd do
23
if ri and rj are ternary adjacent then
24
Insert MergedRule(ri , rj ) in C ;
25
covered[i] ← covered[j] ← True;
26
R(ri ) ← R(rj ) ← False;
27
28
29
30
31
32
33

if d = then
psd ← Partition in PD corresponding to ;
foreach pair or rules ri ∈ pd × rj ∈ psd do
if ri and rj are ternary adjacent then
Insert MergedRule(ri , rj ) in C ;
covered[i] ← covered[j] ← True;
R(ri ) ← R(rj ) ← False;

34(cont'd)
35

Figure 6.12: Algorithm MinimizeOverlayClassifier(C).
178

1(cont'd)
34
if C is empty then // no rules merged
35
return C;
36
37
38

rule ri ∈ C do if covered[i] = False then Insert ri in C ;
Remove duplicate rules from C ;
return BitMerge(C ); // recursively call BitMerge and return the result
foreach

39 Function MergedRule(r1 , r2 )
40
T ← ternary cover of P(r1 ) and P(r2 );
41
if D(r1 ) = then D ← D(r1 ) else D ← D(r2 ) ;
42
reqd ← R(ri ) ∨ R(rj );
43
return Rule(T, D, reqd);
44

Figure 6.12: Algorithm MinimizeOverlayClassifier(C) (cont'd).

6.5.3.1

Pre-merging Bits

The initial overlay classier created from the Decision and Required arrays will have |O|
rules, one rule for each overlay, and the predicate of any rule ri is i (the corresponding
overlay (binary) value). For our example, the rst column in Figure 6.11 shows the initial
overlay classier. We have |O| = 16. There are two unique actual decisions denoted by A
and B. A `?' next to an actual decision indicates that the rule is not required (rules with
a

decision are always not required).

At this point we can directly apply the bit merging algorithm, which will result in a
minimized set of rules. But in most cases, all except for a few overlays have the same
decision. So only a few bits that distinguish the overlays having dierent decisions will
vary in the minimized rules. All the other bits will be merged to ∗'s in all the minimized
rules. We can accelerate the bit merging step by identifying these bits and pre-merging
them so that the bit-merging algorithm only needs to work on the few remaining bits that
179

are not pre-merged.
The pre-merging works as follows. For a binary value p, ^b (p) denotes the value obtained
0
by inserting a 0 bit at location b, and ^b (p) denotes the value obtained by inserting a 1
1
bit at location b. Bit location b is pre-merged if the following condition is true: ∀p ∈
[0..|O|/2), D(r^ (p) ) matches D(r^ (p) ). That is, for every pair of rules whose predicates
0b
1b

dier only in bit location b, their decisions match. Since the decisions for every such pair
of rules match, we merge these pair of rules. A pair of such rules, lets say ri and rj are
merged as follows. We create a new merged rule, say rk . P(rk ) is set to the ternary cover of
P(ri ) and P(rj ). If D(ri ) =

then we set D(rk ) ← D(ri ) otherwise we set D(rk ) ← D(rj ),

and we set R(rk ) ← R(ri ) ∨ R(rj ). Rules ri and rj are replaced with the merged rule rk .
We test and pre-merge one bit location at a time. Every time a bit is pre-merged, the
number of rules is reduced by half.
In our example in Figure 6.11, bit location 0 gets pre-merged, and the resulting rules are
shown in the second column.

6.5.3.2

Bit Merging Algorithm

The bit merging algorithm runs in several iterations. The input to each iteration is an
overlay classier C, and the output is an equivalent overlay classier C . Each iteration
works as follows.
We rst initialize a Covered ag to false for each rule in C. For rule ri , Covered[ri ]
180

indicates if rule ri is covered by some rule in C . Then for every pair of rules, say ri and
rj , in C that are ternary adjacent, we insert the merged rule rk in C . The merged rule
rk is created in the same way as during the pre-merging step. After inserting merged rule
rk to C , we set Covered[ri ] and Covered[rj ] to true and set R(ri ) and R(rj ) to false.

The idea behind setting the required ags for ri (and rj ) to false is that since a rule has
already been added to C that covers ri , any further rules we add to C should not be set
as required because of ri .
To speed up bit merging, we partition the rules based on the ternary position mask of the
each rule's predicate and each rule's decision. This reduces the number of pairs of rules
we need to check for merging. After all pairs have been checked for merging, any rules left
in C with their Covered ag false are added to C . The bit merging iterations continue
as long as there is at least one merged rule added to C When no pair of rules is merged,
we stop and return the current overlay classier.
For our example in Figure 6.11, we have two iterations of bit merging. After the rst
iteration, we get the rules in column 3. The rst rule in column 3 is obtained by merging
the rst two rules in column 2. After merging the rst two rules in column 2, both rules
will be marked as non-required. Therefore when the third rule in column 3 is created by
merging the rst and third rule in column 2, it is marked as non-required. We get the
rules in column 4 after the second iteration of bit merging. No more rules can be merged
after that, so bit merging stops. Finally, we remove the non-required rules to get the nal
overlay classier shown in column 5.
181

6.5.4

Overlay Discussion

6.5.4.1

Restricting Overlay Count to Power of 2

We keep the number of overlays in all intermediate OD2 FAs and the nal OD2 FA to be a
power of 2 and number the overlays starting with 0 and ending with |O| − 1. We achieve
this by modifying the algorithm that constructs an OD2 FA from one RE to pad empty
overlays at the end if necessary. The OD2 FA merge algorithm requires no modication
since the number of overlays in the merged OD2 FA is equal to the product of the number
of overlays in the two input OD2 FAs.
We explain by example the benet of requiring the number of overlays to be a power of 2.
Figure 6.13(a) shows the D2 FA for the RE /x. y. z/ and Figure 6.13(b) shows two
possible overlay structures for the OD2 FA. Since there are three self-looping states in the
D2 FA, 0, 1 and 2, our algorithm places them in the root super-state. The overlay structure
on the left has three overlays, with the three self-looping states in them, with no padding.

‐x

0

‐y
x

1

y

2

z

(a) D2 FA for RE /x. y. x/.

3/3

1

2

0

1

2

0

0

‐z

0

1

2

1  

3

Without padding

0

0

1

2

0

1

2



0

1

2

3

1  

3

3



With padding

(b) Possible overlay structures for the corresponding OD2 FA.

Figure 6.13: Overlay Padding Example.
182

0

3,0

2

1

3

1

7

6 12

0

1

2

3

3

X
4

5

0

6

2

0

0

1

1

2

7

8

9

10

11

1,0 1,1 1,2 7,0 7,1 7,2 6,0 6,1 6,2 12,012,112,2
Without padding
0

3,0

0

1

2

1

3

1

7

6 12

2

1,0 1,1 1,2

3



4

3

5

X
6

7

7,0 7,1 7,2



0

8

2

0

0

1

1

2

9

10

6,0 6,1 6,2

11

3



12

13

14

15

 12,012,112,2 

With padding

(c) Merged super-state.
Overlay
0
1
2
3
4
5

OID
0000
0001
0010
0011
0100
0101

Overlay
0
1
2
3
4
5
6
7

OID
00
010

Without padding

OID
0000
0001
0010
0011
0100
0101
0110
0111

OID
0

With padding

(d) TCAM rules.
Figure 6.13: Overlay Padding Example (cont'd).
In the right overlay structure, we pad one empty overlay, so that the number of overlays
is a power of 2. Now consider what happens when this new OD2 FA in Figure 6.13(b) with
and without padding, is merged with the OD2 FA in Figure 6.6(c). As an example, we
consider the merging of super-state 3 in Figure 6.6(c), which we call S3 and super-state 0
for the new OD2 FA, which we call S0 . For both cases, Figure 6.13(c) shows the resulting
183

super-state in the merged OD2 FA, which we call Sm . In both cases, there will be 12 states
in the merged super-state. The rst three of these states are replications of state 1 in S3 ,
the next three states are replications of state 7 in S3 , and so on. Furthermore, states 1 and
7 in S3 were itself replications of the state 1 of the D2 FA in Figure 6.4. Hence, the rst

six states in Sm are replications of the same state (i.e. state 1) of the D2 FA in Figure 6.4.
For the case without padding, Sm has 12 overlays, with one state in each overlay. For the
case with padding, Sm has 16 overlays, with the overlays 3, 7, 11 and 15 being empty.
Now, since the rst six states in Sm are replications of state 1 of the D2 FA in Figure 6.4,
in the merged OD2 FA, they all will have one non-deferred transitions on input character
a. In both cases, the overlay osets will also be the same for all six state transitions. So

all six overlays will have the same decision, and will bit-merge in the overlay classier.
Figure 6.13(d) shows the (predicates of the) rules in the minimized overlay classier for
both cases. For the case without padding, we can only get down to two rules from six
rules. In the case with padding, the overlays 3 = 0011 and 7 = 0111 are empty overlays,
and hence will have decision during bit-merging. As a result, we can merge all six rules
into a single rule.

6.5.4.2

Eliminating Overlay Bits

We modify the OD2 FA merge algorithm to eliminate unnecessary overlay ID bits and
thus reduce the required TCAM entry width. The idea behind doing a cross product of
overlays while merging is to capture the replication of states. Replicated states get assigned
184

to dierent overlays in the same super-state. However, sometimes there is no replication
and we do not need to create extra overlays. For example, consider the merging of the
OD2 FAs for REs/ab. cd/ and /ab. ef/. The two input OD2 FAs will both have two
overlays 0 and 1, so in the merged OD2 FA we will create four overlays 0, 1, 2, and 3. In
this case, since both REs have a common prex, there is no state replication and overlays
1 and 2 will be empty in the merged OD2 FA. The two lled overlays, 0 and 3, have overlay
IDs 00 and 11. Since the two overlays dier in both the bits, either bit is redundant and
can be removed from the overlay ID producing only two overlays 0 and 1. In general, after
merging two OD2 FAs, we eliminate as many overlay ID bits as possible by searching for
overlay ID bits i where in every pair of overlays whose overlay ID diers only in bit i, at
least one of the two overlays is empty. If bit i is eliminated, one empty overlay from each
pair that dier in bit i is removed. We note that the overlay count stays a power of 2.

6.6

OD2FA Software Implementation

In this section we discus the implementation of OD2 FA in software on a general purpose
processor. We rst review the implementation of DFA and D2 FA in software, then present
our proposed implementation of OD2 FA.
Implementation of any nite automta mainly involves choosing a data structure to store
the transition function and then implementing the lookup function using the given data
structure. In a DFA (Q, Σ, q0 , M, δ), each state in Q has |Σ| transitions. The transition
185

function δ can be stored in memory as a 2-dimensional array of next state values, indexed
over Q and Σ. Looking up the next state requires just one memory lookup in the array
using the current state and input character as indices. If we assume a 4 byte state ID value,
then the amount of memory required to implement the transition function is |Q| × |Σ| × 4
bytes.
For a D2 FA (Q, Σ, q0 , M, ρ, F), each state in Q has 0 to |Σ| transition plus the deferment
pointer. Most states have only a couple of transitions. So the transitions for each state
can be stored as a list of (current character, next state) pairs in memory. To do a lookup,
we go through the list of transitions for the current state to check if there is a transition
on the current input character or not. If there is one, we get the next state, otherwise we
go to the deferred state of the current state and check its transition table. The amount
of memory required to implement the transition function is # transitions in ρ × 5 bytes
for the transitions and |Q| × 4 bytes for the deferment pointers.

6.6.1

Implementing OD2 FA

We now discuss the implementation for an OD2 FA (Q, Σ, q0 , F, S, O, M, ∆). All of the
elds of an OD2 FA are simple to implement except for ∆. To implement ∆, we use a
structure similar to that of a D2 FA except that instead of storing next state values, we
store pointers to overlay classiers. Specically, for each super-state, we store a list of
(current character, pointer to overlay classier) pairs in memory for each character that
is not deferred. Note that a character may be deferred for some overlays, but we say it is
186

not deferred if there is at least one overlay where it is not deferred.
Given the current super-state S, current overlay O and current character σ, the lookup is
done as follows. We go through the transition list for the super-state S to check if there
is an entry for character σ. If there is no entry for σ, we perform the lookup using the
deferred super-state for S F(S). If there is an entry for σ, that gives us the location of the
overlay classier to use. We do a lookup in this overlay classier for overlay O (we discuss
next how to do this). If we nd a match, the decision gives us the next super-state and
overlay values. If we do not nd a match, then overlay O is deferred for character σ, so
we again perform the lookup using the deferred super-state for S F(S).

6.6.2

Overlay Classiﬁer Storage and Lookup

An overlay classier is just a list of rules. Each rule has a rule predicate, which is a ternary
value, and a rule decision, which is a triple of next super-state, overlay value and the oset
bit. If we use 4 byte overlay id values, then the rule predicate can be stored using two
4 byte values. One value will be the ternary position mask of the rule predicate and the

other value will be the binary bit mask of the rule predicate. The rule decision can also
be stored as two 4 byte values, one for the next super-state and the other for the overlay
value. The single oset bit can be encoded in either of these two values. We would just
store the list of rules in memory requiring 16 bytes per rule.
The lookup for an overlay O is done as follows. We just go through the list of rules and
check if any rule matched the overlay O. To check if a rule r matches overlay O, we need
187

to check if the rule predicate P(r) covers O. P(r) will cover O if all the bit locations that
contain a binary bit in P(r) have the same bit in both P(r) and O. This check may be
done very eciently using just one bitwise OR by testing (O | τ(P(r))) = β(P(r)).

6.6.3

Space Requirement

For the OD2 FA, we need |S| × 4 bytes to store the super-state deferment pointers and
roughly |S| bytes to store the super-state match function M. If m = ΣS∈S (# of nondeferred characters for S), then we need m×5 bytes to store the overlay classier pointers.
We optimize the size required to store the overlay classiers using the following observation.
The same overlay classier may be used by multiple super-states for multiple characters.
Rather than storing the same overlay classier multiple times, we store one copy of each
unique overlay classier. In each super-state transition list, the same pointer is used by
each entry that points to the same overlay classier. The memory required to store the
overlay classiers will be 16 times the total number of rules among all the unique overlay
classier stores.

6.7

OD2FA Implementation in TCAM

In this section, we describe how OD2 FA can be implemented in TCAM and present our
OverlayCAM algorithm for doing so. We extend our solution of the RegCAM algorithm
described in Chapter 5 to implement the OD2 FA in TCAM. The RegCAM implementation
188

uses two tables to represent an automata: a TCAM lookup table with a source state ID
column and an input character column, and a corresponding SRAM decision table which
contains the next state ID. To implement OD2 FA in TCAM, we use the unique pair of
super-state ID and overlay ID as source state ID in the TCAM lookup table and next state
ID (which is pair of next super-state ID and next overlay ID) in the SRAM decision table.
The super-state ID and overlay ID columns in TCAM will be lled with ternary values that
together match multiple states rather than a single state whereas the super-state ID and
overlay ID columns in SRAM will be binary values that together give a single state. We
add an extra bit in the SRAM decision table to specify the overlay bit in the super-state
transition decision. Just as in RegCAM, we leverage the rst match feature of TCAMs to
ensure that the correct transition will be found in the TCAM lookup table. Specically,
if super-state S defers to super-state S , then we list all the super-state transitions for
super-state S before those of super-state S . We describe the specic challenges of implementing OD2 FA in TCAM including dealing with super-states, overlays, and super-state
transitions in the remainder of this section.

6.7.1

Generating Super-state IDs and Codes

For the super-states, we apply the shadow encoding algorithm described in Section 5.2.2.3
on the super-state deferment forest of the given OD2 FA. This generates a binary super-state
ID SSID(S) and a ternary super-state shadow code SSCD(S) for each super-state S that
satises the

Shadow Encoding Properties (SEP). Figure 6.6(c) shows the SSIDs and
189

SSCDs generated for that OD2 FA.

6.7.2

Implementing Super-state Transitions

σ
We now address the implementation of super-state transitions in TCAM. Let (S1 , X) −
→

(S2 , o, b) be the super-state transition we want to implement in TCAM. In the TCAM

table, we use SSCD(S1 ) in the super-state ID column. Since we restrict the set of overlays
in any super-state transition to ternary values, we can just use X in the overlay ID column
of the TCAM. For the SRAM, in the super-state ID column, we use SSID(S2 ), In the
overlay ID column, we use the binary representation of the overlay value o. And the oset
bit b is stored in the oset bit location in the SRAM.
The RE matching process works as follows. Let S be the current super-state, O be the
current overlay, and σ the current input character. So s = SSID(S) · O denotes the current
state; s concatenated with σ is used as a TCAM lookup key. Let uid be the SSID stored
in super-state ID column in SRAM and o be the value stored in the overlay ID column
in SRAM and b be the value of the oset bit stored in SRAM. We compute the next
super-state ID and overlay ID as follows. The next super-state ID will be uid. The next
overlay ID will be (b×O(s)+o) mod |O|. If b = 0, the next overlay ID is simply o. If b = 1,
the next overlay ID is (O(s) + o) mod |O|. In most cases where o = 0, the next overlay
ID is (O(s) + 0) mod |O| = O(s). For example, consider the OD2 FA in Figure 6.6(c).
We represent the super-state transition ∆(03 , {0, 1}, a) = (33 , 0, 1) as follows. The TCAM
super-state ID column is lled with SSCD(03 ) = ∗∗∗, the TCAM overlay ID column is 0∗,
190

the SRAM super-state ID column is lled with SSID(33 ) = 011, the overlay ID column is
lled with 0, and the oset bit is set to 1.

6.7.3

TCAM Table Generation

We now explain how we generate the TCAM entries for OD2 FA. We generate the TCAM
entries for one super-state at a time. Say S is the current super-state. We use the overlay
classiers of super-state S to generate its TCAM rules. For each character for which S
has an overlay classier, we add a TCAM entry for each rule in the overlay classier
as described in the previous section. After building this initial TCAM table for S, we
reduce the TCAM entries as follows. We apply the bit merging algorithm explained in
Section 6.5.3.2 on the TCAM entries generated for the super-state. The predicate of each
rule corresponding to the TCAM entries has three parts: the current super-state code
SSCD(S), the overlay set X, and the current input character. The SSCD(S) part will be

the same in all TCAM rules for S, and the bit merging algorithm was already applied on
the overlay eld while building the overlay classiers, so we cannot merge TCAM rules
using any bits from these two elds. However, we can merge rules based on the current
input character eld. This mostly helps with case insensitive searches where transitions on
the alphabet characters will mostly occur in pairs and such pairs can be merged because
they dier on only one bit in ASCII encoding.
We order the TCAM tables of the super-states according to the super-state deferment
relationship (every super-state table occurs before its deferred super-state table).
191

The overlay classiers for the root super-state exclude all the self-looping transitions. All
of these transitions are handled by the last rule added in the TCAM which is all ∗s.
TCAM
Source Input 
SCD char.
State 2
0010 d 
State 1
0001 b 
00 a 
State 0
00 c 
00  
State 6
0110 q 
State 5
0101 d 
01 c 
01 p 
State 3
01 n 
01  
State 8
1010
r 
State 7
1001 b 
10 a 
10 p 
State 4
10 n 
10  
1101 q 
State 12
1101
r 
11 p 
State 9
11 n 
11  

SRAM
Dest. 
SID
1000
0100
0001
0010
0000
0111
1100
0101
0110
0000
0100
1011
1100
1001
1010
0000
1000
1110
1111
1101
0000
1100

Super‐
state 3
Super‐
state 1

Super‐
state 0

TCAM
Source
Input
Overlay 
SSCD
char.
set
011
b 
0
011
q 
1
011
11
r 
001
d 
0
001
01
r 
a 

0
c 

0
p 

1
01
p 

n 




 

SRAM
Destination
Overlay  offset 
SSID
value
bit
000
2
1
100
0
1
010
0
1
000
1
1
010
0
1
011
0
1
001
0
1
011
0
1
001
0
1
000
0
0
000
0
1

OverlayCAM TCAM rules

RegCAM TCAM rules

Figure 6.14: TCAM rules for RegCAM and OD2 FA.
Figure 6.14 shows the nal TCAM and SRAM tables for the OD2 FA in Figure 6.6, and, for
comparison purposes, the TCAM and SRAM tables generated by the RegCAM algorithm
for the same RE set f/ab[ˆn] pq/, /cd[ˆn] pr/g.
192

6.7.4

Variable Striding

In this section, we describe how we adapt the technique of variable striding introduced
in Section 5.4 to use with OD2 FA. We rst explain the basic idea of a variable striding
in a DFA. Creating a full k-stride DFA leads to space explosion because of two reasons.
First each state in a k-stride DFA has |Σ|k transitions. This leads to transition explosion.
Second, anytime a k-stride transition passes through an accepting state, we might need to
create multiple copies of the destination state in order to record the matching. This leads
to state explosion.
A k-var-stride DFA handles both these problems by generating variable (between 1 and k)
stride transitions. The transition decision stores the stride length of the transition along
with the destination state. The problem of transition explosion is managed by selectively
extending the stride of a limited number of transitions. The problem of state explosion is
eliminated by never extending a transition past an accepting state.
There are two implementations of variable striding that we considered in Section 5.4,
self-loop unrolling and full variable striding.

6.7.4.1

Self-loop Unrolling

The self-loop unrolling technique for the OD2 FA works in the same way as for the D2 FA as
presented in Section 5.4. The basic idea behind self-loop unrolling is as follows. The last
rule in the TCAM table for the root super-state is always the self-loop rule which handles
193

all the self-looping transitions for all the states in the root super-state. For example
consider the TCAM table for the root super-state (0) in Figure 6.14, which is also shown
in Figure 6.15(a).
Consider the lookup when the next two input characters are xa and 0 is the current
super-state. On the rst input character x, we will match the last self-loop rule. This
indicates that after processing the current character, we return to the same state. We
can replace the last self-loop rule with another copy of super-state 0s TCAM table with
the input character over the second stride and ∗s in the rst stride. This is shown in
Figure 6.15(b) with this second copy of the rules marked as Stride-2. If we do a lookup
for xa, we will match the rst Stride-2 rule. Thus, instead performing two lookups in the
1-stride table, we get the same decision by performing one lookup in the unrolled 2-stride

table.
If we unroll the self-loop rule at the end of the second copy of the TCAM rules one more
time, we get the table shown in Figure 6.15(b). We can further unroll the self-loop rule
to extend to a k-stride table. If the 1-stride TCAM table has n rules, then the self-loop
unrolled k-stride table will have only (n − 1)k + 1 rules.

6.7.4.2

Full Variable Striding

Adapting the full variable striding technique for the OD2 FA is more challenging. The
k-var-stride transition sharing algorithm presented in Section 5.4 generates k-var-stride

tables which correctly handle state deferment in the D2 FA. What we mean by this is the
194

TCAM
Source
Input
Overlay 
SSCD
char.
set
a

0

c

0

p

1

01
p


n








SRAM
Destination
Overlay  offset 
SSID
value
bit
011
0
1
001
0
1
011
0
1
001
0
1
000
0
0
000
0
1

(a) 1-stride table for super-state 0.
TCAM

Stride 1

Stride 2

Stride 3

Source
Input
Overlay 
SSCD
char1 char2 char3
set
a

0



c

0



p

1



01
p




n





a

0



c

0



p

1



01
p




n





a

0



c

0



p

1



01
p




n












SRAM
Destination
Overlay  offset  Stride
SSID
value
bit
011
0
1
1
001
0
1
1
011
0
1
1
001
0
1
1
000
0
0
1
011
0
1
2
001
0
1
2
011
0
1
2
001
0
1
2
000
0
0
2
011
0
1
3
001
0
1
3
011
0
1
3
001
0
1
3
000
0
0
3
000
0
1
3

(b) Super-state 0 table unrolled to 3-var-stride.
Figure 6.15: Root super-state self loop unrolling example for TCAM rules in Figure 6.14.

195

following. Suppose s1 is the current state and it defers to state s2 . If we lookup a character
and match a rule from state s2 's TCAM table giving the next state s3 , then state s1 also
transitions to state s3 on the same input. In general, a match found in the TCAM table
of an ancestor of s1 when doing a lookup for s1 will always be correct.
We cannot extend the k-var-stride transition sharing algorithm to OD2 FA to generate
tables that correctly handle deferment. The diculty arises from the following. In an
OD2 FA, each super-state has multiple states. On the same input, dierent states in the
same super-state might transition to states in dierent super-states. Thus, we propose an
alternate technique to generate variable stride tables.
For each super-state S, we generate a k-var-stride table in addition to its 1-stride table.
When the k-var-stride table is implemented in TCAM, in the current super-state column
of the TCAM, we use SSID(S) instead of the SSCD(S). That way, the k-var-stride rules
of super-state S will only match when doing a lookup for itself, and will not match when
doing a lookup for any other super-state. So the k-var-stride rules only have to be correct
for S. The k-var-stride table for S is placed just before its 1-stride table in TCAM, so
higher priority is given to k-var-stride rules over the 1-stride rules.
We now explain our algorithm to generate the k-var-stride table for a super-state. We
dene the variable stride transition function as Γ : S × 2O × (

i
1≤i≤k Σ )

→ S × [0..|O|) ×

{0, 1}, which is same as ∆ except that Γ transitions over a string of characters of length

between 1 and k. Let S be the super-state for which we are generating the k-var-stride
transitions. For each 1-stride transition for super-state S, we build k-var-stride transitions
196

by extending the transitions of super-state S2 with that transition in two ways: rst by
composing with S2 's k-var-stride table, then by composing with S2 's 1-stride table. More
σ
specically, let (S, X) − (S1 , o1 , 1) ∈ ∆ be any 1-stride transition for S such that S < S1
→

and M(S1 ) = ∅. We add the condition S < S1 because we only want to extend forward
transitions and this condition is true for most forward transitions. We add the condition
M(S1 ) = ∅ because we stop a variable stride transition at matching super-states.

If we have not already built the k-var-stride transition table for super-state S1 , we recursively build it rst. Then we rst extend the transitions in the k-var-stride table of S1 : for
w
each transition (S1 , Y) − (S2 , o2 , 1) in the k-var-stride transition table of S1 , if |X ∩ Y| is
→
σ.w
large enough and len(w) < k, we add the extended transition (S, X ∩ Y) − − (S2 , (o1 + o2 )
−→

mod |O|, 1) to the k-var-stride transition table for S. Next we extend the transitions in
σ

2
the 1-stride table of S1 : for each transition (S1 , Y) −→ (S2 , o2 , 1) in the 1-stride transition
−
σ.σ

table of S1 , if |X ∩ Y| is large enough, we add the extended transition (S, X ∩ Y) − −2
−→
(S2 , (o1 + o2 ) mod |O|, 1) to the k-var-stride transition table for S. We use the condition
|X ∩ Y| ≥ min(|X|, |Y|)/4 as the measure for large enough in our experiments. When we

extend one transition to the next, the extended transition can only cover overlays that are
common in both initial transitions. Ideally we would like both transitions to cover the
exact same set of overlays (in most cases this is true). But even when we do not have
the same overlay set, if the size of the intersection is signicant compared to the number
of overlays covered by the two initial transitions, it is worthwhile to add the extended
transition. We do not extend 1-stride transitions that are on the whitespace characters.
197

We have found experimentally that extending 1-stride transitions on these characters signicantly increases the number of TCAM rules while only marginally (if at all) increasing
the average stride. Figure 6.16 shows the k-var-stride transition table built for super-state
O from the 1-stride transition tables in Figure 6.9.

Super-state 0 rule Next super-state rule Extended var-stride rule
a

(3, 0∗) − (0, 2, 1)
→

b

(0, 0∗) −→ (0, 2, 1)
−

(0, ∗0) − (1, 0, 1)
→

c

(1, ∗0) − (0, 1, 1)
→

d

(0, 1∗) − (3, 0, 1)
→

q

(0, ∗0) −→ (0, 1, 1)
−

(3, 1∗) − (4, 0, 1)
→

(0, 1∗) −→ (4, 0, 1)
−

(0, 01) − (1, 0, 1)
→

(1, 01) − (2, 0, 1)
→

r

(0, 01) − (2, 0, 1)
→

(0, 0∗) − (3, 0, 1)
→
p

p

ab
cd

pq
pr

Figure 6.16: variable stride transitions generated for super-state 0 from 1-stride transition
in Figure 6.9.

The pseudo-code of our algorithm for building the k-var-stride transition tables is shown
in Algorithm 6.17.

6.8

Experimental Results

We implemented OverlayCAM using C++ and conducted experiments to evaluate its
eectiveness and scalability. We verify our results by conrming that the TCAM table
generated by OverlayCAM is equivalent to the original DFA. That is, for every pair of
current state and input character, the next state returned by the TCAM lookup matches
the next state returned by the DFA.
198

2
1 Input: OD FAs, D = (Q, Σ, q0 , F, S, O, M, ∆).
Output: Builds multi-stride transitions for D.
1 foreach Si ∈ S do Initialize Built[Si ] ← False;
2 foreach Si ∈ S do
3
if Built[Si ] = False then BuildVarStrideTrans (Si );

4 Function BuildVarStrideTrans(S)
c
5
foreach oset transition (S, X) → (Si , o, 1) ∈ ∆ for super-state S do
−
6
if Si ≤ S then Continue; // skip backward transition
7
if M(Si ) = ∅ then Continue; // stop at accepting super-states
8
if Built[Si ] = False then
9
BuildVarStrideTrans (Si );
// extend var-stride transitions of destination super-state
w
foreach transition (Si , Y) − (Sj , o2 , 1) ∈ Γ for super-state Si do
→
if |X ∩ Y| ≥ min(|X|, |Y|)/4 then
if len(w) < k then // max stride limit not reached
c.w
Add transition (S, X ∩ Y) −→ (Sj , (o + o2 ) mod |O|, 1) to Γ ;
−

10
11
12
13

// extend 1-stride transitions of destination super-state
foreach oset transition (Si , Y) −2 (Sj , o2 , 1) ∈ ∆ for super-state Si do
→
if |X ∩ Y| ≥ min(|X|, |Y|)/4 then
c.c2
Add transition (S, X ∩ Y) − → (Sj , (o + o2 ) mod |O|, 1) to Γ ;
−
c

14
15
16
17

Built[S] ← True;

18

6.8.1

Figure 6.17: Algorithm BuildVarStrideOD2FA(D) to build k-var-stride rules.

Eﬀectiveness of OverlayCAM

We use the same 8 RE sets used in Section 4.5 for the main results. We dene the
following metric for measuring the amount of state replication in the DFA that corresponds to an RE set. For any RE set R, we dene SR(R) to be the ratio of the

number of states in the minimum state DFA corresponding to
number of states in the standard NFA without

R

divided by the

transitions corresponding to R.

Based on the characteristics of the REs, these eight sets are partitioned into three groups,
199

STRING =fC613, Bro217g, which contains mostly strings, causing little state replication (SR(Bro271) = 3.0, SR(C613) = 2.1); WILDCARD =fC7, C8 and C10g, which
contains multiple wildcard closures `. ', causing lots of state replication (SR(C7) = 231,
SR(C8) = 43, and SR(C10) = 162); and SNORT =fSnort24, Snort31, and Snort34g, which

contain a diverse set of REs, roughly 40% of the REs have wildcard closures, causing moderate state replication (SR(Snort24) = 24, SR(Snort31) = 22, and SR(Snort34) = 16).
We conducted side-by-side comparison with RegCAM-TC (RegCAM without Table Consolidation) and RegCAM+TC (RegCAM with Table Consolidation) on all 8 real-world RE sets.
For RegCAM+TC, we consolidated 4 tables together. The results are shown in Table 6.1. For
TCAM space, we only report the number of TCAM entries because the TCAM widths for
all TCAM tables generated by RegCAM-TC, RegCAM+TC, and OverlayCAM on all 8 RE sets.
Since TCAM width typically is only allowed to be congured as 36, 72, or 144 bits, we
use a TCAM width of 36 in all cases.

200

RE
set
C8
C10
C7
Snort24
Snort34
Snort31
Bro217
C613

#
NFA
States
72
92
107
575
891
917
2132
5343

SR
43.17
161.61
231.31
24.15
15.52
21.88
3.06
2.12

#
# TCAM entries
SRAM size (Kb)
Throughput (Gbps)
#
#
NFA Over- Super RegCAM RegCAM Overlay RegCAM RegCAM Overlay RegCAM RegCAM Overlay
-TC
+TC
CAM
-TC
+TC
CAM
-TC
+TC
CAM
Trans. lays states
2177
72
85
3722
1012
125
47.25
51.39
1.83
5.44
8.51 12.50
2982 288
133
17824
4739
263
261.09
277.68
4.62
3.11
4.35 12.12
3261 648
127
29196
8315
234
456.19
519.69
4.57
3.11
3.64 12.31
4054
30
897
16130
5310
1426
236.28
331.88 26.46
3.64
4.35
7.27
4731
48 1151
16297
5026
2293
238.73
294.49 42.55
3.64
4.35
5.44
5738
32 2395
41539
14464
9478
689.61
960.50 185.12
2.72
3.64
3.64
5424
2 3401
9143
5087
6028
133.93
317.94 88.30
3.64
4.35
4.35
14563
1 11308
18256
13182 18256
320.91
978.35 338.73
3.11
3.64
3.11

Table 6.1: Experimental results of OverlayCAM on 8 RE sets in comparison with RegCAM-TC and RegCAM+TC

201

TCAM lookup speed is typically higher for smaller TCAM chips. We use the TCAM model
discussed in Section 5.5 to calculate RE matching throughput. For the two string-based
RE sets Bro217 and C613, we observe that OverlayCAM does not signicantly outperform
the two RegCAM algorithms. This is expected as OverlayCAM is designed to handle state
replication and string-based RE sets have little state replication.
For the other RE sets, OverlayCAM signicantly outperforms RegCAM and often outperforms NFAs. (1) OverlayCAM uses orders of magnitude less TCAM and SRAM than

RegCAM. On average, OverlayCAM uses 41 times less TCAM and 33 times less SRAM
than RegCAM-TC and 12 times less TCAM and 38 times less SRAM than RegCAM+TC. (2)

OverlayCAM has signicantly higher throughput than RegCAM. On average, OverlayCAM has 2.5 and 1.93 times higher throughput than RegCAM-TC and RegCAM+TC, respectively. (3) The total number of TCAM entries used by OverlayCAM is often (far)

smaller than the total number of NFA transitions. For C7, OverlayCAM's number of
TCAM entries is 14 times less than the number of NFA transitions.
We now describe why OverlayCAM performs so well. (4) OverlayCAM is very eective

in conquering state replication. OverlayCAM eectively and automatically identies
all NFA state replicates and groups them together into super-states. The number of
super-states is, on average, 1.55 times the number of NFA states and is never more than
2.61 times the number of NFA states. Because of this, the larger SR(R) is, the more that

OverlayCAM outperforms RegCAM. For C7, OverlayCAM uses 125 times less TCAM and
100 times less SRAM than RegCAM-TC and 36 times less TCAM and 114 times less SRAM
202

than RegCAM+TC. (5) OverlayCAM eectively multiplies the compression benets of

conquering state replication and transition sharing. That is, OverlayCAM eectively
multiplies the benets of ODFA and D2 FA. The average number of TCAM entries per
super-state is only 2.14, even when super-states have hundreds of constituent states.
We wanted to conduct side-by-side comparison with Peng et al.'s scheme [38]; however,
we do not have access to their code. Fortunately, Peng et al. have reported their results
on the two public RE sets Snort24 and Snort34. For these two sets, OverlayCAM requires
2.15 and 1.44 times less TCAM and SRAM space.

6.8.2

Results on 7-var-stride

We now compare the results of applying the variable striding technique with k = 7 on OverlayCAM with the results for RegCAM-TC. We compare the average stride values achieved
using the same traces that were used for the experiments in Section 5.6.3 as well as the
number of TCAM rules.
We only compare using the RE sets in the WILDCARD and SNORT groups since the RE
sets in the STRING group have no (or limited) state replication.

6.8.2.1

Self-loop Unrolling

The root state in both RegCAM-TC and OverlayCAM are exactly the same since the selflooping states are selected as the root states. As a result, the resulting TCAM rules
203

after unrolling the roots states are semantically equivalent. Hence we get the exact same
average stride values for both algorithms (which are shown in Table 6.3). Table 6.2 shows
the number of TCAM rules required without self-loop unrolling (i.e. for 1-stride) and
with self-loop unrolling for both algorithms.
RegCAM-TC
OverlayCAM
RE
1-stride Unroll 7-var-stride 1-stride Unroll 7-var-stride
set
C8
3722
7794
8192
125
310
814
C10
17824 36336
65536
263
590
1113
C7
29196 64356
65536
234
442
1381
Snort24
16130 18627
32768
1426
1482
6942
Snort34
16297 19825
32768
2293
2577
9654
Snort31
41539 43920
65536
9478
9819
32243

Table 6.2: Number of TCAM rules for RegCAM-TC and OverlayCAM for 1-stride, with
self-loop unrolling and with 7-var-stride
Compared to RegCAM-TC, OverlayCAM requires on average 77 times fewer TCAM rules
for the WILDCARD group and 8 times fewer TCAM rules for the SNORT group. The
average percentage increase in the number of TCAM rules resulting from unrolling the
roots for the SNORT group is 14.3% for RegCAM-TC and only 6.6% for OverlayCAM. This
is because in RegCAM-TC, there are many root states that are unrolled. On the other hand,
in OverlayCAM, there is only one root super-state that is unrolled.

6.8.2.2

Full Variable Striding

Table 6.2 shows the number of TCAM rules required for full variable striding, and Table 6.3 shows the average stride values for RegCAM-TC and OverlayCAM. As we can see,
OverlayCAM requires many fewer TCAM rules than RegCAM-TC. On average OverlayCAM
204

requires 38.8 times fewer rules for the WILDCARD group and 3.4 times fewer TCAM rules
for the SNORT.
RE
set
C8
C10
C7
Snort24
Snort34
Snort31

Self-loop
unroll
0

50

95

6.1
5.9
6.1
5.6
5.9
6.1

2.9
3.4
1.9
1.7
1.7
1.7

1.8
1.9
1.8
1.1
1.1
1.1

7-var-stride
RegCAM-TC
OverlayCAM
0 50 95
0 50 95

6.1
6.0
6.1
5.7
5.9
6.2

4.1
4.5
3.7
2.9
3.4
2.8

2.9
3.2
3.8
3.6
3.7
2.3

6.1
5.9
6.1
5.6
5.9
6.1

3.8
4.1
2.7
2.4
2.5
2.3

3.7
3.6
3.8
4.0
4.1
2.9

Table 6.3: Average stride values for self-loop unrolling and 7-var-stride for RegCAM-TC and
OverlayCAM for pM = 0, 50 and 95.
In general OverlayCAM is able to achieve nearly the same average stride values as RegCAM-TC.
For random trac (pM = 0), OverlayCAM has nearly identical average stride value as
RegCAM-TC. This is because with random trac, most of the transitions taken are self-loops

around the root state, whichh are unrolled to 7-stride in both algorithms. For pM = 95,
OverlayCAM is able to achieve equal or higher average stride value than RegCAM-TC for
all the RE sets. This is because with pM = 95, most of the transitions taken are forward
transitions, and OverlayCAM is able to selectively combine longer chains of forward transitions into higher stride transitions than RegCAM-TC. The average of the ratio of the stride
values across all RE sets and pM values is only 1.09.

6.8.3

Scalability of OverlayCAM

We evaluated the scalability of OverlayCAM on synthetic RE sets constructed by adding
new REs from 13 REs from a recent release of the Snort rules one at a time. Each RE
205

contains closure on the wildcard or a range; these cause the DFA size to double as each
RE is added. The nal DFA has 225,040 states.
We rst dene the TCAM Expansion Factor (TEF) of an RE set to be the number of
TCAM entries divided by the number of NFA transitions. In Figure 6.18(a), we plot the
TEF for RegCAM-TC, RegCAM+TC and OverlayCAM. We omit the rst 5 data points because
the corresponding 5 DFAs are too small. As expected, the TEF of the RegCAM algorithms
grows exponentially with the number of NFA states due to state replication. In contrast,
the TEF of OverlayCAM grows linearly at a very slow growth rate with the number of
NFA states. We next dene the super-state expansion factor (SEF) of an RE set to be
the number of super-states divided by the number of NFA states. Figure 6.18(b) shows
that the SEF of OverlayCAM also grows linearly and slowly with the number of NFA
states. Note that for any RE set, the number of NFA states is the minimum compared to
any other automaton.

206

#TCAM entries/#NFA trans
#Super states/#NFA states

(a) TEF
35
30
25

RegCAM-TC
RegCAM+TC
OverlayCAM

20
15
10
5
0
200

250

300
350
#NFA states

400

450

400

450

(b) OverlayCAM SEF
2
1.5
1
0.5
0
200

250

300
350
#NFA states

Figure 6.18: (a) TEF vs. # NFA states for OverlayCAM and RegCAM, (b) SEF vs. #
NFA states for OverlayCAM

207

Chapter 7

Conclusion

In this dissertation, we consider the problem of RE matching in DPI for networking applications. We survey current solutions for RE matching for DPI and identify their limitations.
We then develop several techniques and algorithms for fast and ecient RE matching.
For a software solution of RE matching, we use an existing automata model D2 FA. We
propose a novel Minimize then Union framework and develop ecient algorithms for building D2 FA based on the framework. Our approach requires a fraction of the memory and
time required by current algorithms. This allows us to build much larger D2 FAs than
was possible with previous techniques. Our algorithm naturally supports frequent RE set
updates. We conducted experiments on real-world and synthetic RE sets that verify our
claims. For example, our algorithm requires an average of 1400 times less memory and
300 times less time than the original D2 FA construction algorithm of Kumar

et al.. We

believe our Minimize then Union framework can be incorporated with other alternative
208

automata for RE matching.
We propose the rst TCAM-based RE matching solution. We prove that this unexplored
direction works very well for RE matching. We implemented our techniques and conducted
experiments on real-world RE sets. We show that small TCAMs are capable of storing
large DFAs. For example, in our experiments, we were able to store a DFA with 25K
states in a 0.5Mb TCAM chip. We also develop multi-striding techniques to increase
matching throughput wihtout signicantly increasing the memory requirement. We are
able to achieve a matching throughput of nearly 20Gbps.
The D2 FA and our TCAM-based solution only partially handles the problem of state
replication in a DFA. We propose a new overlay automata model called the OD2 FA, which
fully exploits state replication in a DFA. We develop algorithms for eciently constructing
the OD2 FA. We also develop techniques to implement the OD2 FA in software and in
hardware using TCAMs. Our experiments indicate that OD2 FA is able to eectively
manage state replication. This results in a memory requirement proportional to that of a
NFA while maintaining fast and deterministic matching throughput like that of a DFA.

209

APPENDICES

210

Glossary
character redundancy

Redundant/shared transitions within a state. 17, 78

The directed graph with states as vertices and edges given by the

deferment forest

deferment relation F. 22, 211
deferment pointer
deferment tree

The deferred state, F(s), of a state s. 23

A tree (connected component) in the deferment forest. 23

self-looping state

State with more than Σ/2 of its transitions looping back to itself.

16
state redundancy
state replication

Redundant/shared transitions between two states. 17, 78
Multiple replications of same NFA state in a DFA when DFAs for

two REs are combined. 14, 77, 78, 131
transitions sharing

Multiple transitions within a state or between dierent states go-

ing to the same next state. 14, 77, 78

211

Acronyms
Delayed Input DFA. 4, 19, 208, 209

D2 FA
DFA

Deterministic Finite state Automata. ii, iii, 3, 12, 209

DPI

Deep Packet Inspection. ii, 1, 208

NFA

Nondeterministic Finite state Automata. iii, 3, 209

OD2 FA
ODFA

RE

Overlay Delayed Input DFA. 5, 134, 144, 209
Overlay Deterministic Finite state Automata. 5, 133, 136, 141

Regular Expression. ii, iii, 2, 208, 209

SEP

Shadow Encoding Properties. 86, 91, 93, 94, 189

SRG

Space Reduction Graph. 24, 42

TCAM

Ternary Content Addressable Memory. iii, 5, 30, 209

212

Notation
D

A DFA/D2 FA. 12

D

An ODFA/OD2 FA. 141

Q

Set of states in the DFA/D2 FA/ODFA/OD2 FA. 12

Σ

The input alphabet. 12

S

The set of super-states in an ODFA/OD2 FA. 141

O

The set of overlays an ODFA/OD2 FA. 141

s, q, u

A DFA/D2 FA/ODFA/OD2 FA state. 13

S

An ODFA/OD2 FA super-state. 141

O

An ODFA/OD2 FA overlay. 141

X

A set of overlays in an ODFA/OD2 FA. 140

M(s)

Set of REs accepted by state s. 14

M(S)

Set of REs accepted by all states in super-state S. 141

F(s)

Deferred state of state s. 19, 20, 211

F(S)

Deferred super-state of super-state S. 144

u→v

State u defers to state v. 23

u v

State u descendant of state v. 23
213

⊥

NULL

state/empty location. 143

δ(s, σ)

The state transition function for a DFA. 13

ρ(s, σ)

Partial state transition function for a D2 FA. 22

∆(S, X, σ)

Super-state transition function for a ODFA/OD2 FA. 141

ρ (s, σ)

Partial state transition function derived from ∆ for OD2 FA. 144

δ (s, σ)

Total transaction function derived from ρ for D2 FA. 22

δ (s, σ)

Total transaction function derived from ∆ (ρ ) for ODFA (OD2 FA). 142

214

BIBLIOGRAPHY

215

BIBLIOGRAPHY
[1] Application layer packet classier for linux. http://l7-lter.clearfoundation.com/.
[2] Snort. http://www.snort.org/.
[3] B. Agrawal and T. Sherwood. Modeling TCAM power for next generation network
devices. In Proc. IEEE Int. Symposium on Performance Analysis of Systems and
Software, pages 120{ 129, 2006.
[4] A. V. Aho and M. J. Corasick. Ecient string matching: an aid to bibliographic
search. Communications of the ACM, 18(6):333{340, 1975.
[5] M. Alicherry, M. Muthuprasanna, and V. Kumar. High speed pattern matching for
network ids/ips. In Proc. 2006 IEEE International Conference on Network Protocols, pages 187{196. Ieee, 2006.
[6] M. Becchi and S. Cadambi. Memory-ecient regular expression search using state
merging. In Proc. INFOCOM. IEEE, 2007.
[7] M. Becchi and P. Crowley. A hybrid nite automaton for practical deep packet inspection. In Proc. ACM Int. Conf. on emerging Networking EXperiments and
Technologies (CoNEXT). ACM Press, 2007.
[8] M. Becchi and P. Crowley. An improved algorithm to accelerate regular expression
evaluation. In Proc. ACM/IEEE ANCS, 2007.
[9] M. Becchi and P. Crowley. Ecient regular expression evaluation: Theory to practice.
In Proc. ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), pages 50{59, 2008.
216

[10] M. Becchi and P. Crowley. Extending nite automata to eciently match perlcompatible regular expressions. In Proc. ACM Int. Conf. on emerging Networking
EXperiments and Technologies (CoNEXT). ACM Press, 2008. Article Number 25.
[11] M. Becchi, M. Franklin, and P. Crowley. A workload for evaluating deep packet
inspection architectures. In Proc. IEEE IISWC, 2008.
[12] A. Bremler-Barr, D. Hay, and Y. Koral. Compactdfa: Generic state machine compression for scalable pattern matching. In Proc. IEEE INFOCOM, pages 1{9. Ieee,
2010.
[13] B. C. Brodie, D. E. Taylor, and R. K. Cytron. A scalable architecture for highthroughput regular-expression pattern matching. SIGARCH Computer Architecture News, 2006.
[14] C. R. Clark and D. E. Schimmel. Ecient recongurable logic circuits for matching
complex network intrusion detection patterns. In Proc. Field-Programmable Logic
and Applications, pages 956{959, 2003.
[15] C. R. Clark and D. E. Schimmel. Scalable pattern matching for high speed networks. In Proc. 12th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM), Washington, DC, 2004.
[16] J. Edmonds. Paths, trees, and owers. Canad. J. Math., 17:449{467, 1965.
[17] D. Ficara, S. Giordano, G. Procissi, F. Vitucci, G. Antichi, and A. D. Pietro. An
improved DFA for fast regular expression matching. Computer Communication
Review, 38(5):29{40, 2008.
[18] H. N. Gabow. An ecient implementation of edmonds' algorithm for maximum
matching on graphs. J. ACM, 23:221{234, April 1976.
[19] J. E. Hopcroft. The Theory of Machines and Computations, chapter An nlogn
algorithm for minimizing the states in a nite automaton, pages 189{196. Academic
Press, 1971.
217

[20] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory,
Languages, and Computation. Addison-Wesley, 2000.
[21] D. E. Knuth. Human's algorithm via algebra. Journal of Combinatorial Theory,
Series A, 32(2):216 { 224, 1982.
[22] S. Kong, R. Smith, and C. Estan. Ecient signature matching with multiple alphabet compression tables. In Proc. 4th Int. Conf. on Security and privacy in
communication netowrks (SecureComm), page 1. ACM Press, 2008.
[23] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman
problem. Proc. American Mathematical Society, 7:48{50, 1956.
[24] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research
Logistics Quarterly, 2:83{97, 1955.
[25] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese. Curing regular expressions
matching algorithms from insomnia, amnesia, and acalculia. In Proc. ACM/IEEE
ANCS, pages 155{164, 2007.
[26] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc.
SIGCOMM, pages 339{350, 2006.
[27] S. Kumar, J. Turner, and J. Williams. Advanced algorithms for fast and scalable deep
packet inspection. In Proc. IEEE/ACM ANCS, pages 81{92, 2006.
[28] T. Liu, Y. Yang, Y. Liu, Y. Sun, and L. Guo. An ecient regular expressions compression algorithm from a new perspective. In Proc. IEEE INFOCOM, pages 2129{2137,
2011.
[29] Y. Liu, L. Guo, M. Guo, and P. Liu. Accelerating DFA construction by hierarchical
merging. In Proc. IEEE 9th Int. Symposium on Parallel and Distributed Processing with Applications, 2011.
218

[30] C. R. Meiners, A. X. Liu, and E. Torng. TCAM Razor: A systematic approach
towards minimizing packet classiers in TCAMs. In Proc. 15th IEEE Conf. on
Network Protocols (ICNP), pages 266{275, October 2007.
[31] C. R. Meiners, A. X. Liu, and E. Torng. Bit weaving: A non-prex approach to
compressing packet classiers in TCAMs. In Proc. 17th IEEE Conf. on Network
Protocols (ICNP), October 2009.
[32] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu. Fast regular expression
matching using small TCAMs for network intrusion detection and prevention systems.
In Proc. 19th USENIX Security Symposium (USENIX Security), pages 111{126,
Washington, DC, August 2010.
[33] A. Mitra, W. Najjar, and L. Bhuyan. Compiling PCRE to FPGA for accelerating
SNORT IDS. In Proc. 3rd ACM/IEEE Symposium on Architecture for networking
and communications systems ANCS. ACM Press, 2007.
[34] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos. Implementation of a contentscanning module for an internet rewall. In Proc. 11th IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM), pages 31{38. IEEE Comput. Soc, 2003.
[35] J. Munkres. Algorithms for the assignment and transportation problems. Journal of
the Society of Industrial and Applied Mathematics, 5(1):32{38, March 1957.
[36] J. Patel, A. X. Liu, and E. Torng. Bypassing space explosion in regular expression
matching for network intrusion detection and prevention systems. In Proc. Network
and Distributed System Security Symposium (NDSS'12), February 2012.
[37] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer
Networks, 31(23-24):2435{2463, 1999.
[38] K. Peng, S. Tang, M. Chen, and Q. Dong. Chain-based DFA deation for fast and
scalable regular expression matching using TCAM. In Proc. ACM ANCS, pages
24{35, 2011.
219

[39] M. Roesch. Snort: Lightweight intrusion detection for networks. In Proc. 13th
Systems Administration Conference (LISA), USENIX Association, pages 229{
238, November 1999.
[40] R. Sidhu and V. K. Prasanna. Fast regular expression matching using fpgas. In Proc.
IEEE Symposium on Field-Programmable Custom Computing Machines FCCM,
pages 227{238, 2001.
[41] R. Smith, C. Estan, and S. Jha. Xfa: Faster signature matching with extended
automata. In Proc. IEEE Symposium on Security and Privacy, pages 187{201,
2008.
[42] R. Smith, C. Estan, S. Jha, and S. Kong. Deating the big bang: fast and scalable
deep packet inspection with extended nite automata. In Proc. SIGCOMM, pages
207{218, 2008.
[43] R. Sommer and V. Paxson. Enhancing byte-level network intrusion detection signatures with context. In Proc. 10th ACM Conf. on Computer and Communications
Security (CCS), pages 262{271, 2003.
[44] I. Sourdis and D. Pnevmatikatos. Pnevmatikatos: Fast, large-scale string match for
a 10gbps fpga-based network intrusion detection system. In Proc. Int. on Field
Programmable Logic and Applications, pages 880{889, 2003.
[45] I. Sourdis and D. Pnevmatikatos. Pre-decoded cams for ecient and high-speed nids
pattern matching. In Proc. 12th IEEE Symposium on FieldProgrammable Custom
Computing Machines, volume C, pages 258{267. Ieee, 2004.
[46] J.-S. Sung, S.-M. Kang, Y. Lee, T.-G. Kwon, and B.-T. Kim. A multi-gigabit rate
deep packet inspection algorithm using tcam. In Proc. IEEE GLOBECOM, pages
453{457, 2005.
[47] S. Suri, T. Sandholm, and P. Warkhede. Compressing two-dimensional routing tables.
Algorithmica, 35:287{300, 2003.
[48] L. Tan and T. Sherwood. A high throughput string matching architecture for intrusion
detection and prevention. In Proc. 32nd Annual Int. Symposium on Computer
220

Architecture (ISCA), pages 112{122, 2005.
[49] N. Tuck, T. Sherwood, B. Calder, and G. Varghese. Deterministic memory-ecient
string matching algorithms for intrusion detection. In Proc. IEEE Infocom, pages
333{340, 2004.
[50] L. Yang, R. Karim, V. Ganapathy, and R. Smith. Fast, memory-ecient regular expression matching with NFA-OBDDs. Computer Networks, 55(55):3376{3393, 2011.
[51] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz. Fast and memory-ecient
regular expression matching for deep packet inspection. In Proc. ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS),
pages 93{102, 2006.
[52] F. Yu, R. H. Katz, and T. V. Lakshman. Gigabit rate packet pattern-matching using
TCAM. In Proc. 12th IEEE Int. Conf. on Network Protocols (ICNP), pages
174{183, 2004.

221